1. Home
  2. >>
  3. digital
  4. >>
  5. How to predict socioeconomic trends with digital traces

How to predict socioeconomic trends with digital traces

How to predict socioeconomic trends with digital traces

The Telefónica Scientific Group has carried out a study in Mexico in order to try to find out the financial trends thanks to telephony.

Social development is typically measured through socioeconomic time series such as the level of employment, the gross domestic product or the consumer price index. Calculating such indicators efficiently and quickly is critical to be able to implement and evaluate the policies that can change them. Traditionally, it is the national statistics institutes that calculate these values ​​through data obtained from surveys. However, the ubiquitous presence of social networks and mobile phones is generating a series of data that are useful for characterizing social behavior and that may be relevant for calculating and predicting socioeconomic indicators.

For example, in a study using Google searches related to the financial sector, the authors were able to predict the time series of mortgage interest. The results indicated that the use of Google searches produced better results than the exclusive use of information from the time series of interests. Also focusing on the financial environment, another study showed that Twitter usage (volume of comments) is strongly correlated with various time series of economic indicators. Likewise, and continuing with this strategy, other authors have shown that the correlation does not only exist with the volume of tweets, but also with the sentiment (positive or negative) expressed in them in at least two cases: the price level of the oil and the DJIA (Dow Jones Industrial Average). In general, the state of the art indicates that for the prediction of socioeconomic time series, having not only past values ​​but also extra information obtained from digital traces considerably improves their prediction.

Following the tracks of digitization

In this sense, in the Telefónica Scientific Group We have studied whether the use of mobile telephony traces can facilitate the prediction of these time series. In previous studies we have already shown that the traces of mobile telephony are correlated with the socioeconomic levels of the regions. In this case, what we seek is to evaluate if the time series of socioeconomic values ​​can be predicted using information extracted from the telephony traces.

To do this, we focus our study on Mexico, where the INEGI (National Institute of Statistics and Geography) provided a set of time series for each federal state for 17 months. The socioeconomic series were: (1) total number of employed persons; (2) total number of workers in private companies; (3) total number of officials; and (4) total number of people subcontracted. During the same time window, and from the telephony traces, we calculated two groups of variables monthly at the state level: consumption and mobility. The consumption variables included elements such as the average number of incoming and outgoing calls and their duration. Mobility variables included the number of antennas (BTSs) used during a call, the average distance traveled by the user during a call, the distance between consecutive calls, the average distance traveled in a month, the diameter, and the radius of gyration. average in a month (radius of gyration, imaginary radius of the towers used during a period of time weighted by the number of calls).

A useful first step to get an indication of which time series are predictive is calculate the cross correlations between the telephony series and the socioeconomic series. This analysis tells us that the correlations are statistically significant and with what time difference the correlation between the series occurs. They are therefore relevant negative correlations in the sense that they represent telephone series that have the ability to predict changes in the socioeconomic series before they happen. Table 1 presents the cross correlations between consumption and distance variables and the socioeconomic time series considered. Only significant correlations are shown, and in that case the interval in which they occur and the correlation are detailed. The cases in which the interval is positive would indicate that the socioeconomic series would be the leaders and would have predictive capacity over the telephone series.

Table 1 shows that the consumption variables are correlated with negative intervals for the total number of employees and the total number of workers in the private sector. In general, it is observed that an increase in the number of outgoing calls and its duration implies an increase the following month in the number of employees, which may be an indication that telephones are used as a tool to search for work and/or of a greater availability of free time (that is, when there are fewer active workers, the number of outgoing calls increases). Regarding the mobility variables, the variables that reflect the total distance traveled during a call or the average distance traveled during a month have a positive correlation with an interval of 1, indicating that when there is an increase in the distances traveled, that may be an indicator of an increase in the number of employees. The diameter and radius of gyration variables have a negative interval for the number of workers in the private sector and for the number of public workers, which indicates that when you have a job you tend to have a greater area of ​​mobility.

To evaluate the predictive nature of telephony time series, we used multivariate autoregressive models., where, of the 17 months that each time series has, we use 13 to train the prediction and 4 to evaluate it. Table 2 presents the goodness of the models (measured using the mean square error), both for the training time series and for the validation time series using a single mobile phone series (the one that produced the best result in each case). Regarding the training phase, the model produces very good approximations, whether using the number of outgoing calls or the radius of gyration (the values ​​presented in Table 1 are for the time series of outgoing calls). Regarding the predictive character, we obtain values ​​in the area of ​​0.5 and 0.6, in this case the number of outgoing calls is the one that produces the best result for the total number of employees and the radius of gyration for the other three socioeconomic series.

Figure 1 presents, for the total number of workers (a) and the number of subcontracted people (b), the original time series (solid line) and the trained model (dashed line) that includes the training phase (until February ) and validation (last four months). Series (a) is constructed with the time series of the number of outgoing calls and series (b) with the radius of gyration. In general, it can be seen that if the root mean square error is good, as in case (a), the change in the number of employees can be predicted. If the error is not so good, as in the case (b) of subcontracted workers, although the absolute value is, in general, underestimated, the general trend of the series is captured.

positive conclusions

The results obtained indicate that the use of mobile telephony time series adds relevant information for the prediction of socioeconomic time series. This opens the door to being able to provide national statistics institutes with new tools for their predictions. The results obtained in this study cannot be directly extrapolated to other markets, mainly due to the continuous evolution of the way of pricing that directly affects the consumption variables, so that their predictive capacity may vary. Likewise, the use of other communication alternatives, such as Whatsapp or Skype, affects the calculation of consumption variables and their predictive capacity. In this sense, it is the mobility variables that really give us a differential value and that will allow us to develop tools that complement the prediction of socioeconomic time series.


An extended version of the study can be found here.