Modelling human mobility patterns using photographic data shared online

Humans are inherently mobile creatures. The way we move around our environment has consequences for a wide range of problems, including the design of efficient transportation systems and the planning of urban areas. Here, we gather data about the position in space and time of about 16 000 individuals who uploaded geo-tagged images from locations within the UK to the Flickr photo-sharing website. Inspired by the theory of Lévy flights, which has previously been used to describe the statistical properties of human mobility, we design a machine learning algorithm to infer the probability of finding people in geographical locations and the probability of movement between pairs of locations. Our findings are in general agreement with official figures in the UK and on travel flows between pairs of major cities, suggesting that online data sources may be used to quantify and model large-scale human mobility patterns.


Retrieval of data
The following online resources have been used to retrieve data related to our study: • Geographic coordinates of country boundaries [2].

Distribution of displacement lengths
We can study the distribution of displacement lengths by analysing the dierence between the coordinates of geo-tagged photos taken at consecutive times. Fig. 1 shows the empirical complementary cumulative distribution function (ccdf) computed from a random sample of 100000 displacements. This is a heavy tailed distribution, and is consistent with a Lévy ight behaviour. A random sample of the available data has been used to reduce the time needed to compute the ccdf. However, multiple trials on independent samples conrmed the results shown here.

Distribution of time intervals between consecutive photos
We can analyse the distribution of time intervals between photos by computing the time elapsed between the timestamps of any two photos uploaded by the same author at consecutive times. Fig. 2 shows the empirical complementary cumulative distribution function (ccdf) computed from a random sample of 100000 displacements. This is a heavy tailed distribution, and indicates that most photos are taken within intervals of a few days (e.g., the probability that the time elapsed between consecutive photos exceeds 3 days is about 0.1), but a few photos are taken years apart. A random sample of the available data has been used to reduce the time needed to compute the ccdf. However, multiple trials on independent samples conrmed the results shown here.

Derivation of marginal probability distributions
Let x = [x lon , x lat ] be the longitude and latitude coordinates of a point. Given the output of a HMM, the probability p(x|un) of locating the user un in x is given by: where p(x|s i , un) indicates the conditional probability distribution of the i-th Gaussian emission learned by the HMM for the user un, Sn is the total number of states estimated by the DBSCAN algorithm, and p(s i |un) is the prior probability of the state s i conditional on the user un. Let h n,i [m] indicate the set of photos uploaded by the user un that have been assigned by the Viterbi algorithm to the state s i . We estimate the probability p(s i |un) by counting the number of these photos M n,i def = h n,i [m] , and dividing it by the total number of photos uploaded by the user: (4.1) Therefore, the model estimates that users are more likely to be found in the areas where they uploaded more photos. By aggregating p(x|un) over the entire set of users, we can estimate a probability distribution that describes the likelihood of nding any user in a given area. This will be the marginal probability where N is the total number of users in the database, and an equal probability p(un) = 1/N is assigned to every user.
Let us consider a single user un (whenever unambiguous from the context, we will avoid conditioning probabilities on un for clarity of notation). The parameters learned by the HMM describe the probability p(x|s) of nding the user in a particular location x given the value of the hidden state s, that corresponds to one of the clusters learned by the model, along with the transition probability p(s i |s j ) between states s i and s j .
A travel is dened by a pair of locations xo and x d that represent the origin and destination positions respectively. We are interested in the travel probability p(x d , xo) that quanties the likelihood of nding the user in locations x d and xo at consecutive times. This depends on the latent variables s d and so.
where S is the set of latent states learned for the user un. We assume that the variables x d and xo are independent conditional on the hidden states. Therefore, we can write the travel probability as: The probability p(so) can be derived from the Viterbi path, and has already been calculated in Eq. (5.1) as p(s i |un). Eq. (5.3) quanties the likelihood that the user un travelled from xo to x d , while the aggregate travel likelihood can be obtained as the marginal probability: where p(un) describes the probability that we are observing the user un travelling between xo and x d . We can estimate this quantity by counting the number of transitions between dierent latent states in the Viterbi path of the user Tn = | hn[m] : hn[m + 1] ∈ s i , hn[m] ∈ s j , i = j |, and dividing it by the total number of transitions estimated for all users:  Figure 3: Correlation between the percentile rank of the number of travels reported in the NTS survey between pairs of major UK cities and the percentile rank of the probability of travel between the same cities, as learned by the HMM model. There is a moderate but signicant correlation between the two variables (Kendall's tau coecient τ = 0.45, p < 0.001, N = 210). Kendall's tau has been used instead of Pearson's correlation because the data is not normally distributed.