Using von Mises-Fisher distributions for geo-spatial machine learning

Question

There's an interesting paper about predicting the geographical co-ordinates of Twitter users based on the kinds of words that they use in their posts. I'd like to do something similar that involves taking text and using it to predict a subject's latitude and longitude.

However, the approach mentioned in the above paper isn't ideal, because it ignores the spherical-ness of the Earth. Here's a quote from another paper:

The earth’s surface is continuous, so a natural approach is to predict locations using a continuous distribution. For example, Eisenstein et al. (2010) use Gaussian distributions to model the locations of Twitter users in the United States of America. This appears to work reasonably well for that restricted region, but is likely to run into problems when predicting locations for anywhere on earth—-instead, spherical distributions like the von Mises-Fisher distribution would need to be employed.

I'd like to try using the von Mises-Fisher distribution to perform ML tasks with spatial data, but I just don't know how. Does anyone know of a paper or text book that describes in more detail how to do something like this? I've read about some approaches that involve discretizing the earth into little squares, but I'd prefer to avoid that sort of solution, if I can. Also, if using von Mises-Fisher is the wrong way to go, feel free to suggest going in a completely different direction.

Thanks for your help.

Edit: This question is not an easy one to answer. At this point, I think that I've done a pretty exhaustive search through the literature, and I haven't found a paper that both (a) predicts lat/lon co-ordinates with text or whatever other features and (b) doesn't treat the earth like a flat plane. If anyone else found a paper that met those two requirements, I would be quite surprised. Anyway, as a last-ditch effort, I'll add a bounty to this question. I'll also list all of the lat/lon prediction papers that I know of below, for others that may research this topic later on:

A latent variable model for geographic lexical variation <-- The one mentioned above in the question
Estimating User Location in Social Media with Stacked Denoising Auto-encoders <-- Uses deep learning techniques, but I don't see any directional statistics
Inferring the Origin Locations of Tweets with Quantitative Confidence <-- Returns probability densities, the paper contains a note that mentions "plate carrée"
Sparse Additive Generative Models of Text <-- See "Application 4" for lat/lon prediction

Edit 2: One of the below comments asked me to briefly describe the model used in the paper by Eisenstein et al. Unfortunately, that paper is quite dense and so I think that it may make more sense to summarize a simpler method used by one of the other papers instead.

The simpler paper is based on Gaussian Mixture Models (GMMs). Training consists of tokenizing all of the tweets to create a list of all the words used; if a word happens to be used 1000 times (for example), then that means that there will be 1000 locations that map to that word in the data (taken from the geotagged tweets). The locations are used to fit a 2D GMM for each of the words. This collection of many fitted GMMs is the trained model.

When the user wishes to estimate the location for a tweet, it is tokenized in the same way, and the GMMs for each of the words are combined. The sum of all the GMMs can then be examined to find the location with the highest probability of producing the tweet under consideration.

Unfortunately, the use of 2D Gaussian distributions could be problematic, since they have no way of taking into account the 'wrapping around' at the international date line. As mentioned above, the use of von Mises-Fisher distributions have been suggested as a possible method for resolving this problem, but I'm not knowledgeable enough to say for sure if this is the best way forward. Any suggestions or ideas would be welcome.

Read about exponential families and generalized linear models. You can do what you want from there. — Neil G
– Neil G, Commented May 16, 2016 at 4:51
It seems very unlikely that the distribution of users across the earth follows a vMF distribution, or even a mixture of vMF distributions. We might have a reasonable fit with a vMF-mixture component for each city, which is probably why the approach in the quote works. — Kees Mulder
– Kees Mulder, Commented May 16, 2016 at 16:58
How to incorporate the sphericalness of the Earth to a model is strongly depends on which model they use. Could you briefly describe their model? — user31264
– user31264, Commented May 20, 2016 at 2:39
I added a brief description of a model in Edit 2 above (although not the one from the original paper). Let me know if there is anything that you would like me to expand on. — bnsmith
– bnsmith, Commented May 24, 2016 at 17:59
How can spherical distributions be worth the effort? If you remove a single(!) point (take this one for example goo.gl/maps/r5UygV5nYJp) you CAN treat the surface of Earth as a plane. Furthermore you can do this by maintaining closeness (i.e. two points close on Earth will be close on your plane and vice versa). This is what maps are doing. So I struggle to see any advantage. Given the strong clustering of Twitter users and the fact that huge areas of Earth are virtually void of human beings, one could argue whether your is domain connected (in the topological sense) at all. — g g
– g g, Commented May 25, 2016 at 7:05

Vast Academician · Accepted Answer · 2016-05-25 09:40:54Z

During my research in search systems design I met that article once upon a time: http://www.jmlr.org/papers/volume6/banerjee05a/banerjee05a.pdf

This paper is pretty old, but is well written and proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, they derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture.

But my extensive experience in solving exactly your task suggests, that when you rely totally on association of words and locations you are not in a right track. Even short twitter texts are very free form, contain slang and homonyms and if some tweet contain no toponyms at all you have no chance to reliably estimate location of tweet.

I don't know what task you really solve, but if I had to estimate location from tweets, I would consider all possible sources of information and form an ensemble.

Those sources could be, for example:

Social connections ( if I know for sure, that for some user X his friends just checked in at Paris and user X mentioned some French toponyms in some tweets, then my prior ( in Bayesian sense ) that X is in Paris is more, that if I had no idea about X's friends location ). Also bursts of mutual amount of messeges via connections could be important.
Tweets sequence for each user. No one mention Paris and Seine in each tweet just to please data scientist :-) So, in Bayesian sense you are better to increase scores for Paris for each tweet, which coincides in time with tweets with toponyms.
User photos. This resource is very valuable and using recent advances in deep learning and convolutional networks it's typically very simple to uncover full potential of photos
Time difference between 'typical' user tweets and tweets for last day/3 days/week. If a user is in trip over a Paris ( but typically he lives in US ) then it's likely, that his tweets will suddenly have different time zone ( during this trip )

Actually, these sources are fractal-like, and can be combined in multiple ways, so I know no better tool than ensembling these evidences.

Thanks so much for your detailed answer! I'm actually not working on finding the locations for Twitter messages specifically; I'm trying to find locations for strings in different area, but I think that you're ensemble idea is still a good one (there is some other metadata that I could use in addition to the short string). Anyway, just to make sure that I have the right idea, the paper that you suggested above describes a good method to take the lat/lon data associated with a word, and fit a vMF mixture to that data, which could be used just like the GMMs from the paper I described. Right? — bnsmith
– bnsmith, Commented May 25, 2016 at 13:05
Well, I think that article provide a good framework to do this, because it's very general and will show you how to fit a vMF mixture to some data ( which could even be high-dimensional ). — Vast Academician
– Vast Academician, Commented May 25, 2016 at 14:48

Stack Exchange Network

Using von Mises-Fisher distributions for geo-spatial machine learning

1 Answer 1

Hot Network Questions

Using von Mises-Fisher distributions for geo-spatial machine learning

1 Answer 1

Related

Hot Network Questions