There's an interesting paper about predicting the geographical co-ordinates of Twitter users based on the kinds of words that they use in their posts. I'd like to do something similar that involves taking text and using it to predict a subject's latitude and longitude.
However, the approach mentioned in the above paper isn't ideal, because it ignores the spherical-ness of the Earth. Here's a quote from another paper:
The earth’s surface is continuous, so a natural approach is to predict locations using a continuous distribution. For example, Eisenstein et al. (2010) use Gaussian distributions to model the locations of Twitter users in the United States of America. This appears to work reasonably well for that restricted region, but is likely to run into problems when predicting locations for anywhere on earth—-instead, spherical distributions like the von Mises-Fisher distribution would need to be employed.
I'd like to try using the von Mises-Fisher distribution to perform ML tasks with spatial data, but I just don't know how. Does anyone know of a paper or text book that describes in more detail how to do something like this? I've read about some approaches that involve discretizing the earth into little squares, but I'd prefer to avoid that sort of solution, if I can. Also, if using von Mises-Fisher is the wrong way to go, feel free to suggest going in a completely different direction.
Thanks for your help.
Edit: This question is not an easy one to answer. At this point, I think that I've done a pretty exhaustive search through the literature, and I haven't found a paper that both (a) predicts lat/lon co-ordinates with text or whatever other features and (b) doesn't treat the earth like a flat plane. If anyone else found a paper that met those two requirements, I would be quite surprised. Anyway, as a last-ditch effort, I'll add a bounty to this question. I'll also list all of the lat/lon prediction papers that I know of below, for others that may research this topic later on:
- A latent variable model for geographic lexical variation <-- The one mentioned above in the question
- Estimating User Location in Social Media with Stacked Denoising Auto-encoders <-- Uses deep learning techniques, but I don't see any directional statistics
- Inferring the Origin Locations of Tweets with Quantitative Confidence <-- Returns probability densities, the paper contains a note that mentions "plate carrée"
- Sparse Additive Generative Models of Text <-- See "Application 4" for lat/lon prediction
Edit 2: One of the below comments asked me to briefly describe the model used in the paper by Eisenstein et al. Unfortunately, that paper is quite dense and so I think that it may make more sense to summarize a simpler method used by one of the other papers instead.
The simpler paper is based on Gaussian Mixture Models (GMMs). Training consists of tokenizing all of the tweets to create a list of all the words used; if a word happens to be used 1000 times (for example), then that means that there will be 1000 locations that map to that word in the data (taken from the geotagged tweets). The locations are used to fit a 2D GMM for each of the words. This collection of many fitted GMMs is the trained model.
When the user wishes to estimate the location for a tweet, it is tokenized in the same way, and the GMMs for each of the words are combined. The sum of all the GMMs can then be examined to find the location with the highest probability of producing the tweet under consideration.
Unfortunately, the use of 2D Gaussian distributions could be problematic, since they have no way of taking into account the 'wrapping around' at the international date line. As mentioned above, the use of von Mises-Fisher distributions have been suggested as a possible method for resolving this problem, but I'm not knowledgeable enough to say for sure if this is the best way forward. Any suggestions or ideas would be welcome.