I am trying to create a model that predicts an author's age. I'm using (Nguyen et al, 2011) as my basis.
Using a Bag of Words Model I count the occurrences of words per Document (which are Posts from Boards) and create the vector. I am using scikit-learn.
I limit the size of the vector by using as features the top-k (k=number) most frequent used words (stopwords will not be used)
The vectors will be scaled.
X_train = preprocessing.scale(X_train) I train the data on a Linear Regression Model (also tried Lasso)
model = linear_model.LinearRegression() model.fit(X_train, y_train) When I test the model on my test data I get a low r² score(0.01-0.15) but an acceptable MAE score (compared with the paper).
When I run the plot function from scikit-learn's Example, I get this: 
Like in the example, I use the first Feature of the Dataset.
What can I do to improve the r² score and what did I do wrong that the plot looks like this?