Using Linear Regression on text data

Question

I am trying to create a model that predicts an author's age. I'm using (Nguyen et al, 2011) as my basis.

Using a Bag of Words Model I count the occurrences of words per Document (which are Posts from Boards) and create the vector. I am using scikit-learn.

I limit the size of the vector by using as features the top-k (k=number) most frequent used words (stopwords will not be used)

The vectors will be scaled.

X_train = preprocessing.scale(X_train)

I train the data on a Linear Regression Model (also tried Lasso)

model = linear_model.LinearRegression() model.fit(X_train, y_train)

When I test the model on my test data I get a low r² score(0.01-0.15) but an acceptable MAE score (compared with the paper).

When I run the plot function from scikit-learn's Example, I get this:

Like in the example, I use the first Feature of the Dataset.

What can I do to improve the r² score and what did I do wrong that the plot looks like this?

Since you scale the word counts and the plot runs from - 1 to 8, I suspect that a lot of points are overlapping at word count 0 (-0.5 in your plot). You could improve the plot by adding a little jitter. Than you would see that the majority of the data has word count 0. — Pieter
– Pieter, Commented May 27, 2016 at 21:04

Timothy Nodine · Accepted Answer · 2016-05-27 16:41:06Z

The plot doesn't look wrong. Your X axis is the word count of one word, after scaling. The Y axis is age. The vertical stacks result from always having an integer word count; there are 8 stacks corresponding to word counts of 0-7. The blue trend line shows that this word is a weak positive indicator for age.

The plot would be slightly clearer if you did not scale your input. Linear regression doesn't benefit from unit-variance scaling anyway.

AdamO · Accepted Answer · 2017-08-16 12:28:46Z

Without labeled axes, we can only speak in vague terms:

A linear regression can be extended to consider a polynomial regression of possibly unbounded degree. This can be extended one step further still to consider splines, which are piecewise-continuous polynomial trends. There are ample software implementations of smoothing splines, LOESS, or other related terms. The implementation in Python is off-topic for this site, but I'm sure it's out there.

Yet, a flexible piecewise-continuous polynomial line only promises minimal improvement in $r^2$. As you can see, for a fixed "X" level, the variability of the "Y" is substantial relative to the overall variability of "Y". Without identifying extra features, or considering mixture models for undetected clusters, there is no way to obtain more granular predictions.

Stack Exchange Network

Using Linear Regression on text data

2 Answers 2

Hot Network Questions

Using Linear Regression on text data

2 Answers 2

Related

Hot Network Questions