Do methods exist other than Ridge Regression and Y ~ X + 0 to prevent OLS from dropping variables?

Question

Goal is to evaluate chess players using a novel analysis system I'm been working on -- not all wins are created equal, finding the only move in razor sharp positions is better than finding the best move when the ten-best-alternates are negligibly worse, etc.

Current dataset I'm working with towards proof of concept has 30 players. The design matrix has players as the columns, but each player gets two columns: one for when they're playing as white, one for when they're playing as black. Each row of the design matrix represents half of a match, and 1/0/-1 dummies are used for white/not present/black.

Example: if Player 4 and Player 9 played a match, the design matrix will have two rows for this match. One row will have p4w assigned a "1" and p9b assigned a "-1". The other row will have p4b assigned a "-1" and p9w assigned a "1". All other player columns are 0.

The result vector is the Engine's score for the player playing as white in that half of the match.

There's also two other columns, Sw and Sb, to attempt to quantify the value of being white first in any given match and if a penalty exists for the player who started as black once they switch to white -- since white always moves first, and white wins more games than black, black is more likely to be disadvantaged after the first game.

Using matrix math rather than an R function.

csv <- read.csv("~/chess.csv", header=TRUE) engine <- as.numeric(csv$Engine) # ready design matrix/remove dropped variables csv$Engine <- NULL csv$Sb <- NULL csv$P30w <- NULL csv$P30b <- NULL # readies X and Y X <- data.matrix(csv) Y <- engine # remove copies remove(csv) remove(engine) # Add one column of "1" to X one.col <- matrix(1, nrow(X), 1) X <- cbind(X, one.col) # transposing X X.t = t(X) # X'X, X'Y X.t.X <- X.t %*% X X.t.Y <- X.t %*% Y # MATHS betahat = solve(X.t.X) %*% X.t.Y

Here's the CSV: http://www.filedropper.com/chess_1

Right from the top, I have to drop Sb -- it's redundant. I then am forced to drop a player to defeat the "system is computationally singular" error. In this case, I'm dropping the same player lm() would: the last one.

I have no philosophical objections to dropping variables but for the purposes of this, for evaluating players against each other, the incompleteness is troublesome.

Using Ridge Regression "works" to prevent any variable from being dropped, but this is unsatisfying -- are the results really then meaning what they should? X + 0 doesn't help matters for this problem either.

Are there any other tools I'm missing? Is ridge regression the right path to take for this problem but, rather than penalize towards zero, penalize towards priors?

I don't entirely follow how your design matrix is constructed, but in general, if your design matrix is not full rank, then your model is over parametrized: some of your parameters can be written as linear transformations of others. So you can recover these by fitting the reduced system then solving for them after the fact. — Andrew M
– Andrew M, Commented Dec 5, 2015 at 0:12

Sycorax · Accepted Answer · 2015-12-08 14:34:57Z

The technical term for this problem is linear dependence. As you've discovered, it results in a singular linear system which is not amenable to OLS. In the social sciences, this is commonly referred to as the "dummy variable trap."

But it would appear that for this problem, you don't need to add the intercept column. The rank deficiency of your design matrix appears to be due to the inclusion of an intercept. Considering that your research interest appears to center on interest in players themselves, by omitting the intercept, you could focus your inquiry and inference directly on the question of player quality.

This doesn't even cause any information loss -- it's two equivalent ways of representing the problem. Consider a simpler case of predicting height based on gender. One method would be to have an intercept, and a binary column indicating Male or Female (this example supposes gender is binary, but this is not a necessary assumption -- math doesn't care about gender politics). In this model, the intercept corresponds to the average height of the gender coded as 0. This is the "baseline" category. Then this baseline gender estimate is shifted up or down by the coefficient of the binary feature to give the estimate of the average effect for the gender coded as 1. An alternative model would omit the intercept, but have two binary columns, in which male is coded as 1 and one in which female is coded as 1. This predicts a different mean for each gender.

Importantly, the two have equivalent results. In the first model, to get results for each gender, you have the mean and then the mean offset by your gender dummy. In the second model, you can just read off both gender values. Clearly, in a situation where you're immediately interested in all of the features, just go with the fully-dummy model. (Caveat: you can only do this once. Continuing the second height/gender example, you couldn't introduce two additional dummy variables Democrat/Republican because then both gender and political party columns sum to 1, meaning that those pairs of columns are linearly dependent.)

It took a second time or two of reading that but I get what you mean. Thank you. My follow up question would be... is there any way to ... extract... the value of the variables I have to drop? A model which doesn't give a value for every player will work most of the time, but none of the time that that player is involved. Additionally, I kind of like the inclusion of an intercept: it let's the outputs be "this how much better in engine units player X is than the average player". That's a nice output. — chessmatchesr
– chessmatchesr, Commented Dec 8, 2015 at 5:47
I've revised my answer. The first draft was sloppy, but now it should be clear how to extract estimates for all groups. Importantly, the intercept model isn't a mean effect for everyone, as a misstated the first time. The intercept is the mean effect for the baseline category. If you've found my answer helpful, please consider upvoting and/or accepting it. — Sycorax
– Sycorax ♦, Commented Dec 8, 2015 at 14:36
I have upvoted it, I just don't have the necessary reputation yet to change its score! Maybe I'm not asking this question the right way: When I run lm(Y~X), the values for P30w and P30b are N/A. When I do OLS through matrix math, the equation won't finish: A player has to be dropped manually csv$P30w <- NULL csv$P30b <- NULL for the matrix math method to obtain results. How do I get the values of P30w and P30b without using ridge regression or machine learning? — chessmatchesr
– chessmatchesr, Commented Dec 8, 2015 at 20:15
This is described in my answer. These players are dropped due to the degeneracy of your design matrix. The intercept is the estimate for the dropped players. All other players' estimates are benchmarked against the omitted players. — Sycorax
– Sycorax ♦, Commented Dec 8, 2015 at 20:23
The follow-up would then be, how to delineate P30w and P30b from within the intercept? The intercept is 1 number holding 2 answers, how to siphon each half of player 30 out of the intercept? — chessmatchesr
– chessmatchesr, Commented Dec 8, 2015 at 21:30

Stack Exchange Network

Do methods exist other than Ridge Regression and Y ~ X + 0 to prevent OLS from dropping variables?

1 Answer 1

Hot Network Questions

Do methods exist other than Ridge Regression and Y ~ X + 0 to prevent OLS from dropping variables?

1 Answer 1

Related

Hot Network Questions