I am trying to run svm (from e1071 package) on a document-feature matrix produced by the package quanteda. I start by training the svm on training data:
svm_fit <- svm(x=dfm_train, y=as.factor(y_train), kernel="radial", cost=10) where dfm_train is an S4 object of class dfm, produced by the dfm() function in quanteda. Next, I want to apply the resulting model to a validation set. dfm_val is also produced by applying dfm() to the validation set observations, and then making sure the features match the ones in the training dfm:
dfm_val <- dfm_match(dfm_val, features = featnames(dfm_train)) However, when I run:
predictions_val <- predict(svm_fit, newx=dfm_val, type="class") The predict.svm() function ignores the newx input, as it does when the newx columns don't match the dataset it was fitted on. Instead, it predicts on the training set, so the above line gives the same result as:
predict(svm_fit, type="class") I have previously successfully used the same pipeline to predict on models fitted with glmnet(), so this problem appears to be specific to svm().
I tried double checking whether the training and validation sets have the same columns:
> sum(dfm_val@Dimnames$features != dfm_train@Dimnames$features) [1] 0 Here is a minimal reproducible example:
library("textdata") library("quanteda") library("e1071") d <- dataset_ag_news() d_train <- d[1:1000,] d_val <- d[1001:2000,] dfm_train <- dfm(tokens(d_train$description)) y_train <- as.factor(d_train$class) dfm_val <- dfm(tokens(d_val$description)) dfm_val <- dfm_match(dfm_val, features = featnames(dfm_train)) svm_fit <- svm(x=dfm_train, y=y_train, kernel="radial", cost=10) predictions_val <- predict(svm_fit, newx=dfm_val, type="class") predictions_train <- predict(svm_fit, newx=dfm_val, type="class") table(predictions_val) table(predictions_train)