Struggling to integrate sklearn and pandas in simple Kaggle task

Question

I'm trying to use the sklearn_pandas module to extend the work I do in pandas and dip a toe into machine learning but I'm struggling with an error I don't really understand how to fix.

I was working through the following dataset on Kaggle.

It's essentially an unheadered table (1000 rows, 40 features) with floating point values.

import pandas as pdfrom sklearn import neighbors from sklearn_pandas import DataFrameMapper, cross_val_score path_train ="../kaggle/scikitlearn/train.csv" path_labels ="../kaggle/scikitlearn/trainLabels.csv" path_test = "../kaggle/scikitlearn/test.csv" train = pd.read_csv(path_train, header=None) labels = pd.read_csv(path_labels, header=None) test = pd.read_csv(path_test, header=None) mapper_train = DataFrameMapper([(list(train.columns),neighbors.KNeighborsClassifier(n_neighbors=3))]) mapper_train

Output:

DataFrameMapper(features=[([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', n_neighbors=3, p=2, weights='uniform'))])

So far so good. But then I try the fit

mapper_train.fit_transform(train, labels)

Output:

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-6-e3897d6db1b5> in <module>() ----> 1 mapper_train.fit_transform(train, labels) //anaconda/lib/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params) 409 else: 410 # fit method of arity 2 (supervised transformation) --> 411 return self.fit(X, y, **fit_params).transform(X) 412 413 //anaconda/lib/python2.7/site-packages/sklearn_pandas/__init__.pyc in fit(self, X, y) 116 for columns, transformer in self.features: 117 if transformer is not None: --> 118 transformer.fit(self._get_col_subset(X, columns)) 119 return self 120 TypeError: fit() takes exactly 3 arguments (2 given)`

What am I doing wrong? While the data in this case is all the same, I'm planning to work up a workflow for mixtures categorical, nominal and floating point features and sklearn_pandas seemed to be a logical fit.

Your second import is not correctly indented. I would correct the code myself if the edit was long enough. — logc
– logc, Commented Jul 7, 2014 at 10:08

meyerson · Accepted Answer · 2015-05-27 03:16:56Z

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want the scan to consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('col_name1',deepcopy(to_vect)), ('col_name2',deepcopy(to_vect))], [('col_name1',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('col_name2',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is a separate task outside the scope of the question.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

logc · Accepted Answer · 2014-07-07 13:42:31Z

I have never used sklearn_pandas, but from reading their source code, it looks like this is a bug on their side. If you look for the function that is throwing the exception, you can notice that they are discarding the y argument (it does not even survive until the docstring), and the inner fit function expects one argument more, which is probably y:

def fit(self, X, y=None): ''' Fit a transformation from the pipeline X the data to fit ''' for columns, transformer in self.features: if transformer is not None: transformer.fit(self._get_col_subset(X, columns)) return self

I would recommend that you open an issue in their bug tracker.

UPDATE:

You can test this if you run your code from IPython. To summarize, if you use the %pdb on magic right before you run the problematic call, the exception is captured by the Python debugger, so you can play around a bit and see that calling the fit function with the label values y[0] does work -- see the last line with the pdb> prompt. (The CSV files are downloaded from Kaggle, except for the largest one which is just a part of the real file).

In [1]: import pandas as pd In [2]: from sklearn import neighbors In [3]: from sklearn_pandas import DataFrameMapper, cross_val_score In [4]: path_train ="train.csv" In [5]: path_labels ="trainLabels.csv" In [6]: path_test = "test.csv" In [7]: train = pd.read_csv(path_train, header=None) In [8]: labels = pd.read_csv(path_labels, header=None) In [9]: test = pd.read_csv(path_test, header=None) In [10]: mapper_train = DataFrameMapper([(list(train.columns),neighbors.KNeighborsClassifier(n_neighbors=3))]) In [13]: %pdb on In [14]: mapper_train.fit_transform(train, labels) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-14-e3897d6db1b5> in <module>() ----> 1 mapper_train.fit_transform(train, labels) /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params) 409 else: 410 # fit method of arity 2 (supervised transformation) --> 411 return self.fit(X, y, **fit_params).transform(X) 412 413 /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn_pandas/__init__.pyc in fit(self, X, y) 116 for columns, transformer in self.features: 117 if transformer is not None: --> 118 transformer.fit(self._get_col_subset(X, columns)) 119 return self 120 TypeError: fit() takes exactly 3 arguments (2 given) > /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn_pandas/__init__.py(118)fit() 117 if transformer is not None: --> 118 transformer.fit(self._get_col_subset(X, columns)) 119 return self ipdb> l 113 114 X the data to fit 115 ''' 116 for columns, transformer in self.features: 117 if transformer is not None: --> 118 transformer.fit(self._get_col_subset(X, columns)) 119 return self 120 121 122 def transform(self, X): 123 ''' ipdb> transformer.fit(self._get_col_subset(X, columns), y[0]) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', n_neighbors=3, p=2, weights='uniform')

Thanks. I wouldn't have known what had caused it. I only know most of the time it's my work that's at fault :) — elksie5000
– elksie5000, Commented Jul 7, 2014 at 12:56
@elksie5000 : I have added how to debug the call. I hope the last call is what you would expect from a successful call to the function (?). Otherwise, it is always good to know how to step into the code with pdb :) — logc
– logc, Commented Jul 7, 2014 at 13:44
I must admit pdb was something I was looking at again after working through the Python for Data Analysis book by Wes McKinney. I already work in IPython, but had been reasonably happy with print statements. Thank you again. — elksie5000
– elksie5000, Commented Jul 7, 2014 at 15:03
As a side note, the debugger prompt says "ipdb" because it is the ipython debugger - this is an extra install in my setup. Under normal circumstances, it would be the regular pdb that is called. Just noticed this difference. — logc
– logc, Commented Jul 7, 2014 at 15:27

Stack Exchange Network

Struggling to integrate sklearn and pandas in simple Kaggle task

2 Answers 2

Hot Network Questions

Struggling to integrate sklearn and pandas in simple Kaggle task

2 Answers 2

Related

Hot Network Questions