Skip to main content
1 of 3
meyerson
  • 176
  • 1
  • 3

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())]) 

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect), ('col_name3',None) ]) 

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))]) 

define the params you want to scan consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('cell',deepcopy(to_vect)), ('fname_str',deepcopy(to_vect))], [('cell',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('fname_str',deepcopy(to_vect))]]} 

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is separate sklearn,pandas decoupled task.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1) 
meyerson
  • 176
  • 1
  • 3