Revisions to Struggling to integrate sklearn and pandas in simple Kaggle task

added 4 characters in body

edited May 27, 2015 at 3:16

176
1
3

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want tothe scan to consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('col_name1',deepcopy(to_vect)), ('col_name2',deepcopy(to_vect))], [('col_name1',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('col_name2',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is a separate task outside the scope of the question.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want to scan consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('col_name1',deepcopy(to_vect)), ('col_name2',deepcopy(to_vect))], [('col_name1',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('col_name2',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is a separate task outside the scope of the question.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want the scan to consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('col_name1',deepcopy(to_vect)), ('col_name2',deepcopy(to_vect))], [('col_name1',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('col_name2',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is a separate task outside the scope of the question.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

added 21 characters in body

Source Link

edited May 20, 2015 at 22:09

meyerson

176
1
3

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect),   ('col_name3',None) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want to scan consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('cell''col_name1',deepcopy(to_vect)), ('fname_str''col_name2',deepcopy(to_vect))], [('cell''col_name1',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('fname_str''col_name2',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is a separate sklearn,pandas decoupled task outside the scope of the question.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect),   ('col_name3',None) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want to scan consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('cell',deepcopy(to_vect)), ('fname_str',deepcopy(to_vect))], [('cell',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('fname_str',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is separate sklearn,pandas decoupled task.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want to scan consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('col_name1',deepcopy(to_vect)), ('col_name2',deepcopy(to_vect))], [('col_name1',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('col_name2',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is a separate task outside the scope of the question.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

Source Link

answered May 19, 2015 at 15:46

meyerson

176
1
3

Here is an example of how to get pandas and sklearn to play nice

say you have 2 columns that are both strings and you wish to vectorize - but you have no idea which vectorization params will result in the best downstream performance.

create the vectorizer

to_vect = Pipeline([('vect',CountVectorizer(min_df =1,max_df=.9,ngram_range=(1,2),max_features=1000)), ('tfidf', TfidfTransformer())])

create the DataFrameMapper obj.

full_mapper = DataFrameMapper([ ('col_name1', to_vect), ('col_name2',to_vect), ('col_name3',None) ])

this is the full pipeline

full_pipeline = Pipeline([('mapper',full_mapper),('clf', SGDClassifier(n_iter=15, warm_start=True))])

define the params you want to scan consider

full_params = {'clf__alpha': [1e-2,1e-3,1e-4], 'clf__loss':['modified_huber','hinge'], 'clf__penalty':['l2','l1'], 'mapper__features':[[('cell',deepcopy(to_vect)), ('fname_str',deepcopy(to_vect))], [('cell',deepcopy(to_vect).set_params(vect__analyzer= 'char_wb')), ('fname_str',deepcopy(to_vect))]]}

Thats it! - note however that mapper_features are a single item in this dictionary - so use a for loop or itertools.product to generate a FLAT list of all to_vect options you wish to consider - but that is separate sklearn,pandas decoupled task.

Go on to create the optimal classifier or whatever else your pipeline ends with

gs_clf = GridSearchCV(full_pipe, full_params, n_jobs=-1)

Stack Exchange Network

Return to Answer