sklearn pipeline - how to apply different transformations on different columns

Question

I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).

I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.

guerda · Accepted Answer · 2018-03-02 09:23:21Z

The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.

Important notes:

You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False

Something like this:

from sklearn.pipeline import make_union, make_pipeline from sklearn.preprocessing import FunctionTransformer def get_text_cols(df): return df[['name', 'fruit']] def get_num_cols(df): return df[['height','age']] vec = make_union(*[ make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))), make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler()))) ])

Any idea why I get 'TypeError: All estimators should implement fit and transform.' if I run your code? scikit-learn 0.19.1
Nevermind, the signature has been changed apparently - I've submitted an edit
How could we handle, if the there is one more feature which doesn't need any scaling along with the above?

zachguo · Accepted Answer · 2018-10-31 17:13:24Z

21

Since v0.20, you can use ColumnTransformer to accomplish this.

answered Oct 31, 2018 at 17:13

zachguo

6,8266 gold badges33 silver badges32 bronze badges

1 Comment

lightbox142 Over a year ago

Could you please provide an example?

LC117 · Accepted Answer · 2021-04-15 13:12:21Z

An Example of ColumnTransformer might help you:

# FOREGOING TRANSFORMATIONS ON 'data' ... # filter data data = data[data['county'].isin(COUNTIES_OF_INTEREST)] # define the feature encoding of the data impute_and_one_hot_encode = Pipeline([ ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(sparse=False, handle_unknown='ignore')) ]) featurisation = ColumnTransformer(transformers=[ ("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']), ('word2vec', MyW2VTransformer(min_count=2), ['last_name']), ('numeric', StandardScaler(), ['num_children', 'income']) ]) # define the training pipeline for the model neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109) pipeline = Pipeline([ ('features', featurisation), ('learner', neural_net)]) # train-test split train_data, test_data = train_test_split(data, random_state=0) # model training model = pipeline.fit(train_data, train_data['label'])

You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py

Collectives™ on Stack Overflow

sklearn pipeline - how to apply different transformations on different columns

3 Answers 3

3 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Linked

Related