25

I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).

I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.

3 Answers 3

37

The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.

Important notes:

  • You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model

  • You need to initialize FunctionTransformer with validate=False

Something like this:

from sklearn.pipeline import make_union, make_pipeline from sklearn.preprocessing import FunctionTransformer def get_text_cols(df): return df[['name', 'fruit']] def get_num_cols(df): return df[['height','age']] vec = make_union(*[ make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))), make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler()))) ]) 
Sign up to request clarification or add additional context in comments.

3 Comments

Any idea why I get 'TypeError: All estimators should implement fit and transform.' if I run your code? scikit-learn 0.19.1
Nevermind, the signature has been changed apparently - I've submitted an edit
How could we handle, if the there is one more feature which doesn't need any scaling along with the above?
21

Since v0.20, you can use ColumnTransformer to accomplish this.

1 Comment

Could you please provide an example?
12

An Example of ColumnTransformer might help you:

# FOREGOING TRANSFORMATIONS ON 'data' ... # filter data data = data[data['county'].isin(COUNTIES_OF_INTEREST)] # define the feature encoding of the data impute_and_one_hot_encode = Pipeline([ ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(sparse=False, handle_unknown='ignore')) ]) featurisation = ColumnTransformer(transformers=[ ("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']), ('word2vec', MyW2VTransformer(min_count=2), ['last_name']), ('numeric', StandardScaler(), ['num_children', 'income']) ]) # define the training pipeline for the model neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109) pipeline = Pipeline([ ('features', featurisation), ('learner', neural_net)]) # train-test split train_data, test_data = train_test_split(data, random_state=0) # model training model = pipeline.fit(train_data, train_data['label']) 

You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.