Problem with column transformations inside pipeline

Question

I'm trying to build a pipeline containing several user-defined column transformations. When creating a new column transformer, I'm inheriting sklearn.base.BaseEstimator and sklearn.base.TransformerMixin, and implementing fit and transform methods. Calling the transformations directly works as expected, but using them as a part of a sklearn.pipeline.Pipeline instance fails giving ambiguous errors.

Let's say I have a pandas.DataFrame instance df containing the following data:

 date genre 0 9/22/11 horror 1 1/16/04 NULL 2 10/11/96 NULL 3 3/28/13 drama 4 4/22/94 drama

I want to implement two transformers:

DateTransformer, which converts date strings in df['date'] into a numpy.array instance containing year, month, and day for every row.
GenreTransformer, which for every genre in df['genre'], returns 1 if it is not specified ('NULL'), and -1 otherwise.

Here is my code:

class GenreTransformer(BaseEstimator, TransformerMixin): def fit(self, x, y=None): return self def transform(self, x): x_copy = x.copy() x_copy[x_copy != 'NULL'] = -1 x_copy[x_copy == 'NULL'] = 1 return x_copy.values class DateTransformer(BaseEstimator, TransformerMixin): def fit(self, x, y=None): return self def transform(self, x): x_timestamp = x.apply(pd.to_datetime) return np.column_stack(( x_timestamp.apply(lambda t: t.year).values, x_timestamp.apply(lambda t: t.month).values, x_timestamp.apply(lambda t: t.day).values, ))

Both transformers work correctly:

>>> GenreTransformer().fit_transform(df['genre']) array([-1, 1, 1, -1, -1]) >>> DateTransformer().fit_transform(df['date']) array([[2011, 9, 22], [2004, 1, 16], [1996, 10, 11], [2013, 3, 28], [1994, 4, 22]])

However, when I merge the transformers using sklearn.compose.ColumnTransformer, and create a pipeline, DateTransformer doesn't work:

column_transformer = ColumnTransformer( transformers=[ ('date_trans', DateTransformer(), ['date']), ('genre_trans', GenreTransformer(), ['genre']), ], remainder='drop', ) pipe = Pipeline( steps=[ ('union', column_transformer), # estimators ], )

>>> pipe.fit(df) --------------------------------------------------------------------------- Traceback (most recent call last) ... AttributeError: ("'Series' object has no attribute 'year'", 'occurred at index date')

Interestingly, using pandas.Series.apply instead of mask methods inside GenreTransformer.transform and fitting the pipe also fails:

class GenreTransformer(BaseEstimator, TransformerMixin): # ... def transform(self, x): return x.apply(lambda g: -1 if g != 'NULL' else 1)

>>> pipe.fit(df) --------------------------------------------------------------------------- Traceback (most recent call last) ... ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index genre')

So, I guess there is something wrong with applying pandas.Series.apply method inside pipelines. Is there a possibility of a bug on scikit-learn source code? Or there is something I'm doing incorrectly? If so, can you please point out how to implement column transformers, so that I can include them in pipelines?

gmds · Accepted Answer · 2019-05-19 01:03:49Z

4

There is a subtle mistake in your code.

You specified ['date'] for the columns to apply DateTransformer to. When you do so, [it signifies that DateTransformer expects a 2D array-like], which, in this case, is a DataFrame. However, it actually expects a 1D array-like, or a Series.

Therefore, what you did was equivalent to DateTransformer().fit_transform(df[['date']]), when you actually wanted df['date'].

Accordingly, pass ('date_trans', DateTransformer(), 'date') to ColumnTransformer instead and everything should be fine.

answered May 19, 2019 at 1:03

gmds

20k4 gold badges37 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Sanjar Adilov Over a year ago

Thanks, @gmds. Should have paid closer attention to the conventions of sklearn transformers. According to the documentation, TransformerMixin.fit_transform method both accepts and returns 2D array-like. So, it is more convenient to pass list of columns into column transformers. Nonetheless, when I build a new transformer accepting data of shape (x.shape[0], 1) and return transformed 2D numpy array of the same shape, I again get an error.

gmds Over a year ago

@SanjarAdylov Yup, but ColumnTransformer wraps it, so you need not necessarily follow the TransformerMixin convention. Anyway, I'm not really sure what you mean; have you tried my suggested solution? Does it work?

Sanjar Adilov Over a year ago

Yes, @gmds, your solution works well. But actually, I have some other transformers that are expected to return 1D array-like: transform accepts pandas.Series instance x and returns x.apply(my_func). After pipeline fitting, ValueError is raised: The output of the transformer should be 2D (scipy matrix, array, or pandas DataFrame). I then reshape the result so that the transformed x is of the shape (x.shape[0], 1) but it still raises an error.

gmds Over a year ago

@SanjarAdylov That is a problem. However, to help you, we would need an actual example (since it seems the problem you actually brought up in your question has been solved). Accordingly, I suggest either editing your question or asking a new question.

Jyoti Prasad Pal Over a year ago

@SanjarAdylov Could you please share the solution? You said your code was buggy, so how did you fix? please help here with your correct code, I'm also having similar requirement.

|

Collectives™ on Stack Overflow

Problem with column transformations inside pipeline

1 Answer 1

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Related