2

I'm trying to build a pipeline containing several user-defined column transformations. When creating a new column transformer, I'm inheriting sklearn.base.BaseEstimator and sklearn.base.TransformerMixin, and implementing fit and transform methods. Calling the transformations directly works as expected, but using them as a part of a sklearn.pipeline.Pipeline instance fails giving ambiguous errors.

Let's say I have a pandas.DataFrame instance df containing the following data:

 date genre 0 9/22/11 horror 1 1/16/04 NULL 2 10/11/96 NULL 3 3/28/13 drama 4 4/22/94 drama 

I want to implement two transformers:

  1. DateTransformer, which converts date strings in df['date'] into a numpy.array instance containing year, month, and day for every row.

  2. GenreTransformer, which for every genre in df['genre'], returns 1 if it is not specified ('NULL'), and -1 otherwise.

Here is my code:

class GenreTransformer(BaseEstimator, TransformerMixin): def fit(self, x, y=None): return self def transform(self, x): x_copy = x.copy() x_copy[x_copy != 'NULL'] = -1 x_copy[x_copy == 'NULL'] = 1 return x_copy.values class DateTransformer(BaseEstimator, TransformerMixin): def fit(self, x, y=None): return self def transform(self, x): x_timestamp = x.apply(pd.to_datetime) return np.column_stack(( x_timestamp.apply(lambda t: t.year).values, x_timestamp.apply(lambda t: t.month).values, x_timestamp.apply(lambda t: t.day).values, )) 

Both transformers work correctly:

>>> GenreTransformer().fit_transform(df['genre']) array([-1, 1, 1, -1, -1]) >>> DateTransformer().fit_transform(df['date']) array([[2011, 9, 22], [2004, 1, 16], [1996, 10, 11], [2013, 3, 28], [1994, 4, 22]]) 

However, when I merge the transformers using sklearn.compose.ColumnTransformer, and create a pipeline, DateTransformer doesn't work:

column_transformer = ColumnTransformer( transformers=[ ('date_trans', DateTransformer(), ['date']), ('genre_trans', GenreTransformer(), ['genre']), ], remainder='drop', ) pipe = Pipeline( steps=[ ('union', column_transformer), # estimators ], ) 
>>> pipe.fit(df) --------------------------------------------------------------------------- Traceback (most recent call last) ... AttributeError: ("'Series' object has no attribute 'year'", 'occurred at index date') 

Interestingly, using pandas.Series.apply instead of mask methods inside GenreTransformer.transform and fitting the pipe also fails:

class GenreTransformer(BaseEstimator, TransformerMixin): # ... def transform(self, x): return x.apply(lambda g: -1 if g != 'NULL' else 1) 
>>> pipe.fit(df) --------------------------------------------------------------------------- Traceback (most recent call last) ... ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index genre') 

So, I guess there is something wrong with applying pandas.Series.apply method inside pipelines. Is there a possibility of a bug on scikit-learn source code? Or there is something I'm doing incorrectly? If so, can you please point out how to implement column transformers, so that I can include them in pipelines?

1 Answer 1

4

There is a subtle mistake in your code.

You specified ['date'] for the columns to apply DateTransformer to. When you do so, [it signifies that DateTransformer expects a 2D array-like], which, in this case, is a DataFrame. However, it actually expects a 1D array-like, or a Series.

Therefore, what you did was equivalent to DateTransformer().fit_transform(df[['date']]), when you actually wanted df['date'].

Accordingly, pass ('date_trans', DateTransformer(), 'date') to ColumnTransformer instead and everything should be fine.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks, @gmds. Should have paid closer attention to the conventions of sklearn transformers. According to the documentation, TransformerMixin.fit_transform method both accepts and returns 2D array-like. So, it is more convenient to pass list of columns into column transformers. Nonetheless, when I build a new transformer accepting data of shape (x.shape[0], 1) and return transformed 2D numpy array of the same shape, I again get an error.
@SanjarAdylov Yup, but ColumnTransformer wraps it, so you need not necessarily follow the TransformerMixin convention. Anyway, I'm not really sure what you mean; have you tried my suggested solution? Does it work?
Yes, @gmds, your solution works well. But actually, I have some other transformers that are expected to return 1D array-like: transform accepts pandas.Series instance x and returns x.apply(my_func). After pipeline fitting, ValueError is raised: The output of the transformer should be 2D (scipy matrix, array, or pandas DataFrame). I then reshape the result so that the transformed x is of the shape (x.shape[0], 1) but it still raises an error.
@SanjarAdylov That is a problem. However, to help you, we would need an actual example (since it seems the problem you actually brought up in your question has been solved). Accordingly, I suggest either editing your question or asking a new question.
@SanjarAdylov Could you please share the solution? You said your code was buggy, so how did you fix? please help here with your correct code, I'm also having similar requirement.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.