I'm trying to build a pipeline containing several user-defined column transformations. When creating a new column transformer, I'm inheriting sklearn.base.BaseEstimator and sklearn.base.TransformerMixin, and implementing fit and transform methods. Calling the transformations directly works as expected, but using them as a part of a sklearn.pipeline.Pipeline instance fails giving ambiguous errors.
Let's say I have a pandas.DataFrame instance df containing the following data:
date genre 0 9/22/11 horror 1 1/16/04 NULL 2 10/11/96 NULL 3 3/28/13 drama 4 4/22/94 drama I want to implement two transformers:
DateTransformer, which converts date strings indf['date']into anumpy.arrayinstance containing year, month, and day for every row.GenreTransformer, which for every genre indf['genre'], returns 1 if it is not specified ('NULL'), and -1 otherwise.
Here is my code:
class GenreTransformer(BaseEstimator, TransformerMixin): def fit(self, x, y=None): return self def transform(self, x): x_copy = x.copy() x_copy[x_copy != 'NULL'] = -1 x_copy[x_copy == 'NULL'] = 1 return x_copy.values class DateTransformer(BaseEstimator, TransformerMixin): def fit(self, x, y=None): return self def transform(self, x): x_timestamp = x.apply(pd.to_datetime) return np.column_stack(( x_timestamp.apply(lambda t: t.year).values, x_timestamp.apply(lambda t: t.month).values, x_timestamp.apply(lambda t: t.day).values, )) Both transformers work correctly:
>>> GenreTransformer().fit_transform(df['genre']) array([-1, 1, 1, -1, -1]) >>> DateTransformer().fit_transform(df['date']) array([[2011, 9, 22], [2004, 1, 16], [1996, 10, 11], [2013, 3, 28], [1994, 4, 22]]) However, when I merge the transformers using sklearn.compose.ColumnTransformer, and create a pipeline, DateTransformer doesn't work:
column_transformer = ColumnTransformer( transformers=[ ('date_trans', DateTransformer(), ['date']), ('genre_trans', GenreTransformer(), ['genre']), ], remainder='drop', ) pipe = Pipeline( steps=[ ('union', column_transformer), # estimators ], ) >>> pipe.fit(df) --------------------------------------------------------------------------- Traceback (most recent call last) ... AttributeError: ("'Series' object has no attribute 'year'", 'occurred at index date') Interestingly, using pandas.Series.apply instead of mask methods inside GenreTransformer.transform and fitting the pipe also fails:
class GenreTransformer(BaseEstimator, TransformerMixin): # ... def transform(self, x): return x.apply(lambda g: -1 if g != 'NULL' else 1) >>> pipe.fit(df) --------------------------------------------------------------------------- Traceback (most recent call last) ... ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index genre') So, I guess there is something wrong with applying pandas.Series.apply method inside pipelines. Is there a possibility of a bug on scikit-learn source code? Or there is something I'm doing incorrectly? If so, can you please point out how to implement column transformers, so that I can include them in pipelines?