DataFrame drop rows whose column has certain values

Question

For my question, I have found quite a few entries that explain how to drop rows with specific column values; however, I've not been able to find (I know a post might be out there) a post that addresses how to drop rows in a dataframe with specific column values across multiple columns (34 in this case).

baddata

zip age item1 item2 item3 item4 item5 item6 item7 item34 12345 10 1 0 1 1 0 0 1 0 23456 20 10 111 11 1 0 1 9 8 45678 60 1 0 1 1 0 1 0 1

I want to retain all those rows that has values of '1' or '0' (drop all rows for which col values in 34 cols are not '1' or '0'). This is what I tried so far:

baddata = pd.DataFrame(data=dirtydata, columns=['zip','age','item1','item2'...'item34'])

gooddata=baddata.dropna() # some rows have NaN; drops rows with NaN values

option-1:

gooddata[gooddata[['item1','item2'...'item34']].isin([0,1])] #this makes values for zip and age NaN; not sure why?

option-2:

gooddata[gooddata[['item1','item2'...'item34']].map(len) < 2).any(axis=1)] #also tried replacing 'any' with 'all'; did not work

option-3:

cols_of_interest=['item1','item2'...'item34'] gooddata[gooddata.drop(gooddata[cols_of_interest].map(len) < 2)] #doubtful about the syntax and usage of functions

Let me be clear, you want to drop all rows where the value in item34 is not 0 or 1? Is this what you want? That's it? — Joe T. Boka
– Joe T. Boka, Commented Jun 5, 2016 at 2:22
Joe R - I want to only retain those rows which have values of '0' or '1' for various items i.e remove all those rows that has values other than '0' or '1' as values in cols item1, item2, item3, item4,...item34. — ads
– ads, Commented Jun 5, 2016 at 2:48
Expected Result: zip age item1 item2 item3 item4 item5 item6 item7 item34 12345 10 1 0 1 1 0 0 1 0 45678 60 1 0 1 1 0 1 0 1 — ads
– ads, Commented Jun 5, 2016 at 2:54
@Merlin how do I get the expected result with good data as stated in row1 and row3. row2 is an example of how different items with values other than 1 or 0 must be dropped or not retained in the dataframe. Hope I am not making it too confusing. — ads
– ads, Commented Jun 5, 2016 at 3:00

John Karasinski · Accepted Answer · 2016-06-05 03:40:02Z

Start by selecting all the columns after age

df[df.columns[2:]] item1 item2 item3 item4 item5 item6 item7 item34 0 1 0 1 1 0 0 1 0 1 10 111 11 1 0 1 9 8 2 1 0 1 1 0 1 0 1

check if their values are 0 or 1

df[df.columns[2:]].isin((0, 1)) item1 item2 item3 item4 item5 item6 item7 item34 0 True True True True True True True True 1 False False False True True True False False 2 True True True True True True True True

check if all values in the row are True

df[df.columns[2:]].isin((0, 1)).all(axis=1) 0 True 1 False 2 True dtype: bool

select only these rows

df[df[df.columns[2:]].isin((0, 1)).all(axis=1)] zip age item1 item2 item3 item4 item5 item6 item7 item34 0 12345 10 1 0 1 1 0 0 1 0 2 45678 60 1 0 1 1 0 1 0 1

EDIT

Breaking this out a bit more clearly, we have

relevant_columns = df[df.columns[2:]] values_as_ints = relevant_columns.convert_objects(convert_numeric=True) values_valid = values_as_ints.isin((0, 1)) row_valid = values_valid.all(axis=1) good_rows = df[row_valid]

they're of type float, integer, object. I should have clarified that.
You could instead try df[df[df.columns[2:]].astype(int).isin((0, 1)).all(axis=1)].
@karasinski Getting this error message: invalid literal for long() with base 10: ' '
It's difficult to solve your issue, as none of these problems are showing up in the example data you posted above. I edited my answer above, could you try that?

Merlin · Accepted Answer · 2016-06-05 04:29:03Z

Try this:

 print df zip age item1 item2 item3 item4 item5 item6 item7 item34 12345 10 1 0 1 1 0 0 1 0 23456 20 10 111 11 1 0 1 9 8 45678 60 1 0 1 1 0 1 0 1 dfSlice = df[df.columns[2:]] def mapZeroOne(x): if x == 0 or x == 1: return x dfNa = dfSlice.applymap(mapZeroOne) print dfNa item1 item2 item3 item4 item5 item6 item7 item34 12345 1.0 0.0 1.0 1 0 0 1.0 0.0 23456 NaN NaN NaN 1 0 1 NaN NaN 45678 1.0 0.0 1.0 1 0 1 0.0 1.0 dfAge = df[['zip',"age"]] print dfAge zip age 12345 10 23456 20 45678 60 df_new = pd.concat([dfAge, dfNa], axis=1) print df_new zip age item1 item2 item3 item4 item5 item6 item7 item34 12345 10 1.0 0.0 1.0 1 0 0 1.0 0.0 23456 20 NaN NaN NaN 1 0 1 NaN NaN 45678 60 1.0 0.0 1.0 1 0 1 0.0 1.0 print df_new.dropna() zip age item1 item2 item3 item4 item5 item6 item7 item34 12345 10 1.0 0.0 1.0 1 0 0 1.0 0.0 45678 60 1.0 0.0 1.0 1 0 1 0.0 1.0

You may need to adjust 0 to "0" and 1 to "1".

Why would I only use specific 'item34' for the bitwise operation? Other items could also have invalid data values as represented in row1. Trying to understand the usage of bitwise operation and specific use of only item34

Collectives™ on Stack Overflow

DataFrame drop rows whose column has certain values

option-1:

option-2:

option-3:

2 Answers 2

10 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

option-1:

option-2:

option-3:

2 Answers 2

10 Comments

1 Comment

Linked

Related