Pandas: find common values across columns

Question

I have the following dataframe:

df = pd.DataFrame({'TX':['bob','tim','frank'],'IL':['fred','bob','tim'],'NE':['tim','joe','bob']})

I would like to isolate the strings that occur across all columns to generate a list. The expected result is:

output = ['tim','bob']

The only way I can think to achieve this is using for loops which I would like to avoid. Is there a built-in pandas function suited to accomplishing this?

jezrael · Accepted Answer · 2020-03-23 11:34:08Z

You can create mask for count values per columns and test if not missing values per rows by DataFrame.all:

m = df.apply(pd.value_counts).notna() print (m) TX IL NE bob True True True frank True False False fred False True False joe False False True tim True True True L = m.index[m.all(axis=1)].tolist() print (L) ['bob', 'tim']

Jaroslav Bezděk · Accepted Answer · 2020-03-22 19:12:03Z

1

You can achieve this by pandas.DataFrame.apply() and set.intersection(), like this:

cols_set = list(df.apply(lambda col: set(col.values)).values) output = list(set.intersection(*cols_set))

The result is following:

>>> print(output) ['tim', 'bob']

edited Mar 22, 2020 at 19:12

answered Mar 22, 2020 at 19:04

Jaroslav Bezděk

7,7156 gold badges34 silver badges59 bronze badges

3 Comments

nishant Over a year ago

list(set.intersection(*[set(col) for col in df.values])). Summary of the above answer. Achieves the same result in lesser amount of code.

Jaroslav Bezděk Over a year ago

@nishant, thank you for your comment. However, you are not right. The code could be used for a problem from the question only if it looked like this: list(set.intersection(*[set(col) for col in df.values.T])). The author of the question is interested in values common for every column, not row! Next time, please, read the question carefully.

nishant Over a year ago

@Jaroslav..yes you are correct. I missed df.T i.e. transpose of the data frame. Actual code would be list(set.intersection(*[set(col) for col in df.T.values]))

Umar.H · Accepted Answer · 2020-03-23 11:23:32Z

IIUC,

you can stack all your columns vertically and then do a value_counts to count the occurrences of each item, we'll do that in the variable called s

we then want all occurrences of the names which are equal to the max number of occurrences, in this instance 3, the column values are now indices thanks to using stack

s = df.stack().value_counts() # or if you want to ignore duplicates column wise #df.stack().groupby(level=1).unique().explode().value_counts() print(s) tim 3 bob 3 frank 1 fred 1 joe 1 s1 = s[s.eq(s.max())].index.tolist() print(s1) ['tim', 'bob']

This might fail if the same value appears in one column more than once. For example say, bob appeared twice in the first column, df.stack.value_counts() would give 4 for 'bob' and thus s1 would only return ['bob'] which is wrong according to the question.
correct @nishant but in the absence of any feedback from OP it's hard to say what's wrong or right, the above could be corrected by df.stack().groupby(level=1).unique().explode().value_counts()

Collectives™ on Stack Overflow

Pandas: find common values across columns

3 Answers 3

Comments

3 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

3 Comments

Related