3

I have the following dataframe:

df = pd.DataFrame({'TX':['bob','tim','frank'],'IL':['fred','bob','tim'],'NE':['tim','joe','bob']}) 

I would like to isolate the strings that occur across all columns to generate a list. The expected result is:

output = ['tim','bob'] 

The only way I can think to achieve this is using for loops which I would like to avoid. Is there a built-in pandas function suited to accomplishing this?

0

3 Answers 3

8

You can create mask for count values per columns and test if not missing values per rows by DataFrame.all:

m = df.apply(pd.value_counts).notna() print (m) TX IL NE bob True True True frank True False False fred False True False joe False False True tim True True True L = m.index[m.all(axis=1)].tolist() print (L) ['bob', 'tim'] 
Sign up to request clarification or add additional context in comments.

Comments

1

You can achieve this by pandas.DataFrame.apply() and set.intersection(), like this:

cols_set = list(df.apply(lambda col: set(col.values)).values) output = list(set.intersection(*cols_set)) 

The result is following:

>>> print(output) ['tim', 'bob'] 

3 Comments

list(set.intersection(*[set(col) for col in df.values])). Summary of the above answer. Achieves the same result in lesser amount of code.
@nishant, thank you for your comment. However, you are not right. The code could be used for a problem from the question only if it looked like this: list(set.intersection(*[set(col) for col in df.values.T])). The author of the question is interested in values common for every column, not row! Next time, please, read the question carefully.
@Jaroslav..yes you are correct. I missed df.T i.e. transpose of the data frame. Actual code would be list(set.intersection(*[set(col) for col in df.T.values]))
1

IIUC,

you can stack all your columns vertically and then do a value_counts to count the occurrences of each item, we'll do that in the variable called s

we then want all occurrences of the names which are equal to the max number of occurrences, in this instance 3, the column values are now indices thanks to using stack

s = df.stack().value_counts() # or if you want to ignore duplicates column wise #df.stack().groupby(level=1).unique().explode().value_counts() print(s) tim 3 bob 3 frank 1 fred 1 joe 1 s1 = s[s.eq(s.max())].index.tolist() print(s1) ['tim', 'bob'] 

3 Comments

Please explain.
This might fail if the same value appears in one column more than once. For example say, bob appeared twice in the first column, df.stack.value_counts() would give 4 for 'bob' and thus s1 would only return ['bob'] which is wrong according to the question.
correct @nishant but in the absence of any feedback from OP it's hard to say what's wrong or right, the above could be corrected by df.stack().groupby(level=1).unique().explode().value_counts()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.