value matching between two DataFrames using pandas in python

Question

Hi I have two DataFrames like below

 DF1 Alpha | Numeric | Special and | 1 | @ or | 2 | # lol ok | 4 | & DF2 with single column Content boy or girl school @ morn pyc LoL ok student Chandra

I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF

 `df11 = (df1.unstack() .reset_index(level=2,drop=True) .rename_axis(('col_order','col_name')) .dropna() .reset_index(name='val_low')) df22 = (df2['Content'].str.split(expand=True) .stack() .rename('val') .reset_index(level=1,drop=True) .rename_axis('idx') .reset_index())` df22['val_low'] = df22['val'].str.lower() df = (pd.merge(df22, df11, on='val_low', how='left') .dropna(subset=['col_name']) .sort_values(['idx','col_order']) .drop_duplicates(['idx'])) df = (pd.concat([df2, df.set_index('idx')], axis=1) .fillna({'col_name':'Other'})[['val','col_name','Content']])

but it is not considering the spaces between lol ok

 expected_output_DF val col_name Content 0 or Alpha boy or girl 1 @ Special school @ morn 2 lol ok Alpha pyc LoL ok student 3 NaN Other Chandra

someone help me with this

See either of our answers, they both do essentially the same thing. — cs95
– cs95, Commented Nov 6, 2017 at 7:21

Bharath M Shetty · Accepted Answer · 2017-11-06 07:21:12Z

Melt the dataframe 1 and convert it to dict. Then get the values of dict based on key obtained by pattern match in dataframe 2 i.e

vals = df.melt() di = dict(zip(vals['value'],vals['variable'])) # {'or': 'Alpha', 1: 'Numeric', 2: 'Numeric', 'and': 'Alpha', 4: 'Numeric', '@': 'Special', '#': 'Special', '&': 'Special', 'Special': 'new', 'Alpha': 'new', 'lol ok': 'Alpha'} #Create a regex pattern based on dict keys. pat = '|'.join(r"\s{}\s".format(x) for x in di.keys()) #Find the words that match the pattern df2['val'] = df2['Content'].str.lower().str.findall(pat).apply(lambda x : x[0].strip() if len(x)>=1 else np.nan) # Map the values with di and fill nan with other. df2['new'] = df2['val'].map(di).fillna('other')

Ouptut :

 Content new val 0 boy or girl Alpha or 1 school @ morn Special @ 2 pyc LoL ok student Alpha lol ok 3 Chandra other NaN

I cannot melt my dataframe, I think the syntax should be pd.melt(df), can you confirm and edit the same in the solution
I am getting this error, TypeError: object of type 'float' has no len() on df2['val'] = df2[content_col].str.lower().str.findall(pat).apply(lambda x : x[0].strip() if len(x)>=1 else np.nan)
pandas version is '0.19.2' , I did filledna() but it affects in other ways

cs95 · Accepted Answer · 2017-11-06 07:38:48Z

Use str.cat + str.extract. Then, use map for the column names, and pd.concat to join.

i = df.stack().astype(str) j = i.reset_index(level=0, drop=1) m = dict(zip(j.values, j.index)) v = i.str.cat(sep='|') df2['val'] = df2.Content.str.extract(r'\s(' + v + r')\s', flags=re.I, expand=False) df2['col_name'] = df2['val'].str.lower().map(m).fillna('Other') df2 Content val col_name 0 boy or girl or Alpha 1 school @ morn @ Special 2 pyc LoL ok student LoL ok Alpha 3 Chandra NaN Other

Details

i and j are setup variable to create the mapping
m is the mapping of values to column names
v is the regex pattern that is sent to str.extract for keyword extraction. I use re.I to ignore the case
I use map + fillna to map extracted values to column names via m

v 'and|1|@|or|2|#|lol ok|4|&' m {'#': 'Special', '&': 'Special', '1': 'Numeric', '2': 'Numeric', '4': 'Numeric', '@': 'Special', 'and': 'Alpha', 'lol ok': 'Alpha', 'or': 'Alpha'} df['val'] 0 or 1 @ 2 LoL ok 3 NaN Name: val, dtype: object df['col_name'] 0 Alpha 1 Special 2 Alpha 3 Other Name: col_name, dtype: object

@cᴏʟᴅsᴘᴇᴇᴅ so many new terms in the code. You need to update it. I think its df2['val'] = and also not jus val.str maybe df2['val'l.str.
@Bharath Oh man, there were a ton of those bugs, but I've fixed them.
@coldspped, tried your solution its is not working when the mapping keyword present in the end like if the keyword in df is "displayed" and in df2["content"]="Policies not getting renewed displayed" its going under other category

Collectives™ on Stack Overflow

value matching between two DataFrames using pandas in python

2 Answers 2

6 Comments

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

6 Comments

Linked

Related