Python find partial matches of two data frame columns of different lengths

Question

I have two data frame columns of different lengths. DF1:

filename
b~cute
b~cute2
nan
b~cat
b~dog
nan

DF2:

filename
sgjsg~12345~b~cute~v4.jpeg
jgsgj~1233~b~dog~v4.jpeg
jhdjd~1252~b~cat~v4.jpeg
sggsn~1252~b~cute2~v4.jpeg

I am trying to do a partial match to create a separate column in DF1 with the filename as such

filename	filepath
b~cute	sgjsg~12345~b~cute~v4.jpeg
b~cute2	sggsn~1252~b~cute2~v4.jpeg
nan	nan
b~cat	jhdjd~1252~b~cat~v4.jpeg
b~dog	jgsgj~1233~b~dog~v4.jpeg
nan	nan

what if some filename from df1 occurs in multiple records of df2? — RomanPerekhrest
– RomanPerekhrest, Commented Apr 5, 2023 at 19:47

iuvbio · Accepted Answer · 2023-04-05 20:10:43Z

Make a helper function and use it to create the column on DF1.

def match_fn(fn, filenames): if not isinstance(fn, str): return None for filename in filenames: if fn in filename: return filename return None df1.loc[:, "filepath"] = df1.filename.apply(lambda fn: match_fn(fn, df2.filename.values))

Corralien · Accepted Answer · 2023-04-05 20:12:36Z

You can use a regex:

import re # Create pattern and escape regex pat = (df1['filename'].dropna().sort_values(ascending=False) .map(re.escape).str.cat(sep='|')) match = df2['filename'].str.extract(f"({pat})", expand=False) out = df1.merge(df2.rename(columns={'filename': 'filepath'}) .assign(filename=match), on='filename', how='left')

Output:

>>> out filename filepath 0 b~cute sgjsg~12345~b~cute~v4.jpeg 1 b~cute2 sggsn~1252~b~cute2~v4.jpeg 2 NaN NaN 3 b~cat jhdjd~1252~b~cat~v4.jpeg 4 b~dog jgsgj~1233~b~dog~v4.jpeg 5 NaN NaN

Details:

>>> pat 'b\\~dog|b\\~cute2|b\\~cute|b\\~cat'

I forgot to mention why I used sort_values(ascending=False). As regex engine is greedy, you have to put b~cute2 before b~cute in the pattern string else sggsn~1252~b~cute2~v4.jpeg will match with b~cute and not b~cute2.
Thank you! it works amazing on my small data set but it does not work for a larger data set. I will keep testing it
You can slice your first dataframe to build small regex (sort first before slicing) then apply it on df2.

Collectives™ on Stack Overflow

Python find partial matches of two data frame columns of different lengths

2 Answers 2

Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Related