2

I have two data frame columns of different lengths. DF1:

filename
b~cute
b~cute2
nan
b~cat
b~dog
nan

DF2:

filename
sgjsg~12345~b~cute~v4.jpeg
jgsgj~1233~b~dog~v4.jpeg
jhdjd~1252~b~cat~v4.jpeg
sggsn~1252~b~cute2~v4.jpeg

I am trying to do a partial match to create a separate column in DF1 with the filename as such

filename filepath
b~cute sgjsg~12345~b~cute~v4.jpeg
b~cute2 sggsn~1252~b~cute2~v4.jpeg
nan nan
b~cat jhdjd~1252~b~cat~v4.jpeg
b~dog jgsgj~1233~b~dog~v4.jpeg
nan nan
5
  • My first two tables are not populating correctly Commented Apr 5, 2023 at 19:43
  • what if some filename from df1 occurs in multiple records of df2? Commented Apr 5, 2023 at 19:47
  • @RomanPerekhrest it doesn’t Commented Apr 5, 2023 at 19:52
  • What is nan, a string or NaN? Commented Apr 5, 2023 at 20:01
  • @Corralien Empty NaN Commented Apr 5, 2023 at 20:03

2 Answers 2

1

Make a helper function and use it to create the column on DF1.

def match_fn(fn, filenames): if not isinstance(fn, str): return None for filename in filenames: if fn in filename: return filename return None df1.loc[:, "filepath"] = df1.filename.apply(lambda fn: match_fn(fn, df2.filename.values)) 
Sign up to request clarification or add additional context in comments.

Comments

1

You can use a regex:

import re # Create pattern and escape regex pat = (df1['filename'].dropna().sort_values(ascending=False) .map(re.escape).str.cat(sep='|')) match = df2['filename'].str.extract(f"({pat})", expand=False) out = df1.merge(df2.rename(columns={'filename': 'filepath'}) .assign(filename=match), on='filename', how='left') 

Output:

>>> out filename filepath 0 b~cute sgjsg~12345~b~cute~v4.jpeg 1 b~cute2 sggsn~1252~b~cute2~v4.jpeg 2 NaN NaN 3 b~cat jhdjd~1252~b~cat~v4.jpeg 4 b~dog jgsgj~1233~b~dog~v4.jpeg 5 NaN NaN 

Details:

>>> pat 'b\\~dog|b\\~cute2|b\\~cute|b\\~cat' 

3 Comments

I forgot to mention why I used sort_values(ascending=False). As regex engine is greedy, you have to put b~cute2 before b~cute in the pattern string else sggsn~1252~b~cute2~v4.jpeg will match with b~cute and not b~cute2.
Thank you! it works amazing on my small data set but it does not work for a larger data set. I will keep testing it
You can slice your first dataframe to build small regex (sort first before slicing) then apply it on df2.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.