Python 3 Data Frame string manipulation to extract numbers between 8 to 12 characters [duplicate]

Question

I'm not sure where to start with this one.

I have a list of obsolete items with a new item_code listed somewhere in the description column. Item codes are always between 8 & 12 characters so all other numbers in the description should be ignored.

import pandas as pd df1 = pd.DataFrame({'Item_Code': ['00001234', '00012345', '00123456', '01234567'], 'Desc': ['Widget1 - Obsolete Use Alternative 56789100', 'Obsolete Widget 2 - Use Alternative 56789100 - Blah Blah Blah', 'Alternative Use 9999999910 - Blah Blah Blah', 'Obsolete use 99999999911']}, index=[0, 1, 3, 4]) print(df1.head(10))

So ideally I'm looking to have the alternative codes in a new column.

Alex · Accepted Answer · 2024-03-11 16:45:58Z

1

You can use Series.str.extract like so:

df["Alternative"] = df["Desc"].str.extract(r"(\d{8,12})")

This applies the regex r"(\d{8,12})" (explained here) over each row. The values in the resultant column will be strings unless you convert them to integers.

answered Mar 11, 2024 at 16:45

Alex

7,1654 gold badges27 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lee Murray Over a year ago

This is awesome, but how would I deal with Multiple in the same cell? for example:- "Widget1 - Obsolete Use Alternative 56789100 or 56789101" Ideally, I'd want a new row for each replacement item if possible

Alex Over a year ago

You would need to use Series.str.extractall but this returns a DataFrame with a multi index so needs more work to join the rows back together.

Collectives™ on Stack Overflow

Python 3 Data Frame string manipulation to extract numbers between 8 to 12 characters [duplicate]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related