57

Given the following data frame:

import pandas as pd import numpy as np df = pd.DataFrame({'A':['1a',np.nan,'10a','100b','0b'], }) df A 0 1a 1 NaN 2 10a 3 100b 4 0b 

I'd like to extract the numbers from each cell (where they exist). The desired result is:

 A 0 1 1 NaN 2 10 3 100 4 0 

I know it can be done with str.extract, but I'm not sure how.

4 Answers 4

105

Give it a regex capture group:

df.A.str.extract('(\d+)') 

Gives you:

0 1 1 NaN 2 10 3 100 4 0 Name: A, dtype: object 

(\d+) is a regex capturing group, and \d+ specifies a regex pattern that matches only digits. Note that this will only work for whole numbers and not floats.

Sign up to request clarification or add additional context in comments.

6 Comments

how could I do it when there is a comma like : 6,000 a
@StevenG strip out commas first?
As of 2020, this codes gives a FutureWarning. You get around it by adding the parameter expand=False to the extract
This doesn't work if there is number after alphabets
This does not work for my column with number and units: 0.7 mg
|
9

To answer @Steven G 's question in the comment above, this should work:

df.A.str.extract('(^\d*)') 

Comments

8

U can replace your column with your result using "assign" function:

df = df.assign(A = lambda x: x['A'].str.extract('(\d+)')) 

Comments

3

If you have cases where you have multiple disjoint sets of digits, as in 1a2b3c, in which you would like to extract 123, you can do it with Series.str.replace:

>>> df A 0 1a 1 b2 2 a1b2 3 1a2b3c >>> df['A'] = df['A'].str.replace('\D+', '') 0 1 1 2 2 12 3 123 

You could also work this around with Series.str.extractall and groupby but I think that this one is easier.

Hope this helps!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.