Pandas groupby function returns NaN values

Question

I have a list of people with fields unique_id, sex, born_at (birthday) and I’m trying to group by sex and age bins, and count the rows in each segment.

Can’t figure out why I keep getting NaN or 0 as the output for each segment.

Here’s the latest approach I've taken...

Data sample:

|---------------------|------------------|------------------| | unique_id | sex | born_at | |---------------------|------------------|------------------| | 1 | M | 1963-08-04 | |---------------------|------------------|------------------| | 2 | F | 1972-03-22 | |---------------------|------------------|------------------| | 3 | M | 1982-02-10 | |---------------------|------------------|------------------| | 4 | M | 1989-05-02 | |---------------------|------------------|------------------| | 5 | F | 1974-01-09 | |---------------------|------------------|------------------|

Code:

df[‘num_people’]=1 breakpoints = [18,25,35,45,55,65] df[[‘sex’,’born_at’,’num_people’]].groupby([‘sex’,pd.cut(df.born_at.dt.year, bins=breakpoints)]).agg(‘count’)

I’ve tried summing as the agg type, removing NaNs from the data series, pivot_table using the same pd.cut function but no luck. Guessing there’s also probably a better way to do this that doesn’t involve creating a column of 1s.

Desired output would be something like this...

The extra born_at column isn't necessary in the output and I'd also like the age bins to be 18 to 24, 25 to 34, etc. instead of 18 to 25, 25 to 35, etc. but I'm not sure how to specify that either.

Can you post a desired output for your data sample?

user3483203
– user3483203

2018-07-12 21:29:31 +00:00
Commented Jul 12, 2018 at 21:29 — user3483203
– user3483203, Commented Jul 12, 2018 at 21:29
Done. Also made some minor changes to the column naming.

acmyers
– acmyers

2018-07-12 22:17:32 +00:00
Commented Jul 12, 2018 at 22:17 — acmyers
– acmyers, Commented Jul 12, 2018 at 22:17

Viktor · Accepted Answer · 2018-07-12 22:25:57Z

I think you missed the calculation of the current age. The ranges you define for splitting the bithday years only make sense when you use them for calculating the current age (or all grouped cells will be nan or zero respectively because the lowest value in your sample is 1963 and the right-most maximum is 65). So first of all you want to calculate the age:

datetime.now().year-df.birthday.dt.year

This information then can be used to group the data (which are previously grouped by gender):

df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count')

In order to get rid of the nan cells you simply do a fillna(0) like this:

df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count').fillna(0).rename(columns={'birthday':'count'})

I should be better about testing these things. Working on it. Thanks again.

Collectives™ on Stack Overflow

Pandas groupby function returns NaN values

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related