0

I have a list of people with fields unique_id, sex, born_at (birthday) and I’m trying to group by sex and age bins, and count the rows in each segment.

Can’t figure out why I keep getting NaN or 0 as the output for each segment.

Here’s the latest approach I've taken...

Data sample:

|---------------------|------------------|------------------| | unique_id | sex | born_at | |---------------------|------------------|------------------| | 1 | M | 1963-08-04 | |---------------------|------------------|------------------| | 2 | F | 1972-03-22 | |---------------------|------------------|------------------| | 3 | M | 1982-02-10 | |---------------------|------------------|------------------| | 4 | M | 1989-05-02 | |---------------------|------------------|------------------| | 5 | F | 1974-01-09 | |---------------------|------------------|------------------| 

Code:

df[‘num_people’]=1 breakpoints = [18,25,35,45,55,65] df[[‘sex’,’born_at’,’num_people’]].groupby([‘sex’,pd.cut(df.born_at.dt.year, bins=breakpoints)]).agg(‘count’) 

I’ve tried summing as the agg type, removing NaNs from the data series, pivot_table using the same pd.cut function but no luck. Guessing there’s also probably a better way to do this that doesn’t involve creating a column of 1s.

Desired output would be something like this... enter image description here

The extra born_at column isn't necessary in the output and I'd also like the age bins to be 18 to 24, 25 to 34, etc. instead of 18 to 25, 25 to 35, etc. but I'm not sure how to specify that either.

2
  • Can you post a desired output for your data sample? Commented Jul 12, 2018 at 21:29
  • Done. Also made some minor changes to the column naming. Commented Jul 12, 2018 at 22:17

1 Answer 1

1

I think you missed the calculation of the current age. The ranges you define for splitting the bithday years only make sense when you use them for calculating the current age (or all grouped cells will be nan or zero respectively because the lowest value in your sample is 1963 and the right-most maximum is 65). So first of all you want to calculate the age:

datetime.now().year-df.birthday.dt.year 

This information then can be used to group the data (which are previously grouped by gender):

df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count') 

In order to get rid of the nan cells you simply do a fillna(0) like this:

df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count').fillna(0).rename(columns={'birthday':'count'}) 
Sign up to request clarification or add additional context in comments.

2 Comments

ah. thank you for catching that. the groupby is working now
I should be better about testing these things. Working on it. Thanks again.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.