0

How can one calculate the age over multiple dates for customers where not all dates are populated?

Thinking of the problem, I am trying to get the minimum age in a temporary table and then use this table to get to a final table where an age exists for each year for each customer_id.

I know how to get to the min_age_table with each person's earliest recorded age and date. I am not sure how to use this to generate a table with each customer's age going sequentially backwards and forwards as shown in the figure below.

enter image description here

I have set up a minimum working example to try to implement this in the big-query SQL UI.

-- CREATE OR REPLACE TABLE `dataset_id.project_id.example_table` WITH original_table AS (SELECT 'a' as customer_id, '2020-11-01' as snapshot_date, 20 as age UNION ALL SELECT 'a', '2020-12-01', 21 UNION ALL SELECT 'b', '2020-09-01', 25 UNION ALL SELECT 'b', '2020-10-01', 25 UNION ALL SELECT 'c', '2020-01-01', 45) -- select customer_id, min_snapshot_date , age for min table SELECT --original_table.customer_id, MIN(original_table.snapshot_date) AS min_snapshot_date, --original_table.age FROM original_table -- use min table to get an age for 3 years (will need to be able to increase both ways) 
2
  • 2
    I would consider deriving an estimate date of birth using something like date_sub(cast(min_date as date), INTERVAL age YEAR) but this may be off by 1 year for some people. May need to play around with it. Commented Sep 22, 2022 at 21:50
  • Use the formula from the user isolated: date_sub(cast(min_date as date), INTERVAL age YEAR) as X then calculate the minimum value per user and the difference: date_diff(max(X),min(X),day) as y. If y is greater than 365 days the user lied with the age. If it is zero only one day is given and thus the user can be one year older. 365-y gives you the amount of days the user can be older. Commented Sep 23, 2022 at 9:00

2 Answers 2

1
+50

I have written a simple solution that follows your diagram completely and have annotated it to attempt to help you understand what we are doing.

-- Create a temp table to hold the date values CREATE TEMP TABLE dates ( YEAR INT64 ); -- insert dates in the desired range - have chosen 2015-2022 here as an example INSERT INTO dates SELECT EXTRACT(YEAR FROM MY_DATE) FROM ( -- this selects dates from 2015-2022, change 2015 to desired start date and 7 to the desired number of years SELECT DATE_ADD('2015-01-01',INTERVAL param YEAR) AS MY_DATE FROM unnest(GENERATE_ARRAY(0, 7, 1)) as param ) ; -- create data in the original table WITH original_table AS (SELECT 'a' as customer_id, '2020-11-01' as snapshot_date, 20 as age UNION ALL SELECT 'a', '2020-12-01', 21 UNION ALL SELECT 'b', '2020-09-01', 25 UNION ALL SELECT 'b', '2020-10-01', 25 UNION ALL SELECT 'c', '2020-01-01', 45), -- Select customer_id, min_snapshot_date, age for min table min_date_age AS (SELECT original_table.customer_id, DATE(MIN(original_table.snapshot_date)) AS min_snapshot_date, MIN(original_table.age) AS min_age FROM original_table GROUP BY customer_id) -- select customer id, and derived snapshot year and age SELECT customer_id, YEAR AS derived_snapshot_year, min_age - (EXTRACT(YEAR FROM min_snapshot_date) - YEAR) AS derived_age FROM min_date_age -- cross join to create duplicate rows (one for each desired date) CROSS JOIN dates ORDER BY min_date_age.customer_id, derived_snapshot_year 

Output table:

enter image description here

This creates the exact behavior as you have described with your diagram - however as pointed out by other users there are some issues with this output as each year there are technically two ages a user can possibly be in a single year (with the rare exception of people born on XXXX/01/01).

EDIT:

Edit to response to the comment.

-- Create a temp table to hold the date values CREATE TEMP TABLE dates ( YEAR INT64 ); -- insert dates in the desired range - have chosen 2015-2022 here as an example INSERT INTO dates SELECT EXTRACT(YEAR FROM MY_DATE) FROM ( -- this selects dates from 2015-2022, change 2015 to desired start date and 7 to the desired number of years SELECT DATE_ADD('2015-01-01',INTERVAL param YEAR) AS MY_DATE FROM unnest(GENERATE_ARRAY(0, 7, 1)) as param ) ; -- create data in the original table WITH original_table AS (SELECT 'a' as customer_id, '2020-11-01' as snapshot_date, 20 as age UNION ALL SELECT 'a', '2020-12-01', 21 UNION ALL SELECT 'b', '2020-09-01', 25 UNION ALL SELECT 'b', '2020-10-01', 25 UNION ALL SELECT 'c', '2020-01-01', 45), -- Select customer_id, min_snapshot_date, age for min table min_date_age AS (SELECT original_table.customer_id, DATE(MIN(original_table.snapshot_date)) AS min_snapshot_date, MIN(original_table.age) AS min_age FROM original_table GROUP BY customer_id), -- select customer id, and derived snapshot year and age output AS (SELECT customer_id, YEAR AS derived_snapshot_year, min_age - (EXTRACT(YEAR FROM min_snapshot_date) - YEAR) AS derived_age FROM min_date_age -- cross join to create duplicate rows (one for each desired date) CROSS JOIN dates ORDER BY min_date_age.customer_id, derived_snapshot_year) SELECT ... FROM other_table LEFT JOIN output ON other_table.field = output.field 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @JeremySavage , this was a great a great answer. What about if you want to create this table as a sub table to left join onto another table. How does this change how you write it?
@user4933 - no change, you can just pass the CTE to a LEFT JOIN statement. Of course, if you wanted to save the intermediate data you could just save it to an output table.
0

Interesting question, I would like to post an approach for mssql users

;WITH original_table AS ( SELECT * FROM (VALUES ('a', CAST('2020-11-01' AS DATE), 20), ('a', CAST('2020-12-01' AS DATE), 21), ('b', CAST('2020-09-01' AS DATE), 25), ('b', CAST('2020-10-01' AS DATE), 25), --('c', CAST('2020-02-01' AS DATE), 46), ('c', CAST('2020-01-01' AS DATE), 45) ) x ( customer_id, snapshot_date, age ) ), min_table AS ( SELECT customer_id, MIN(DATEADD(DAY, DATEDIFF(DAY, 0, DATEADD(DAY, -age*364.75, snapshot_date)), 0)) snapshot_date FROM original_table GROUP BY customer_id ), years_period AS ( SELECT * FROM (VALUES (2019), (2020), (2021) ) x ( snapshot_year ) ) SELECT customer_id, y.snapshot_year, DATEDIFF(YEAR, m.snapshot_date, DATEADD(YEAR, snapshot_year-1900, 0)) age FROM years_period y, min_table m 

And related output

customer_id snapshot_year age a 2019 20 a 2020 21 a 2021 22 b 2019 24 b 2020 25 b 2021 26 c 2019 44 c 2020 45 c 2021 46 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.