Optimise a function with numerous conditions that depends on the previous row in a Python dataframe

Question

I have the following dataframe:

country_ID	ID	direction	date
ESP_1	0	IN	2021-02-28
ENG	0	IN	2021-03-03
ENG	0	OUT	2021-03-04
ESP_2	0	IN	2021-03-05
FRA	1	OUT	2021-03-07
ENG	1	OUT	2021-03-09
ENG	1	OUT	2021-03-10
ENG	2	IN	2021-03-13

I have implemented the following functionality:

ef create_columns_analysis(df): df['visit_ESP'] = 0 df['visit_ENG'] = 0 df['visit_FRA'] = 0 list_ids = [] for i in range(len(df)): if df.loc[i,'country_ID'] == 'ENG': country_ID_ENG(df, i, list_ids) else: # case country_ID = {FRA, ESP_1, ESP_2} # other methods not specified return df

For each row with a specific country_ID, a similarly structured function is applied.

I would like to optimise or simplify the code of the country_ID_ENG function. The country_ID_ENG function is defined as follows:

def country_ID_ENG(df, i, list_ids): # If it is the first time the ID is detected if df.loc[i,'ID'] not in list_ids: # It adds up to one visit regardless of the direction of the ID df.loc[i,'visit_ENG'] = 1 # Add the ID to the read list list_ids.append(df.loc[i, 'ID']) # Assigns the error column a start message df.loc[i,'error'] = 'ERROR:1' # If it is not the first time it detects that ID else: # Saves the information of the previous row prev_row = df.loc[i-1] # If the current row direction is 'IN' if df.loc[i,'direction'] == 'IN': # Add a visit df.loc[i,'visit_ENG'] = 1 # Behaviour dependent on the previous row # If the current row direction is 'IN' and previous row is 'IN' if prev_row['direction'] == 'IN': if prev_row['country_ID'] == 'FRA': df.loc[i,'error'] = 'ERROR:0' elif prev_row['country_ID'] in ['ESP_1','ESP_2']: df.loc[i,'error'] = 'ERROR:2' df.loc[i,'visit_FRA'] = 1 else: df.loc[i,'error'] = 'ERROR:3' # If the current row direction is 'IN' and previous row is 'OUT' else: if prev_row['country_ID'] == 'ENG': df.loc[i,'error'] = 'ERROR:0' elif prev_row['country_ID'] in ['FRA','ESP_2']: df.loc[i,'error'] = 'ERROR:4' df.loc[i,'visit_FRA'] = 1 else: df.loc[i,'error'] = 'ERROR:5' df.loc[i,'visit_ESP'] = 1 df.loc[i,'visit_FRA'] = 1 # If the current row direction is 'OUT' else: # If the current row direction is 'OUT' and previous row is 'IN' if prev_row['direction'] == 'IN': # If it detects an output before an input of the same 'country_ID', # it calculates the visit time if prev_row['country_ID'] == 'ENG': df.loc[i,'mean_time'] = df.loc[i,'date']-prev_row['date'] df.loc[i,'error'] = 'ERROR:0' elif prev_row['country_ID'] in ['ESP_1','ESP_2']: df.loc[i,'error'] = 'ERROR:6' df.loc[i,'visit_FRA'] = 1 df.loc[i,'visit_ENG'] = 1 else: df.loc[i,'error'] = 'ERROR:7' df.loc[i,'visit_ENG'] = 1 # If the current row direction is 'OUT' and previous row is 'OUT' else: df.loc[i,'visit_ENG'] = 1 if prev_row['country_ID'] == 'ENG': df.loc[i,'error'] = 'ERROR:8' elif prev_row['country_ID'] in ['FRA','ESP_2']: df.loc[i,'error'] = 'ERROR:9' df.loc[i,'visit_FRA'] = 1 else: df.loc[i,'error'] = 'ERROR:10' df.loc[i,'visit_ESP'] = 1 df.loc[i,'visit_FRA'] = 1

The above function uses the information from the current row and the previous row (if any) to create new columns for visit_ENG, visit_ESP, visit_FRA, mean_time and error.

For the example dataframe the function, applying the function country_ID_ENG to rows whose country_ID is equal to ENG, should return the following result:

country_ID	ID	direction	date	visit_ENG	visit_FRA	mean_time	error
ESP_1	0	IN	2021-02-28	0	0	NaN	NaN
ENG	0	IN	2021-03-03	0	1	NaN	ERROR:2
ENG	0	OUT	2021-03-04	0	0	1 days	ERROR:0
ESP_2	0	IN	2021-03-05	0	0	NaN	NaN
FRA	1	OUT	2021-03-07	0	0	NaN	NaN
ENG	1	OUT	2021-03-09	1	1	NaN	ERROR:9
ENG	1	OUT	2021-03-10	1	0	NaN	ERROR:8
ENG	2	IN	2021-03-13	1	0	NaN	ERROR:1

The function is very long, and the other functions for rows with country_ID equal to ESP or FRA will have the same complexity. I would like you to help me to simplify or optimise the code of this function to also take it into account when defining the country_ID_ESP and country_ID_FRA functions. I appreciate your help.

Good question, but the title at the moment is quite generic - what is your code actually for; that's what you should title your question in general — Greedo
– Greedo, Commented Mar 25, 2022 at 21:45
What is the logic behind the errors. I'm trying to come up with a way to index the errors with a condition. why is prev_row['country_ID'] == 'ENG' and prev_row['country_ID'] == 'FRA' an ERROR:0. — Jason Leaver
– Jason Leaver, Commented Mar 26, 2022 at 2:25
This won't actually run. Your first else can't only have a comment: Python requires at least a pass. — Reinderien
– Reinderien, Commented Mar 27, 2022 at 12:30
Your example output does not match what your code does. Assuming that - means NaN/None, your first row has zeros in visit_ESP, visit_ENG and visit_FRA, but you've shown -. — Reinderien
– Reinderien, Commented Mar 27, 2022 at 12:37
Your edit didn't really help. Your output is full of discrepancies. I encourage you to verbatim copy and paste and check the results. — Reinderien
– Reinderien, Commented Mar 27, 2022 at 15:45

Jason Leaver · Accepted Answer · 2022-03-26 06:00:56Z

per pandas iteration guidance

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!

suggested

from typing import Iterable, Tuple import pandas as pd COLS = ['country_ID', 'ID', 'direction', 'date'] DATA = [['ESP_1', 0, 'IN', '2021-02-28'], ['ENG', 0, 'IN', '2021-03-03'], ['ENG', 0, 'OUT', '2021-03-04'], ['ESP_2', 0, 'IN', '2021-03-05'], ['FRA', 1, 'OUT', '2021-03-07'], ['ENG', 1, 'OUT', '2021-03-09'], ['ENG', 1, 'OUT', '2021-03-10'], ['ENG', 2, 'IN', '2021-03-13']] def both_in(country_id: str): """where both condtions were `IN`""" esp, eng, fra = (0, 0, 0) if country_id == 'FRA': error_code = 0 elif country_id in ('ESP_1', 'ESP_2'): error_code = 2 fra = 1 else: error_code = 3 return (esp, eng, fra, f'ERROR:{error_code}') def both_out(country_id: str): """where both contionds were `OUT`""" esp, eng, fra = (0, 1, 0) if country_id == 'ENG': error_code = 8 elif country_id in ('FRA', 'ESP_2'): error_code = 9 fra = 1 else: error_code = 10 esp, fra = 1, 1 return (esp, eng, fra, f'ERROR:{error_code}') def in_out(country_id: str): """where current was IN and previous was OUT""" esp, eng, fra = (0, 0, 0) if country_id == 'ENG': error_code = 0 elif country_id in ('FRA', 'ESP_2'): error_code = 4 fra = 1 else: error_code = 5 esp = 1 fra = 1 return (esp, eng, fra, f'ERROR:{error_code}') def out_in(country_id: str): """where current was `OUT` and previous was `IN`""" esp, eng, fra = (0, 0, 0) if country_id == 'ENG': error_code = 0 elif country_id in ('ESP_1', 'ESP_2'): error_code = 6 eng, fra = 1, 1 else: error_code = 7 eng = 1 return (esp, eng, fra, f'ERROR:{error_code}') def create_columns_analysis(df: pd.DataFrame)->Iterable[Tuple[int,int,int,str,pd.Timestamp]]: """create_columns_analysis""" # in your logic there are 4 potential driving conditions based on the direction # of the current row and the previous row. so we'll make a dictionary that we can # index and call the associated functions. direction = { ('IN', 'IN'): both_in, ('OUT', 'OUT'): both_out, ('IN', 'OUT'): in_out, ('OUT', 'IN'): out_in, } # to align the direction slice # - the last row # - the first row # - and zip them together def iter_countires(): """yields a (ESP,ENG,FRA,ERROR,MEAN_TIME)""" list_ids = [] # because we sliced the first row from the loop yield that value first def first_row(country_id): if country_id == 'ENG': return (0, 1, 0, 'ERROR:1', pd.NA) return (0, 0, 0, 'ERROR:1', pd.NA) yield first_row(df['country_ID'][0]) for previous, current in zip(df[:-1].itertuples(), df[1:].itertuples()): time_delta = current.date-previous.date # If it is the first time the ID is detected if current.country_ID == 'ENG' and current.ID not in list_ids: list_ids.append(current.ID) yield (0, 1, 0, 'ERROR:1', time_delta) elif current.country_ID == 'ENG': # indexing our dict with the ('IN','OUT') to get the function conditional_func = ( direction[(previous.direction, current.direction)]) # call the function and pass the previous.country_ID as thats the only var it relies on # unpack thoes values and tack on the timedelta yield (*conditional_func(previous.country_ID), time_delta) else: # you could create a different conditional func dict if you wanted to # handle country_ID logic differently yield (0, 0, 0, pd.NA, pd.NA) return iter_countires() def start(): """start""" df = pd.DataFrame(DATA, columns=COLS) df['date'] = pd.to_datetime(df['date']) df[['visit_ESP', 'visit_ENG', 'visit_FRA', 'error', 'mean_time']]=( tuple(create_columns_analysis(df))) print(df) if __name__ == '__main__': start()

result

 country_ID ID direction date visit_ESP visit_ENG visit_FRA error mean_time 0 ESP_1 0 IN 2021-02-28 0 0 0 ERROR:1 <NA> 1 ENG 0 IN 2021-03-03 0 1 0 ERROR:1 3 days 00:00:00 2 ENG 0 OUT 2021-03-04 0 0 0 ERROR:0 1 days 00:00:00 3 ESP_2 0 IN 2021-03-05 0 0 0 <NA> <NA> 4 FRA 1 OUT 2021-03-07 0 0 0 <NA> <NA> 5 ENG 1 OUT 2021-03-09 0 1 0 ERROR:1 2 days 00:00:00 6 ENG 1 OUT 2021-03-10 0 1 0 ERROR:8 1 days 00:00:00 7 ENG 2 IN 2021-03-13 0 1 0 ERROR:1 3 days 00:00:00

Thank you very much for your reply. Just a note, the mean_time column only has a value other than NaN if, in this case, the current row is 'OUT' and the previous row is 'IN' and both have country_ID == ENG. — Carola
– Carola, Commented Mar 26, 2022 at 8:39

Stack Exchange Network

Optimise a function with numerous conditions that depends on the previous row in a Python dataframe

1 Answer 1

suggested

result

You must log in to answer this question.

Hot Network Questions

Optimise a function with numerous conditions that depends on the previous row in a Python dataframe

1 Answer 1

suggested

result

You must log in to answer this question.

Related

Hot Network Questions