I have the following dataframe:
| country_ID | ID | direction | date |
|---|---|---|---|
| ESP_1 | 0 | IN | 2021-02-28 |
| ENG | 0 | IN | 2021-03-03 |
| ENG | 0 | OUT | 2021-03-04 |
| ESP_2 | 0 | IN | 2021-03-05 |
| FRA | 1 | OUT | 2021-03-07 |
| ENG | 1 | OUT | 2021-03-09 |
| ENG | 1 | OUT | 2021-03-10 |
| ENG | 2 | IN | 2021-03-13 |
I have implemented the following functionality:
ef create_columns_analysis(df): df['visit_ESP'] = 0 df['visit_ENG'] = 0 df['visit_FRA'] = 0 list_ids = [] for i in range(len(df)): if df.loc[i,'country_ID'] == 'ENG': country_ID_ENG(df, i, list_ids) else: # case country_ID = {FRA, ESP_1, ESP_2} # other methods not specified return df For each row with a specific country_ID, a similarly structured function is applied.
I would like to optimise or simplify the code of the country_ID_ENG function. The country_ID_ENG function is defined as follows:
def country_ID_ENG(df, i, list_ids): # If it is the first time the ID is detected if df.loc[i,'ID'] not in list_ids: # It adds up to one visit regardless of the direction of the ID df.loc[i,'visit_ENG'] = 1 # Add the ID to the read list list_ids.append(df.loc[i, 'ID']) # Assigns the error column a start message df.loc[i,'error'] = 'ERROR:1' # If it is not the first time it detects that ID else: # Saves the information of the previous row prev_row = df.loc[i-1] # If the current row direction is 'IN' if df.loc[i,'direction'] == 'IN': # Add a visit df.loc[i,'visit_ENG'] = 1 # Behaviour dependent on the previous row # If the current row direction is 'IN' and previous row is 'IN' if prev_row['direction'] == 'IN': if prev_row['country_ID'] == 'FRA': df.loc[i,'error'] = 'ERROR:0' elif prev_row['country_ID'] in ['ESP_1','ESP_2']: df.loc[i,'error'] = 'ERROR:2' df.loc[i,'visit_FRA'] = 1 else: df.loc[i,'error'] = 'ERROR:3' # If the current row direction is 'IN' and previous row is 'OUT' else: if prev_row['country_ID'] == 'ENG': df.loc[i,'error'] = 'ERROR:0' elif prev_row['country_ID'] in ['FRA','ESP_2']: df.loc[i,'error'] = 'ERROR:4' df.loc[i,'visit_FRA'] = 1 else: df.loc[i,'error'] = 'ERROR:5' df.loc[i,'visit_ESP'] = 1 df.loc[i,'visit_FRA'] = 1 # If the current row direction is 'OUT' else: # If the current row direction is 'OUT' and previous row is 'IN' if prev_row['direction'] == 'IN': # If it detects an output before an input of the same 'country_ID', # it calculates the visit time if prev_row['country_ID'] == 'ENG': df.loc[i,'mean_time'] = df.loc[i,'date']-prev_row['date'] df.loc[i,'error'] = 'ERROR:0' elif prev_row['country_ID'] in ['ESP_1','ESP_2']: df.loc[i,'error'] = 'ERROR:6' df.loc[i,'visit_FRA'] = 1 df.loc[i,'visit_ENG'] = 1 else: df.loc[i,'error'] = 'ERROR:7' df.loc[i,'visit_ENG'] = 1 # If the current row direction is 'OUT' and previous row is 'OUT' else: df.loc[i,'visit_ENG'] = 1 if prev_row['country_ID'] == 'ENG': df.loc[i,'error'] = 'ERROR:8' elif prev_row['country_ID'] in ['FRA','ESP_2']: df.loc[i,'error'] = 'ERROR:9' df.loc[i,'visit_FRA'] = 1 else: df.loc[i,'error'] = 'ERROR:10' df.loc[i,'visit_ESP'] = 1 df.loc[i,'visit_FRA'] = 1 The above function uses the information from the current row and the previous row (if any) to create new columns for visit_ENG, visit_ESP, visit_FRA, mean_time and error.
For the example dataframe the function, applying the function country_ID_ENG to rows whose country_ID is equal to ENG, should return the following result:
| country_ID | ID | direction | date | visit_ENG | visit_FRA | visit_ESP | mean_time | error |
|---|---|---|---|---|---|---|---|---|
| ESP_1 | 0 | IN | 2021-02-28 | 0 | 0 | 0 | NaN | NaN |
| ENG | 0 | IN | 2021-03-03 | 0 | 1 | 0 | NaN | ERROR:2 |
| ENG | 0 | OUT | 2021-03-04 | 0 | 0 | 0 | 1 days | ERROR:0 |
| ESP_2 | 0 | IN | 2021-03-05 | 0 | 0 | 0 | NaN | NaN |
| FRA | 1 | OUT | 2021-03-07 | 0 | 0 | 0 | NaN | NaN |
| ENG | 1 | OUT | 2021-03-09 | 1 | 1 | 0 | NaN | ERROR:9 |
| ENG | 1 | OUT | 2021-03-10 | 1 | 0 | 0 | NaN | ERROR:8 |
| ENG | 2 | IN | 2021-03-13 | 1 | 0 | 0 | NaN | ERROR:1 |
The function is very long, and the other functions for rows with country_ID equal to ESP or FRA will have the same complexity. I would like you to help me to simplify or optimise the code of this function to also take it into account when defining the country_ID_ESP and country_ID_FRA functions. I appreciate your help.
prev_row['country_ID'] == 'ENG'andprev_row['country_ID'] == 'FRA'anERROR:0. \$\endgroup\$elsecan't only have a comment: Python requires at least apass. \$\endgroup\$-means NaN/None, your first row has zeros invisit_ESP,visit_ENGandvisit_FRA, but you've shown-. \$\endgroup\$