4
\$\begingroup\$

The function serves two purposes as follows:

  1. When dealing with the day of the week, Python will always assign 0 to Sunday and 6 to Saturday and I want to change that. The first part of the code is changing the start of the week from Sunday to users preference, i.e. Monday or Tuesday.

  2. The second section is assigning the correct week number based on the new day of the week. For example, the new day of the week is Tuesday and the week assignment should be from Tuesday to Monday instead of Sunday to Saturday by default.

The code works (obviously), but I want to optimize it. The current code takes about 30 seconds for a dataset with over 2000 rows, which don't seem to be normal. The delay is mostly in the loop, I can't seem to optimize the loop because it's assigning week number based on a condition. The condition is simple, assign week number until it reaches a new merchant then the week resets.

def date_manipulate(df,startday): df['Month']=df.index.strftime("%B") df['DOW']=df.index.strftime("%A") week_offset = collections.OrderedDict() default_week =['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] temp_week = default_week[startday:] + default_week[0:startday] for index, day in enumerate(temp_week): week_offset[day] = index df.replace({"DOW":week_offset},inplace=True) df=df.reset_index().sort_values(['merchant_name', 'date'],ascending=['True','True']).set_index(['date']) df['temp_date']=df.index.map(lambda x: x.toordinal()) current_merchant=df['merchant_name'][0] offset=list(week_offset.values())[df['DOW'][0]] startdate=df['temp_date'][0]-(offset) enddate=startdate+6 week=1 df['week']=0 df.reset_index(inplace=True) for counter, day in enumerate (df['temp_date']): if df['merchant_name'][counter]==current_merchant: if(df['temp_date'][counter])<=enddate: df.loc[counter,'week']=week else: enddate+=7 week+=1 df.loc[counter,'week']=week else: offset=list(week_offset.values())[df['DOW'][counter]] print(offset) current_merchant=df.loc[counter,'merchant_name'] startdate=df.loc[counter,'temp_date']-(offset) enddate=startdate+6 week=1 df.loc[counter,'week']=week df.set_index('date',inplace=True) return df 
\$\endgroup\$
1
  • 1
    \$\begingroup\$ can you add an example of input and expeced output? \$\endgroup\$ Commented Mar 19, 2018 at 16:31

1 Answer 1

2
\$\begingroup\$

The code doesn't run because the current version of Pandas sort_values needs to accept booleans for ascending. (But those can be omitted entirely because True is the default.)

More broadly: the function fails to declare its signature types, mutates df when it shouldn't, is far too complex, includes several instances of repeated expressions that should be de-duplicated, fails to vectorise, and inappropriately returns a result containing a temporary column.

week is baffling. You claim

The code works (obviously)

I really wouldn't be so sure. week is not the week of the month, nor the week of the year, nor (as one may guess from examining the output) the number of weeks from the first week seen per merchant group. The latter would be calculated via something like

merchant_timedelta = result2.index - ( result2.reset_index() .groupby('merchant_name')['date'] .transform('first') .set_axis(result2.index) ) merchant_weeks = merchant_timedelta // pd.Timedelta(1, unit='W') 

so the original implementation is either simply wrong, or poorly-described and poorly-calculated. I ignore it, which reduces a refactored version to

def date_manipulate_refactored(df: pd.DataFrame, start_day: int) -> pd.DataFrame: result = df.sort_values(by=['merchant_name', 'date']) return result.assign( Month=result.index.strftime('%B'), DOW=(result.index.dayofweek + 1 - start_day) % 7, ) 

This by the way fixes the eight-some warnings that the original code throws:

189787.py:15: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` df.replace({"DOW": week_offset}, inplace=True) 189787.py:21: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` current_merchant = df['merchant_name'][0] ... 

This can be tested via something like

import collections import numpy as np import pandas as pd def date_manipulate(df, startday): df['Month'] = df.index.strftime("%B") df['DOW'] = df.index.strftime("%A") week_offset = collections.OrderedDict() default_week = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] temp_week = default_week[startday:] + default_week[0:startday] for index, day in enumerate(temp_week): week_offset[day] = index df.replace({"DOW": week_offset}, inplace=True) df=df.reset_index().sort_values( ['merchant_name', 'date'], ascending=[True, True], ).set_index(['date']) df['temp_date'] = df.index.map(lambda x: x.toordinal()) current_merchant = df['merchant_name'][0] offset = list(week_offset.values())[df['DOW'][0]] startdate = df['temp_date'][0]-(offset) enddate = startdate+6 week=1 df['week']=0 df.reset_index(inplace=True) for counter, day in enumerate (df['temp_date']): if df['merchant_name'][counter]==current_merchant: if(df['temp_date'][counter])<=enddate: df.loc[counter,'week']=week else: enddate+=7 week+=1 df.loc[counter,'week']=week else: offset=list(week_offset.values())[df['DOW'][counter]] current_merchant=df.loc[counter,'merchant_name'] startdate=df.loc[counter,'temp_date']-(offset) enddate=startdate+6 week=1 df.loc[counter,'week']=week df.set_index('date',inplace=True) return df def date_manipulate_refactored(df: pd.DataFrame, start_day: int) -> pd.DataFrame: result = df.sort_values(by=['merchant_name', 'date']) return result.assign( Month=result.index.strftime('%B'), DOW=(result.index.dayofweek + 1 - start_day) % 7, ) def demo() -> None: rand = np.random.default_rng(seed=0) dates = pd.date_range( name='date', start='2020-01-01T00:00', end='2020-04-01T00:00', freq='12h', ) merchant_points = rand.choice( a=rand.integers( low=ord('a'), high=1 + ord('z'), size=(20, 7), ), size=len(dates), ) df = pd.DataFrame( index=dates, data={ 'merchant_name': [ ''.join(chr(x) for x in line) for line in merchant_points ], }, ) result = date_manipulate(df=df.copy(), startday=5) result2 = date_manipulate(df, startday=5) pd.testing.assert_frame_equal( left=result[['merchant_name', 'Month', 'DOW']], right=result2[['merchant_name', 'Month', 'DOW']], ) if __name__ == '__main__': demo() 
\$\endgroup\$
4
  • \$\begingroup\$ DRY??? df.loc[counter,'week']=week is (unnecessarily) repeated 3x... \$\endgroup\$ Commented Jul 17 at 21:02
  • \$\begingroup\$ @Fe2O3 yes - in OP's code, which I reproduced (nearly) verbatim for regression testing. That isn't something I chose. \$\endgroup\$ Commented Jul 17 at 22:35
  • \$\begingroup\$ Just thought pointing out repetition to OP might lead to slightly better code... Cheers! :-) \$\endgroup\$ Commented Jul 17 at 22:51
  • 1
    \$\begingroup\$ @Fe2O3 You're right. Whereas the refactor deletes those instances, it's still worth calling out in the summary. \$\endgroup\$ Commented Jul 19 at 16:03

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.