I work at a relatively large Swedish retail company where I am currently performing initial linear regression in order to understand the linkage between dependent variable store sales (number of transactions) and the following predictor variables: Traffic (number of people entering the store), staffing (number of employees in store) and intraday variability in traffic (calculated as the deviation from the daily mean for each store). I have hourly data points for these variables from about 200 stores during 2017, but form daily averages (per hour) in order to get more robust stats. My variables are then
atran: Average number of transactions per hour in store s during day d
atraf: Average number of customers per hour in store s during day d
astaf: Average number of staff working hours per hour in store s during day d
trafvar: Traffic variability for store s during day d
I then regress atran onto the predictor variables (atraf, astaf and trafvar), including a number of nonlinear terms that I find reasonable to include. I also include dummies representing each store, the 7 weekdays, each month and a number of events (holidays, promotions etc.) that might affect our sales. The methodology is highly influenced by Effect of Traffic on Sales and Conversion Rates of Retail Stores, O. Perdikaki, S. Kesavan & J. M. Swaminathan, Manufacturing & Service Operations Management, 14 (1), 2012.
So much for the background, on to my question. For my numerical features, statsmodels different API:s (numerical and formula) give different coefficients, see below. However, this only happens when the astaf^2 x atraf^2 interaction term is included, as seen further down where the regressions are compared in the absence of that variable. Coincidentally, that variable is the only one with "high" p-value, however one wants to interpret that. I should say also that the nonlinear terms for the OLS API are generated via simple multiplication of the pandas dataframe columns. No centering or anything else fancy.
Does anyone have a clue to what's going on here?
Thanks, Robert
Modeling with all variables included:
||---------------------------------------------------|| || Statsmodels ols formula API || || || || R^2: 0.9139 || || || || Variable Coeff P-value || || -------- --------- -------- || || atraf 0.335907 0.000000 || || I(atraf ** 2) -0.000553 0.000000 || || astaf -0.739342 0.000000 || || I(astaf ** 2) 0.048491 0.000000 || || astaf:atraf 0.023846 0.000000 || || astaf:I(atraf ** 2) 0.000030 0.000000 || || I(astaf ** 2):atraf -0.001631 0.000000 || || I(astaf ** 2):I(atraf ** 2) 0.000000 0.129057 || || trafvar -0.040479 0.000000 || ||---------------------------------------------------|| ||-------------------------------------|| || Statsmodels OLS API || || || || R^2: 0.9139 || || || || Variable Coeff P-value || || -------- --------- -------- || || atraf 0.341431 0.000000 || || atraf2 -0.000571 0.000000 || || astaf -0.738928 0.000000 || || astaf2 0.057339 0.000003 || || astaf_atraf 0.022066 0.000000 || || astaf_atraf2 0.000036 0.000000 || || astaf2_atraf -0.001534 0.000000 || || astaf2_atraf2 -0.000014 0.235355 || || trafvar -0.040439 0.000000 || ||-------------------------------------|| Modeling without astaf^2 x atraf^2-term:
||-------------------------------------------|| || Statsmodels ols formula API || || || || R^2: 0.9139 || || || || Variable Coeff P-value || || -------- --------- -------- || || atraf 0.340940 0.000000 || || I(atraf ** 2) -0.000569 0.000000 || || astaf -0.662886 0.000000 || || I(astaf ** 2) 0.043986 0.000000 || || astaf:atraf 0.022067 0.000000 || || astaf:I(atraf ** 2) 0.000035 0.000000 || || I(astaf ** 2):atraf -0.001502 0.000000 || || trafvar -0.040441 0.000000 || ||-------------------------------------------|| ||------------------------------------|| || Statsmodels OLS API || || || || R^2: 0.9139 || || || || Variable Coeff P-value || || -------- --------- -------- || || atraf 0.340940 0.000000 || || atraf2 -0.000569 0.000000 || || astaf -0.662886 0.000000 || || astaf2 0.043986 0.000000 || || astaf_atraf 0.022067 0.000000 || || astaf_atraf2 0.000035 0.000000 || || astaf2_atraf -0.001502 0.000000 || || trafvar -0.040441 0.000000 || ||------------------------------------||