0
$\begingroup$

I am trying to model sales of 2020 on calls of 2020 where the values of calls are (0,1,1.5,2,2.5,3,3.5......,19,19.5,20). My lookout is to establish some sort of relationship between the two variables. The scatter plot shows no linearity between the variables and even after scaling the two variables, the R-sq is 0.003. Here's the scatter plot Scatter plot of Calls (X-axis) vs Sales

my goal is to establish some sort of relationship between two variables, I have tried and used weights and log transformation, but that's not improving anything significantly

$\endgroup$
4
  • $\begingroup$ It sounds like your underlying data may be time series. Would time series plots be helpful? Also, what is your goal? Prediction, inference, explanation, variable importance - what will you do with the end result of your analysis? $\endgroup$ Commented Jul 27, 2022 at 7:47
  • $\begingroup$ There are other variables mostly sales of previous years ergo time series components are there, but my specific goal right now is to check the influence of calls on the sales of that corresponding year. Once the relationship is established, would like to predict future values let's say I have calls data of 2022, then predict sales of 2022. I hope that clears it a little. $\endgroup$ Commented Jul 27, 2022 at 7:52
  • $\begingroup$ Hm. If you are interested in forecasting, then I would usually always look at the time series in conjunction with any explanatory variables. It may well be that calls are explanatory once you account for time series dynamics. This may be useful. $\endgroup$ Commented Jul 27, 2022 at 8:01
  • $\begingroup$ For the time being, need I develop a relationship between the two variables, should I apply any kind of transformation on calls data. I know it is frowned upon to transform your IV, but to insinuate some kind of relationship? $\endgroup$ Commented Jul 27, 2022 at 8:36

1 Answer 1

1
$\begingroup$

Per above, in a context, I would always look at the time series and its associated plots, and consider any explanatory variables only as a second step.

Transformations of IV are no problem at all, I would consider .

However, there is very little visible relationship between your calls and your sales. If at all, it seems like sales may be highest for calls around 2 or 3. (Whatever a fractional call is.) Splines may be able to capture this relationship. I recommend:

  1. Do not consider $R^2$, which is not a good indicator of prediction accuracy. It's a measure of in-sample fit, which is notoriously misleading in terms of prediction accuracy.
  2. Do consider a holdout sample. In your time series context, this should be taken from the end of the time series.
  3. Compare your model(s) (with a linear predictor of calls, or a spline transformation) against the very simple historical mean model, and perhaps a simple automatically chosen time series forecasting model. These can be surprisingly hard to beat.
  4. Learn more about the relationship between calls and sales. If these are calls made by salespeople, then it stands to reason that they might make more calls in situations where they think sales are more likely. That is, calls are endogenous. If so, it would not make a lot of sense to tell them to increase the number of calls to drive sales, at least not in the context of this model.
$\endgroup$
1
  • $\begingroup$ Historically it is seen that the sales and calls show an S shaped curve similar to most sales vs calls data. Unfortunately the data I am working with is covid affected. However, would it be advisable to use a negative binomial or a poisson regression model? To establish any kind of relationship? $\endgroup$ Commented Jul 27, 2022 at 9:14

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.