0
$\begingroup$

I have tick-level data for a single trading day of a specific contract and aim to conduct time series analysis on it. The mid-price at each tick is computed as $MidPrice=0.5×(Ask_1 +Bid_1​)$. The data is then segmented into fixed-length time intervals (hereafter referred to as bins), within which I calculate the average mid-price, denoted as $x_t$. Subsequently, I compute the log returns as $r_t =\log(x_t /x_{t−1})$, and apply a rolling window to the series $\{r_t ,r_{t−1} ,…,r_1\}$ to fit an ARIMA model for forecasting the next interval's return, $r_{t+1}$. While I initially considered this methodology sound, my supervisor pointed out a possible issue of information leakage—specifically, the computation of $r_t$ involves $x_t$, while the predicted value $r_{t+1} = \log(x_{t+1}/ x_t)$ also contains the value $x_t$, thereby contaminating the training process with future information.

I set a benchmark to compare the ARIMA model result. i.e. the mid-price of next bin, $t+1$, is calculated by $x_{t+1} = x_t + (x_t - x_{t-1})$, and the log return of time $t+1$ is still computed by $r_t = \log(x_{t+1} / x_t)$. Then, I computed the accuracy by comparing the signs of the actural log return and the benchmark log return, where $x_{t+1} = x_t + (x_t - x_{t-1})$.

My supervisor suggested that instead of using the average mid-price of each bin, i.e. $x_t = \frac{1}{\text{#ticks in a bin}}\sum_{i \in \text{ticks in a bin}} \text{mid-price}_i$ I should use the last tick’s mid-price, i.e. $x_t = \text{mid-price}_\text{last tick of the bin}$. I implemented both approaches to compare the results.

For the average mid-price of each bin, the benchmark accuracy is 57.63%, while the ARIMA forescasting accuracy is 60.21%. The 2.58% difference between these two results seems to verify my supervisor's concern.

For the last tick's mid-price of each bin, the benchmark accuracy is 48.51%, while the ARIMA forecasting accuracy is 48.96%.

The difference of acuracy is big. Does my method have the problem of information leakage?

$\endgroup$
6
  • 2
    $\begingroup$ I'm not sure what you're professor means because, at time $(t+1)$, $x_{t}$ is not future information. $\endgroup$ Commented Jun 26 at 14:43
  • $\begingroup$ @markleeds He means that I want to predict $r_{t+1} = \log(x_{t+1}/x_t)$, while my training dataset's $r_t = \log(x_t/x_{t-1})$ contains information of $x_t$, which can leak information. $\endgroup$ Commented Jun 30 at 8:12
  • $\begingroup$ Hi user398843: $log(\frac{x_{T+1}}{x_{T}}) = log(x_{T+1}) - log(x_{T})$, so if time $T$ is the border line where training stops, then returns in the future onwards from T, ALWAYS depend on what the price was in the past but that's not leaking information. It's just computing how much the price increased since time $T$. The information in the training set is being used to compute the performance in the non-trained dataset. If you don't use that, then you won't know what the performance was. $\endgroup$ Commented Jun 30 at 10:35
  • $\begingroup$ user398843: I didn't find my comment above very insightful so here's a better way to think about it. This comment assumes that, by information leakage, your professor means that you are taking information in the training set and using it as valid information in the non-training data set. This is not what is being done here. All you are doing here is using the information in the training set to calculate the return performance in the non-trained data set. But, as far as I can see, $x_t$ is not being used in any other way. It's like an anchor for performance one step ahead. $\endgroup$ Commented Jun 30 at 20:25
  • $\begingroup$ @markleeds Thank you for your comments. I greatly appreciate it! Do you know why there is a large difference in the accuracy of our prediction when using "the average mid-price of each bin" versus "the last tick's mid-price of each bin" as $x_t$? $\endgroup$ Commented Jul 2 at 7:49

1 Answer 1

2
$\begingroup$

Although the target that will eventually be realised, $$ r_{t+1} = \log\!\left(\frac{x_{t+1}}{x_t}\right), $$ also contains $x_t$, this does not contaminate training because $x_t$ is already part of the information set $\mathcal{F}_t$ available to any real-world trader at the end of interval $t$.

$\endgroup$
1
  • $\begingroup$ Thank you for your answer. I have added some new information; could you take a look at why there are differences in prediction accuracy when different methods of constructing $x_t$ are used? $\endgroup$ Commented Jun 28 at 13:56

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.