I have tick-level data for a single trading day of a specific contract and aim to conduct time series analysis on it. The mid-price at each tick is computed as $MidPrice=0.5×(Ask_1 +Bid_1)$. The data is then segmented into fixed-length time intervals (hereafter referred to as bins), within which I calculate the average mid-price, denoted as $x_t$. Subsequently, I compute the log returns as $r_t =\log(x_t /x_{t−1})$, and apply a rolling window to the series $\{r_t ,r_{t−1} ,…,r_1\}$ to fit an ARIMA model for forecasting the next interval's return, $r_{t+1}$. While I initially considered this methodology sound, my supervisor pointed out a possible issue of information leakage—specifically, the computation of $r_t$ involves $x_t$, while the predicted value $r_{t+1} = \log(x_{t+1}/ x_t)$ also contains the value $x_t$, thereby contaminating the training process with future information.
I set a benchmark to compare the ARIMA model result. i.e. the mid-price of next bin, $t+1$, is calculated by $x_{t+1} = x_t + (x_t - x_{t-1})$, and the log return of time $t+1$ is still computed by $r_t = \log(x_{t+1} / x_t)$. Then, I computed the accuracy by comparing the signs of the actural log return and the benchmark log return, where $x_{t+1} = x_t + (x_t - x_{t-1})$.
My supervisor suggested that instead of using the average mid-price of each bin, i.e. $x_t = \frac{1}{\text{#ticks in a bin}}\sum_{i \in \text{ticks in a bin}} \text{mid-price}_i$ I should use the last tick’s mid-price, i.e. $x_t = \text{mid-price}_\text{last tick of the bin}$. I implemented both approaches to compare the results.
For the average mid-price of each bin, the benchmark accuracy is 57.63%, while the ARIMA forescasting accuracy is 60.21%. The 2.58% difference between these two results seems to verify my supervisor's concern.
For the last tick's mid-price of each bin, the benchmark accuracy is 48.51%, while the ARIMA forecasting accuracy is 48.96%.
The difference of acuracy is big. Does my method have the problem of information leakage?