I followed from this question.
I have the following task to do: I have time series data. Training by the consecutive 3 days to predict the each 4th day. Each day data represents one CSV file which has dimension 24x25. Every data points of each CSV file are pixels.
Now I need to do that, predict day4 (meaning the 4th day) by using training data day1, day2, day3 (meaning the three consecutive days prior), and after that calculate MSE between predicted day4 data and original day4 data. Let's call it mse1.
Similarly, I need to predict the day5 (meaning the 5th day) by using training data day2, day3, day4, and then calculate the mse2 (MSE between predicted day5 data and original day5 data).
I need to predict day6 (meaning the 6th day) by using training data day3, day4, day5, and then calculate mse3 (MSE between predicted day6 data and original day6).
..........
And finally I want to predict day93 by using training data day90, day91, day92, calculate mse90 (MSE between predicted day93 data and original day93).
I want to use in this case, Linear regression, and we have 90 MSE for this model.
import os import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.preprocessing import MinMaxScaler import matplotlib.pyplot as plt # Paths data_folder = r'C:\Users\alokj\OneDrive\Desktop\jupyter_proj\All_data' output_folder = r'C:\Users\alokj\OneDrive\Desktop\jupyter_proj\90_days_merged' # Ensure the output folder exists os.makedirs(output_folder, exist_ok=True) # List all CSV files in the folder csv_files = [f for f in os.listdir(data_folder) if f.endswith('.csv')] # Sort the files based on the numeric part extracted from the filename csv_files = sorted(csv_files, key=lambda x: int(x.split('_Day')[1].split('_')[0])) # Prepare data data_list = [pd.read_csv(os.path.join(data_folder, file), header=None).values for file in csv_files] data_array = np.array(data_list) # Shape: (num_days, 24, 25) # Flatten the data for easier handling in regression models num_days, rows, cols = data_array.shape data_flattened = data_array.reshape(num_days, -1) # Shape: (num_days, 600) # Prepare features and target matrix for range (3, num_days) X = np.array([data_flattened[i-3:i].flatten() for i in range(3, num_days)]) # Shape: (num_days-3, 1800) y = data_flattened[3:num_days] # Target is the 4th day in each sequence # Train-Test Split and Validation (Separate fixed split) train_size = int(0.8 * len(X)) # 80% for training X_train = X[:train_size] y_train = y[:train_size] X_test = X[train_size:] y_test = y[train_size:] # Scaling the data scaler_X = MinMaxScaler() scaler_X.fit(X_train) # Fit on training set X_train_scaled = scaler_X.transform(X_train) X_test_scaled = scaler_X.transform(X_test) scaler_y = MinMaxScaler() scaler_y.fit(y_train) # Fit on training set y_train_scaled = scaler_y.transform(y_train) y_test_scaled = scaler_y.transform(y_test) ### Linear Regression lr_model = LinearRegression() lr_model.fit(X_train_scaled, y_train_scaled) y_pred_test_scaled_lr = lr_model.predict(X_test_scaled) y_pred_test_lr = scaler_y.inverse_transform(y_pred_test_scaled_lr) # Validation for Days 3 to 93 XX = np.array([data_flattened[i-3:i].flatten() for i in range(3, 96)]) # Shape: (90, 1800) yy = data_flattened[3:93] # Target for validation yy_pred_lr = lr_model.predict(scaler_X.transform(XX)) yy_pred_lr = scaler_y.inverse_transform(yy_pred_lr) # Calculate residuals for Linear Regression residuals_lr = [np.mean((yy[i] - yy_pred_lr[i])**2) for i in range(len(yy))] # Plot residuals for all models days = [f'Day {i+1}' for i in range(90)] # Start labels from Day 4 to Day 93 plt.figure(figsize=(12, 6)) plt.plot(days, residuals_lr, label='Linear Regression Residuals', marker='o') # Configure plot plt.xticks(ticks=range(0, len(days), 2), labels=[f'Day {i+1}' for i in range(0, len(days), 2)], rotation=45, ha='right') plt.xlabel('Days (Validation Set)') plt.ylabel('Residuals (MSE)') plt.title('Residuals for Models (Validation Set)') plt.legend() plt.grid(True) # Save and show plot plt.savefig(os.path.join(output_folder, 'residuals_plot_models_comparison_with_naive.png')) plt.show() We know that linear regression models often do not do very well with time series data because the assumption of independent and identically distributed data is usually violated.
But in my case from the above plot, regression model is doing exceptionally very well (meaning mean squared error is very low, close to zero), would anybody check my regression model inside the code (if I made any mistakes or bugs that I might not be aware of)?
Actually I have implemented my code based on suggestions of @RobertLong's answer from my above linked question.
My all 93 days data folder link that I used for code.
