Plotting timeseries data with multiple categories

Question

I have a dataset from a production line, which is formatted as time series data. There is a batch column, which indicates the name of the batch (str), and there is a phase column which indicates the phase of the production (str). I am working with the datetime as the index of the pandas DataFrame.

I want to plot this data on timeseries graph, overlaying the data from each phase and distinguishing each batch (i.e. different colour), with each process variable (i.e. temp1, temp2, press1, press2) on a different axis (as per the diagram) How can this be done?

EDIT: for clarity, I need the trends to be plotted against a datetime baseline, otherwise they will not overlay.

Example of the dataset: | datetime | temp1 | temp 2|press1|press2|batch | phase | |:---- |:--: | :--: | :--: | :--: |:--: |:--: | | 2023-02-03 15:45:34| 34.45 | 23.34 | 13.23| 45.5 | 'D' | '10-Wait' | | ... | ... | ... | ... | ... | 'D' | ... | | 2023-02-03 15:55:34| 36.55 | 22.14 | 18.23| 78.5 | 'D' | '20-Initialise'|

To create a similar dataset -to mine- you can use the following code:

import numpy as np import pandas as pd import datetime date = pd.date_range(start='1/1/2023', end='10/06/2023', freq=datetime.timedelta(seconds=30)) tags = ['temp1','temp2','press1','press2'] data=np.random.rand(len(date),len(tags)) df=pd.DataFrame(data,columns=tags).set_index(date) batches = ['A','B','C','D','E','F','G'] n=len(batches) period_start = pd.to_datetime('1/1/2023') period_end = pd.to_datetime('10/06/2023') batch_start = (pd.to_timedelta(np.random.rand(n) * ((period_end - period_start).days + 1), unit='D') + period_start) batch_end = (batch_start + pd.to_timedelta(8,unit='H')) df_batches = pd.DataFrame(data=[batch_start,batch_end],columns=[batches],index=['start','end']).T for item in batches: start_time = df_batches['start'][item] end_time = df_batches['end'][item] df.loc[((df.index>=start_time)&(df.index<=end_time)), 'batch'] = item df.dropna(subset=['batch'],inplace=True) df['phase']='' phases = ['10-Wait','20-Initialise','30-Warm','40-Running'] for batch in batches: wait_len = int(len(df[df['batch']==batch].index)*0.2) init_len = int(len(df[df['batch']==batch].index)*0.4) warm_len = int(len(df[df['batch']==batch].index)*0.6) run_len = int(len(df[df['batch']==batch].index)) wait_start = df[df['batch']==batch].index[0] wait_end = df[df['batch']==batch].index[wait_len] init_end = df[df['batch']==batch].index[init_len] warm_end = df[df['batch']==batch].index[warm_len] run_end = df[df['batch']==batch].index[-1] df['phase'].loc[wait_start:wait_end] = phases[0] df['phase'].loc[wait_end:init_end] = phases[1] df['phase'].loc[init_end:warm_end] = phases[2] df['phase'].loc[warm_end:run_end] = phases[3] df.to_csv('stackoverflowqn.csv')

your index is datetime, but are you going to observe changes monthly,hourly or yearly, etc — Yilmaz
– Yilmaz, Commented Aug 31, 2023 at 4:00
I want to overlay the plots against a baseline timescale. Not using the timescale in the DataFrame index. — hamslice
– hamslice, Commented Aug 31, 2023 at 4:22

Quang Hoang · Accepted Answer · 2023-08-31 04:49:09Z

You can use seaborn facetgrid like this:

df = df.rename_axis(index='time').reset_index().melt(['time','batch','phase']) for p, data in df.groupby('phase', group_keys=False): print(p) fg = sns.FacetGrid(data=data, col='variable', col_wrap=2, hue='batch') fg.map(sns.lineplot, 'time','value') plt.show()

You would get for each phase a plot like this:

Avish Wagde · Accepted Answer · 2023-08-31 04:13:42Z

I tried to create my own data like the one you suggested, and tried to show how to plot, I think this will help you mate!

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Create a sample DataFrame for illustration purposes data = { 'batch': ['Batch1', 'Batch1', 'Batch1', 'Batch2', 'Batch2', 'Batch2'], 'phase': ['Phase1', 'Phase2', 'Phase3', 'Phase1', 'Phase2', 'Phase3'], 'temp1': [100, 110, 105, 95, 105, 98], 'temp2': [90, 95, 92, 85, 88, 87], 'press1': [50, 52, 51, 48, 49, 47], 'press2': [30, 31, 29, 28, 30, 29] } df = pd.DataFrame(data) df['datetime'] = pd.date_range(start='2023-01-01', periods=len(df), freq='D') df.set_index('datetime', inplace=True) sns.set_style("whitegrid") phases = df['phase'].unique() batches = df['batch'].unique() variables = ['temp1', 'temp2', 'press1', 'press2'] # List of process variables num_cols = 2 # Number of columns for the subplot grid num_rows = (len(variables) + num_cols - 1) // num_cols for phase in phases: fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 6 * num_rows)) plt.subplots_adjust(hspace=0.5) plt.suptitle(phase, y=1.02) for idx, variable in enumerate(variables): row = idx // num_cols col = idx % num_cols for batch in batches: batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)] axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='o') axes[row, col].set_title(variable) axes[row, col].set_xlabel('Datetime') axes[row, col].set_ylabel(variable) axes[row, col].legend() plt.tight_layout() plt.show()

This is very close to the solution I am looking for. I think my question was missing 1 crucial point picked up by @Yilmaz. I want the plots to be baselined against a timescale, so the plots will be over layed.
I modified your code, above, to set a new Timedelta index (with 30s intervals) when the plots are created. However, its buggy and throws up an error "IndexError: index 2 is out of bounds for axis 0 with size 2" for batch in batches: batch_data = df[(df['BxID'] == batch) & (df['Phase'] == phase)] new_index = pd.to_timedelta(np.arange(0,len(batch_data.index)*30,30),unit='s') batch_data=batch_data.set_index(new_index) axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='')
@hamslice I didn't quite understand what you mean by "want the plots to be baselined against a timescale, so the plots will be over layed" , can you explain properly so that I can help you with it? currently can't solve your error, because i don't understand what you are looking for exactly.
In your current suggested code, you are plotting the df.index. The index is formatted as a timedate. Since the batches are processed sequentially (i.e. there is no overlap between batches) and the trends are plotted at the date and time they were processed, it is not possible to compare batches on the axis. I believe the way around this is to set a new index and use Timedelta. When plotting, the first point on the x-axis should be 0:00 the first row of the index of the dataframe should be 0:00. Since all data is recorded at constant 30s interval having a 30s timedelta should work.
I think I figured out why it was throwing up an error. The new column I created was a Timedelta.index, but I converted it to a Series and it works well.

Yilmaz · Accepted Answer · 2023-09-01 03:22:41Z

Breaking down the process of plotting a graph step by step can make the task much more manageable.

you get the data:

raw_data=pd.read_csv("stackoverflowqn.csv",index_col=[0])

its index is date. reset the index and create a new column "date" with datetime type:

data=raw_data.reset_index() data.columns=['date', 'temp1', 'temp2', 'press1', 'press2', 'batch', 'phase'] data["date"]=pd.to_datetime(data["date"])

Create the groupby object and get the group names in a list:

gbo=data.groupby("phase",as_index=False) keys=list(gbo.groups.keys())

after that, create dataframe for each group.

list_1=gbo.groups[keys[0]] frame_1=data[data.index.isin(list_1)] list_2=gbo.groups[keys[1]] frame_2=data[data.index.isin(list_2)] list_3=gbo.groups[keys[2]] frame_3=data[data.index.isin(list_3)] list_4=gbo.groups[keys[3]] frame_4=data[data.index.isin(list_4)]

setting date as one of the axes will not look great. maybe you should add a minute column to one of the frames:

frame_1["minute"]=frame_1["date"].dt.minute

Now you have 4 different new data frames, you just have to plot them. you choose x,y axes.

plt.figure(figsize=(16,10)) plt.suptitle("Main Figure",fontsize=24) # 2 x 2 grid and I am working on the first plot plt.subplot(2,2,1) # by default lineplot uses estimator=mean. you might need to change it # By default, seaborn line plots show confidence intervals for the dataset. Yremove it by setting by errorbar=None sns.lineplot(data=frame_1,x="minute",y="temp1",hue="batch",errorbar=None).set(title="MINUTE-TEMP1") plt.subplot(2,2,2) sns.lineplot(data=frame_2,x="temp2",y="press1",hue="batch",errorbar=None).set(title="Title_2") plt.subplot(2,2,3) sns.lineplot(data=frame_3,x="temp1",y="press1",hue="batch",errorbar=None).set(title="Title_3") plt.subplot(2,2,4) sns.lineplot(data=frame_3,x="temp1",y="press1",hue="batch",errorbar=None).set(title="Title_4")

Result is like this. you decide the x,y axes:

this is very close to the desired solution. Unfortunately, the suggestion to use frame_1["minute"]=frame_1["date"].dt.minute does not work in the way I would like it to. Refer to this screenshot for the issue (imgur.com/a/g4LDYJ2).
I showed minute as an example. I dont understand the dataset and what research is about. I prepared the process and you just have to plug the right x and y axes

hamslice · Accepted Answer · 2023-09-03 01:05:51Z

Credit to @AvishWagde who definitely broke the back of the problem. The 1 missing ingredient was having the x-axis of each plot baselined against zero.

The solution to baselining these plots was to create a new Timedelta column which starts from 00:00:00 and goes upwards, in increments of 00:00:30.

In Avish's code he uses:

for batch in batches: batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)] axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='o')

However, since the Dataframe index is a Datetime, plotting this on the x-axis will not result in a comparison of the data. As stated they need to be plotted against a baseline. Using Timedelta on the x-axis allows comparison of the process data in each phase. In this case the 00:00:00 is taken to be the start of each phase. This dataset was recorded at 30s intervals, and it is necessary to convert the Timedelta from Index to Series, as per this line pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series() which results in this slight change:

for batch in batches: batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)] baseline_time = pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series() batch_data = batch_data.set_index(baseline_time) axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='')

For the full working code:

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('stackoverflowqn.csv',index_col=[0]) df.index = pd.to_datetime(df.index) sns.set_style("whitegrid") phases = df['phase'].unique() batches = df['batch'].unique() variables = ['temp1', 'temp2', 'press1', 'press2'] # List of process variables num_cols = 2 # Number of columns for the subplot grid num_rows = (len(variables) + num_cols - 1) // num_cols for phase in phases: fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 6 * num_rows)) plt.subplots_adjust(hspace=0.5) plt.suptitle(phase, y=1.02) for idx, variable in enumerate(variables): row = idx // num_cols col = idx % num_cols for batch in batches: batch_data = df[(df['batch'] == batch) & (df['phase'] == phase)] baseline_time = pd.to_timedelta(np.arange(0,len(batch_data)*30,30),unit='s').to_series() batch_data = batch_data.set_index(baseline_time) axes[row, col].plot(batch_data.index, batch_data[variable], label=batch, marker='') axes[row, col].set_title(variable) axes[row, col].set_xlabel('time') axes[row, col].set_ylabel(variable) axes[row, col].legend() plt.tight_layout() plt.show()

Wait: 2 x variables Initialise: 2 x variables
Warm: 2 x variables

Collectives™ on Stack Overflow

Plotting timeseries data with multiple categories

4 Answers 4

Comments

5 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

5 Comments

2 Comments

Comments

Related