Improving time efficiency of code, working with a big Data Set using Python

Question

I’m trying to optimize my code to reduce the time it takes to run through the data set.

I’m working with a .csv file that has three columns: time_UTC, vmag2D and vdir. The data set as around 1420000 lines (one million, four hundred and 20 thousand).

This for cycle took around 15/20 minutes to run on my Mac with the M1 processor. I’m sure I complicated something for it to take so much time. (I believe the processor is good enough to run this little piece of code faster).

import pandas as pd path_data = '" *insert a path here* "' file = path_data + ' *name of the .csv file* ' data = pd.read_csv(file) time_UTC = [] vmag2D = [] vdir = [] for i in range(len(data)): x = data.iloc[i][0] x1 = x.split(' ') x2 = x1[1].split(';') date = x.split(' ')[0] time_UTC.append(x2[0]) vmag2D.append(x2[1]) vdir.append(x2[2])

The code is parsing each of the lines in the .csv file, and each of them has the same “template”: '1994-01-01 00:05:00;0.52;193'

carlo_barth · Accepted Answer · 2022-08-04 17:22:59Z

It shouldn't be necessary to use any type of for loop for your code. You are reading the CSV using pandas, but you don't seem to specify the correct parameters.

import pandas as pd path_data = '" *insert a path here* "' file = path_data + ' *name of the .csv file* ' df = pd.read_csv(file, sep=';', parse_dates=[0], engine='c', header=None) time_UTC = df.iloc[:, 0] vmag2D = df.iloc[:, 1] vdir = df.iloc[:, 2]

If you run this, your resulting variables (time_UTC, ...) will be of type pandas.Series. You can convert those to list with .to_list() or access the numpy array using .values.

Note that I am specifying engine='c' here in the pandas CSV parser, which is using a native C parser that is faster than its python equivalent, as you are processing a large file here.

Worked perfectly! exactly what I was looking for! the run time before was 29minutes and now it is instantaneous! Thank you very much!

Colim · Accepted Answer · 2022-08-04 17:28:09Z

You can split the entire column at once

import pandas as pd import numpy as np df = pd.DataFrame({"all": ["1994-01-01 00:05:00;0.52;193"]*1000}) # split at space " " df[["date", "time vmag vdir"]] = df["all"].str.split(" ", expand=True) # split at ";" df[["time", "vmag2D", "vdir"]] = df['time vmag vdir'].str.split(';', expand=True) date = pd.to_datetime(df["date"]).to_list() time_UTC = pd.to_datetime(df["time"]).to_list() vmag2D = pd.to_numeric(df["vmag2D"]).to_list() vdir = pd.to_numeric(df["vdir"]).to_list()

Collectives™ on Stack Overflow

Improving time efficiency of code, working with a big Data Set using Python

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related