0

I’m trying to optimize my code to reduce the time it takes to run through the data set.

I’m working with a .csv file that has three columns: time_UTC, vmag2D and vdir. The data set as around 1420000 lines (one million, four hundred and 20 thousand).

This for cycle took around 15/20 minutes to run on my Mac with the M1 processor. I’m sure I complicated something for it to take so much time. (I believe the processor is good enough to run this little piece of code faster).

import pandas as pd path_data = '" *insert a path here* "' file = path_data + ' *name of the .csv file* ' data = pd.read_csv(file) time_UTC = [] vmag2D = [] vdir = [] for i in range(len(data)): x = data.iloc[i][0] x1 = x.split(' ') x2 = x1[1].split(';') date = x.split(' ')[0] time_UTC.append(x2[0]) vmag2D.append(x2[1]) vdir.append(x2[2]) 

The code is parsing each of the lines in the .csv file, and each of them has the same “template”: '1994-01-01 00:05:00;0.52;193'

2 Answers 2

1

It shouldn't be necessary to use any type of for loop for your code. You are reading the CSV using pandas, but you don't seem to specify the correct parameters.

import pandas as pd path_data = '" *insert a path here* "' file = path_data + ' *name of the .csv file* ' df = pd.read_csv(file, sep=';', parse_dates=[0], engine='c', header=None) time_UTC = df.iloc[:, 0] vmag2D = df.iloc[:, 1] vdir = df.iloc[:, 2] 

If you run this, your resulting variables (time_UTC, ...) will be of type pandas.Series. You can convert those to list with .to_list() or access the numpy array using .values.

Note that I am specifying engine='c' here in the pandas CSV parser, which is using a native C parser that is faster than its python equivalent, as you are processing a large file here.

Sign up to request clarification or add additional context in comments.

1 Comment

Worked perfectly! exactly what I was looking for! the run time before was 29minutes and now it is instantaneous! Thank you very much!
1

You can split the entire column at once

import pandas as pd import numpy as np df = pd.DataFrame({"all": ["1994-01-01 00:05:00;0.52;193"]*1000}) # split at space " " df[["date", "time vmag vdir"]] = df["all"].str.split(" ", expand=True) # split at ";" df[["time", "vmag2D", "vdir"]] = df['time vmag vdir'].str.split(';', expand=True) date = pd.to_datetime(df["date"]).to_list() time_UTC = pd.to_datetime(df["time"]).to_list() vmag2D = pd.to_numeric(df["vmag2D"]).to_list() vdir = pd.to_numeric(df["vdir"]).to_list() 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.