How can I iterate more efficiently through a large CSV file?

Question

I have a number of large CSV files (each at around two million rows), which have rows of timestamps looking like this:

16.01.2019 12:52:22 16.01.2019 12:52:23 16.01.2019 12:52:24

Given that there's an entry for each second (over a course of about a year), it should be understandable why there are so many rows. I want to be more flexible, which is why I want to divide the timestamps into three rows: date, date+hour, date+hour+minute, date+hour+second, so that I'm able to group timestamps at will. This is how I'm doing it:

dates = [] hours = [] minutes = [] seconds = [] i = 0 #initial values dates.append(str(get_date(i).date())) hours.append(str(get_date(i).hour)) minutes.append(str(get_date(i).minute)) seconds.append(str(get_date(i).second)) for i in range(len(df)): if i < len(df) - 1 : if str(get_date(i).date) < str(get_date(i+1).date): #dates: YYYY-MM-DD dates.append(str(get_date(i+1).date())) else: dates.append(str(get_date(i).date())) if str(get_date(i).hour) < str(get_date(i+1).hour): #dates+hours: YYYY-MM-DD HH hours.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour)) else: hours.append(str(get_date(i).date()) + " " + str(get_date(i).hour)) if str(get_date(i).minute) < str(get_date(i+1).minute): #dates+hours+minutes: YYYY-MM-DD HH:mm minutes.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute)) else: minutes.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute)) if str(get_date(i).second) < str(get_date(i+1).second): #dates+hours+minutes+seconds: YYYY-MM-DD HH:mm+ss seconds.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute) + ":" + str(get_date(i+1).second)) else: seconds.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute) + ":" + str(get_date(i).second)) df["dates"] = dates df["hours"] = hours df["minutes"] = minutes df["seconds"] = seconds

where get_date() is simply a function returning the timestamp with the given index:

def get_date(i): return (dt.datetime.strptime(df["timestamp"][i], '%d.%m.%Y %H:%M:%S'))

I basically iterate through all entries, put each date/hour/minute/second into a list, and then insert them each into my dataframe.and put them into where get_date() is simply a function returning the timestamp with the given index.

I guess this would put me at O(n²)? Which is obviously not ideal.

Now, doing this on one file (~60MB, 2 million rows) takes half an hour. I personally can't think of another way to do what I want to do, so I just wanted to see if there's anything I can do to reduce the complexity.

edit: Tweaking @Chris' answer for my needs:

times = bogie_df["timestamp"] #got an error when applying map directly into pd.DataFrame, which is why I first converted it into a list items = ['year', 'month', 'day', 'hour', 'minute', 'second'] df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(times))), columns=items) #for my desired YYYY-MM-DD format (though attrgetter only return "1" for "January instead of "01" df["date"] = df['year'].map(str) + "-" + df["month"].map(str) + df["day"].map(str)

Is there a mistake in your code dates.append(str(get_date(i).date())) vs dates.append(str(get_date(i).date)). You call the get_date() method really a lot. Have you tried saving that result into a variable? And is there any reason for the additional if i < len(df) - 1 :? With the range object (I hope you are on python 3.x) you already have only I from 0,... len(df)-1 — Sparky05
– Sparky05, Commented May 15, 2019 at 14:55
Do you mean saving get_date() into a variable at the beginning of each iteration? That could help, I guess yeah. I edited my question to show what get_date really does. — Readler
– Readler, Commented May 15, 2019 at 15:00
Since I'm also asking for i+1 during each iteration, the additional if-clause prevents the method from crashing at the end. — Readler
– Readler, Commented May 15, 2019 at 15:01

Chris · Accepted Answer · 2019-05-17 01:31:10Z

Use operator.attrgetter with pd.to_datetime:

import pandas as pd import operator s = pd.Series(["16.01.2019 12:52:22", "16.01.2019 12:52:23", "16.01.2019 12:52:24"]) items = ['day', 'hour', 'minute', 'second'] df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(s))), columns=items)

Output:

 day hour minute second 0 16 12 52 22 1 16 12 52 23 2 16 12 52 24

Benchmark:

large_s = pd.Series(pd.date_range('16.01.2019 12:52:22', periods=2000000, freq='1s').astype(str).tolist()) # Make 2M rows of timestamp in str %%timeit items = ['day', 'hour', 'minute', 'second'] df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(large_s))), columns=items) # 6.77 s ± 54.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

takes about 7 seconds.

Updated:

Instead of manually creating a YYYY-mm-dd-formated strings, let pandas do it for you.

df = pd.DataFrame(pd.to_datetime(s), columns = ['date']) items = ['year', 'month', 'day', 'hour', 'minute', 'second'] df[items] = pd.DataFrame(list(map(operator.attrgetter(*items), df['date'])), columns=items)

Then:

df['dates'] = df['date'].astype(str).str[:10]

Output (with the months padding with zero):

0 2019-01-16 1 2019-01-16 2 2019-01-16

Wow, game changer! Tweaked it a little bit, as the map function didn't work as a parameter for pd.DataFrame() for me (see my edited question), but that's huge. I couldn't get to 7 seconds, though. Took me 7 minutes, actually. Maybe because my items was bigger?
@Readler My bad, list was missing from the answer :P. For me, both lines (making df, and df['date']) take less than 10 seconds, combined, but this will vary hugely based on the hardwares. Also, df['date'] part seems redundant. Let me post some update.
Nice, that's a lot more elegant than my solution! Could you maybe explain how the paddings (.str[:10]) works? I made another loop and used list comprehension (map(lambda x: "0" + str(x) if len(str(x)) < 2 else x, df[d])), which is a lot more expensive than your solution.

Sparky05 · Accepted Answer · 2019-05-15 15:03:23Z

Since comments are limited in length at least following points:

Remove the if i < len(df) - 1 :, you don't need it. Replace your range with range(len(df)-1).
Save the results of your get_date function:

Before the loop: next_time = get_date(0)

Inside the loop:

current_time = next_time next_time = get_date(i+1)

This should save you some function calls, but probably pandas has some better way to do such stuff.

Serge Ballesta · Accepted Answer · 2019-05-15 15:37:58Z

You do not need that, but instead should add a single column of Timestamp type:

df['ts'] = pd.to_datetime(df.timestamp, format='%d.%m.%Y %H:%M:%S')

Then you can directly use all the time goodies on that column:

df.ts.dt.date: gives the date as a datetime.date
df.ts.dt.strftime(format): gives the date as a string formatted with format. For exampe, df.ts.dt.strftime("YYYY-MM-DD HH:mm") is your "minutes" column
df.ts.dt.floor(freq='h'): is a timestamp truncated at hour level, for example for grouping
... (just read pandas Datetime Data for reference)

Collectives™ on Stack Overflow

How can I iterate more efficiently through a large CSV file?

3 Answers 3

3 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Related