2

I have a number of large CSV files (each at around two million rows), which have rows of timestamps looking like this:

16.01.2019 12:52:22 16.01.2019 12:52:23 16.01.2019 12:52:24 

Given that there's an entry for each second (over a course of about a year), it should be understandable why there are so many rows. I want to be more flexible, which is why I want to divide the timestamps into three rows: date, date+hour, date+hour+minute, date+hour+second, so that I'm able to group timestamps at will. This is how I'm doing it:

dates = [] hours = [] minutes = [] seconds = [] i = 0 #initial values dates.append(str(get_date(i).date())) hours.append(str(get_date(i).hour)) minutes.append(str(get_date(i).minute)) seconds.append(str(get_date(i).second)) for i in range(len(df)): if i < len(df) - 1 : if str(get_date(i).date) < str(get_date(i+1).date): #dates: YYYY-MM-DD dates.append(str(get_date(i+1).date())) else: dates.append(str(get_date(i).date())) if str(get_date(i).hour) < str(get_date(i+1).hour): #dates+hours: YYYY-MM-DD HH hours.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour)) else: hours.append(str(get_date(i).date()) + " " + str(get_date(i).hour)) if str(get_date(i).minute) < str(get_date(i+1).minute): #dates+hours+minutes: YYYY-MM-DD HH:mm minutes.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute)) else: minutes.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute)) if str(get_date(i).second) < str(get_date(i+1).second): #dates+hours+minutes+seconds: YYYY-MM-DD HH:mm+ss seconds.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute) + ":" + str(get_date(i+1).second)) else: seconds.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute) + ":" + str(get_date(i).second)) df["dates"] = dates df["hours"] = hours df["minutes"] = minutes df["seconds"] = seconds 

where get_date() is simply a function returning the timestamp with the given index:

def get_date(i): return (dt.datetime.strptime(df["timestamp"][i], '%d.%m.%Y %H:%M:%S')) 

I basically iterate through all entries, put each date/hour/minute/second into a list, and then insert them each into my dataframe.and put them into where get_date() is simply a function returning the timestamp with the given index.

I guess this would put me at O(n²)? Which is obviously not ideal.

Now, doing this on one file (~60MB, 2 million rows) takes half an hour. I personally can't think of another way to do what I want to do, so I just wanted to see if there's anything I can do to reduce the complexity.

edit: Tweaking @Chris' answer for my needs:

times = bogie_df["timestamp"] #got an error when applying map directly into pd.DataFrame, which is why I first converted it into a list items = ['year', 'month', 'day', 'hour', 'minute', 'second'] df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(times))), columns=items) #for my desired YYYY-MM-DD format (though attrgetter only return "1" for "January instead of "01" df["date"] = df['year'].map(str) + "-" + df["month"].map(str) + df["day"].map(str) 
3
  • Is there a mistake in your code dates.append(str(get_date(i).date())) vs dates.append(str(get_date(i).date)). You call the get_date() method really a lot. Have you tried saving that result into a variable? And is there any reason for the additional if i < len(df) - 1 :? With the range object (I hope you are on python 3.x) you already have only I from 0,... len(df)-1 Commented May 15, 2019 at 14:55
  • Do you mean saving get_date() into a variable at the beginning of each iteration? That could help, I guess yeah. I edited my question to show what get_date really does. Commented May 15, 2019 at 15:00
  • Since I'm also asking for i+1 during each iteration, the additional if-clause prevents the method from crashing at the end. Commented May 15, 2019 at 15:01

3 Answers 3

4

Use operator.attrgetter with pd.to_datetime:

import pandas as pd import operator s = pd.Series(["16.01.2019 12:52:22", "16.01.2019 12:52:23", "16.01.2019 12:52:24"]) items = ['day', 'hour', 'minute', 'second'] df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(s))), columns=items) 

Output:

 day hour minute second 0 16 12 52 22 1 16 12 52 23 2 16 12 52 24 

Benchmark:

large_s = pd.Series(pd.date_range('16.01.2019 12:52:22', periods=2000000, freq='1s').astype(str).tolist()) # Make 2M rows of timestamp in str %%timeit items = ['day', 'hour', 'minute', 'second'] df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(large_s))), columns=items) # 6.77 s ± 54.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

takes about 7 seconds.

Updated:

Instead of manually creating a YYYY-mm-dd-formated strings, let pandas do it for you.

df = pd.DataFrame(pd.to_datetime(s), columns = ['date']) items = ['year', 'month', 'day', 'hour', 'minute', 'second'] df[items] = pd.DataFrame(list(map(operator.attrgetter(*items), df['date'])), columns=items) 

Then:

df['dates'] = df['date'].astype(str).str[:10] 

Output (with the months padding with zero):

0 2019-01-16 1 2019-01-16 2 2019-01-16 
Sign up to request clarification or add additional context in comments.

3 Comments

Wow, game changer! Tweaked it a little bit, as the map function didn't work as a parameter for pd.DataFrame() for me (see my edited question), but that's huge. I couldn't get to 7 seconds, though. Took me 7 minutes, actually. Maybe because my items was bigger?
@Readler My bad, list was missing from the answer :P. For me, both lines (making df, and df['date']) take less than 10 seconds, combined, but this will vary hugely based on the hardwares. Also, df['date'] part seems redundant. Let me post some update.
Nice, that's a lot more elegant than my solution! Could you maybe explain how the paddings (.str[:10]) works? I made another loop and used list comprehension (map(lambda x: "0" + str(x) if len(str(x)) < 2 else x, df[d])), which is a lot more expensive than your solution.
0

Since comments are limited in length at least following points:

  1. Remove the if i < len(df) - 1 :, you don't need it. Replace your range with range(len(df)-1).
  2. Save the results of your get_date function:

Before the loop: next_time = get_date(0)

Inside the loop:

current_time = next_time next_time = get_date(i+1) 

This should save you some function calls, but probably pandas has some better way to do such stuff.

Comments

0

You do not need that, but instead should add a single column of Timestamp type:

df['ts'] = pd.to_datetime(df.timestamp, format='%d.%m.%Y %H:%M:%S') 

Then you can directly use all the time goodies on that column:

  • df.ts.dt.date: gives the date as a datetime.date
  • df.ts.dt.strftime(format): gives the date as a string formatted with format. For exampe, df.ts.dt.strftime("YYYY-MM-DD HH:mm") is your "minutes" column
  • df.ts.dt.floor(freq='h'): is a timestamp truncated at hour level, for example for grouping
  • ... (just read pandas Datetime Data for reference)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.