I have a number of large CSV files (each at around two million rows), which have rows of timestamps looking like this:
16.01.2019 12:52:22 16.01.2019 12:52:23 16.01.2019 12:52:24 Given that there's an entry for each second (over a course of about a year), it should be understandable why there are so many rows. I want to be more flexible, which is why I want to divide the timestamps into three rows: date, date+hour, date+hour+minute, date+hour+second, so that I'm able to group timestamps at will. This is how I'm doing it:
dates = [] hours = [] minutes = [] seconds = [] i = 0 #initial values dates.append(str(get_date(i).date())) hours.append(str(get_date(i).hour)) minutes.append(str(get_date(i).minute)) seconds.append(str(get_date(i).second)) for i in range(len(df)): if i < len(df) - 1 : if str(get_date(i).date) < str(get_date(i+1).date): #dates: YYYY-MM-DD dates.append(str(get_date(i+1).date())) else: dates.append(str(get_date(i).date())) if str(get_date(i).hour) < str(get_date(i+1).hour): #dates+hours: YYYY-MM-DD HH hours.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour)) else: hours.append(str(get_date(i).date()) + " " + str(get_date(i).hour)) if str(get_date(i).minute) < str(get_date(i+1).minute): #dates+hours+minutes: YYYY-MM-DD HH:mm minutes.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute)) else: minutes.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute)) if str(get_date(i).second) < str(get_date(i+1).second): #dates+hours+minutes+seconds: YYYY-MM-DD HH:mm+ss seconds.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute) + ":" + str(get_date(i+1).second)) else: seconds.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute) + ":" + str(get_date(i).second)) df["dates"] = dates df["hours"] = hours df["minutes"] = minutes df["seconds"] = seconds where get_date() is simply a function returning the timestamp with the given index:
def get_date(i): return (dt.datetime.strptime(df["timestamp"][i], '%d.%m.%Y %H:%M:%S')) I basically iterate through all entries, put each date/hour/minute/second into a list, and then insert them each into my dataframe.and put them into where get_date() is simply a function returning the timestamp with the given index.
I guess this would put me at O(n²)? Which is obviously not ideal.
Now, doing this on one file (~60MB, 2 million rows) takes half an hour. I personally can't think of another way to do what I want to do, so I just wanted to see if there's anything I can do to reduce the complexity.
edit: Tweaking @Chris' answer for my needs:
times = bogie_df["timestamp"] #got an error when applying map directly into pd.DataFrame, which is why I first converted it into a list items = ['year', 'month', 'day', 'hour', 'minute', 'second'] df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(times))), columns=items) #for my desired YYYY-MM-DD format (though attrgetter only return "1" for "January instead of "01" df["date"] = df['year'].map(str) + "-" + df["month"].map(str) + df["day"].map(str)
dates.append(str(get_date(i).date()))vsdates.append(str(get_date(i).date)). You call theget_date()method really a lot. Have you tried saving that result into a variable? And is there any reason for the additionalif i < len(df) - 1 :? With the range object (I hope you are on python 3.x) you already have only I from0,... len(df)-1get_date()into a variable at the beginning of each iteration? That could help, I guess yeah. I edited my question to show what get_date really does.i+1during each iteration, the additional if-clause prevents the method from crashing at the end.