4

Here is an example of my .csv file:

date, time, value 20240112,085917,11 20240112,085917,22 

I used to import it to DataFrame with the following way:

df = pd.read_csv(csv_file, parse_dates=[['date', 'time']]).set_index('date_time') 

And I was getting the following structure:

date_time value 2023-10-02 10:00:00 11 2023-10-02 10:01:00 22 

Now after updating to Pandas 2.2.0 I started to get this error:

FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.

So in order to achieve the same result now I have to do:

df['datetime'] = df.date.astype(str) + ' ' + df.time.astype(str) df['datetime'] = pd.to_datetime(df.datetime, format="%Y%m%d %H%M%S") df = df.drop(['date', 'time'], axis=1).set_index('datetime') 

Is there any way to do it in the new versions of Pandas without strings concatenations which are very slow usually?

2
  • 1
    not that I'm aware of but you can accomplish it all in one line: df["datetime"] = pd.to_datetime(arg=df.pop("date").str.cat(df.pop("time"), sep=" ")) Commented Feb 12, 2024 at 18:44
  • 1
    Why oh why, does this new to_datetime solutions read much worse than the now deprecated feature of read_csv? Commented May 22, 2024 at 14:54

1 Answer 1

7

Since parsing the date will involve strings anyway and given your time format without separator, this seems like the most reasonable option.

You could simplify your code to read the columns as string directly and to pop the columns:

df = pd.read_csv(csv_file, sep=', *', engine='python', dtype={'date': str, 'time': str}) df['datetime'] = pd.to_datetime(df.pop('date')+' '+df.pop('time'), format="%Y%m%d %H%M%S") df = df.set_index('datetime') 

NB. if your days and hours/minutes/seconds are reliably padded with zeros, you can use df.pop('date')+df.pop('time') and format="%Y%m%d%H%M%S".

Output:

 value datetime 2024-01-12 08:59:17 11 2024-01-12 08:59:17 22 

A variant with numeric operations and a timedelta:

df = pd.read_csv(csv_file, sep=', *', engine='python', dtype={'date': str}) a = df.pop('time').to_numpy() a, s = np.divmod(a, 100) h, m = np.divmod(a, 100) df['datetime'] = (pd.to_datetime(df.pop('date')) +pd.to_timedelta(h*3600+m*60+s, unit='s') ) 

which is actually much slower (27.7 ms ± 4.11 ms per loop vs 350 µs ± 44.5 µs per loop for the string approach)

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.