added 1639 characters in body

edited Jul 4, 2023 at 14:31

user432564

Just to address some of the comments with more background info:

This data is generated by an app that outputs csv log data to a file. It is not my app and there is no configuration control over how the app logs. The CSV is unqouted (even if data in the field contains spaces) and empty fields contain nothing.

I am loading the csv data directly into a mysql database. While timezone would be a good idea generally, this data is always timestamped with the local time and when visualizing the data (grafana), I have no need to store it in UTC then convert to EDT just for viewing (why convert the time to UTC just to convert it back to EDT). Plus, each csv line contains longitude and latitude (so if I wanted to go back and change the timestamp to UTC, it wouldn't be impossible to figure out what local time was).

The additional formatting I am doing is not much, and probably could be done with awk (again, I am not too familiar with the syntax there). It doesn't help that the original data needs an ID column added, and qoutes put around some fields, and there are two date-time fields in TWO different formats. So my long and terrible pipe line generally looks like this:

cat file | add ID column | format timestamp in second csv field | format timestamp in third csv field | qoute any field with spaces | replace empty fields with \N > output file

I had some trouble with mysql and empty fields, so I added the explicit null character. There is definitely better ways to do this, once I get the whole process working I'll go back through and simplify.

I do very appreciate everyones responses.

Just to address some of the comments with more background info:

This data is generated by an app that outputs csv log data to a file. It is not my app and there is no configuration control over how the app logs. The CSV is unqouted (even if data in the field contains spaces) and empty fields contain nothing.

I am loading the csv data directly into a mysql database. While timezone would be a good idea generally, this data is always timestamped with the local time and when visualizing the data (grafana), I have no need to store it in UTC then convert to EDT just for viewing (why convert the time to UTC just to convert it back to EDT). Plus, each csv line contains longitude and latitude (so if I wanted to go back and change the timestamp to UTC, it wouldn't be impossible to figure out what local time was).

The additional formatting I am doing is not much, and probably could be done with awk (again, I am not too familiar with the syntax there). It doesn't help that the original data needs an ID column added, and qoutes put around some fields, and there are two date-time fields in TWO different formats. So my long and terrible pipe line generally looks like this:

cat file | add ID column | format timestamp in second csv field | format timestamp in third csv field | qoute any field with spaces | replace empty fields with \N > output file

I had some trouble with mysql and empty fields, so I added the explicit null character. There is definitely better ways to do this, once I get the whole process working I'll go back through and simplify.

I do very appreciate everyones responses.

Became Hot Network Question

occurred Jul 4, 2023 at 10:38

Source Link

asked Jul 4, 2023 at 2:20

user432564

Changing non-standard date timestamp format in CSV using awk/sed

I have a csv with a few hundred thousand lines and I'm trying to change the date format in the second field. I should also add the second field is sometimes not populated at all. The deplorable input format is DayofWeek MonthofYear DayofMonth Hour:Minute:Second Timezone Year

Example:

Mon Jul 03 14:48:54 EDT 2023

My desired output format is YYYY-MM-DD HH:MM:SS Example:

2023-07-03 14:48:54

I am familiar with sed, so I got this sed regex replace line to get it in almost the right format, but the month not being a number is an issue.

sed -E "s/[A-Za-z]{3}\s([A-Za-z]{3})\s([0-9]{2})\s([0-9]{2}:[0-9]{2}:[0-9]{2})\s[A-Z]+\s([0-9]{4})/\4-\1-\2 \3/"

I don't think its possible to run the date command inside the sed replace section using the capture group 1 (but please correct me if I'm wrong).

I don't know how to go about referencing the month and parsing it with the date command once the sed command finishes, and I think it would be better to do the processing without piping the entire output to another command. This command is just one in a long line of piped commands for formatting the rest of the data.

It seems that maybe awk can do the entire formatting all at once, but I don't really know how to use awk that well.

What's the most efficient way to get the timestamp into the correct format?

Stack Exchange Network

Return to Question

Changing non-standard date timestamp format in CSV using awk/sed