Awk date validation

Question

I have an awk script where I need to validate a large number of lines containing dates.

I'm currently using either a regex based solution to do a basic validation (without testing for leap years or ) or calling the UNIX date command to validate it more accurately. The date command works well, but calling a system command is pretty expensive in terms of performance.

I was hoping that someone here might be able to suggest a solution that is both accurate and is fast.

Here's an example of my data

20140804024614 20140803190020 20140803163320 20140803083222 20140803170321 20140803234044 20140804011857 20140803204008 20140803160026 20140803140120

Thanks.

Are you using GNU awk? If so, gnu.org/software/gawk/manual/html_node/Time-Functions.html — Tom Fenech
– Tom Fenech, Commented Nov 5, 2014 at 16:05
Please show your attempts, maybe you are using an over-complicated call to syscall. — fedorqui
– fedorqui, Commented Nov 5, 2014 at 16:11
Are those additional numbers times? If so, do you need to validate those too? In DST zones 2:30am, for example, would be invalid on the day when the clocks go forward since at 2am the time jumps forward to 3am. Wouldn't it make sense to include some examples of invalid dates in your sample input and also post the output you'd expect from a tool given that input? At least a statement of which segment of each line is the date would be good - e.g. is the date on the first line 2014-08-04 in YYYY-MM-DD format or 2014-04-08 or 2014-04-02 or something else? — Ed Morton
– Ed Morton, Commented Nov 5, 2014 at 18:12
@EdMorton Thanks for your reply. yes the other fields are the time - "yyyyMMddhhmmss". I do not need to worry about time zones since it's already taken into consideration by another system. — Soucrit
– Soucrit, Commented Nov 5, 2014 at 18:58

Ed Morton · Accepted Answer · 2014-11-05 19:35:37Z

Given a whole lot of assumptions about your input file, this is probably all you need to print only the valid dates+times using GNU awk for time functions and gensub():

awk 'strftime("%Y%m%d%H%M%S",mktime(gensub(/(.{4})(..)(..)(..)(..)/,"\\1 \\2 \\3 \\4 \\5 ",""))) == $0' file

It will only work with dates since the epoch.

If you need to print some kind of "valid/invalid" message for each date/time:

$ cat file 20140230035900 20140804024614 $ $ awk '{print (strftime("%Y%m%d%H%M%S",mktime(gensub(/(.{4})(..)(..)(..)(..)/,"\\1 \\2 \\3 \\4 \\5 ",""))) == $0 ? "" : "in") "valid:", $0}' file invalid: 20140230035900 valid: 20140804024614

The above works by converting the date+time to seconds since the epoch, then converting those seconds to a date+time in the original format and if the result is identical to what you started with then the original date was valid.

Hi, this looks like a pretty useful solution. However, it does not seem to work as a date such as "20140230035900" which contains the 30th of Feb is marked as valid.
FYI, the input file contains a number of columns. The date field is actually split into two columns with other cotent on either side.
Thanks. The edited version works perfectly. I've noted to add the erroneous data next time.

Juan Diego Godoy Robles · Accepted Answer · 2014-11-05 17:27:33Z

Check this:

checkFormat () { dateV="${1}" echo "${dateV}"|gawk '{ if (match($0,/^((?:19|20)[0-9][0-9])(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])([01][0-9]|2[0-4])$/,a)) { year=a[1]+0 mon=a[3]+0 day=a[4]+0 hour=a[5]+0 } else { print "KO: "$0 exit } if (day == 31 && (mon == 4 || mon == 6 || mon == 9 || mon == 11)) print "KO: "$0 # 30 days months else if (day >= 30 && mon == 2) print "KO: "$0 # Febrary never 30 o 31 else if (mon == 2 && day == 29 && ! ( year % 4 == 0 && (year % 100 != 0 || year % 400 == 0))) print "KO: "$0 # Febrary 29 leap year else print "Correct date !:" $0 }' } checkFormat 2014080417 checkFormat 20140803190035

Usage:

$ ./checker.sh Correct date !:2014080417 KO: 20140803190035

NOTE: MINUTES and SECONDS will be your task :)

Check also: http://nixtip.wordpress.com/2011/11/28/an-awk-date-format-validator/

if you just wanna check for short-months (anything < 31 days), just do ::::::::::::: :::::::::::::: :::::::::::: function __(_) { return _%2~(7<_) } - that's literally all you need … and if you don't care for nawk at all, even more concise :::: function __(_) { return _%2~7<_ }
conversely, long months check is function ___(_) { return _%2~_<8 }

RARE Kpop Manifesto · Accepted Answer · 2023-08-07 15:24:40Z

I think this regex should cover ANY year, with or without left-padded zeros, in lieu of calculating the leap-year indicator :

awk '/(([2468][048]|[13579][26]|(^|[0+-])[48])(00)?|0000|^[+-]?0*)$/'

To my genuine surprise,

gawk went much faster with the regex than the standard formula
nawk is roughly the same speed between the 2 methods, and
only mawk seem to favor doing math

Collectives™ on Stack Overflow

Awk date validation

3 Answers 3

3 Comments

2 Comments

Comments

Linked

Hot Network Questions