1

I have an awk script where I need to validate a large number of lines containing dates.

I'm currently using either a regex based solution to do a basic validation (without testing for leap years or ) or calling the UNIX date command to validate it more accurately. The date command works well, but calling a system command is pretty expensive in terms of performance.

I was hoping that someone here might be able to suggest a solution that is both accurate and is fast.

Here's an example of my data

20140804024614 20140803190020 20140803163320 20140803083222 20140803170321 20140803234044 20140804011857 20140803204008 20140803160026 20140803140120 

Thanks.

5
  • Are you using GNU awk? If so, gnu.org/software/gawk/manual/html_node/Time-Functions.html Commented Nov 5, 2014 at 16:05
  • Please show your attempts, maybe you are using an over-complicated call to syscall. Commented Nov 5, 2014 at 16:11
  • How did you made this format working with UNIX date ? Commented Nov 5, 2014 at 16:13
  • Are those additional numbers times? If so, do you need to validate those too? In DST zones 2:30am, for example, would be invalid on the day when the clocks go forward since at 2am the time jumps forward to 3am. Wouldn't it make sense to include some examples of invalid dates in your sample input and also post the output you'd expect from a tool given that input? At least a statement of which segment of each line is the date would be good - e.g. is the date on the first line 2014-08-04 in YYYY-MM-DD format or 2014-04-08 or 2014-04-02 or something else? Commented Nov 5, 2014 at 18:12
  • @EdMorton Thanks for your reply. yes the other fields are the time - "yyyyMMddhhmmss". I do not need to worry about time zones since it's already taken into consideration by another system. Commented Nov 5, 2014 at 18:58

3 Answers 3

3

Given a whole lot of assumptions about your input file, this is probably all you need to print only the valid dates+times using GNU awk for time functions and gensub():

awk 'strftime("%Y%m%d%H%M%S",mktime(gensub(/(.{4})(..)(..)(..)(..)/,"\\1 \\2 \\3 \\4 \\5 ",""))) == $0' file 

It will only work with dates since the epoch.

If you need to print some kind of "valid/invalid" message for each date/time:

$ cat file 20140230035900 20140804024614 $ $ awk '{print (strftime("%Y%m%d%H%M%S",mktime(gensub(/(.{4})(..)(..)(..)(..)/,"\\1 \\2 \\3 \\4 \\5 ",""))) == $0 ? "" : "in") "valid:", $0}' file invalid: 20140230035900 valid: 20140804024614 

The above works by converting the date+time to seconds since the epoch, then converting those seconds to a date+time in the original format and if the result is identical to what you started with then the original date was valid.

Sign up to request clarification or add additional context in comments.

3 Comments

Hi, this looks like a pretty useful solution. However, it does not seem to work as a date such as "20140230035900" which contains the 30th of Feb is marked as valid.
FYI, the input file contains a number of columns. The date field is actually split into two columns with other cotent on either side.
Thanks. The edited version works perfectly. I've noted to add the erroneous data next time.
1

Check this:

checkFormat () { dateV="${1}" echo "${dateV}"|gawk '{ if (match($0,/^((?:19|20)[0-9][0-9])(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])([01][0-9]|2[0-4])$/,a)) { year=a[1]+0 mon=a[3]+0 day=a[4]+0 hour=a[5]+0 } else { print "KO: "$0 exit } if (day == 31 && (mon == 4 || mon == 6 || mon == 9 || mon == 11)) print "KO: "$0 # 30 days months else if (day >= 30 && mon == 2) print "KO: "$0 # Febrary never 30 o 31 else if (mon == 2 && day == 29 && ! ( year % 4 == 0 && (year % 100 != 0 || year % 400 == 0))) print "KO: "$0 # Febrary 29 leap year else print "Correct date !:" $0 }' } checkFormat 2014080417 checkFormat 20140803190035 

Usage:

$ ./checker.sh Correct date !:2014080417 KO: 20140803190035 

NOTE: MINUTES and SECONDS will be your task :)

Check also: http://nixtip.wordpress.com/2011/11/28/an-awk-date-format-validator/

2 Comments

if you just wanna check for short-months (anything < 31 days), just do ::::::::::::: :::::::::::::: :::::::::::: function __(_) { return _%2~(7<_) } - that's literally all you need … and if you don't care for nawk at all, even more concise :::: function __(_) { return _%2~7<_ }
conversely, long months check is function ___(_) { return _%2~_<8 }
0

I think this regex should cover ANY year, with or without left-padded zeros, in lieu of calculating the leap-year indicator :

awk '/(([2468][048]|[13579][26]|(^|[0+-])[48])(00)?|0000|^[+-]?0*)$/' 

To my genuine surprise,

  • gawk went much faster with the regex than the standard formula
  • nawk is roughly the same speed between the 2 methods, and
  • only mawk seem to favor doing math

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.