For a full list of BASHing data blog posts see the index page.
Check the day of year, given a date
The ugliest function in my data-auditing toolkit is "chkday". I wrote it for data tables containing both a date in ISO 8601 format and the corresponding day number for the given year. Here's a simplified version of such a table, called "dates":
ID | ISO_8601_date | day |
aaa | 2017-03-14 | 72 |
bbb | 1995-11-06 | 310 |
ccc | 2000-04-19 | 109 |
The day numbers for aaa and ccc are incorrect, and to find those records I use the horrible "chkday":
chkday() { awk -F"\t" -v isodate="$2" -v dayno="$3" 'NR>1 && $isodate != "" && $dayno != "" {split($isodate,a,"-"); b=strftime("%j",mktime(a[1]" "a[2]" "a[3]" 0 0 0")); if (b != sprintf("%03d",$dayno)) print $isodate FS $dayno FS b}' "$1"; }
"chkday" takes as its 3 arguments the filename of a tab-separated table, the number of the field with the ISO date and the number of the field with the day number to be checked. Where the day number is wrong, "chkday" returns the ISO date, the given day number and the correct day number:
AWK is told with -F"\t" that the table is tab-separated. The variable isodate is defined as the data item in the field with the number given in the second argument (-v isodate="$2"), and the variable dayno as the data item in the field with the number given in the third argument (-v dayno="$3").
AWK is instructed only to look at the lines in the table that come after the header (NR>1) and in which both the isodate and dayno fields are non-empty ($isodate != "" && $dayno != "").
Now for the ugliness. AWK first breaks the ISO 8601 date using the split function (split($isodate,a,"-")) into year, month and day components, storing these in an array "a" where year is "a[1]", month is "a[2]" and day is "a[3]".
These components are fed to the mktime function to convert year, month and day to UNIX epoch in seconds (mktime(a[1]" "a[2]" "a[3]" 0 0 0")).
The mktime output is converted to day number in the given year with the strftime function (strftime("%j",mktime(a[1]" "a[2]" "a[3]" 0 0 0"))) The day number is stored in the variable "b".
The strftime output in "b" is a number from 001 to 365 (or 366 in a leap year). To compare the given day number with the one I've just derived with AWK functions, I first need to ensure that the given day number is also a number with up to 2 leading zeroes. The comparison is therefore done with the given day number reformatted using sprintf (sprintf("%03d",$dayno)). If the two numbers aren't equal (if (b != sprintf("%03d",$dayno))), AWK prints the full ISO date, the given day number and the derived one, all tab-separated (print $isodate FS $dayno FS b).
Before I use "chkday" I check the ISO date field to ensure that all entries with a corresponding day number are valid and correctly formatted as YYYY-MM-DD.
I've often thought that there must be a better way to do this. After all, getting the day number of an ISO date is so easy with the date command:
There are two different ways (that I know of) to put a shell command like date inside an AWK command, and make the shell command output available to AWK. One is to use the system() function:
A second method is to simply run the shell command and save its output with getline in a variable; this is a GNU AWK (gawk) trick:
To see where the mismatches are, I can pipe the outputs to a second AWK command that looks for disagreement between the 2 last fields in the output:
awk -F"\t" 'NR>1 {printf("%s\t%03d\t",$2,$3); system("date -d "$2" +%j")}' dates | awk '$2 != $3'
awk -F"\t" 'NR>1 {"date -d "$2" +%j" |& getline foo; printf("%s\t%03d\t%s\n",$2,$3,foo)}' dates | awk '$2 != $3'
Well, they both work, but I don't think I've prettified my function very much with either of those two approaches:
chkday_system() { awk -F"\t" -v isodate="$2" -v dayno="$3" 'NR>1 && $isodate != "" && $dayno != "" {printf("%s\t%03d\t",$isodate,$dayno); system("date -d "$isodate" +%j")}' "$1" | awk '$2 != $3'; }
chkday_getline() { awk -F"\t" -v isodate="$2" -v dayno="$3" 'NR>1 {"date -d "$isodate" +%j" |& getline foo; printf("%s\t%03d\t%s\n",$isodate,$dayno,foo)}' "$1" | awk '$2 != $3'; }
...so for the time being I'm sticking with "chkday".
Last update: 2020-11-18
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License