For a full list of BASHing data blog posts see the index page.
A quick repair job on a dislocated table
The tab-separated data table I was auditing had 5463 records with 21 fields each, but something was seriously wrong. Here's a hugely simplified version of that table, called "dislocated":
Fld1 | Fld2 | Fld3 | Fld4 | Fld5 | Fld6 | Fld7 |
001 | ccc | LLL | 111 | Qqq, 888 | mmm | hhh |
002 | ggg | PPP | 777 | Rrr | 444 | sss |
003 | vvv | AAA | 333 | Eee, 666 | ddd | hhh |
004 | ooo | EEE | 222 | Iii, 999 | uuu | hhh |
005 | hhh | JJJ | 888 | Vvv, 111 | aaa | hhh |
006 | bbb | CCC | 444 | Sss | 222 | kkk |
007 | iii | RRR | 666 | Bbb | 555 | ddd |
008 | rrr | NNN | 333 | Jjj, 888 | iii | hhh |
009 | sss | HHH | 555 | Www | 333 | xxx |
010 | aaa | TTT | 999 | Qqq, 111 | uuu | hhh |
Somehow the comma[space] in field 5 (above) had been converted to a field separator in certain records. This displaced all the subsequent fields in those records one field to the right, and the last field (always containing "hhh") had been trimmed off.
To continue the audit I needed to fix the real-world table. My quick, Band-Aid® solution was an AWK command (shown here for "dislocated", not for the real-world table):
awk 'BEGIN {FS=OFS="\t"} NR>1 && $5 !~ /,/ \
{print $1,$2,$3,$4,$5", "$6,$7,"hhh"; next} 1' dislocated
The BEGIN statement tells AWK that the input field separator (FS) and the output field separator (OFS) are both the tab character.
The "1" at the end of the command tells AWK to print all lines. It's an AWK shorthand for the usual pattern {action} statement. The pattern in this case is the number "1", which is always true (1 always equals 1) and the action is the default AWK action, which is to print the line.
The line pattern to be treated specially is that field 5 doesn't match a comma ($5 !~ /,/). Since this is also true in the header line (NR is 1), which I don't want to treat specially, the full pattern for special action is NR>1 && $5 !~ /,/.
The first special action is to print the line so that fields 5 and 6 are separated by comma[space] rather than by a tab, and with "hhh" (the standard entry) added as the last tab-separated field. This prints the fields in their correct form and order.
The second special action is "next", telling AWK to go straight to the next line without printing the dislocated original line.
Last update: 2020-07-15
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License