For a full list of BASHing data blog posts see the index page.
How to fix "one2many" data issues
I explained the "one2many" problem in an earlier BASHing data post and showed how to detect it (with the "one2many" function) in that post and in A Data Cleaner's Cookbook. In a nutshell, the problem occurs when each entry in a field (call it field A) should have only one corresponding entry in a second field (B).
Sometimes that doesn't happen. Instead, field B has the wrong entry or no entry at all. Here's a much simplified example, a tab-separated table called "demo":
saleID | date | item | class | kg |
001 | 2021-01-02 | capsicum | vegetable | 11.9 |
002 | 2021-01-02 | banana | fruit | 12.7 |
003 | 2021-01-02 | capsicum | vegetable | 3.7 |
004 | 2021-01-02 | potato | vegetable | 4.1 |
005 | 2021-01-02 | capsicum | vegetable | 6.0 |
006 | 2021-01-02 | potato | fruit | 13.0 |
007 | 2021-01-02 | banana | vegetable | 9.1 |
008 | 2021-01-02 | potato | vegetable | 15.0 |
009 | 2021-01-02 | apple | fruit | 5.6 |
010 | 2021-01-02 | banana | fruit | 7.7 |
011 | 2021-01-02 | pumpkin | vegetable | 8.3 |
012 | 2021-01-02 | pumpkin | vegetable | 5.6 |
013 | 2021-01-02 | apple | fruit | 3.5 |
014 | 2021-01-02 | pumpkin | vegetable | 5.3 |
015 | 2021-01-02 | capsicum | vegetable | 10.3 |
016 | 2021-01-03 | apple | fruit | 12.2 |
017 | 2021-01-03 | pumpkin | vegetable | 12.6 |
018 | 2021-01-03 | potato | vegetable | 4.4 |
019 | 2021-01-03 | apple | fruit | 12.5 |
020 | 2021-01-03 | pumpkin | vegetable | 11.6 |
021 | 2021-01-03 | banana | vegetable | 14.5 |
022 | 2021-01-03 | capsicum | vegetable | 4.1 |
023 | 2021-01-03 | banana | 5.9 | |
024 | 2021-01-03 | potato | vegetable | 4.8 |
025 | 2021-01-03 | apple | fruit | 15.6 |
Are all the items in the right class? "one2many" says no. One of the potato sales has been classed as "fruit", two of the banana orders are classed as "vegetable", and one banana order is missing a class entry:
The fix is simple. First, build a lookup table from the "one2many" results, containing only the correct pairing. With long "one2many" outputs this might best be done in a text editor, but in this demonstration I can do it on the command line:
one2many demo 3 4 | cut -f2- | sed -n '1p;5p' > lookup
Update 2022-02-20. The "one2many" function now adds a blank line between similar sets of data items. In building "lookup", these unneeded blank lines can be deleted, on the command line with awk NF.
Next, put the lookup table in an AWK array, with the item as index string and the correct class as value string. Then get AWK to replace the class entry in the main file with the correct one on every line where the item occurs. This fills in gaps in the class field, replaces incorrect entries and overwrites correct entries (with the correct one, again):
awk 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1]=$2; next} \
$3 in a {$4=a[$3]} 1' lookup demo > demo_fixed
If instead you want to edit the table one data item at a time, the lookup table can also be used to build a "to do" list:
awk -F"\t" 'FNR==NR {a[$1]=$2; next} \
$3 in a && $4 != a[$3] \
{print $1 FS $3 FS $4}' lookup demo
Last update: 2022-02-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License