banner

For a full list of BASHing data blog posts see the index page.     RSS


How to fix "one2many" data issues

I explained the "one2many" problem in an earlier BASHing data post and showed how to detect it (with the "one2many" function) in that post and in A Data Cleaner's Cookbook. In a nutshell, the problem occurs when each entry in a field (call it field A) should have only one corresponding entry in a second field (B).

Sometimes that doesn't happen. Instead, field B has the wrong entry or no entry at all. Here's a much simplified example, a tab-separated table called "demo":

saleIDdateitemclasskg
0012021-01-02capsicumvegetable11.9
0022021-01-02bananafruit12.7
0032021-01-02capsicumvegetable3.7
0042021-01-02potatovegetable4.1
0052021-01-02capsicumvegetable6.0
0062021-01-02potatofruit13.0
0072021-01-02bananavegetable9.1
0082021-01-02potatovegetable15.0
0092021-01-02applefruit5.6
0102021-01-02bananafruit7.7
0112021-01-02pumpkinvegetable8.3
0122021-01-02pumpkinvegetable5.6
0132021-01-02applefruit3.5
0142021-01-02pumpkinvegetable5.3
0152021-01-02capsicumvegetable10.3
0162021-01-03applefruit12.2
0172021-01-03pumpkinvegetable12.6
0182021-01-03potatovegetable4.4
0192021-01-03applefruit12.5
0202021-01-03pumpkinvegetable11.6
0212021-01-03bananavegetable14.5
0222021-01-03capsicumvegetable4.1
0232021-01-03banana5.9
0242021-01-03potatovegetable4.8
0252021-01-03applefruit15.6

Are all the items in the right class? "one2many" says no. One of the potato sales has been classed as "fruit", two of the banana orders are classed as "vegetable", and one banana order is missing a class entry:

o2m1

The fix is simple. First, build a lookup table from the "one2many" results, containing only the correct pairing. With long "one2many" outputs this might best be done in a text editor, but in this demonstration I can do it on the command line:

one2many demo 3 4 | cut -f2- | sed -n '1p;5p' > lookup

Update 2022-02-20. The "one2many" function now adds a blank line between similar sets of data items. In building "lookup", these unneeded blank lines can be deleted, on the command line with awk NF.

o2m2

Next, put the lookup table in an AWK array, with the item as index string and the correct class as value string. Then get AWK to replace the class entry in the main file with the correct one on every line where the item occurs. This fills in gaps in the class field, replaces incorrect entries and overwrites correct entries (with the correct one, again):

awk 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1]=$2; next} \
$3 in a {$4=a[$3]} 1' lookup demo > demo_fixed

o2m3

If instead you want to edit the table one data item at a time, the lookup table can also be used to build a "to do" list:

awk -F"\t" 'FNR==NR {a[$1]=$2; next} \
$3 in a && $4 != a[$3] \
{print $1 FS $3 FS $4}' lookup demo

o2m4

Last update: 2022-02-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License