For a full list of BASHing data blog posts see the index page.
How to find almost-duplicates
Update. For an important update on this topic, please see this BASHing data post.
A Data Cleaner's Cookbook explains how to find partial duplicates in a data table using an AWK array.
Here's an example using the tab-separated table "demo" (copy/paste-able as TSV):
Fld1 | Fld2 | Fld3 | Fld4 | Fld5 | Fld6 | Fld7 | Fld8 |
001 | 7b03 | 020d | 71b7 | 43c4 | 8ffd | f9352b | 102e2d |
002 | 521a | 1da1 | f9eb | 4268 | 9fa8 | fc7357 | 0a31b8 |
003 | e6c3 | 0e9b | dc9f | 448b | b1c4 | 7705ca | 772ab5 |
004 | 36cf | fd59 | 0c62 | 4eb6 | 82d1 | e30076 | ecedbd |
005 | 15c5 | 7874 | 33dc | 4b20 | b1c4 | 7a1f3b | 8465b0 |
006 | b3fb | 5bad | 3361 | 4259 | a5b0 | 30370c | 953333 |
007 | 15c5 | c3d5 | 33dc | 4b20 | b1c4 | 7a1f3b | 8465b0 |
008 | 7b03 | 7686 | d264 | 4c34 | b0e4 | 364607 | 5af668 |
009 | 7ee1 | 8a53 | 5cc5 | 4f57 | 9cf5 | ddc735 | 56eee8 |
010 | bd75 | 3324 | 21mz | 41b0 | b1bc | 22964a | b9f2a3 |
011 | 15d7 | 1fb2 | 7223 | 4e8f | 8f1f | 8e6b76 | f60cd1 |
012 | c3cc | ef6c | 70fb | 4a45 | 9428 | f00f73 | 07e92d |
013 | 9ab4 | 991c | 0bd7 | 4f3c | badf | ee145b | 5a6d17 |
014 | 6ad5 | 8395 | 19aa | 43c4 | 9cea | 3a3c90 | e84150 |
015 | 607c | 3753 | 8a69 | 44bf | b41f | ddb1eb | 4a42ff |
016 | 7b03 | f067 | 71b7 | 43c4 | 8ffd | f9352b | 102e2d |
017 | 05f1 | 89f3 | 5067 | 6712 | b1c4 | 3b5245 | 4c4e35 |
018 | e20d | 5346 | 71a8 | 4b26 | a31d | ab914d | e39049 |
019 | 15c5 | 52bd | 33dc | 4b20 | b1c4 | 7a1f3b | 8465b0 |
020 | e917 | b879 | 08dd | 4387 | b520 | 814a8a | 10717b |
Fields 1 and 3 of "demo" contain only unique values. Ignoring those two fields, are there any duplicate records in the table? The Cookbook-style command would be:
awk -F"\t" 'FNR==NR {a[$2,$4,$5,$6,$7,$8]++; next} a[$2,$4,$5,$6,$7,$8]>1' demo demo | sort -t $'\t' -k2 -nk1
Fields in "demo" are tab-delimited (-F"\t"). The FNR==NR condition tells AWK to do the first action on the first file, "demo", and the second action on the second file, also "demo", as explained here.
In the first action, the array "a" is filled using the strung-together fields 2 and 4-8 as index strings (a[$2,$4,$5,$6,$7,$8]), and the count of each unique combination of fields as value string (++). After each line has its fields added to the array (or merely counted if already in the array), AWK moves to the next line (next).
In the second action, AWK checks the value string for the array member corresponding to the fields in the current line. If the value is greater than 1 (meaning there's a duplicate), the line is printed, which is AWK's default action.
All the almost-duplicate lines found by AWK are passed to sort, which in this case sorts first by the second tab-separated field (field 2), then by the serial number in field 1.
It works, but there's a convenience issue here. I had to write out all the fields except 1 and 3 for the array index. Wouldn't it be simpler to write something like this?
{a[expression for all fields except $1 and $3]++; next} ...
It would be simpler, but unfortunately there's no such expression in GNU AWK. You can't invert a selection of fields. In a recent data audit the issue loomed a lot bigger for me than it does in "demo": there were two unique-entry fields and 139 others, and I was looking for the "almost-duplicates" in the 139 non-unique-entry fields! (See the end of this post.)
It's easy enough to determine whether "almost-duplicates" actually exist in a table using cut, sort and uniq. Here's that process using "demo":
cut -f1,3 --complement demo | sort | uniq -D
Unfortunately this method doesn't return the whole record, so you don't know where in the table these "almost-duplicates" are lurking.
An interesting workaround is based on replacing the unique-entry field contents with the "zero or more of any one character" regex expression ".*", then getting a list of duplicate records, uniquifying it and searching for the unique records in the original table. As with the AWK array method, two passes through the table are required.
In the first pass through "demo", AWK does the replacement of field 1 and field 3 contents with ".*". The result is piped to sort and uniq, with uniq's -d option returning just one example of each of the duplicates:
awk 'BEGIN {FS=OFS="\t"} NR>1 {$1=$3=".*"; print}' demo | sort | uniq -d
AWK needs to be told that both the input and output field separator is a tab (BEGIN {FS=OFS="\t"}) because each line will be rebuilt in the action part of the command. The action applies to lines after the header (NR>1) and consists of setting the values of fields 1 and 3 to the string ".*", then printing the line.
The "-d" option for uniq isn't as familiar to many users as the "-D" option. "-D" prints all duplicate lines (only). "-d" also prints only duplicate lines, but only one example from each group of duplicates.
Now to search for the two records above, which have the regex expression ".*" in place of unique field 1 and field 3 entries. I can do that either with grep -f, using as a reference file a redirection from the first pass, or with a while loop to read the redirection one line at a time and pass it to AWK for matching in "demo":
grep -f <(awk 'BEGIN {FS=OFS="\t"} NR>1 {$1=$3=".*"; print}' demo | sort | uniq -d) demo | sort -t $'\t' -k2 -nk1
while read line; do awk -v FIND="$line" '$0 ~ FIND' demo; done < <(awk 'BEGIN {FS=OFS="\t"} NR>1 {$1=$3=".*"; print}' demo | sort | uniq -d) | sort -t $'\t' -k2 -nk1
In the while loop command, each line in the list of duplicates is fed to AWK as the shell variable "$line" and replaced with the AWK variable FIND (-v FIND="$line"). As AWK goes through "demo" line by line, it checks to see if the current line is a regex match for the line from the list of duplicates ($0 ~ FIND), and if it is, the current line is printed. Each of the unique entries in fields 1 and 3 is matched by ".*".
Both these methods work, but they fail spectacularly if there are any strings in the "almost-duplicates" lines that grep or AWK can interpret as regex (other than ".*"). There aren't any such strings in "demo", but there were lots in my audit table, and I had to fall back on the array method (see top of page). The array was indexed with
$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,
$21,$22,$23,$24,$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,
$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,
$55,$56,$57,$58,$59,$60,$61,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,
$72,$73,$74,$75,$76,$77,$78,$79,$80,$81,$83,$84,$85,$86,$87,$88,$89,$90,
$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,$102,$103,$104,$105,
$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,
$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,
$132,$133,$134,$135,$136,$137,$138,$139
where the excluded unique-entry fields were 1 and 82. AWK processed the audit table (139 fields, 810000+ records) in less than 15 seconds on my desktop PC (quad core i5, 8G RAM).
The index string formula for the array comes from echo "\$$(seq -s ",$" 139)", from which I manually deleted "$1," and "$82,".
Last update: 2021-12-24
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License