For a full list of BASHing data blog posts see the index page.

Duplicate records differing only in unique identifiers

There's a big data table with lots of fields and lots of records. Each record has one or more unique identifier field entries. How to check for records that are exactly the same, apart from those unique identifiers?

I've been tinkering with this problem for years, and you can read my last, fairly clumsy effort in this BASHing data blog post from 2020. Here I present a much-improved solution, which has also gone into A Data Cleaner's Cookbook as an update.

In 2020, the fastest and most reliable method I used to extract these partial duplicates was with an AWK array and two passes through the table. In the first pass, an array "a" is built with the non-unique-identifier field entries as index string and the tally of each different entry as the value string. In the second pass through the table, AWK looks for records where the value string for the same index string is greater than one, and by default prints the record.

The example table from the 2020 blog post was the TSV "demo":

Fld1	Fld2	Fld3	Fld4	Fld5	Fld6	Fld7	Fld8
001	7b03	020d	71b7	43c4	8ffd	f9352b	102e2d
002	521a	1da1	f9eb	4268	9fa8	fc7357	0a31b8
003	e6c3	0e9b	dc9f	448b	b1c4	7705ca	772ab5
004	36cf	fd59	0c62	4eb6	82d1	e30076	ecedbd
005	15c5	7874	33dc	4b20	b1c4	7a1f3b	8465b0
006	b3fb	5bad	3361	4259	a5b0	30370c	953333
007	15c5	c3d5	33dc	4b20	b1c4	7a1f3b	8465b0
008	7b03	7686	d264	4c34	b0e4	364607	5af668
009	7ee1	8a53	5cc5	4f57	9cf5	ddc735	56eee8
010	bd75	3324	21mz	41b0	b1bc	22964a	b9f2a3
011	15d7	1fb2	7223	4e8f	8f1f	8e6b76	f60cd1
012	c3cc	ef6c	70fb	4a45	9428	f00f73	07e92d
013	9ab4	991c	0bd7	4f3c	badf	ee145b	5a6d17
014	6ad5	8395	19aa	43c4	9cea	3a3c90	e84150
015	607c	3753	8a69	44bf	b41f	ddb1eb	4a42ff
016	7b03	f067	71b7	43c4	8ffd	f9352b	102e2d
017	05f1	89f3	5067	6712	b1c4	3b5245	4c4e35
018	e20d	5346	71a8	4b26	a31d	ab914d	e39049
019	15c5	52bd	33dc	4b20	b1c4	7a1f3b	8465b0
020	e917	b879	08dd	4387	b520	814a8a	10717b

Fields 1 and 3 of "demo" contain only unique values. Ignoring those two fields, are there any duplicate records in the table? The 2020 command was:

awk -F"\t" 'FNR==NR {a[$2,$4,$5,$6,$7,$8]++; next} \
a[$2,$4,$5,$6,$7,$8]>1' demo demo

And the output can be sorted to suit.

The method works fine, but what if there are one or two unique identifier fields and 100+ other fields? It's tedious to build an index string like $2,$4,$5,$6,$7,$8....

In my improved method, I don't have to build that string at all. Instead I first set the unique identifier field entries to the arbitrary value "1" in the first pass through the table, then index the just-modified record as $0. In the second pass, I first assign the whole, unmodified record to the variable "x", then repeat setting the unique identifier fields to "1". If the value of the array with the modified record as index is greater than 1, I print "x":

awk -F"\t" 'FNR==NR {[Set UI field(s) = 1; a[$0]++; next} \
{x=$0; [Set UI field(s) = 1]} a[$0]>1 {print x}' table table

The unique identifier field entries can be set to any arbitrary value. I use "1", but the empty string ("") and the string "hippopotamus" (with quotes) would work just as well. The aim is to make those unique identifiers non-unique.

It's a simple change in the code, but it saves a lot of fiddling!

Update. An elegant one-pass variation of this command was suggested by Janis Papanagnou in 2023. The output is sets of partial duplicate records with each set separated by a blank line:

awk -F"\t" '{x=$0; [Set UI field(s) = 1]; y=$0; a[y]=a[y] RS x; b[y]++} END {for (i in a) if (b[i]>1) print a[i]}' table

In the screenshot below, the CSV "demo" has unique IDs in fields 1 and 3:

Last update: 2023-08-19
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License