For a full list of BASHing data blog posts, see the index page.

Repair job: separate the tandem repeats

I slipped up badly the other day. What I wanted to do, as shown below in a simplified demo table, was put strings 1 and 2 in separate, tab-separated fields:

ID	string1	string2	compared
001	aa bb cc ddd	aa bbb cc ddd	different
002	aa bbb cc ddd	aa bbb cc ddd	same
003	bbb ddd eee	bbb ddd eee	same
004	bbb eee ff	cc eee ff	different
005	bbb cc ff ggg	bbb cc ff ggg	same
006	bbb ggg	bbb h	different

What I got instead was strings 1 and 2 in tandem in the first field, separated by a single space, whenever the 2 strings were identical:

ID	string1	string2	compared
001	aa bb cc ddd	aa bbb cc ddd	different
002	aa bbb cc ddd aa bbb cc ddd		same
003	bbb ddd eee bbb ddd eee		same
004	bbb eee ff	cc eee ff	different
005	bbb cc ff ggg bbb cc ff ggg		same
006	bbb ggg	bbb h	different

If there were only 6 records (as in the demo) I could have repaired the table quickly in a text editor. Alas, there were hundreds of "tandem repeat" records, scattered among hundreds of records in which strings 1 and 2 were different. I needed a fast command-line fix!

Fix 1. Regular expressions? Sigh. OK, I could use sed with the extended regular expression option (-E) to replace

\t(.+) \1\t(\tsame)

A tab character, followed by one or more characters,
followed by a whitespace, followed by the same characters,
followed by a tab, followed by "same", with the repeating characters
and the tab-"same" captured for backreferencing

with

\t\1\t\1\2

A tab character, followed by the repeating characters,
followed by a tab, followed by the repeating characters,
followed by tab-"same"

Ugly, but it works:

Fix 2. The method I prefer doesn't involve regular expressions. It relies on AWK's measurement of string length. As shown in the screenshot below, AWK includes whitespace in its length measure. Half of 7 characters is 3.5, and the integer part of 3.5 is 3:

In a tandem repeat with a single whitespace as separator, the overall string length will always be an odd number, and AWK can return just the first unit of the repeat with its "substr" substring function. Start the substring character counter at 0 and finish at the integer value of half the total string length:

Now for the full command, shown in the screenshot below. The "BEGIN" statement {FS=OFS="\t"} ensures that whatever AWK does to a line, it will start out and finish up with the fields tab-separated. The $4 == "same" condition means that the action that follows next will apply only to lines where the fourth field has the entry "same". The action statement {$2=$3=substr($2,0,(int(length($2)/2)))} redefines the second and third fields as equal to the first unit of the tandem repeat in the second field. This fixes my mistake.

The mysterious final 1 is common to many AWK commands. It's short for "if 1 is true (the pattern to be looked for), then print the line (the default AWK action)". Since "1" is always true, every line gets printed, in this case with the changes made in the "same" lines by the preceding action. The "1" is just a convention among AWK users. It could be any other always-true number, like "7612" or "13.03":

Last update: 2018-10-26
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License