banner

For a full list of BASHing data blog posts, see the index page.     RSS


Repair job: separate the tandem repeats

I slipped up badly the other day. What I wanted to do, as shown below in a simplified demo table, was put strings 1 and 2 in separate, tab-separated fields:

IDstring1string2compared
001aa bb cc dddaa bbb cc ddddifferent
002aa bbb cc dddaa bbb cc dddsame
003bbb ddd eeebbb ddd eeesame
004bbb eee ffcc eee ffdifferent
005bbb cc ff gggbbb cc ff gggsame
006bbb gggbbb hdifferent

What I got instead was strings 1 and 2 in tandem in the first field, separated by a single space, whenever the 2 strings were identical:

IDstring1string2compared
001aa bb cc dddaa bbb cc ddddifferent
002aa bbb cc ddd aa bbb cc dddsame
003bbb ddd eee bbb ddd eeesame
004bbb eee ffcc eee ffdifferent
005bbb cc ff ggg bbb cc ff gggsame
006bbb gggbbb hdifferent

If there were only 6 records (as in the demo) I could have repaired the table quickly in a text editor. Alas, there were hundreds of "tandem repeat" records, scattered among hundreds of records in which strings 1 and 2 were different. I needed a fast command-line fix!


Fix 1. Regular expressions? Sigh. OK, I could use sed with the extended regular expression option (-E) to replace

\t(.+) \1\t(\tsame)
 
A tab character, followed by one or more characters,
followed by a whitespace, followed by the same characters,
followed by a tab, followed by "same", with the repeating characters
and the tab-"same" captured for backreferencing

with

\t\1\t\1\2
 
A tab character, followed by the repeating characters,
followed by a tab, followed by the repeating characters,
followed by tab-"same"

Ugly, but it works:

fix1
 

Fix 2. The method I prefer doesn't involve regular expressions. It relies on AWK's measurement of string length. As shown in the screenshot below, AWK includes whitespace in its length measure. Half of 7 characters is 3.5, and the integer part of 3.5 is 3:

fix2

In a tandem repeat with a single whitespace as separator, the overall string length will always be an odd number, and AWK can return just the first unit of the repeat with its "substr" substring function. Start the substring character counter at 0 and finish at the integer value of half the total string length:

fix3

Now for the full command, shown in the screenshot below. The "BEGIN" statement {FS=OFS="\t"} ensures that whatever AWK does to a line, it will start out and finish up with the fields tab-separated. The $4 == "same" condition means that the action that follows next will apply only to lines where the fourth field has the entry "same". The action statement {$2=$3=substr($2,0,(int(length($2)/2)))} redefines the second and third fields as equal to the first unit of the tandem repeat in the second field. This fixes my mistake.

fix4
 

The mysterious final 1 is common to many AWK commands. It's short for "if 1 is true (the pattern to be looked for), then print the line (the default AWK action)". Since "1" is always true, every line gets printed, in this case with the changes made in the "same" lines by the preceding action. The "1" is just a convention among AWK users. It could be any other always-true number, like "7612" or "13.03":

fix5

Last update: 2018-10-26
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License