For a full list of BASHing data blog posts, see the index page.
Repair job: separate the tandem repeats
I slipped up badly the other day. What I wanted to do, as shown below in a simplified demo table, was put strings 1 and 2 in separate, tab-separated fields:
ID | string1 | string2 | compared |
001 | aa bb cc ddd | aa bbb cc ddd | different |
002 | aa bbb cc ddd | aa bbb cc ddd | same |
003 | bbb ddd eee | bbb ddd eee | same |
004 | bbb eee ff | cc eee ff | different |
005 | bbb cc ff ggg | bbb cc ff ggg | same |
006 | bbb ggg | bbb h | different |
What I got instead was strings 1 and 2 in tandem in the first field, separated by a single space, whenever the 2 strings were identical:
ID | string1 | string2 | compared |
001 | aa bb cc ddd | aa bbb cc ddd | different |
002 | aa bbb cc ddd aa bbb cc ddd | same | |
003 | bbb ddd eee bbb ddd eee | same | |
004 | bbb eee ff | cc eee ff | different |
005 | bbb cc ff ggg bbb cc ff ggg | same | |
006 | bbb ggg | bbb h | different |
If there were only 6 records (as in the demo) I could have repaired the table quickly in a text editor. Alas, there were hundreds of "tandem repeat" records, scattered among hundreds of records in which strings 1 and 2 were different. I needed a fast command-line fix!
Fix 1. Regular expressions? Sigh. OK, I could use sed with the extended regular expression option (-E) to replace
\t(.+) \1\t(\tsame)
A tab character, followed by one or more characters,
followed by a whitespace, followed by the same characters,
followed by a tab, followed by "same", with the repeating characters
and the tab-"same" captured for backreferencing
with
\t\1\t\1\2
A tab character, followed by the repeating characters,
followed by a tab, followed by the repeating characters,
followed by tab-"same"
Ugly, but it works:
Fix 2. The method I prefer doesn't involve regular expressions. It relies on AWK's measurement of string length. As shown in the screenshot below, AWK includes whitespace in its length measure. Half of 7 characters is 3.5, and the integer part of 3.5 is 3:
In a tandem repeat with a single whitespace as separator, the overall string length will always be an odd number, and AWK can return just the first unit of the repeat with its "substr" substring function. Start the substring character counter at 0 and finish at the integer value of half the total string length:
Now for the full command, shown in the screenshot below. The "BEGIN" statement {FS=OFS="\t"} ensures that whatever AWK does to a line, it will start out and finish up with the fields tab-separated. The $4 == "same" condition means that the action that follows next will apply only to lines where the fourth field has the entry "same". The action statement {$2=$3=substr($2,0,(int(length($2)/2)))} redefines the second and third fields as equal to the first unit of the tandem repeat in the second field. This fixes my mistake.
The mysterious final 1 is common to many AWK commands. It's short for "if 1 is true (the pattern to be looked for), then print the line (the default AWK action)". Since "1" is always true, every line gets printed, in this case with the changes made in the "same" lines by the preceding action. The "1" is just a convention among AWK users. It could be any other always-true number, like "7612" or "13.03":
Last update: 2018-10-26
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License