banner

For a full list of BASHing data blog posts see the index page.     RSS


Targeted string replacements with sed and AWK

Global replacement of A with B with sed or AWK might be a mistake unless you're 100% sure that you really, truly want to replace every instance of A with B in the data file. Even more risky (says he, who has done it more than once to his regret) is globally replacing over a whole set of files:

sed 's/A/B/g' *.txt

target1

Base image found here, source unknown.

It's much safer to do replacements targeted at particular instances of A in a file. This post reviews a few strategies for using sed and AWK for targeted string replacements. I can't consider every possible replacement case, but I hope the examples I give are useful.

For demonstration purposes, here's the tab-separated table "reds". What I'll do is replace the underlined "red" words with "white".

target2
Fld1Fld2Fld3
red winered tapered shift
red hotred dogred flag
red inkred foxred meat
red deerred bookred rice

To replace the first "red" in line 2, I address line number 2 and occurrence number 1 (in that line). If I leave out the occurrence address, sed replaces the first "red" it finds and then stops looking:

target3

sed '2s/red/white/1' reds
sed '2s/red/white/' reds

In line 3, replacing the second "red" in line 3 uses occurrence address "2":

target4

sed '3s/red/white/2' reds

The two replacements in line 4 offer a good demonstration of sed's occurrence-address logic. Once sed has done a replacement in a line, it restarts the count of occurrences. That's why the first command shown below doesn't simply repeat the replacement of the second occurrence. The third command uses a feature of GNU sed: putting "g" after the occurrence address means "that occurrence and all others to the end of the line":

target5

sed '4s/red/white/2;4s/red/white/2' reds
sed '4s/red/white/3;4s/red/white/2' reds
sed '4s/red/white/2g' reds

The line 5 replacement of the first and third "red"s uses occurrence-address logic again:

target6

sed '5s/red/white/1;5s/red/white/2' reds
sed '5s/red/white/3;5s/red/white/1' reds

Combining the commands:

target7

sed '2s/red/white/;3s/red/white/2;4s/red/white/2g;5s/red/white/1;5s/red/white/2' reds

Note that in the second version in the screenshot I enclosed the command in double quotes rather than single quotes, to allow use of BASH's interrupted line feature (" \").

Another approach to targeting is to "hook" the strings to be replaced to their context, in other words to a nearby string. In this case I want to replace the "red"s that precede the words:

wine,dog,fox,meat,deer,rice      #The order doesn't matter

There isn't an easy way to do this with sed, but I can fill an AWK array "a" with the "hook" words as index strings, then do a replacement whenever "red" and a space are followed by one of the index strings in the array:

target8

awk 'FNR==NR {a[$0]; next} {for (i in a) sub("red i","white i")} 1' <(echo "wine,dog,fox,meat,deer,rice" | tr ',' '\n') reds

The array "a" is built from the redirection <(echo "wine,dog,fox,meat,deer,rice" | tr ',' '\n'), where each word becomes an index string in the array. For each of those index strings, AWK substitutes "white "[word] for "red "[word]. The final instruction "1" tells AWK to print all lines in the file.

AWK is also a better choice than sed for replacements if you want to target a string in a particular line and field, without knowing (or caring) if the string to be replaced is elsewhere in the line. For example, to replace just the "red" in "red dog":

target9

awk 'BEGIN {FS=OFS="\t"} NR==3 {sub("red","white",$2)} 1' reds

Finally, here's AWK doing all my required replacements, with the targets grouped by field:

target10

awk 'BEGIN {FS=OFS="\t"} \
> NR==2 || NR==5 {sub("red","white",$1)} \
> NR==3 || NR==4 {sub("red","white",$2)} \
> NR==4 || NR==5 {sub("red","white",$3)} \ > 1' reds


PUZZLE in last post:

moji8

In the second echoed EMPATHY, the "M" isn't a Latin capital letter m (hex 4d). It's a Greek capital mu (hex ce 9c), which isn't on the Windows-1252 codepage. The two "M" letters are homoglyphs and are part of a Unicode triplet of letters that look the same but have different encodings in UTF-8. The third "M" is a Cyrillic capital em (hex d0 9c).

There are 11 letters that form Latin/Greek/Cyrillic triplets: ABEHKMOPTXY. The longest common English word I know that can be made with these 11 letters is EMPATHY.


Last update: 2020-04-08
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License