For a list of BASHing data 2 blog posts see the index page.

Find a word, plus words either side of the matching word

One of grep's great features is that it can return "line context". It can not only find the individual line containing a searched-for pattern, but also one or more lines before that line, after that line, or both.

Here's a text file ("orbs") in which I'll search first for words containing "moon" (case-insensitively), then for the same pattern plus the 1 line before with the -B1 option:

The Sun also rises
Moonlight becomes you
Here comes the sun
Ain't no sunshine when she's gone
The quarter-moon and half-moon are a week apart
You say it's only a paper moon
A place in the sun
Moon, moon, glorious moon

grep -i "[^ ]*moon[^ ]*" orbs
grep -i -B1 "[^ ]*moon[^ ]*" orbs

What grep doesn't do easily is return "word context", like the 1 or 2 words either side of the pattern. There's no option like -wordB1. So how could you get each occurrence of "moon" words (case-insensitively) plus the word before and after? The desired output from "orbs" would be

Moonlight becomes
The quarter-moon and
and half-moon are
paper moon
Moon, moon
Moon, moon, glorious
glorious moon

One approach would be with a GNU AWK command, making special cases of "moon" as first or last word:

awk 'BEGIN {IGNORECASE=1}
$1 ~ /moon/ {print $1,$2}
{for (i=2;i<NF;i++) if ($i ~ /moon/)
print $(i-1),$i,$(i+1)}
$NF ~ /moon/ {print $(NF-1),$NF}' orbs

Note that "IGNORECASE" is a GNU AWK option, and that by default the AWK field separator is one or more spaces.

AWK processes "orbs" line by line and does 3 tests on each line. The first test checks if the first word matches "moon"; if yes, AWK prints that first word and the second one.

The second test ignores the first and last words and loops through the remaining words. If any of them match "moon", AWK prints the word before the matching word, the matching word itself, and the word after the matching word.

The third test checks if the last word matches "moon"; if yes, AWK prints that last word and the one before it.

Two "gotchas" in using AWK to print words by field number are that $i-1 is the whole line if "i" is one, and that AWK will exit with an error message if asked to print a field with a negative field number.

I haven't tried to do the job with grep, because I suspect that would involve some convoluted, nightmarish PCRE regex. There is, however, an ancient UNIX command for extracting keywords in context called ptx. I dusted off its cobwebs and asked it to find "moon" in context, but was unable to tweak a result as neat as the AWK one, even after further processing with sed and/or AWK.

ptx is part of "coreutils", and the clearest explanation I've seen of how it works is in the "coreutils" manual: info '(coreutils) ptx invocation'.

Last update: 2024-08-23
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License