For a full list of BASHing data blog posts see the index page.
Some regex tests with grep, sed and AWK
In my data work I regularly do searching and filtering with GNU grep (version 3.3), GNU sed (4.7) and GNU AWK (4.2.1). I don't know if they all use the same regex engine, but I've noticed differences in regex speed between these three programs. This post documents some of the differences.
For more on regex engines, see
Roger Qiu's comparison charts
Alex Grebenyuk's DIY regex engine series
Odd and even. I'll start simple, filtering a list of numbers to get just the odd ones:
Those are direct searches for odd numbers. I can also do an indirect search, finding the numbers that are not what I'm not looking for:
For testing I timed the search seq 10000000 | [command] > /dev/null 10 times for each command and took an average, where command looked for numbers ending (or not) in an odd digit ([13579]$):
grep wins, and the direct vs indirect results are strikingly different for sed and AWK.
AND AND AND. To find the line in "demo1" containing "a" and "b" and "c" is logically straightforward with AWK (see screenshot below). grepping is most easily done by chaining three commands, and the same can be done with sed. The fourth command shown below is a kind of indirect sed search:
For a speed test I did a search in "tax" for the 38 lines out of ca 6.8M that contain the strings "Insecta", "Linneaus" and "species":
grep wins again, even as a three-command chain, and it's interesting that the two sed constructions were equally fast.
Quantifying. The file "cite" has ca 7.9M lines with lengths from 1 to 11381 characters including spaces. Which lines have from 10 to 20 characters? (There are 876433 such lines.)
For grep, sed and AWK I used the regex quantifier {n,m}, and I also did a two-condition AWK test:
The big difference in the two AWK constructions surprised me a bit; I'm guessing that two calls on the length function slowed the processing.
Summing up. OK, grep is very fast and its syntax is simple. When speed is important in my data processing, it might be worth my splitting out the regex from sed and AWK commands and letting grep do the searching or filtering first.
Last update: 2021-10-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License