banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Another embedded newlines fix

There were 1.4 million records in the file I was checking, but hundreds of scattered records were broken over several lines, and some of those lines were blank. How to fix this big file?

Luckily for me, every record began with exactly the same string, and this particular string didn't appear in other fields. That allowed me to use a neat AWK command illustrated in 2012 by Guru Prasad on his blog The UNIX School. Here's a simple demonstration file, "broken":

this1,aaa,
bbb,
ccc
 
this2,ddd,,eee
this3,
ccc,
ddd,
 
eee
this4,,
fff,
this5,ggg,
hhh,,
this6,,,

The goal here is to get rid of the blank lines and build this 4-field CSV:

this1,aaa,bbb,ccc
this2,ddd,,eee
this3,ccc,ddd,eee
this4,,fff,
this5,ggg,hhh,,
this6,,,

A good solution is to first get rid of the blank lines, then do the newline fixes (explained below):

awk NF broken | awk '/^this/ {if (x) print x; x=""} {x=(!x) ? $0 : x$0} END {print x}'

embedfix

The situation was only a little more complicated with that 1.4M-record file I was fixing, which was a TSV. The line breaks were all within fields rather than between them. I joined the fragments of the broken records with a single space (rather than a comma, as above) without knowing whether there were spaces at the beginning or end of a fragment. I then squeezed any multiple spaces to one space in all the "recombined" records with tr -s " ".

I also used the command with a broken TSV (4.4M records) where the first tab-separated field in each record was expected to fit the regex /^[0-9]+\t/. Before running the command I checked with grep -cP "^[0-9]+\t" that none of the broken record fragments began with that pattern. In other words, the grep output should be the same as the expected number of records, and it was.

The newline-fixing AWK command explained:
 
The first action {if (x) print x; x=""} is taken if the line begins with "this" (/^this/). In plain English the action is "If the variable x is defined, print it, then set x equal to the empty string, or in other words empty the variable x".
 
The first line of the modified "broken" (no blank lines thanks to awk NF) begins with "this" but x has not yet been defined, so the command doesn't print anything. Instead it moves on to the second action {x=(!x) ? $0 : x$0}. This is in ternary format and says "If x is empty, then x is equal to the whole line ($0). If x does have a value, then x is equal to x followed without a space by the whole line (x$0)." For the first line of the modified "broken", this sets x equal to the whole line "this1,aaa,". Nothing is printed yet.
 
The second line of the modified "broken" is a fragment of the record that began with "this1". This second line doesn't begin with "this", so AWK skips the first action and moves on to the second action. This time x isn't empty: it contains "this1,aaa," from the first line. AWK redefines x to be that string followed immediately by the current whole line. x now contains "this1,aaa,bbb,". Still nothing is printed.
 
The third line of the modified "broken" is another "this1" fragment, so the first action is ignored, the second action is taken and x now contains "this1,aaa,bbb,ccc". Still nothing is printed.
 
The fourth line of the modified "broken" begins with "this", again. This time AWK takes the first action and prints "this1,aaa,bbb,ccc", then empties x. Taking the second action, it loads x with "this2,ddd,,eee".
 
The fifth line of the modified "broken" begins with "this", so "this2,ddd,,eee" is printed and x is loaded with "this3,".
 
And so on to the last line of the modified "broken". It begins with "this", so AWK prints "this5,ggg,hhh," and loads x with "this6,,,". The END statement tells AWK to print x, and the job is done.


Next post:
2024-12-06   Munging the Atlas of Living Australia table format


Last update: 2024-11-29
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License