For a list of BASHing data 2 blog posts see the index page.
Counterfeit spaces: the NBSP menace
NBSP stands for "no-break space" or "non-breaking space", Unicode code point U+00AO. Like an ordinary space, it's invisible. It's the two bytes c2 a0 (hex values) in UTF-8 encoding and in HTML it's usually written " ".
The Wikipedia article on NBSPs explains very clearly what the character does:
Text-processing software typically assumes that an automatic line break may be inserted anywhere a space character occurs; a non-breaking space prevents this from happening (provided the software recognizes the character). For example, if the text "100 km" will not quite fit at the end of a line, the software may insert a line break between "100" and "km". An editor who finds this behavior undesirable may choose to use a non-breaking space between "100" and "km". This guarantees that the text "100 km" will not be broken: if it does not fit at the end of a line, it is moved in its entirety to the next line.
In other words, it's a formatting character that looks like an ordinary space but really isn't. It's a nuisance in command-line data work, because a command that relies on recognising an ordinary space and doing something about it will fail if the space is a NBSP. The screenshot below shows "Homo sapiens" with a space and with a NBSP. Notice what happens with cut, AWK's split function and grep:
The "gremlins" script from A Data Cleaner's Cookbook will detect NBSPs, and the Cookbook also offers several ways to locate NBSPs within a text file. In addition, the following function will replace a NBSP with a mid-level dot on a bright green background:
nbspvis() { sed 's|\xc2\xa0|\x1b[102m\xc2\xb7\x1b[0m|g'; }
When auditing biodiversity datasets I often find NBSPs in scientific names, like "Homo sapiens". Those plain-text datasets don't need formatting, so why are there NBSPs in so many scientific names?
I used to think that dataset compilers were copy/pasting scientific names with NBSPs from online sources. That explanation doesn't seem to work. There aren't any NBSPs in the name sources I've checked that were cited by the compilers. I also haven't found NBSP-infected scientific names in widely used online references, like the Catalogue of Life and Wikipedia.
What's more, copy/pasting NBSPs isn't always successful. NBSPs are retained when copy/pasting into an ODT or ODS document, but are usually converted to plain spaces when pasting into text editors or a terminal emulator. Try it and see:
This line has NBSPs between words.
The source of those scientific name NBSPs is still a mystery. Readers' suggestions welcome!
Rather than do a global replacement of NBSPs when data cleaning, I first delete any that are at the beginning or end of a data item, then replace any internal NBSPs with an ordinary space, in both cases with sed:
Last update: 2024-03-01
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License