For a full list of BASHing data blog posts see the index page.
Mojibake bonanza
Dozens of gibberished words! New mojibake puzzles! That's what I found recently in a UTF-8 dataset from the Museum of Comparative Zoology (MCZ) at Harvard University. Below are my attempts at reconstructing a few of the mojibake histories.
UTF-8 > Windows-1252 > UTF-8
In this scenario, a two-byte UTF-8 character is first read as two one-byte characters in a Windows-1252 program. The separate Windows-1252 characters are later converted to their two-byte UTF-8 equivalents. The table below shows hexadecimal values.
Original | UTF-8 | Windows-1252 | UTF-8 | Mojibake |
Herich-Schäffer* | c3 a4 | c3, a4 | c3 83, c2 a4 | Herich-Schäffer |
Lefèbvre | c3 a8 | c3, a8 | c3 83, c2 a8 | Lefèbvre |
Médanos | c3 a9 | c3, a9 | c3 83, c2 a9 | Médanos |
Cañon | c3 b1 | c3, b1 | c3 83, c2 b1 | Cañon |
Falcón | c3 b3 | c3, b3 | c3 83, c2 b3 | Falcón |
Oberthür | c3 bc | c3, bc | c3 83, c2 bc | Oberthür |
*Gottlieb August Wilhelm Herrich-Schäffer (1799-1874), German entomologist. The spelling here is the one in the MCZ dataset.
In a three-byte version of this sequence, "Aug. trip ‘83" (left single quote; hex e2 80 98 in UTF-8) was read by a Windows-1252 program as the three single characters "â" (e2), "€" (80) and "˜" (98). The three characters were then converted to UTF-8: "Aug. trip ‘83".
UTF-8 > Mac OS Roman > UTF-8
A similar scenario, but this time the UTF-8 original was processed as one-byte characters on a Mac.
Original | UTF-8 | Mac OS Roman | UTF-8 | Mojibake |
Volcán | c3 a1 | c3, a1 | e2 88 9a, c2 b0 | Volc√°n |
Jordão | c3 a3 | c3, a3 | e2 88 9a, c2 a3 | Jord√£o |
Açu | c3 a7 | c3, a7 | e2 88 9a, c3 9f | A√ßu |
Tapirapé | c3 a9 | c3, a9 | e2 88 9a, c2 a9 | Tapirap√© |
Felíx | c3 ad | c3, ad | e2 88 9a, e2 89 a0 | Fel√≠x |
Dueñas | c3 b1 | c3, b1 | e2 88 9a, c2 b1 | Due√±as |
Jerónimo | c3 b3 | c3, b3 | e2 88 9a, e2 89 a5 | Jer√≥nimo |
Vanhöffen | c3 b6 | c3, b6 | e2 88 9a, e2 88 82 | Vanh√∂ffen |
Izúcar | c3 ba | c3, ba | e2 88 9a, e2 88 ab | Iz√∫car |
Mac OS Roman > Windows-1252 > UTF-8
"S‹o Paulo" started out as "São Paulo" on a Mac, where the "Latin small a with tilde" has the hex encoding 8b. When read on a Windows machine, hex 8b became the Windows-1252 character "single left-pointing angle quotation mark". That character was then converted to UTF-8 as hex e2 80 b9 in the MCZ dataset.
UTF-8 > Windows-1252 > Mac OS Roman > Windows-1252 > UTF-8
"José" possibly became "JosÌ©" in 4 steps:
- The original was in UTF-8, where "é" is a two-byte character, hex c3 a9
- The string went to a Windows program where each byte was read separately, giving "Ã" (c3) and "©" (a9)
- Next to a Mac, where "Ã" is hex cc, not c3, but "©" is again hex a9
- Back to Windows, where cc a9 was read as "Ì" (cc in Windows-1252) and "©" (a9)
- From Windows to a UTF-8 environment, where "Ì" was converted to hex c3 8c and "©" to c2 a9
Bafflers
I can't figure out (yet) what happened in the following 4 cases. Some of the MCZ dataset strings were generated by OCR of specimen labels, so OCR error might be partly to blame.
- In "GroseÂSmith", there's an invisible soft hyphen after the "Â" and the UTF-8 encoding is hex c3 82, c2 ad. There might have been a soft hyphen after the ordinary one in the original "Grose-Smith".
- Gualeguaychú > Gualeguaych£
- Biológica > Biol¢gica
- Herrich-Schä[f?]fer > Herrich-Sch臟fer
P.S. Another interesting feature of the MCZ dataset is the variety of substitutes for degrees in latitude/longitude figures. I'm glad I didn't have to check whether all of these formats had been correctly converted to decimal degrees (in the decimalLatitude and decimalLongitude fields in the datatset):
10°18'N the true degree symbol, hex c2 b0
25º10'E the "masculine ordinal indicator", hex c2 ba
20˚34'N the "ring above" character, hex cb 9a
20*52'04"S an asterisk
31, 53.104 N a comma
20'22'48"S 148'35'45"E an apostrophe
11.40 S no symbol at all
Last update: 2020-12-16
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License