For a list of BASHing data 2 blog posts see the index page.
The Web's most familiar gibberish: ’
I'll bet you've seen this mojibake dozens of times, but in case you haven't, here are some screenshots from my Web browser:
The cause of the gibberish is clear enough. The English possessive "s" was preceded with a right single quote (RSQ; U+2019) in the original text. The RSQ ’ was encoded in UTF-8 as the three bytes e2 80 99 (hexadecimal). One or more programs re-encoded those bytes individually in something like Windows-1252 to give â € ™, and the webpage presented those three characters to be interpreted in UTF-8 again.
What happened was not a failure of my Web browser to interpret a RSQ. In each of the four examples shown above, the webpage code tells the browser that the character set to be used is UTF-8, and then presents the RSQ as the three separate characters. Here's the relevant part of the "Washington Doctors" page code:
What program or programs did the re-interpreting of the RSQ? I don't know and it would be hard to find out.
A more interesting question is why the original text put the RSQ before the possessive "s". The correct character to use is called an "apostrophe". Apostrophes have been used this way for hundreds of years in English, and are still recommended for possessives in dictionaries:
[apostrophe]...used before or after s to show possession, as in Helen's house or babies' hands
Cambridge Dictionary
[apostrophe]...used to indicate the omission of letters or figures, the possessive case (as in "John's book"), or the plural of letters or figures (as in "the 1960's")
Merriam-Webster Dictionary
During most of those hundreds of years, typesetters printed an apostrophe not as a straight-up-and-down mark, but as a "typographer's quote", a kind of "uplifted comma":
The straight-up-and-down version of the mark only became acceptable with the rise of mechanical typewriters. In the digital era we can use either a straight-up-and-down mark or a typographer's quote/curly quote/smart quote/RSQ as an apostrophe, but there's a problem: there's no key for the RSQ on an English-language keyboard. You can't type it by hitting the ' key.
illustration from here.
Or can you? If I type a phrase in a terminal, a text editor or Libre Office Writer, the ' that I type is always interpreted by the terminal program or editor as the 27 byte, which in ASCII and Unicode is called "apostrophe":
However, some text processing programs, like Microsoft Word, will automatically interpret the straight-up-and-down mark on the quotes key as a "smart quote" or "curly quote" and will put a RSQ rather than an apostrophe before the possessive "s". WordPress by default uses "smart quotes". So that familiar gibberish probably starts with a e2 80 99 that an authoring program has used in place of a 27, regardless of the typist's wishes.
An additional wrinkle in this story is that the devisers of Unicode decided that the RSQ (U+2019) is the best choice of character to do the work of an apostrophe. A 2015 blog post attacks this decision with some well-reasoned arguments. The author proposed instead that the apostrophe should be represented by the "modified letter apostrophe" (U+02BC; hex CA BC), which in many fonts looks the same as the RSQ. The post led to a brief bout of soul-searching by Unicoders, but no change.
Regardless of the Unicode situation, I expect the ’ gibberish to be a permanent fixture of the Web, and writers of the original text can truthfully say "I didn't do it!".
Next post:
2024-11-29 Another embedded newlines fix
Last update: 2024-11-22
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License