For a full list of BASHing data blog posts see the index page.
Msot popele can undreatnsd tihs setennce
I wrote about this particular kind of text garbling five years ago on the Linux Rain website. For some background, see the last section of this post. My rules are:
- Input is words separated by a single whitespace
- With words of four or more letters, leave the first and last letters alone and scramble the middle letters
- A word to be scrambled can end with a single punctuation mark (e.g. "Clarence,"), but no internal punctuation is allowed (e.g. "Richard's")
In 2015 I offered a script for garbling which relied on BASH string functions. Here I demo a slightly different garbler and a fairly clumsy ungarbler.
Garbler. A simple version of the script ("garbler") would look like this:
#!/bin/bash
tr " " "\n" <<<"$1" \
| while read line; do
if [[ "${#line}" -lt 4 ]]; then
printf "$line\n"
else
printf "${line:0:1}$(printf ${line:1:-1} | fold -w1 | shuf | paste -s -d"\0")${line:(-1)}\n"
fi
done | paste -s -d" "
exit 0
and work like this:
The script uses several different BASH string functions with the string variable "line".
${#line} is the number of characters in the string, while the other three functions extract substrings from "line".
${line:0:1} extracts from position "0" just "1" character, so this function returns the first character in the string.
${line:(-1)} extracts the last character in the string, and is sometimes written ${line: -1}, with a space before the "-1".
${line:1:-1} extracts the substring beginning with the second character and ending with the character before the last character.
The text string is fed to tr, which turns the string into a vertical list of words. A while loop reads each word from the list in turn. If the word is less than 4 letters long (if [[ "${#line}" -lt 4 ]]) it's printf'ed with a following newline. Otherwise the script printf's the first letter in the word, the middle bit scrambled, and the last letter in the word. To scramble the middle bit I first pipe it to fold -w1 to put the letters into a vertical list. This list is shuffled, then pasted back together as a string with no separation between the letters (paste -s -d"\0"), and printf'ed with a following newline. When all words have been processed, they're reassembled into a string with a single whitespace as separator using paste -s -d" ".
However, if the text being garbled contains punctuation at the end of a word, then the punctuation character will stay put and the last letter of the word will get included in the scrambling. I avoid this by inserting an "if/else" for words ending in punctuation:
#!/bin/bash
tr " " "\n" <<<"$1" \
| while read line; do
if [[ "${#line}" -lt 4 ]]; then
printf "$line\n"
else
if [[ "${line:(-1)}" =~ [[:punct:]] ]]; then
printf "${line:0:1}$(printf ${line:1:-2} | fold -w1 | shuf | paste -s -d"\0")${line:(-2)}\n"
else
printf "${line:0:1}$(printf ${line:1:-1} | fold -w1 | shuf | paste -s -d"\0")${line:(-1)}\n"
fi
fi
done | paste -s -d" "
exit 0
${line:(-2)} is the second-to-last character.
${line:1:-2} extracts the substring beginning with the second character and ending with the second-to-last character.
Ungarbler. Ungarbling works much like garbling. The garbled sentence's word are made into a list and each word is piped to a while loop. Words of three or fewer letters pass through unchanged to a new list, and at the end of the loop the list of words is pasted back into a sentence. One difference from garbling is that all punctuation is first removed from the list with sed, to simplify the ungarbling:
tr " " "\n" <<<"$1" | sed 's/[[:punct:]]//g' | while...
The scrambled words in the list are checked against a lookup table to get their unscrambled originals. I built my lookup table, called "wordtable", in two stages from /usr/share/dict/words. First I selected words of four or more letters without an apostrophe:
grep -vP "\x27" /usr/share/dict/words | grep -E "[[:alpha:]]{4,}" > baselist
I then piped the result (72314 words) to an AWK command which printed the word, a tab, and the letters sorted and in lowercase:
awk '{split($0,a,""); \
PROCINFO["sorted_in"] = "@val_str_asc"; \
printf("%s\t",$0); \
for (i in a) printf("%s",tolower(a[i])); \
print ""}' baselist > wordtable
AWK first splits an unscrambled word into separate letters, putting each letter into the array "a" (split($0,a,"")). It then printf's the unscrambled word ($0) followed by a tab, but without a newline. It follows this printf with a letter-by-letter printf of each of the letters in "a", printing each letter in lowercase (for (i in a) printf("%s",tolower(a[i]))). The sorted order in which the letters are printed is set by PROCINFO["sorted_in"] = "@val_str_asc"; see this BASHing data post for an explanation. Finally, AWK prints a newline (print "") before moving on to the next unscrambled word in the list.
To check each scrambled word — called "line" in the while loop — against "wordtable", I do filtering with grep. The first filter looks for a match in "wordtable" for a sorted, lowercase version of "line" (it'll be in field 2 of "wordtable"), and cuts out the corresponding unscrambled word (in field 1):
grep "$(printf ${line,,} | fold -w1 | sort | paste -s -d"\0")" wordtable | cut -f1
The result is filtered to get the unscrambled words having the same number of letters as "line":
grep "$(printf ${line,,} | fold -w1 | sort | paste -s -d"\0")" wordtable | cut -f1 \
| grep -E "^.{${#line}}$"
The third filtering step finds the unscrambled words having the same starting and finishing letters as "line". Because the starting letter of either the scrambled or unscrambled words might be uppercase, grep is used with its "ignore case" option:
grep "$(printf ${line,,} | fold -w1 | sort | paste -s -d"\0")" wordtable | cut -f1 \
| grep -E "^.{${#line}}$" | grep -Ei "^${line:0:1}.+${line:(-1)}$")\n"
At this stage, the "ungarbler" script looks like this (and works OK):
#!/bin/bash
tr " " "\n" <<<$1" \
| sed 's/[[:punct:]]//g' \
| while read line; do
if [[ "${#line}" -lt 4 ]]; then
printf "$line\n"
else
printf "$(grep "$(printf ${line,,} | fold -w1 | sort | paste -s -d"\0")" wordtable | cut -f1 \
| grep -E "^.{${#line}}$" \
| grep -Ei "^${line:0:1}.+${line:(-1)}$")\n"
fi
done | paste -s -d" "
exit 0
There are two bugs lurking here. One is that despite the filtering, there might still be several different possible unscrambled originals in "wordtable". As the script currently stands, these will appear one after the other in the "ungarbler" output. My fix for this problem is to convert newlines to forward slashes with tr, and to delete the final "/" with sed (see final script below). The result is a visual grouping of multiple (possible) original words. If only one original is found, no forward slash is added:
A second bug is that if no original word is found in "wordtable", the script should report that result, and it doesn't. To do this I first save the result of grep filtering and forward-slashing in a variable, "foo" (see final script below). I then test to see if the variable is empty, meaning no original was found in "wordtable". If "foo" is empty, the script prints the scrambled word followed by a question mark in brackets. A side-benefit of this workaround is that words in scrambled text with an ("illegal") apostrophe are flagged as "(?)" words, since "wordtable" was generated by excluding original words with apostrophes.
The final "ungarbler" script:
#!/bin/bash
tr " " "\n" <<<$1" \
| sed 's/[[:punct:]]//g' \
| while read line; do
if [[ "${#line}" -lt 4 ]]; then
printf "$line\n"
else
foo=$(grep "$(printf ${line,,} | fold -w1 | sort | paste -s -d"\0")" wordtable | cut -f1 \
| grep -E "^.{${#line}}$" \
| grep -Ei "^${line:0:1}.+${line:(-1)}$" | tr '\n' '/' | sed 's/\/$//')
if [ -z "$foo" ]; then
printf "${line}(?) "
else
printf "${foo}\n"
fi
done | paste -s -d" "
exit 0
Background. Text garbling became popular in the early 2000's thanks to this widely shared quote (one word here corrected):
Aoccdrnig to a rscheearer at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, olny taht the frist and lsat ltteres are at the rghit pcleas. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by ilstef, but the wrod as a wlohe.
Several programmers have tinkered with this kind of garbling. Jamie Zawinski wrote a garbled Perl script for garbling, with a licence text beginning Premssioin to use, cpoy, mdoify, drusbiitte, and slel this stafowre.... Steve Sachs set up a webpage for online garbling with AWK, and Clément Pit-Claudel offered Python code to ungarble garbled text.
How would "ungarbler" do with the famous quote? See below. Note that the capitalisation and punctuation are gone (and "doesn't" isn't matched), but the text is otherwise correct.
Last update: 2019-12-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License