banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Four exercises with data art

A PPM image is a text file with a short header followed by a list of numbers. For a color image, the numbers range from 0 to 255 and every 3 numbers in the list become the decimal RGB values for a single pixel. Building PPMs is an easy way to do data art, as shown below.


Random numbers. The PPM header will be

P3
300 300
255

which means the image will be 300 x 300 pixels with a maximum color value of 255. For 300 x 300 pixels (90000 in total) I'll need 270000 random numbers between 0 and 255. The best way I know to get those numbers is to use the phenomenally fast jot utility:

(printf "P3\n300 300\n255\n" && jot -r 270000 0 255) > rando.ppm

random

Colorful data noise!

The -r option for jot tells it to generate random rumbers from the range "0 255". The number between ("270000") is the total to generate. To display "rando.ppm" and the 3 images below I've used GIMP.


Random ASCII characters. The next image was built with a random lot of the visible ASCII characters ("[[:graph:]]"). It's a lot darker than the random numbers image, mainly because the range of ASCII decimal values of the visible characters is only 33-126, so the lighter values between 127 and 255 are missing.

(printf "P3\n300 300\n255\n" && cat /dev/urandom | tr -dc '[:graph:]' | fold -w1 | head -270000 | awk 'BEGIN {for (n=0;n<=255;n++) ord[sprintf("%c",n)]=n} {print ord[$1] + 0}') > ascii.ppm

ascii random

The command
 
    cat /dev/urandom | tr -dc '[:graph:]' | fold -w1 | head -270000
 
generates random characters with /dev/urandom, screens out the invisible ones with tr -dc '[:graph:]', turns that string into a 1-character-wide list with fold -w1 and gets the first 270000 visible characters with head -270000.
 
The AWK command that follows is one way to convert ASCII characters to their decimal values. Another way is to build an array containing the 255 characters and their corresponding decimal values, then use the array as a lookup table to get a number for each character. I haven't seen a significant speed difference between the two methods, and the one use here is a little easier. In a BEGIN statement AWK first builds an array "ord" whose index string is the ASCII character corresponding to one of the numbers from 0 to 255, and whose value string is the corresponding number. This array is then used to convert characters to numbers. I've done print ord[$1] + 0 to give the value "0" in case for some reason the character conversion doesn't work and returns no number.


Dictionary word characters. For this image I used the word list in /usr/share/dict/words as a source of ASCII characters to turn into decimals for color values. There's some texture here and the bottom of the image looks a bit lighter than the top, probably because decimal ASCII values are larger for lowercase letters towards the "z" end of the alphabetically ordered dictionary list.

(printf "P3\n300 300\n255\n" && paste -s -d"\0" /usr/share/dict/words | fold -w1 | head -270000 | awk 'BEGIN {for (n=0;n<=255;n++) ord[sprintf("%c",n)]=n} {print ord[$1] + 0}') > words.ppm

dictionary words

The values-building part of the command is much like the last one, but the characters come from the list in /usr/share/dict/words. The list is converted to a single long string with no separation between words using paste -s -d"\0".


Electricity data. An intriguing image based on a dataset from the Machine Learning Repository at the University of California, Irvine. The data are a time series of electricity parameters at a house in France, and I've taken the "Global_active_power", "Global_reactive_power" and "Voltage" fields for the first 90000 valid records, scaling each parameter to the 0-255 range.

(printf "P3\n300 300\n255\n" && awk -F"\t" '{$1=int((($1-0.194)/9.078)*255); $2=int(($2/0.874)*255);$3=int((($3-224.68)/27.02)*255)} 1' OFS="\n" power) > power.ppm

electricity

I processed the raw data in "household_power_consumption.txt" to remove the header, delete the invalid records with a "?" character, get the first 90000 valid records, convert the field separator from ";" to a tab and cut out the wanted fields 3>5, building a file called "power":
 
    tail -n +2 household_power_consumption.txt | grep -v "?" | head -90000 | tr ";" "\t" | cut -f3-5 > power
 
I used datamash to get minimum and range values for the 3 fields in "power":
 
    datamash min 1 range 1 min 2 range 2 min 3 range 3 < power
 
In the AWK command shown above I get an integer value in the range 0 > 255 for parameters 1 and 3 with
 
    int((([value] - minimum)/range)*255)
 
Parameter 2 starts at zero so [value] - minimum is the same as [value].
 
This processing would have generated tab-separated triplets if a tab was the output field separator, but instead I've used a newline as output field separator (OFS="\n") to directly build a list of 270000 color values.


Next post:
2025-02-21   Extract the year from a date string without using the date command


Last update: 2025-02-14
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License