For a list of BASHing data 2 blog posts see the index page.
7 ways to get the source code of a webpage
Web scraping seems like an IT field of its own, with a multitude of methods and a wide range of online and desktop tools aimed at getting data from websites.
Sometimes, though, all I want to do is capture a webpage as a raw, unformatted HTML file for my own scripted uses. Sure, I could do that job by going to the page in a browser like Firefox, pressing Ctrl+u to see the page source code, Ctrl+a to select it all, Ctrl+c to copy it all, Ctrl+v to paste it all into a text editor like Geany, then Ctrl+s to save the text file somewhere.
I think I'd prefer to do all that with a simple CLI command, like this:
[program name] [option] [URL] > [text file].
Any of the following will work:
Text-mode browsers
elinks -source [URL] > [text file]
links -source [URL] > [text file]
links2 -source [URL] > [text file]
lynx -source [URL] > [text file]
w3m [URL] -dump_source | zcat > [text file]
Data transfer tools
curl -s [URL] > [text file]
wget -q -O [text file] [URL]
All 7 methods give the same result:
elinks -source https://validator.w3.org > elinksdump
links -source https://validator.w3.org > linksdump
links2 -source https://validator.w3.org > links2dump
lynx -source https://validator.w3.org > lynxdump
w3m https://validator.w3.org -dump_source | zcat > w3mdump
curl -s https://validator.w3.org > curldump
wget -q -O wgetdump https://validator.w3.org
Notes
- w3m requests and returns a compressed page-source file, which I uncompress with zcat
- The "-s" (for "silent") option for curl suppresses messages to the terminal
- The "-q" (for "quiet") option for wget likewise suppresses messages to the terminal
Some websites don't like being attacked by curl or wget, but they give their page code freely to text-mode browsers.
For scripting purposes I use links -source.
Last update: 2024-09-27
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License
Coming up:
2024-10-04 Sixel graphics in the terminal
2024-10-11 USV: The Unicode Separated Values format
2024-10-18 How to force a preferred array sort in AWK
2024-10-25 Anatomy of a data analysis
2024-11-01 Easy and not-so-easy ways to time a process