banner

For a list of BASHing data 2 blog posts see the index page.    RSS


7 ways to get the source code of a webpage

Web scraping seems like an IT field of its own, with a multitude of methods and a wide range of online and desktop tools aimed at getting data from websites.

Sometimes, though, all I want to do is capture a webpage as a raw, unformatted HTML file for my own scripted uses. Sure, I could do that job by going to the page in a browser like Firefox, pressing Ctrl+u to see the page source code, Ctrl+a to select it all, Ctrl+c to copy it all, Ctrl+v to paste it all into a text editor like Geany, then Ctrl+s to save the text file somewhere.

I think I'd prefer to do all that with a simple CLI command, like this:
 
[program name] [option] [URL] > [text file].

Any of the following will work:

   Text-mode browsers
      elinks -source [URL] > [text file]
      links -source [URL] > [text file]
      links2 -source [URL] > [text file]
      lynx -source [URL] > [text file]
      w3m [URL] -dump_source | zcat > [text file]
 
   Data transfer tools
      curl -s [URL] > [text file]
      wget -q -O [text file] [URL]

All 7 methods give the same result:

elinks -source https://validator.w3.org > elinksdump
links -source https://validator.w3.org > linksdump
links2 -source https://validator.w3.org > links2dump
lynx -source https://validator.w3.org > lynxdump
w3m https://validator.w3.org -dump_source | zcat > w3mdump
curl -s https://validator.w3.org > curldump
wget -q -O wgetdump https://validator.w3.org

raw code

Notes
 
- w3m requests and returns a compressed page-source file, which I uncompress with zcat
- The "-s" (for "silent") option for curl suppresses messages to the terminal
- The "-q" (for "quiet") option for wget likewise suppresses messages to the terminal
 
Some websites don't like being attacked by curl or wget, but they give their page code freely to text-mode browsers.
 
For scripting purposes I use links -source.


Last update: 2024-09-27
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License


Coming up:
2024-10-04   Sixel graphics in the terminal
2024-10-11   USV: The Unicode Separated Values format
2024-10-18   How to force a preferred array sort in AWK
2024-10-25   Anatomy of a data analysis
2024-11-01   Easy and not-so-easy ways to time a process