Next: BitTorrent, Previous: Feeds, Up: Integration
Simple HTML web page can be downloaded very easily for sending and viewing it offline after:
$ wget http://www.example.com/page.html
But most web pages contain links to images, CSS and JavaScript files, required for complete rendering. GNU Wget supports that documents parsing and understanding page dependencies. You can download the whole page with dependencies the following way:
$ wget \ --page-requisites \ --convert-links \ --adjust-extension \ --restrict-file-names=ascii \ --span-hosts \ --random-wait \ --execute robots=off \ http://www.example.com/page.html
that will create www.example.com directory with all files necessary to view page.html web page. You can create single file compressed tarball with that directory and send it to remote node:
$ tar cf - www.example.com | zstd | nncp-file - remote.node:www.example.com-page.tar.zst
But there are multi-paged articles, there are the whole interesting
sites you want to get in a single package. You can mirror the whole web
site by utilizing wget
’s recursive feature:
$ wget \ --recursive \ --timestamping \ -l inf \ --no-remove-listing \ --no-parent […] \ http://www.example.com/
There is a standard for creating
Web ARChives:
WARC. Fortunately again, wget
supports it as an
output format.
$ wget \ --warc-file www.example_com-$(date '+%Y%M%d%H%m%S') \ --no-warc-compression \ --no-warc-keep-log […] \ http://www.example.com/
That command will create uncompressed www.example_com-XXX.warc
web archive. By default, WARCs are compressed using
gzip, but, in example above,
we have disabled it to compress with stronger and faster
zstd, before sending via
nncp-file
.
There are plenty of software acting like HTTP proxy for your browser, allowing to view that WARC files. However you can extract files from that archive using warcat utility, producing usual directory hierarchy:
$ python3 -m warcat extract \ www.example_com-XXX.warc \ --output-dir www.example.com-XXX \ --progress
Next: BitTorrent, Previous: Feeds, Up: Integration