<- Back | |
# webdump HTML to plain-text converter | |
Last modification on 2025-04-25 | |
webdump is (yet another) HTML to plain-text converter tool. | |
It reads HTML in UTF-8 from stdin and writes plain-text to stdout. | |
## Goals and scope | |
The main goal of this tool for me is to use it for converting HTML mails to | |
plain-text and to convert HTML content in RSS feeds to plain-text. | |
The tool will only convert HTML to stdout, similarly to links -dump or lynx | |
-dump but simpler and more secure. | |
* HTML and XHTML will be supported. | |
* There will be some workarounds and quirks for broken and legacy HTML code. | |
* It will be usable and secure for reading HTML from mails and RSS/Atom feeds. | |
* No remote resources which are part of the HTML will be downloaded: | |
images, video, audio, etc. But these may be visible as a link reference. | |
* Data will be written to stdout. Intended for plain-text or a text terminal. | |
* No support for Javascript, CSS, frame rendering or form processing. | |
* No HTTP or network protocol handling: HTML data is read from stdin. | |
* Listings for references and some options to extract them in a list that is | |
usable for scripting. Some references are: link anchors, images, audio, video, | |
HTML (i)frames, etc. | |
* Security: on OpenBSD it uses pledge("stdio", NULL). | |
* Keep the code relatively small, simple and hackable. | |
## Features | |
* Support for word-wrapping. | |
* A mode to enable basic markup: bold, underline, italic and blink ;) | |
* Indentation of headers, paragraphs, pre and list items. | |
* Basic support to query elements or hide them. | |
* Show link references. | |
* Show link references and resources such as img, video, audio, subtitles. | |
* Export link references and resources to a TAB-separated format. | |
## Usage examples | |
url='https://codemadness.org/sfeed.html' | |
curl -s "$url" | webdump -r -b "$url" | less | |
curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R | |
curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R | |
Yes, all these option flags look ugly, a shellscript wrapper could be used :) | |
## Practical examples | |
To use webdump as a HTML to text filter for example in the mutt mail client, | |
change in ~/.mailcap: | |
text/html; webdump -i -l -r < %s; needsterminal; copiousoutput | |
In mutt you should then add: | |
auto_view text/html | |
Using webdump as a HTML to text filter for sfeed_curses (otherwise the default … | |
SFEED_HTMLCONV="webdump -d -8 -r -i -l -a" sfeed_curses ~/.sfeed/feeds/* | |
# Query/selector examples | |
The query syntax using the -s option is a bit inspired by CSS (but much more li… | |
To get the title from a HTML page: | |
url='https://codemadness.org/sfeed.html' | |
title=$(curl -s "$url" | webdump -s 'title') | |
printf '%s\n' "$title" | |
List audio and video-related content from a HTML page, redirect fd 3 to fd 1 (s… | |
url="https://media.ccc.de/v/051_Recent_features_to_OpenBSD-ntpd_and_bgp… | |
curl -s "$url" | webdump -x -s 'audio,video' -b "$url" 3>&1 >/dev/null … | |
## Clone | |
git clone git://git.codemadness.org/webdump | |
## Browse | |
You can browse the source-code at: | |
* https://git.codemadness.org/webdump/ | |
* gopher://codemadness.org/1/git/webdump | |
## Download releases | |
Releases are available at: | |
* https://codemadness.org/releases/webdump/ | |
* gopher://codemadness.org/1/releases/webdump | |
## Build and install | |
$ make | |
# make install | |
## Dependencies | |
* C compiler. | |
* libc + some BSDisms. | |
## Trade-offs | |
All software has trade-offs. | |
webdump processes HTML in a single-pass. It does not buffer the full DOM tree. | |
Although due to the nature of HTML/XML some parts like attributes need to be | |
buffered. | |
Rendering tables in webdump is very limited. Twibright Links has really nice | |
table rendering. However implementing a similar feature in the current design of | |
webdump would make the code much more complex. Twibright links | |
processes a full DOM tree and processes the tables in multiple passes (to | |
measure the table cells) etc. Of course tables can be nested also, or HTML tab… | |
that are used for creating layouts (these are mostly older webpages). | |
These trade-offs and preferences are chosen for now. It may change in the | |
future. Fortunately there are the usual good suspects for HTML to plain-text | |
conversion, each with their own chosen trade-offs of course: | |
* twibright links: »http://links.twibright.com/« | |
* lynx: »https://lynx.invisible-island.net/« | |
* w3m: »https://w3m.sourceforge.net/« | |
* xmllint (part of libxml2): »https://gitlab.gnome.org/GNOME/libxml2/-/wikis/h… | |
* xmlstarlet: »https://xmlstar.sourceforge.net/« |