| <- Back | |
| # webdump HTML to plain-text converter | |
| Last modification on 2025-04-25 | |
| webdump is (yet another) HTML to plain-text converter tool. | |
| It reads HTML in UTF-8 from stdin and writes plain-text to stdout. | |
| ## Goals and scope | |
| The main goal of this tool for me is to use it for converting HTML mails to | |
| plain-text and to convert HTML content in RSS feeds to plain-text. | |
| The tool will only convert HTML to stdout, similarly to links -dump or lynx | |
| -dump but simpler and more secure. | |
| * HTML and XHTML will be supported. | |
| * There will be some workarounds and quirks for broken and legacy HTML code. | |
| * It will be usable and secure for reading HTML from mails and RSS/Atom feeds. | |
| * No remote resources which are part of the HTML will be downloaded: | |
| images, video, audio, etc. But these may be visible as a link reference. | |
| * Data will be written to stdout. Intended for plain-text or a text terminal. | |
| * No support for Javascript, CSS, frame rendering or form processing. | |
| * No HTTP or network protocol handling: HTML data is read from stdin. | |
| * Listings for references and some options to extract them in a list that is | |
| usable for scripting. Some references are: link anchors, images, audio, video, | |
| HTML (i)frames, etc. | |
| * Security: on OpenBSD it uses pledge("stdio", NULL). | |
| * Keep the code relatively small, simple and hackable. | |
| ## Features | |
| * Support for word-wrapping. | |
| * A mode to enable basic markup: bold, underline, italic and blink ;) | |
| * Indentation of headers, paragraphs, pre and list items. | |
| * Basic support to query elements or hide them. | |
| * Show link references. | |
| * Show link references and resources such as img, video, audio, subtitles. | |
| * Export link references and resources to a TAB-separated format. | |
| ## Usage examples | |
| url='https://codemadness.org/sfeed.html' | |
| curl -s "$url" | webdump -r -b "$url" | less | |
| curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R | |
| curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R | |
| Yes, all these option flags look ugly, a shellscript wrapper could be used :) | |
| ## Practical examples | |
| To use webdump as a HTML to text filter for example in the mutt mail client, | |
| change in ~/.mailcap: | |
| text/html; webdump -i -l -r < %s; needsterminal; copiousoutput | |
| In mutt you should then add: | |
| auto_view text/html | |
| Using webdump as a HTML to text filter for sfeed_curses (otherwise the default … | |
| SFEED_HTMLCONV="webdump -d -8 -r -i -l -a" sfeed_curses ~/.sfeed/feeds/* | |
| # Query/selector examples | |
| The query syntax using the -s option is a bit inspired by CSS (but much more li… | |
| To get the title from a HTML page: | |
| url='https://codemadness.org/sfeed.html' | |
| title=$(curl -s "$url" | webdump -s 'title') | |
| printf '%s\n' "$title" | |
| List audio and video-related content from a HTML page, redirect fd 3 to fd 1 (s… | |
| url="https://media.ccc.de/v/051_Recent_features_to_OpenBSD-ntpd_and_bgp… | |
| curl -s "$url" | webdump -x -s 'audio,video' -b "$url" 3>&1 >/dev/null … | |
| ## Clone | |
| git clone git://git.codemadness.org/webdump | |
| ## Browse | |
| You can browse the source-code at: | |
| * https://git.codemadness.org/webdump/ | |
| * gopher://codemadness.org/1/git/webdump | |
| ## Download releases | |
| Releases are available at: | |
| * https://codemadness.org/releases/webdump/ | |
| * gopher://codemadness.org/1/releases/webdump | |
| ## Build and install | |
| $ make | |
| # make install | |
| ## Dependencies | |
| * C compiler. | |
| * libc + some BSDisms. | |
| ## Trade-offs | |
| All software has trade-offs. | |
| webdump processes HTML in a single-pass. It does not buffer the full DOM tree. | |
| Although due to the nature of HTML/XML some parts like attributes need to be | |
| buffered. | |
| Rendering tables in webdump is very limited. Twibright Links has really nice | |
| table rendering. However implementing a similar feature in the current design of | |
| webdump would make the code much more complex. Twibright links | |
| processes a full DOM tree and processes the tables in multiple passes (to | |
| measure the table cells) etc. Of course tables can be nested also, or HTML tab… | |
| that are used for creating layouts (these are mostly older webpages). | |
| These trade-offs and preferences are chosen for now. It may change in the | |
| future. Fortunately there are the usual good suspects for HTML to plain-text | |
| conversion, each with their own chosen trade-offs of course: | |
| * twibright links: »http://links.twibright.com/« | |
| * lynx: »https://lynx.invisible-island.net/« | |
| * w3m: »https://w3m.sourceforge.net/« | |
| * xmllint (part of libxml2): »https://gitlab.gnome.org/GNOME/libxml2/-/wikis/h… | |
| * xmlstarlet: »https://xmlstar.sourceforge.net/« |