Introduction
Introduction Statistics Contact Development Disclaimer Help
README: expand README - webdump - HTML to plain-text converter for webpages
git clone git://git.codemadness.org/webdump
Log
Files
Refs
README
LICENSE
---
commit 1232b5b3d77c458704341ac436ff4230a3077007
parent bff9fbe51c0f5f5ac37a46deca1016bb56834dac
Author: Hiltjo Posthuma <[email protected]>
Date: Sun, 15 Oct 2023 13:47:16 +0200
README: expand README
Describe the scope and trade-offs a bit more clearly, because webdump is quite
limited.
Diffstat:
M README | 42 ++++++++++++++++++++++++++++-…
1 file changed, 39 insertions(+), 3 deletions(-)
---
diff --git a/README b/README
@@ -34,11 +34,17 @@ Example:
curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R
+Yes, all these option flags look ugly, a shellscript wrapper could be used :)
+
+
Goals / scope
-------------
-The tool will only render HTML to stdout, similarly to links -dump or
-lynx -dump but simpler and more secure.
+The main goal is to use it for converting HTML mails to plain-text and to
+convert HTML content in RSS feeds to plain-text.
+
+The tool will only convert HTML to stdout, similarly to links -dump or lynx
+-dump but simpler and more secure.
- HTML and XHTML will be supported.
- There will be some workarounds and quirks for broken and legacy HTML code.
@@ -46,8 +52,11 @@ lynx -dump but simpler and more secure.
- No remote resources which are part of the HTML will be downloaded:
images, video, audio, etc. But these may be visible as a link reference.
- Data will be written to stdout. Intended for plain-text or a text terminal.
-- No support for Javascript, CSS, frame rendering or forms.
+- No support for Javascript, CSS, frame rendering or form processing.
- No HTTP or network protocol handling: HTML data is read from stdin.
+- Listings for references and some options to extract them in a list that is
+ usable for scripting. Some references are: link anchors, images, audio, vide…
+ HTML (i)frames, etc.
Features
@@ -62,6 +71,33 @@ Features
- Export link references and resources to a TAB-separated format.
+Trade-offs
+----------
+
+All software has trade-offs.
+
+webdump processes HTML in a single-pass. It does not buffer the full DOM tree.
+Although due to the nature of HTML/XML some parts like attributes need to be
+buffered.
+
+Rendering tables in webdump is very limited. Twibright Links has really nice
+table rendering. Implementing a similar feature in the current design of
+webdump would make the code much more complex however. Twibright links
+processes a full DOM tree and processes the tables in multiple passes (to
+measure the table cells) etc. Of course tables can be nested also, or is used
+in (older web) pages that use HTML tables for layout.
+
+These trade-offs and preferences are chosen for now. It may change in the
+future. Fortunately there are the usual good suspects for HTML to plain-text
+conversion, (each with their own chosen trade-offs of course):
+
+For example:
+
+- twibright links
+- lynx
+- w3m
+
+
Examples
--------
You are viewing proxied material from codemadness.org. The copyright of proxied material belongs to its original authors. Any comments or complaints in relation to proxied material should be directed to the original authors of the content concerned. Please see the disclaimer for more details.