GopherProxy

	README - webdump - HTML to plain-text converter for webpages
	git clone git://git.codemadness.org/webdump
	Log
	Files
	Refs
	README
	LICENSE
	---
	README (3219B)
	---
	1 webdump
	2 -------
	3
	4 HTML to plain-text converter tool.
	5
	6 It reads HTML in UTF-8 from stdin and writes plain-text to stdout.
	7
	8
	9 Build and install
	10 -----------------
	11
	12 $ make
	13 # make install
	14
	15
	16 Dependencies
	17 ------------
	18
	19 - C compiler.
	20 - libc + some BSDisms.
	21
	22
	23 Usage
	24 -----
	25
	26 Example:
	27
	28 url='https://codemadness.org/sfeed.html'
	29
	30 curl -s "$url" \| webdump -r -b "$url" \| less
	31
	32 curl -s "$url" \| webdump -8 -a -i -l -r -b "$url" \| less -R
	33
	34 curl -s "$url" \| webdump -s 'main' -8 -a -i -l -r -b "$url" \| le…
	35
	36
	37 Yes, all these option flags look ugly, a shellscript wrapper could be us…
	38
	39
	40 Goals / scope
	41 -------------
	42
	43 The main goal is to use it for converting HTML mails to plain-text and to
	44 convert HTML content in RSS feeds to plain-text.
	45
	46 The tool will only convert HTML to stdout, similarly to links -dump or l…
	47 -dump but simpler and more secure.
	48
	49 - HTML and XHTML will be supported.
	50 - There will be some workarounds and quirks for broken and legacy HTML c…
	51 - It will be usable and secure for reading HTML from mails and RSS/Atom …
	52 - No remote resources which are part of the HTML will be downloaded:
	53 images, video, audio, etc. But these may be visible as a link referenc…
	54 - Data will be written to stdout. Intended for plain-text or a text term…
	55 - No support for Javascript, CSS, frame rendering or form processing.
	56 - No HTTP or network protocol handling: HTML data is read from stdin.
	57 - Listings for references and some options to extract them in a list tha…
	58 usable for scripting. Some references are: link anchors, images, audio…
	59 HTML (i)frames, etc.
	60
	61
	62 Features
	63 --------
	64
	65 - Support for word-wrapping.
	66 - A mode to enable basic markup: bold, underline, italic and blink ;)
	67 - Indentation of headers, paragraphs, pre and list items.
	68 - Basic support to query an element or hide them.
	69 - Show link references.
	70 - Show link references and resources such as img, video, audio, subtitle…
	71 - Export link references and resources to a TAB-separated format.
	72
	73
	74 Trade-offs
	75 ----------
	76
	77 All software has trade-offs.
	78
	79 webdump processes HTML in a single-pass. It does not buffer the full DOM…
	80 Although due to the nature of HTML/XML some parts like attributes need t…
	81 buffered.
	82
	83 Rendering tables in webdump is very limited. Twibright Links has really …
	84 table rendering. Implementing a similar feature in the current design of
	85 webdump would make the code much more complex however. Twibright links
	86 processes a full DOM tree and processes the tables in multiple passes (to
	87 measure the table cells) etc. Of course tables can be nested also, or i…
	88 in (older web) pages that use HTML tables for layout.
	89
	90 These trade-offs and preferences are chosen for now. It may change in the
	91 future. Fortunately there are the usual good suspects for HTML to plain…
	92 conversion, (each with their own chosen trade-offs of course):
	93
	94 For example:
	95
	96 - twibright links
	97 - lynx
	98 - w3m
	99
	100
	101 Examples
	102 --------
	103
	104 To use webdump as a HTML to text filter for example in the mutt mail cli…
	105 change in ~/.mailcap:
	106
	107 text/html; webdump -i -l -r < %s; needsterminal; copiousoutput
	108
	109 In mutt you should then add:
	110
	111 auto_view text/html
	112
	113
	114 License
	115 -------
	116
	117 ISC, see LICENSE file.
	118
	119
	120 Author
	121 ------
	122
	123 Hiltjo Posthuma <[email protected]>