README - webdump - HTML to plain-text converter for webpages | |
git clone git://git.codemadness.org/webdump | |
Log | |
Files | |
Refs | |
README | |
LICENSE | |
--- | |
README (3219B) | |
--- | |
1 webdump | |
2 ------- | |
3 | |
4 HTML to plain-text converter tool. | |
5 | |
6 It reads HTML in UTF-8 from stdin and writes plain-text to stdout. | |
7 | |
8 | |
9 Build and install | |
10 ----------------- | |
11 | |
12 $ make | |
13 # make install | |
14 | |
15 | |
16 Dependencies | |
17 ------------ | |
18 | |
19 - C compiler. | |
20 - libc + some BSDisms. | |
21 | |
22 | |
23 Usage | |
24 ----- | |
25 | |
26 Example: | |
27 | |
28 url='https://codemadness.org/sfeed.html' | |
29 | |
30 curl -s "$url" | webdump -r -b "$url" | less | |
31 | |
32 curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R | |
33 | |
34 curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | le… | |
35 | |
36 | |
37 Yes, all these option flags look ugly, a shellscript wrapper could be us… | |
38 | |
39 | |
40 Goals / scope | |
41 ------------- | |
42 | |
43 The main goal is to use it for converting HTML mails to plain-text and to | |
44 convert HTML content in RSS feeds to plain-text. | |
45 | |
46 The tool will only convert HTML to stdout, similarly to links -dump or l… | |
47 -dump but simpler and more secure. | |
48 | |
49 - HTML and XHTML will be supported. | |
50 - There will be some workarounds and quirks for broken and legacy HTML c… | |
51 - It will be usable and secure for reading HTML from mails and RSS/Atom … | |
52 - No remote resources which are part of the HTML will be downloaded: | |
53 images, video, audio, etc. But these may be visible as a link referenc… | |
54 - Data will be written to stdout. Intended for plain-text or a text term… | |
55 - No support for Javascript, CSS, frame rendering or form processing. | |
56 - No HTTP or network protocol handling: HTML data is read from stdin. | |
57 - Listings for references and some options to extract them in a list tha… | |
58 usable for scripting. Some references are: link anchors, images, audio… | |
59 HTML (i)frames, etc. | |
60 | |
61 | |
62 Features | |
63 -------- | |
64 | |
65 - Support for word-wrapping. | |
66 - A mode to enable basic markup: bold, underline, italic and blink ;) | |
67 - Indentation of headers, paragraphs, pre and list items. | |
68 - Basic support to query an element or hide them. | |
69 - Show link references. | |
70 - Show link references and resources such as img, video, audio, subtitle… | |
71 - Export link references and resources to a TAB-separated format. | |
72 | |
73 | |
74 Trade-offs | |
75 ---------- | |
76 | |
77 All software has trade-offs. | |
78 | |
79 webdump processes HTML in a single-pass. It does not buffer the full DOM… | |
80 Although due to the nature of HTML/XML some parts like attributes need t… | |
81 buffered. | |
82 | |
83 Rendering tables in webdump is very limited. Twibright Links has really … | |
84 table rendering. Implementing a similar feature in the current design of | |
85 webdump would make the code much more complex however. Twibright links | |
86 processes a full DOM tree and processes the tables in multiple passes (to | |
87 measure the table cells) etc. Of course tables can be nested also, or i… | |
88 in (older web) pages that use HTML tables for layout. | |
89 | |
90 These trade-offs and preferences are chosen for now. It may change in the | |
91 future. Fortunately there are the usual good suspects for HTML to plain… | |
92 conversion, (each with their own chosen trade-offs of course): | |
93 | |
94 For example: | |
95 | |
96 - twibright links | |
97 - lynx | |
98 - w3m | |
99 | |
100 | |
101 Examples | |
102 -------- | |
103 | |
104 To use webdump as a HTML to text filter for example in the mutt mail cli… | |
105 change in ~/.mailcap: | |
106 | |
107 text/html; webdump -i -l -r < %s; needsterminal; copiousoutput | |
108 | |
109 In mutt you should then add: | |
110 | |
111 auto_view text/html | |
112 | |
113 | |
114 License | |
115 ------- | |
116 | |
117 ISC, see LICENSE file. | |
118 | |
119 | |
120 Author | |
121 ------ | |
122 | |
123 Hiltjo Posthuma <[email protected]> |