TH DOC2TXT 1
SH NAME
doc2txt, olefs, mswordstrings \- extract printable strings from Microsoft Word documents
SH SYNOPSIS
B doc2txt
[
I file.doc
]
br
B aux/olefs
[
B -m
I mtpt
]
I file.doc
br
B aux/mswordstrings
I /mnt/doc/WordDocument
SH DESCRIPTION
I Doc2txt
is a shell script that uses
I olefs
and
I mswordstrings
to extract the printable text from the body of a Microsoft Word document.
PP
Microsoft Office documents are stored in OLE (Object Linking and Embedding)
format, which is a scaled down version of Microsoft's FAT file system.
I Olefs
presents the contents of an Office document as a file system
on
IR mtpt ,
which defaults to
BR /mnt/doc .
I Mswordstrings
parses the
I WordDocument
file inside an Office document, extracting
the text stream.
SH SOURCE
B /sys/src/cmd/aux/mswordstrings.c
br
B /sys/src/cmd/aux/olefs.c
br
B /rc/bin/doc2txt
SH SEE ALSO
IR strings (1)
br
``Microsoft Word 97 Binary File Format'',
available on line at Microsoft's developer home page.
br
``LAOLA Binary Structures'',
IR snake.cs.tu-berlin.de:8081/~schwartz/pmh .