Path: usenet.cise.ufl.edu!newsfeeds.nerdc.ufl.edu!news.magicnet.net!news.maxwell.syr.edu!newsfeed.corridex.com!nntp2.savvis.net!inetarena.com!not-for-mail
From:
[email protected] (Jari Aalto+mail.perl)
Newsgroups: comp.lang.perl.announce,comp.lang.perl.modules
Subject: ANNOUNCE: v1998.1204 Squeeze.pm -- Shorten text to pagers and GSM phones
Followup-To: comp.lang.perl.modules
Date: 27 Dec 1998 17:03:28 GMT
Organization: University of Tampere
Lines: 249
Approved:
[email protected] (comp.lang.perl.announce)
Message-ID: <
[email protected]>
NNTP-Posting-Host: halfdome.holdit.com
X-Disclaimer: The "Approved" header verifies header information for article transmission and does not imply approval of content.
Xref: usenet.cise.ufl.edu comp.lang.perl.announce:203 comp.lang.perl.modules:7192
What's New: Variable SQZ_OPTIMIZE_LEVEL
Title
ANNOUNCE: v1998.1204 Squeeze.pm -- Shorten text to minimum syllables
The version number is based on date format YYYY.MMDD
Download
Home page:
(eg.
ftp://ftp.funet.fi/pub/languages/perl/CPAN/)
CPAN//modules/by-module/Lingua/
Perl language interpreter pointers at (Win32/Unix etc.)
Perl:
http://language.perl.com/info/software.html
Description
A module that I use to compress text from email before it is
sent to my Cellular phone. If you have a pager, you know how
tight the space is and every extra characters saver is a plus.
A shortened POD page follows. The Module's Interface functions
and interface variables are not included in this announcement.
I would welcome more text compresion rules, so feel free to
suggest more hash entries like:
WORD => CONVERSION
MULTI WORD => CONVERSION
NAME
Squeeze.pm - Shorten text to minimum syllables by using hash and vowel
deletion
REVISION
$Id: Squeeze.pm,v 1.24 1998/10/08 14:58:15 jaalto Exp $
SYNOPSIS
use Squeeze.pm; # imnport only function
use Squeeze qw( :ALL ); # import all functions and variables
use English;
while (<>)
{
print SqueezeText $ARG;
}
DESCRIPTION
Squeeze English text to most compact format possibly so that it is
barely readable. You should convert all text to lowercase for maximum
compression, because optimizations have been designed mostly fr
uncapitalised letters.
`Warning: Each line is processed multiple times, so prepare for slow
conversion time'
You can use this module e.g. to preprocess text before it is sent to
electronic media that has some maximum text size limit. For example
pagers have an arbitrary text size limit, typically 200 characters,
which you want to fill as much as possible. Alternatively you may have
GSM cellular phone which is capable of receiving Short Messages (SMS),
whose message size limit is 160 characters. For demonstration of this
module's SqueezeText() function , the description text of this paragraph
has been converted below. See yourself if it's readable (Yes, it takes
some time to get used to). The compress ratio is typically 30-40%
u _n use thi mod e.g. to prprce txt bfre i_s snt to
elrnic mda has som max txt siz lim. f_xmple pag
hv abitry txt siz lim, tpcly 200 chr, W/ u wnt
to fll as mch as psbleAlternatvly u may hv GSM cllar P8
w_s cpble of rcivng Short msg (SMS), WS/ msg siz
lim is 160 chr. 4 demonstrton of thi mods SquezText
fnc , dsc txt of thi prgra has ben cnvd_ blow
See uself if i_s redble (Yes, it tak som T to get usdto
compr rat is tpcly 30-40
And if $SQZ_OPTIMIZE_LEVEL is set to non-zero
u_nUseThiModE.g.ToPrprceTxtBfreI_sSntTo
elrnicMdaHasSomMaxTxtSizLim.F_xmplePag
hvAbitryTxtSizLim,Tpcly200Chr,W/UWnt
toFllAsMchAsPsbleAlternatvlyUMayHvGSMCllarP8
w_sCpbleOfRcivngShortMsg(SMS),WS/MsgSiz
limIs160Chr.4DemonstrtonOfThiModsSquezText
fnc,DscTxtOfThiPrgraHasBenCnvd_Blow
SeeUselfIfI_sRedble(Yes,ItTakSomTToGetUsdto
comprRatIsTpcly30-40
The comparision of these two show
Original text : 627 characters
Level 0 : 433 characters reduction 31 %
Level 1 : 345 characters reduction 45 % (+14 improvement)
There are few grammar rules which are used to shorten some English
tokens very much:
Word that has _ is usually a verb
Word that has / is usually a substantive, noun,
pronomine or other non-verb
For example, these tokens must be understood before text can be read.
This is not yet like Geek code, because you don't need external parser
to understand this, but just some common sense and time to adapt
yourself to this text. *For a complete up to date list, you have to peek
the source code*
automatically => 'acly_'
for => 4
for him => 4h
for her => 4h
for them => 4t
for those => 4t
can => _n
does => _s
it is => i_s
that is => t_s
which is => w_s
that are => t_r
which are => w_r
less => -/
more => +/
most => ++
however => h/ver
think => thk_
useful => usful
you => u
your => u/
you'd => u/d
you'll => u/l
they => t/
their => t/r
will => /w
would => /d
with => w/
without => w/o
which => W/
whose => WS/
Time is expressed with big letters
time => T
minute => MIN
second => SEC
hour => HH
day => DD
month => MM
year => YY
Other Big letter acronyms
phone => P8
EXAMPLES
To add new words e.g. to word conversion hash table, you'd define your
custom set and merge them to existing ones. Do similarly to
`%SQZ_WXLATE_MULTI_HASH' and `$SQZ_ZAP_REGEXP' and then start using the
conversion function.
use English;
use Squeeze qw( :ALL );
my %myExtraWordHash =
(
new-word1 => 'conversion1'
, new-word2 => 'conversion2'
, new-word3 => 'conversion3'
, new-word4 => 'conversion4'
);
# First take the existing tables and merge them with my
# translation table
my %mySustomWordHash =
(
%SQZ_WXLATE_HASH
, %SQZ_WXLATE_EXTRA_HASH
, %myExtraWordHash
);
my $myXlat = 0; # state flag
while (<>)
{
if ( $condition )
{
SqueezeHashSet \%%mySustomWordHash; # Use MY conversions
$myXlat = 1;
}
if ( $myXlat and $condition )
{
SqueezeHashSet "reset"; # Back to default table
$myXlat = 0;
}
print SqueezeText $ARG;
}
Similarly you can redefine the multi word thanslate table by supplying
another hash reference in call to SqueezeHashSet(), and to kill more
text immediately in addtion to default, just concatenate the regexps to
*$SQZ_ZAP_REGEXP*
KNOWN BUGS
There may be lot of false conversions and if you think that some word
squeezing went too far, please turn on the debug end send the log to the
maintainer. To see how the conversion goes e.g. for word *Messages*:
use English;
use Lingua::EN:Squeeze;
SqueezeDebug( 1, '(?i)Messages' );
$ARG = "This line has some Messages in it";
print SqueezeText $ARG;
AVAILABILITY
Author can be reached at
[email protected] HomePage via forwarding
service is at
http://www.netforward.com/poboxes/?jari.aalto or
alternatively absolute url is at
ftp://cs.uta.fi/pub/ssjaaa/ but this
may move without notice. Prefer keeping the forwarding service link in
your bookmark.
Latest version of this module can be found at $CPAN/modules/by-
module/Lingua/
AUTHOR
Copyright (C) 1998-1999 Jari Aalto. All rights reserved. This program is
free software; you can redistribute it and/or modify it under the same
terms as Perl itself or in terms of Gnu General Public licence v2 or
later.