Path: usenet.cise.ufl.edu!newsfeeds.nerdc.ufl.edu!news.magicnet.net!news.maxwell.syr.edu!newsfeed.corridex.com!nntp2.savvis.net!inetarena.com!not-for-mail
From: [email protected] (Jari Aalto+mail.perl)
Newsgroups: comp.lang.perl.announce,comp.lang.perl.modules
Subject: ANNOUNCE: v1998.1204 Squeeze.pm -- Shorten text to pagers and GSM phones
Followup-To: comp.lang.perl.modules
Date: 27 Dec 1998 17:03:28 GMT
Organization: University of Tampere
Lines: 249
Approved: [email protected] (comp.lang.perl.announce)
Message-ID: <[email protected]>
NNTP-Posting-Host: halfdome.holdit.com
X-Disclaimer: The "Approved" header verifies header information for article transmission and does not imply approval of content.
Xref: usenet.cise.ufl.edu comp.lang.perl.announce:203 comp.lang.perl.modules:7192


What's New: Variable SQZ_OPTIMIZE_LEVEL


Title

       ANNOUNCE: v1998.1204 Squeeze.pm -- Shorten text to minimum syllables
       The version number is based on date format YYYY.MMDD

Download

       Home page:

           (eg. ftp://ftp.funet.fi/pub/languages/perl/CPAN/)
           CPAN//modules/by-module/Lingua/

       Perl language interpreter pointers at (Win32/Unix etc.)
       Perl: http://language.perl.com/info/software.html

Description

       A module that I use to compress text from email before it is
       sent to my Cellular phone. If you have a pager, you know how
       tight the space is and every extra characters saver is a plus.

       A shortened POD page follows. The Module's Interface functions
       and interface variables are not included in this announcement.

       I would welcome more text compresion rules, so feel free to
       suggest more hash entries like:

               WORD       => CONVERSION
               MULTI WORD => CONVERSION

NAME
   Squeeze.pm - Shorten text to minimum syllables by using hash and vowel
   deletion

REVISION
   $Id: Squeeze.pm,v 1.24 1998/10/08 14:58:15 jaalto Exp $

SYNOPSIS
       use Squeeze.pm;         # imnport only function
       use Squeeze qw( :ALL ); # import all functions and variables
       use English;

       while (<>)
       {
           print SqueezeText $ARG;
       }


DESCRIPTION
   Squeeze English text to most compact format possibly so that it is
   barely readable. You should convert all text to lowercase for maximum
   compression, because optimizations have been designed mostly fr
   uncapitalised letters.

       `Warning: Each line is processed multiple times, so prepare for slow
       conversion time'

   You can use this module e.g. to preprocess text before it is sent to
   electronic media that has some maximum text size limit. For example
   pagers have an arbitrary text size limit, typically 200 characters,
   which you want to fill as much as possible. Alternatively you may have
   GSM cellular phone which is capable of receiving Short Messages (SMS),
   whose message size limit is 160 characters. For demonstration of this
   module's SqueezeText() function , the description text of this paragraph
   has been converted below. See yourself if it's readable (Yes, it takes
   some time to get used to). The compress ratio is typically 30-40%

       u _n use thi mod e.g. to prprce txt bfre i_s snt to
       elrnic mda has som max txt siz lim. f_xmple pag
       hv  abitry txt siz lim, tpcly 200 chr, W/ u wnt
       to fll as mch as psbleAlternatvly u may hv GSM cllar P8
       w_s cpble of rcivng Short msg (SMS), WS/ msg siz
       lim is 160 chr. 4 demonstrton of thi mods SquezText
       fnc ,  dsc txt of thi prgra has ben cnvd_ blow
       See uself if i_s redble (Yes, it tak som T to get usdto
       compr rat is tpcly 30-40

   And if $SQZ_OPTIMIZE_LEVEL is set to non-zero

       u_nUseThiModE.g.ToPrprceTxtBfreI_sSntTo
       elrnicMdaHasSomMaxTxtSizLim.F_xmplePag
       hvAbitryTxtSizLim,Tpcly200Chr,W/UWnt
       toFllAsMchAsPsbleAlternatvlyUMayHvGSMCllarP8
       w_sCpbleOfRcivngShortMsg(SMS),WS/MsgSiz
       limIs160Chr.4DemonstrtonOfThiModsSquezText
       fnc,DscTxtOfThiPrgraHasBenCnvd_Blow
       SeeUselfIfI_sRedble(Yes,ItTakSomTToGetUsdto
       comprRatIsTpcly30-40

   The comparision of these two show

       Original text   : 627 characters
       Level 0         : 433 characters    reduction 31 %
       Level 1         : 345 characters    reduction 45 %  (+14 improvement)

   There are few grammar rules which are used to shorten some English
   tokens very much:

       Word that has _ is usually a verb

       Word that has / is usually a substantive, noun,
                       pronomine or other non-verb

   For example, these tokens must be understood before text can be read.
   This is not yet like Geek code, because you don't need external parser
   to understand this, but just some common sense and time to adapt
   yourself to this text. *For a complete up to date list, you have to peek
   the source code*

       automatically => 'acly_'

       for           => 4
       for him       => 4h
       for her       => 4h
       for them      => 4t
       for those     => 4t

       can           => _n
       does          => _s

       it is         => i_s
       that is       => t_s
       which is      => w_s
       that are      => t_r
       which are     => w_r

       less          => -/
       more          => +/
       most          => ++

       however       => h/ver
       think         => thk_

       useful        => usful

       you           => u
       your          => u/
       you'd         => u/d
       you'll        => u/l
       they          => t/
       their         => t/r

       will          => /w
       would         => /d
       with          => w/
       without       => w/o
       which         => W/
       whose         => WS/

   Time is expressed with big letters

       time          => T
       minute        => MIN
       second        => SEC
       hour          => HH
       day           => DD
       month         => MM
       year          => YY

   Other Big letter acronyms

       phone         => P8

EXAMPLES
   To add new words e.g. to word conversion hash table, you'd define your
   custom set and merge them to existing ones. Do similarly to
   `%SQZ_WXLATE_MULTI_HASH' and `$SQZ_ZAP_REGEXP' and then start using the
   conversion function.

       use English;
       use Squeeze qw( :ALL );

       my %myExtraWordHash =
       (
             new-word1  => 'conversion1'
           , new-word2  => 'conversion2'
           , new-word3  => 'conversion3'
           , new-word4  => 'conversion4'
       );

       #   First take the existing tables and merge them with my
       #   translation table

       my %mySustomWordHash =
       (
             %SQZ_WXLATE_HASH
           , %SQZ_WXLATE_EXTRA_HASH
           , %myExtraWordHash
       );

       my $myXlat = 0;                             # state flag

       while (<>)
       {
           if ( $condition )
           {
               SqueezeHashSet \%%mySustomWordHash; # Use MY conversions
               $myXlat = 1;
           }

           if ( $myXlat and $condition )
           {
               SqueezeHashSet "reset";             # Back to default table
               $myXlat = 0;
           }

           print SqueezeText $ARG;
       }

   Similarly you can redefine the multi word thanslate table by supplying
   another hash reference in call to SqueezeHashSet(), and to kill more
   text immediately in addtion to default, just concatenate the regexps to
   *$SQZ_ZAP_REGEXP*

KNOWN BUGS
   There may be lot of false conversions and if you think that some word
   squeezing went too far, please turn on the debug end send the log to the
   maintainer. To see how the conversion goes e.g. for word *Messages*:

       use English;
       use Lingua::EN:Squeeze;

       SqueezeDebug( 1, '(?i)Messages' );

       $ARG = "This line has some Messages in it";
       print SqueezeText $ARG;


AVAILABILITY
   Author can be reached at [email protected] HomePage via forwarding
   service is at http://www.netforward.com/poboxes/?jari.aalto or
   alternatively absolute url is at ftp://cs.uta.fi/pub/ssjaaa/ but this
   may move without notice. Prefer keeping the forwarding service link in
   your bookmark.

   Latest version of this module can be found at $CPAN/modules/by-
   module/Lingua/

AUTHOR
   Copyright (C) 1998-1999 Jari Aalto. All rights reserved. This program is
   free software; you can redistribute it and/or modify it under the same
   terms as Perl itself or in terms of Gnu General Public licence v2 or
   later.