Article 11356 of comp.lang.perl:
Path: feenix.metronet.com!news.utdallas.edu!convex!cs.utexas.edu!howland.reston.ans.net!math.ohio-state.edu!jussieu.fr!univ-lyon1.fr!swidir.switch.ch!scsing.switch.ch!news.dfn.de!news.coli.uni-sb.de!sbusol.rz.uni-sb.de!mpi-sb.mpg.de!uwe
From:
[email protected] (Uwe Waldmann)
Newsgroups: comp.lang.perl
Subject: Re: Redefining \w and \b possible?
Date: 9 Mar 1994 18:36:38 GMT
Organization: Max-Planck-Institut fuer Informatik
Lines: 27
Distribution: world
Message-ID: <
[email protected]>
References: <
[email protected]>
Reply-To:
[email protected]
NNTP-Posting-Host: mpii02005.ag2.mpi-sb.mpg.de
Originator: uwe@mpii02005
In article <
[email protected]>, Stein Kulseth
<
[email protected]> wrote:
> Here in Norway we are blessed/cursed with three extra vowels.
> When doing pattern matching on Norwegian text it would be very
> nice to have \b and \w accept these as letters. Is this possible?
No, as far as I know (unless Larry has changed it in the meantime).
> If not, how can I write a search pattern that will match Norwegian
> word boundaries at either end and anywhere within a string?
# (a) Put a \000 before and after every word:
s/([A-Za-z0-9_\305\306\330\345\346\370]+)/\000$1\000/g;
# (b) Check for \000 instead of \b.
# For example, s/\b([A-Z])\b/"$1"/g becomes:
s/\000([A-Z\305\306\330])\000/"\000$1\000"/g;
# (c) Don't forget to remove all \000's after you are done:
s/\000//g;
If you have several substitutions in a row, be careful to check
that every word boundary remains marked by a \000. It may even be
necessary to repeat steps (c)+(a) in between to readjust them.
--
Uwe Waldmann, Max-Planck-Institut fuer Informatik
Im Stadtwald, D-66123 Saarbruecken, Germany
Phone: +49 681 302-5431, Fax: +49 681 302-5401, E-Mail:
[email protected]