Article 11449 of comp.lang.perl:

Article 11449 of comp.lang.perl:
Path: feenix.metronet.com!news.ecn.bgu.edu!usenet.ins.cwru.edu!howland.reston.ans.net!pipex!sunic!trane.uninett.no!nntp.uio.no!hbf
From: [email protected]
Newsgroups: comp.lang.perl
Subject: Re: Redefining \w and \b possible?
Date: 11 Mar 1994 20:24:18 GMT
Organization: University of Oslo, Norway
Lines: 86
Message-ID: <[email protected]>
References: <[email protected]> <[email protected]>
NNTP-Posting-Host: durin.uio.no
In-reply-to: [email protected]'s message of Thu, 10 Mar 1994 01:44:44 GMT

In article <[email protected]> [email protected] (Larry Wall) writes:

> In Perl 5 you'll just have to do &POSIX::setlocale. In Perl 4 you'd
> have to sneak a setlocale into main() somewhere. But \b and \w are
> defined in terms of isalpha and isdigit, so it oughta work.

Setlocale looks nice for some applications. Problem is, the foreigner
who wrote our locales didn't agree that the Norwegian characters can be
represented as 7-bit "[\]{|}". I can't find any documentation on how
the user can define such a locale, and I suspect there is no reasonably
portable way.

The solution seems to be to add user-defined character classes and
translation tables to the Perl 6 wish list...

In article <[email protected]> [email protected] (Stein Kulseth) writes:

> If not, how can I write a search pattern that will match Norwegian
> word boundaries at either end and anywhere within a string?

Sorry. Rewrite your code so you don't need the delimiters. Prepend and
append a blank to your strings before maching, or split out the words
and call functions on them.

This is close to what I'm going to use. Translates both iso8859-1 and
7-bit Norwegian chars. Since this will run inside a 2-level loop over a
5000-line inputfile, I'd be very grateful for any hints about how to
speed up the thing and still keep the usage simple enough to enable perl
novices to modify it.

# These can be used both in a tr/.../ (inside an eval) and in s/.../.
$upChars = 'A-\135\300-\326\330-\336'; # upcase chars
$toDownChars = 'a-\175\340-\366\370-\376'; # ..tr'ed to downcase
$downChars = 'a-\175\340-\366\370-\376\337\377'; # downcase chars
$toUpChars = 'A-\135\300-\326\330-\336\337\377'; # ..tr'ed to upcase
$norwChars = $upChars . $downChars; # letters
$wordChars = $norwChars . "0-9"; # alphanumerics
# Using \135 instead of ] so s/$upChars/../ won't be confused.

$arg1 = '$_[$[]'; # The argument
eval "
# Convert (and modify) the args to Norwegian upper/lowercase
sub upCase { $arg1 =~ tr/$downChars/$toUpChars/; $arg1; }
sub downCase { $arg1 =~ tr/$upChars/$toDownChars/; $arg1; }

# 1. alphanum in string -> uppercase, rest -> lowercase
sub Capitalize {
$arg1 =~ tr/$upChars/$toDownChars/;
$arg1 =~ s/[$wordChars]/&upCase(\$&)/eo);
$arg1;
}

# 1. alphanum in each word -> uppercase, rest -> lowercase
sub Casify {
$arg1 =~ s/([$wordChars])([$wordChars]*)/
&upCase(\$1) . &downCase(\$2)/geo;
$arg1;
}
";

# Example usage -- convert names to correct case

# In names, these words should be in lowercase
%nameTrans = ('Af', 'af', 'Av', 'av', 'De', 'de', 'Jr', 'jr',
'Den', 'den', 'Der', 'der', 'Van', 'van', 'Von', 'von');

sub convNamePart {
local($_) = &downCase(shift);
s/[$wordChars]/&upCase($&)/eo;
$nameTrans{$_} || do {
s/^Mc(.)/'Mc' . &upCase($1)/eo; # Mcneill -> McNeill
$_;
}
}

sub convName { $_[$[] =~ s/[$wordChars]+/&convNamePart($&)/geo; $_[$[]; }

while (<>) {
print &convName($_);
}

--
Hallvard