news.utdallas.edu!wupost!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!gulfaero.com!ux1.cso.uiuc.edu!news.cso.uiuc.edu!ux2.cso.uiuc.edu!ejk Tue Mar 16 20:21:14 CST 1993

news.utdallas.edu!wupost!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!gulfaero.com!ux1.cso.uiuc.edu!news.cso.uiuc.edu!ux2.cso.uiuc.edu!ejk Tue Mar 16 20:21:14 CST 1993
Article: 1612 of comp.lang.perl
Xref: feenix.metronet.com comp.lang.perl:1612
Newsgroups: comp.lang.perl
Path: feenix.metronet.com!news.utdallas.edu!wupost!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!gulfaero.com!ux1.cso.uiuc.edu!news.cso.uiuc.edu!ux2.cso.uiuc.edu!ejk
From: [email protected] (Ed Kubaitis - CCSO)
#Subject: Re: Fuzzy text matching
Date: Tue, 16 Mar 1993 13:57:19 GMT
Message-ID: <[email protected]>
References: <[email protected]>
Sender: [email protected] (Net Noise owner)
Organization: University of Illinois - Urbana
Lines: 50

[email protected] (Daniel Rich) writes:

|I am in search of a perl function that will allow me to "fuzzy" match
|text. I am trying to search a database of names (firstname/middle
|initial/lastname), and am running into problems with people being able
|to spell/type. Has anyone written anything that will do this?

Attached is an old post with a simple fuzzy match that worked ok for my
application.

----------------------------------
Ed Kubaitis ([email protected])
Computing & Communications Services Office - University of Illinois, Urbana
===============================================================================
My application involves finding the beginning of articles matching
articles listed in the table of contents in machine-readable issues of
a publication intended for human readers, not programs. (The problem
is that these documents are apparently hand entered or edited so
titles in the table-of-contents do not always exactly match the titles
in the text: misspellings, abreviations in one place but not the
other, different punctuation, etc.)

Here's a fuzzy match that has (so far) served well. The algorithm is
to calculate the ratio of two letter substrings in string A occuring
anywhere in string B to the total number of two letter substrings in
the shorter of the two strings. A ratio of 0.8 or greater empirically
seems to do what I want. You may want to change the way of dealing
with differences in lengths of the two strings, or try 3 character
substrings, etc. It may not be the best, but it's small, easy to
understand, and fairly fast.
----------------------------------------------------------------------
sub fmatch {
local($A, $B) = @_;
local($l) = (length($A) < length($B)) ? length($A) : length($B);
local($m) = 0;
local($w) = 2;
local($k);

$A eq $B && return(1.0);
$l > $w || return(0.0);

for $k(0..$l-$w) {
local($s) = substr($A, $k, $w);
#---escape r.e. characters in string A---
$s =~ s/([()[\]*|?.{}\\])/\\$1/;
$B =~ $s && $m++;
}

($m/($l-$w) > 0.80) ? 1 : 0;
}