Subj : Dupeloops
To : mark lewis
From : Rob Swindell
Date : Wed Jun 20 2018 11:44 am
Re: Dupeloops
By: mark lewis to Rob Swindell on Wed Jun 20 2018 08:08 am
>
> On 2018 Jun 19 22:43:24, you wrote to me:
>
> >> AFAIK, seenbys and paths are not included in most dupe detection
> >> schemes... other non-changing control lines are fine to be included...
> >> one of the problems comes when some system sort those control lines on
> >> messages they are passing along... we don't see so much of that like we
> >> did at one time ;)
>
> RS> So some metadata is included in the data that is hashed for dupe
> RS> detection and some is not?
>
> yes...
>
> RS> Are you sure about that?
>
> yes... in fact, and i don't recall who pointed this out to me back in the
> '90s,
> dbridge does exactly this in a manner of speaking... it takes the whole
> message
> header plus X bytes immediately following the message header and uses all of
> that as at least part of the checksum calculation... this was pointed out to
> me
> when i was working on my posting tool and was adding MSGID support to it...
>
> i was using a library and just letting it do its thing... some of my test
> posts
> were reported as dupes when they clearly weren't... IIRC, they were detected
> as
> dupes because they were posted within the same second... it turned out that
> my MSGID was somewhere in the middle of the control lines at the beginning
> of the message body and only my dbridge using testers were seeing this...
> someone pointed out this thing about dbridge also using X bytes from the
> beginning of the message body in addition to the message header so i moved
> my posting tool's
> MSGID to the top of the list and no more dupes were detected by those
> dbridge systems...
>
> i don't know what other systems do... there's only a very few that provide
> this
> information... SBBS is one of them... when i was testing Mystic, there was
> some
> discussion about dupe detection as james worked to try to figure out the
> best method he liked... i have used fastecho here for decades but i don't
> know what data it uses for its checksums... i do know it uses two
> checksums, though... i know this because i was being nosy one day and
> looking at FE's dupe database file (one for all message areas) with a hex
> viewer and noticed that groups of bytes were repeated all throughout the
> file... i asked about this and was told i found a bug... basically, FE has
> two checksums that it uses for each message and both are supposed to be
> stored in the database... what i found was that only one was being used and
> written to both fields... toby fixed that problem right quick... i just
> don't know what data is used to calculate them...
>
> back in the day, dupe detection formulas were not really shared around...
> maybe
> a couple of developers talking amongst themselves would tell each other what
> they were doing but this information was not published where everyone could
> find it... it was more or less black majik to a point...
To complete the discussion, Synchronet (smblib) actually uses multiple methods
of body text dupe detection:
1. A "legacy" CRC-32 hash of the body text, excluding any metadata, like FTN
control lines and excluding any trailing white-space or control-characters
2. A tuple of hashes (MD5 digest, CRC-32, and CRC-16) and length (char count)
of the body text excluding any metadata and *all* white-space characters
These, in addition to duplicate Internet (RFC-822) compliant Message-ID and
FTN-compliant Message-ID checks.
No black majik here. :-)
digital man
Synchronet "Real Fact" #64:
Synchronet PCMS (introduced w/v2.0) is Programmable Command and Menu Structure.
Norco, CA WX: 77.6�F, 57.0% humidity, 8 mph ENE wind, 0.00 inches rain/24hrs