Document: FSC-0047
Version:  001
Date:     28-May-90




        The ^ASPLIT Kludge Line For Splitting Large Messages

                            Pat Terry
                           5:494/4.101
              [email protected]
                      [email protected]




Status of this document:

    This FSC suggests a proposed protocol for the FidoNet(r) community,
    and requests discussion and suggestions for improvements.
    Distribution of this document is unlimited.

    Fido and FidoNet are registered marks of Tom Jennings and Fido
    Software.





Objectives
===========

Several packers place a limit on the size of message that can be
transmitted.  This is often of the order of 14K which, while
sufficient for most purposes, is inadequate for several
applications, and in particular for long messages gated to and
from UUCP land.

A SPLIT/UNSPLIT suite of two programs has been developed, intended
to handle this problem.  SPLIT will split long .MSG format
messages into smaller packets.  After transmission to a remote
site, the packets may be merged by UNSPLIT to recreate the
original message, as closely as possible.  The only differences
are the addition of a kludge line and, possibly, a few line
breaks.

The system ensures that each large message, when split, generates
a collection of small messages, each of which is still valid in
its own right.  If recombination is not effected, the messages
will still be usefully received, and, in particular, split
messages to UUCP should still all get to their destinations,
albeit in parts.

After some weeks of testing, the system seems to be sufficiently
stable and useful to justify making an FSC proposal.



The ^A SPLIT kludge line
========================

Messages split and joined by this system make use of an ^A kludge
line, which has the form below.  It is proposed in this note that
this become the basis for a "standard".

One of these lines is added to the list of kludges preceding each
part of a split message.  When recombined, a line of this form
remains, for reasons which will appear later.

Generically the lines look like this, in fixed columns:

^ASPLIT: date      time     @net/node    nnnnn pp/xx +++++++++++

where
     nnnnn gives the original message number from which the
           components have been derived (cols 41 - 45)
     pp    gives the part number (cols 47 and 48)
     xx    gives the total number of parts (cols 50 and 51)

For example

^ASPLIT: 30 Mar 90 11:12:34 @494/4       123   02/03 +++++++++++
           |      |        |  |          |     |  |  |
           |      |        @  |          |     |  |  |
           Date   Time        Node       MSG   |  |  Eye catcher
           (when split)  (of origin) (at time  |  Total parts
                                     of split) Part number


Thus a large file (existing as 123.MSG when the splitter was run)
originating from 494/4 might be split into 3 parts with the split
lines

^ASPLIT: 30 Mar 90 11:12:34 @494/4       123   01/03 ++++++++++++
^ASPLIT: 30 Mar 90 11:12:34 @494/4       123   02/03 ++++++++++++
^ASPLIT: 30 Mar 90 11:12:34 @494/4       123   03/03 ++++++++++++

Columns 9 through 45 are really a "uniquefier".  The nnnnn
message number is just the one the message had when it was split,
and is of no other significance.  Similarly, the system does not
use 4-d addressing for the node/net component, because this is of
no real interest to this application, and requires parsing a file
like BINKLEY.CFG, or similar extra work, to determine the other
components.

This is, admittedly, verbose, but if recombination fails for any
reason (like all the packets not arriving at once) one can still
recombine or examine the relevant pieces manually.  Note also
that the lines are added to messages that are themselves "long",
and the *relative* increase in length is actually very small.
Further justification will be found below.


Splitting large messages
========================

When splitting large messages, the following happens:

The message base is scanned for large messages.

For each of the (few) large messages found that qualify, the
large message is split into parts.  The original FTSC header is
placed in each component part, save that the FileAttach bit (if
any) is removed from the 2nd, 3rd ... parts.  No attempt is made
to modify the To:, or From: fields.  The Subject: field for the
2nd, 3rd ... parts is modified to include a leading part number.

The original kludge lines are retained in the first part. Most
other "leading" kludges, like ^AFMPT, ^ATOPT, ^AINTL are retained
in these parts.  However, ^AEID and ^AMSGID lines, if any, are
removed from the 2nd, 3rd ... parts.  This is potentially
awkward, but is to avoid "dupe detectors" discarding the 2nd, 3rd
... parts, and in practice should cause no real problems. Large
echomail messages originating on a system will presumably have
their ^AEID lines added to the constituent parts at
scanning/packing time on that system (ie AFTER splitting), and
other large messages should probably not reach this stage - they
should have been split or discarded earlier.

A ^ASPLIT line is added to each part to allow for possible later
recombination.

If the message is addressed "TO UUCP: in the FTSC header, the To:
lines at the start of the message text are copied to all parts.

The "body" of the message is then split between the various
parts.  An attempt is made to split at the end of a line in each
case.

The trailing tear line, ^AVia ^APath etc lines are added to all
parts.


Joining ("unsplitting") messages
================================

When reconstituting large messages, the following happens:

The message base is scanned for messages with ^ASPLIT lines.
A list is made of messages to be unsplit, with each message
having a list of its component parts. If a duplicate component
part is found, it is discarded (thus partially getting around the
problem of any discarded ^AEID lines in the components).
Messages marked "in transit" or "sent" are not eligible for
recombination.  Nor are messages with a split component number of
00, as these will only exist as the result of an earlier
recombination.

For each set of components of messages to be recombined the
following happens:

The first component is examined so as to extract the Kludge
lines, and any UUCP "To: " lines. These, and the FTSC header, are
written out to a new file, with the ^ASPLIT line modified to have
a component number of 00, so as to prevent further splitting
should the splitter program be reapplied to the recombined
message.  If this is not done, large messages can get into a
tedious split-unsplit- split-unsplit... cycle each time the
system is run.

The text portions of the first and subsequent parts are then
merged (discarding extra copies of kludges, UUCP "To:" lines and
the like).

Any tearline, Origin, ^APATH, ^AVia lines etc are appended.

Normally the component files are then automatically deleted.


Justification for "human readable" uniquifier.
==============================================

Most systems do not display kludge lines, and the ^ASPLIT line
should be of no real interest.  However, in one particular
application which was using this system, the ^ASPLIT lines were
made visible for messages that could not be recombined (because
they become too large for gating from FidoNet to another RFC-822
compliant network), and hence it has been deemed essential that a
"visible" line derived from ^ASPLIT became human readable, easily
spotted, and comprehensible.  For much the same reason, fixed
columns have been used, rather than free format, so that archaic
FORTRAN programmers could easily develop "unsplitters" after
getting all the pieces!  Lastly, in this system a sort was done
to order the ^ASPLIT line to be the last kludge line before the
message body proper.


Acknowledgements
================

Particular thanks must be expressed to Randy Bush for offering to
test this system in its earliest releases on the very busy 1/5
zonegate, and for suggesting various improvements.  Thanks for
testing are also due to Dave Wilson who operates the 5/1 zonegate
at the other end of the link from Randy, and to Mike Lawrie of
Rhodes Computer Centre for useful suggestions regarding the form
of the ^ASPLIT line acceptable to non-Fido users.


Prototype system
================

A version of SPLIT/UNSPLIT using this system may be FREQ'd
from 1:105/42 or 5:494/4 using the magic name SPLITTER.  As at
this time I have unsubstantiated reports that it does not work
in conjunction with systems running Novell software (I have no
access to Novell).  It works fine using Msged and QMail.