16-Nov-79 06:55:47-EST,1355;000000000001
Mail from SU-SCORE rcvd at 16-Nov-79 0655-EST
Date: 16 Nov 1979 0341-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: system crash bugfix - for release 3 and 4
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.16: ;
cc: Hess at DEC-MARLBORO
Problem:
ILMNRF bug halt, illegal reference is at SAVE8. Quite reproducable.
Diagnosis:
The SAVE% JSYS is insufficiently paranoid.
User did a CSAVE of a core image which has an indirect pointer
to an inferior fork which in turn has a file mapped with no
access. The program was (no surprise) an INTERLISP job. Len
Bosack can explain it in more detail.
Solution: In MEXEC.MAC, after the CALL SETMPG at SAVE3A-1, insert:
SKIP FPG0A ;MAKE DAMN SURE THE PAGE IS READABLE
ERJMP SAVE3B ;IT WASN'T - TRY NEXT PAGE
This isn't necessarily the best solution, since it still does
the SETMPG call, but it's better than trying to put ERJMPs at
all the places in the save code where it is referencing the
window page. A better fix would probably be to make the check
before the SETMPG call more thorough, but I don't know enough
yet to work that out.
I haven't tested this out completely, nor have I checked to see
if a similar bug exists in SSAVE%, but the fix seems to work.
If anybody else finds out anything more about this, I'll be glad
to hear about it.
-------
16-Nov-79 10:30:12-EST,1265;000000000001
Mail from SU-SCORE rcvd at 16-Nov-79 1029-EST
Date: 16 Nov 1979 0715-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: another ILMNRF bug fix, this time in IMPDV
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.16: ;
I just discovered another cause of ILMNRF bug halts. IMPSTD
incorrectly assumes that T2 contains a hash slot after the CHKNWP
call, where actually T2 contains randomness. Since CHKNWP saves
the temporaries (which is why T2 doesn't contain anything - even
if it didn't T2 is still wrong), there is also no need to push T4
around that call, and T4 does have the correct index into HSTSTS.
This fix is relevant to release 4 and to any release 3 site
running the long leader NCP (MIT-XX and ISIE, possibly others).
I am surprised I haven't been crashed with this bug before.
At IMPSTD+2, replace:
PUSH P,T4 ; Save index
CALL CHKNWP ; Does it know about RAS/RAR, etc?
JRST [ SKIPGE HSTSTS(T2) ; No, so if we think it's up,
CALL HSTDED ; mark it down.
JRST .+1]
POP P,T4 ; Restore index
IMPSTE: ...
with:
CALL CHKNWP ; Does it know about RAS/RAR, etc?
JRST [ PUSH P,T4 ; No, save index
SKIPGE HSTSTS(T4) ; If we think it's up,
CALL HSTDED ; mark it down
POP P,T4
JRST IMPSTE]
IMPSTE: ...
-------
24-Nov-79 08:37:48-EST,1491;000000000001
Mail from SU-SCORE rcvd at 24-Nov-79 0837-EST
Date: 24 Nov 1979 0528-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: probable fix to the release 4 FILLFW hung job bug
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.16: ;
The following appears to be a fix for the famous release 4 bug of
getting hung in LGOUT waiting for some JFN's FILLFW count to be
zeroed. I put this in yesterday and I've gone a full day without
anybody getting hung this way, which is a first on my system.
In JSYSA.MAC:
At PMAP6+2, before the HLRZ C,FILLFW(A), insert:
NOSKD1 ;SEIZE THE MACHINE
At PMAP3-1, after the HRLM C,FILLFW(A), insert:
OKSKD1 ;ALL CLEAR NOW
In JSYSF.MAC:
At CLZMRC+12 (what an appropriate name!) or so, before the
HLLZ A,FILLFW(JFN), insert:
NOSKD1 ;OWN THE MACHINE WHILE DOING THIS
Three instructions down, after the HRRZS FILLFW(JFN), insert:
OKSKD1
Dave Bell at DEC originally pointed me at the idea of checking
PMAP's FILLFW hacking although he wasn't sure it was really the
problem. I decided to check JSYSF as well. These two places are
the only two which put FILLFW in an AC and hack it there instead
of hacking it directly.
Bell thinks that locking the JFN is probably good enough. Since
locks are only good if everybody respect them, I didn't go that
route. It is, however, probably alright to use OKSKED and NOSKED
instead of OKSKD1 and NOSKDD1; I'm just paranoid about these
things and hate needless bug halts.
Mark
-------
30-Nov-79 02:10:51-EST,1893;000000000001
Mail from SU-SCORE rcvd at 30-Nov-79 0210-EST
Date: 29 Nov 1979 2302-PST
From: g.Meyers at SU-SCORE (Harris A. Meyers)
Subject: bug in 3, 3A and 4 EXEC
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.16:
Ever been in the Mini-EXEC and done an Exec command without a Reset first?
The result is a continuous string of "? Invalad CMBFP pointer" messages. The
cause is that the Exec command merges a new EXEC into the adress space, the
COMND JSYS data is in page 0 which gets merged from the file, the flag that
says that the EXEC has been initalized is in another page that is not in
the file, and so it does not get cleared! The attached srccom was our fix,
however maybe the Mini-EXEC should also do a reset before loading in a program.
LINE 1, PAGE 1
1) ;<3A-EXEC>EXECPR.MAC.2, 29-Nov-79 17:02:26, Edit by MEYERS
1) ;1 move CINITF to page 0
1) ;<3A-EXEC>EXECPR.MAC.3, 2-May-79 09:08:12, Edit by MCLURE
LINE 1, PAGE 1
2) ;<3A-EXEC>EXECPR.MAC.3, 2-May-79 09:08:12, Edit by MCLURE
LINE 4, PAGE 2
1) SRI,< ;1
1) ;1 move CINITF to page 0, this will cause an Exec command in the Mini-Exec
1) ;1 without a Reset to work. The same bug may be exercised by non-privliged
1) ;1 users by starting an EXEC, merging in a new copy, & doing a start. It
1) ;1 will loop forever typing "?invalad CMBFP pointer"
1)
1) CINITF: Z ;NON-ZERO AFTER STARTUP INITIALIZATION COMPLETED
1) >;1 SRI
1)
1) ;STORAGE FOR EXEC COMMAND INTERPRETER
LINE 4, PAGE 2
2) ;STORAGE FOR EXEC COMMAND INTERPRETER
LINE 1, PAGE 4
1) NOSRI,< ;1 move to page 0
1) CINITF: Z ;NON-ZERO AFTER STARTUP INITIALIZATION COMPLETED
1) >;1 NOSRI
1)
LINE 1, PAGE 4
2) CINITF: Z ;NON-ZERO AFTER STARTUP INITIALIZATION COMPLETED
2)
-------
30-Nov-79 18:53:43-EST,656;000000000001
Mail from SU-SCORE rcvd at 30-Nov-79 1853-EST
Date: 30 Nov 1979 1517-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: Fix for NETRBG bughlts!!
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.18: ;
At last, a fix for infamous NETRBG bug halt when taking the
network down. Credits to KODA@ISID for this one. This fix is
for both release 3 and release 4, and I strongly recommend it.
In IMPDV.MAC, at IMPQOA, change:
IMPQOA: SKIPN IMPORD ; Is output on?
JRST RLNTBF ; No. don't queue it up
to:
IMPQOA: SKIPN IMPORD ; Is output on?
JRST [ PUSH P,T1 ; No. Save AC1
CALL RLNTBF ; And discard the message
POP P,T1
RET]
-------
30-Nov-79 18:54:20-EST,1116;000000000001
Mail from SU-SCORE rcvd at 30-Nov-79 1853-EST
Mail-from: SRI-KL rcvd at 29-Nov-79 2223-PST
Date: 29 Nov 1979 1802-PST
From: LARSON at SRI-KL
Subject: Connect to subdirectory fix
To: MRC at SU-SCORE
Remailed-date: 30 Nov 1979 1518-PST
Remailed-from: Mark Crispin <Admin.MRC at SU-SCORE>
Remailed-to: [SU-SCORE]<Admin.MRC>Tops-20.DIS.18: ;
Here is a fix to allow users to connect to their own subdirectories
without having to enter a password. It will allow user LARSON to
connect to any directory in DSK*:<LARSON.*> provided the structure is
mounted as domestic.
This is the release 3 version.
- - - - - - - - - -
;<3-MONITOR>JSYSA.MAC.1029 23-Nov-79 16:22:30 EDIT BY LARSON
;22 Allow user to connect to subdirectories of his logged in directory
;22 on all domestic structures (including PS:)
- - - - - - - - - -
(at acces8-2):
;22 JRST ACCES8 ;NO. SPECIAL CASE DOESN'T APPLY
jrst [cain t4,"." ;22 was it a .
jumpe t1,.+1 ;22 yes, ok if it was a substring
jrst acces8] ;22 not a substring or . was not next char
SETZ P2, ;YES. INDICATE OK TO DO THIS
-------
3-Dec-79 00:01:20-EST,798;000000000001
Mail from SU-SCORE rcvd at 3-Dec-79 0001-EST
Mail-from: BBN-TENEXD rcvd at 2-Dec-79 1320-PST
Date: 2 Dec 1979 1611-EST
From: JBORCHEK at BBN-TENEXD
Subject: Bug fix at impqoa:
To: MRC at SCORE, KODA at ISID
cc: ALLEN, TAPPAN
Remailed-date: 2 Dec 1979 1450-PST
Remailed-from: Mark Crispin <Admin.MRC at SU-SCORE>
Remailed-to: [SU-SCORE]<Admin.MRC>Tops-20.DIS.20: ;
I think that the proper fix would be to save t1 in rlntbf rather than
at each call site since may be more than one place where rlntbf gets
called and clobbers something important in t1. I know for a fact such
a place exists in our imp driver, put it in rlntbf and we wont ever
have to worry about it again.
-------
[Note from MRC: I agree with John. I am going to do make RLNTBF save
T1 in my system.]
5-Dec-79 02:52:25-EST,764;000000000001
Mail from SU-SCORE rcvd at 5-Dec-79 0252-EST
Date: 4 Dec 1979 2338-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: bug in SIN JSYS
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.20: ;
I just got an answer to my SPR 20-13701 about inconsistant results
being returned by the SIN JSYS when you specified a count and a null
as the terminator. It turns out BYTINA checks for a file less than
5 bytes long, and if so, assumes it has no line numbers, but forgets
to tell the rest of the world.
The patch is to change BYTIA2+3 from a CAIGE T3,T4 to a SKIPA. In
the source code, you can delete the MOVE T3,FILCNT(JFN) as well as
the patched over instruction and change the CAIN to a CAIE.
This patch is valid for both release 3 and release 4.
-------
8-Dec-79 00:46:30-EST,4180;000000000001
Mail from SU-SCORE rcvd at 8-Dec-79 0045-EST
Mail-from: UTAH-20 rcvd at 7-Dec-79 2044-PST
Date: 7 Dec 1979 2128-MST
From: FRANK at UTAH-20 (Randy Frank)
Subject: bug with bat blocks in swapping space
To: Tops-20:
cc: crossland at UTAH-20
Remailed-date: 7 Dec 1979 2048-PST
Remailed-from: Mark Crispin <Admin.MRC at SU-SCORE>
Remailed-to: [SU-SCORE]<Admin.MRC>Tops-20.DIS.21: ;
For about the past two weeks we've been going around in circles with
soft disk errors in our swapping space. They haven't been bad enough
to cause crashes, but have definitely caused performance degradation.
Being good little boys we dutifully put the bad blocks in the BAT
blocks, and were obviously consternated by the fact that the bad pages
were still being used. Sure enough, according to the code in dskalc
swapping pages which happen to be in bat blocks are marked as "used"
in the drum bit table at system initialization time. Unfortunately,
the wrong page in the drum bit table is marked as unavailable!
Evidently (this is Crossland's guess) at some point in the not to
distant past, some code was added at GTCUB3 in PHYSIO which has
the effect of converting logical drum addresses (as found in CST1)
to disk addresses in a non-linear fashion. It does this by basically
allocating the first cylinders worth of drum addresses to the first
disk in PS:, the next cylinders worth of drum addresses to the the
second disk in PS:, etc. This was a (worthwhile) attempt to split
drum references more evenly over the packs in PS:. Unfortunately,
the reverse of the splitting algorithm was not done in dskalc when it converts
disk addresses to drum addresses for the purpose of marking bat block
pages as unavailable in the drum bit table. The patch to dskalc
follows to fix this problem follows. As an interesting aside,
if you only have a one pack PS: you won't notice this bug, for
obvious reasons!
Another aside worth knowing: If you are crashing from swapping
related bughlts (caused by real bad spots on the disk), the pages
causing the crashes are NOT added by the monitor to the bat blocks
before the monitor crashes. You must manually add the pages to the
bat blocks. There is a nice little utility for manipulating bat blocks
(printing them out, adding to them, deleting from them) written
by Bill MacCormack at DDC, which I have it anyone wants it. Send me a
msg. There was also a recent useful big buffer article on swapping
related bughlts which should be read if you're suffering from such
problems. If you can't get a copy from your favorite software specialist,
let me know and I'll send you one.
The patch begins at DOBDRM+12
It appears to be valid for both 3A and 4 (only tested on 3A though)
LINE 1, PAGE 1
1) ;PS:<3A-MONITOR-SOURCES>DSKALC.MAC.2 6-Dec-79 16:35:22, Edit by FRANK
1) ; FIX CALCULATION OF LOGICAL DRUM ADDRESS FOR BAT BLOCKS
1) ;<3A.MONITOR>DSKALC.MAC.11, 24-Jul-78 22:56:16, Edit by MCLEAN
LINE 1, PAGE 1
2) ;<3A.MONITOR>DSKALC.MAC.11, 24-Jul-78 22:56:16, Edit by MCLEAN
LINE 31, PAGE 81
1) MOVEM C,CKBNDX ;SAVE BAD PAIR INDEX
1) MOVE C,SDBTYP(D) ;GET ADDR OF DSKSIZ TABLE
1) ; A NOW CONTAINS SECTOR OFFSET WITHIN SWAP SPACE FOR THIS PACK
1) IDIV A,SECCYL(C) ;GET CYLINDER # WITHIN SWAPPING
1) IMUL A,SDBNUM(D) ;MULTIPLY BY NUMBER OF PACK IN THIS STRUCTURE
1) MOVEI C,-SDBUDB(P4) ;GET PACK # + SDB LOCATION
1) SUB C,D ;COMPUTE PACK NUMBER IN STRUCTURE
1) ADD A,C ;OFFSET BY PACK # WITHIN STRUCTURE
1) MOVE C,SDBTYP(D) ;GET ADDR OF DSKSIZ TABLE
1) IMUL A,SECCYL(C) ;CALCULATE LOGICAL DRUM ADDRESS
1) ADD A,B ;GET THIS SWAP ADDRESS
1) TXO A,DRMOB ;SAY IS AN OVERFLOW ADDRESS
1) MOVEM A,CKBDRA ;SAVE ADDRESS TO BE ASSIGNED
LINE 31, PAGE 81
2) MOVEI B,-SDBUDB(P4) ;GET PACK # + SDB LOCATION
2) SUB B,D ;COMPUTE PACK NUMBER IN STRUCTURE
2) IMUL B,SDBNSS(D) ;GET FIRST SWAP SECTOR ON THIS PACK
2) ADD A,B ;GET THIS SWAP ADDRESS
2) TXO A,DRMOB ;SAY IS AN OVERFLOW ADDRESS
2) MOVEM C,CKBNDX ;SAVE BAD PAIR INDEX
2) MOVEM A,CKBDRA ;SAVE ADDRESS TO BE ASSIGNED
-------
25-Dec-79 01:16:01-EST,4112;000000000001
Mail from SU-SCORE rcvd at 25-Dec-79 0115-EST
Mail-from: RUTGERS rcvd at 24-Dec-79 2107-PST
Date: 19 Dec 1979 0207-EST
From: HEDRICK at RUTGERS
Subject: send to NVT's
To: admin.mrc at SU-SCORE
Remailed-date: 24 Dec 1979 2126-PST
Remailed-from: Mark Crispin <Admin.MRC at SU-SCORE>
Remailed-to: [SU-SCORE]<Admin.MRC>Tops-20.DIS.22: ;
We noticed that SEND * would randomly hang. At first we thought
it was an RH11 problem, especially since it cleared up for a while.
Then when we got back on the net, the problem came back. Very
suspicious... Turns out that SEND to a NVT doesn't always work.
What happens is that TTMSG puts the message in the sendall buffer,
and then calls a routine to start output (STRTOU). For normal
F.E. lines, this routine actually starts the output. But NVT's
are odd. In this case, they just set a flag. Then an
asynchronous process checks it now and then. When the flag is
on, the asynchronous process knows some NVT wants to do I/O,
so it then checks all the NVT's. (This seems an excessively
indirect way to do things, doesn't it?) The check consists
of
SOBE
start output
Unfortunately, the SOBE may well skip, since it checks only the
conventional output buffer, and the message is in the sendall
buffer. What happens is that the next time some output would
normally go to the terminal, the output process starts and you
get the send, as well as the normal output. But if nothing is
happening to the terminal, output on it never gets started, and
the message is not delivered. The sender's job is meanwhile
waiting for the character count in the sendall buffer to go to
zero. Needless to say, this doesn't happen. So his job
appears to be hung. It turns out that the wait will eventually
time out. The timeout is .5sec * the number of characters in
the message. At that point, the message gets thrown away, and
the sender's job goes on.
The fix seems to be to start output also if there is a sendall.
There are two choices: the sendall buffer count is non-zero,
or the sendall-in-progress bit is on. The latter seems cleaner
for some minor reasons. Anyway, here is the patch. It is
for 3A. I have also looked at rel. 4 somewhere or other, and
I conjecture that the same bug is present there.
The following patch is to IMPDV.MAC. We also put a note in
TTYSRV warning anyone who modifies it that ttsal is also defined
in IMPDV.
It is possible that this won't be much use to other sites. It
is important to us because
- we like to warn all users, via SEND *, when something funny
is going on in the hardware
- SEND is not a privileged command on our system. (i.e.
TTMSG is non-privileged, and the EXEC has a SEND command
that is just like the old ^ESEND, as well as taking
user names as arguments.)
***** CHANGE #1; page 1, line 1; page 1, line 1
---------------------------------
; -*- Macro -*-.=9195
***** CHANGE #2; page 2, line 16; page 2, line 16
;***TEMP-PARAMS***
PIESLC==0
;***TEMP-PARAMS
---------------------------------
;***TEMP-PARAMS***
PIESLC==0
;***TEMP-PARAMS
;[35] offsets in dynamic data in tty area (copied from TTYSRV)
;[35] anything changed here should be changed there, too
ttflg1==0 ;[35]
tt%sal==1b0 ;[35] sendall being done to this line
mskstr ttsal,ttflg1,tt%sal ;[35]
***** CHANGE #3; page 16, line 11; page 16, line 11
MOVE P1,NVTPTR ;COUNT THRU NVT LINES
IMPTS1: HRR T2,P1 ;GET TERMINAL NUMBER IN T2
CALL LCKTTY ;GET ADDRESS OF DYNAMIC DATA
JUMPLE T2,IMPTS2 ;IF NON-STANDARD BLOCK CHECK FOR OUTPUT
PUSH P,T2 ;SAVE ADDRESS OF DYNAMIC DATA
CALL TTSOBE ;ANY OUTPUT?
CALL NETTCS ;YES
---------------------------------
MOVE P1,NVTPTR ;COUNT THRU NVT LINES
IMPTS1: HRR T2,P1 ;GET TERMINAL NUMBER IN T2
CALL LCKTTY ;GET ADDRESS OF DYNAMIC DATA
JUMPLE T2,IMPTS2 ;IF NON-STANDARD BLOCK CHECK FOR OUTPUT
PUSH P,T2 ;SAVE ADDRESS OF DYNAMIC DATA
JN TTSAL,(T2),IMPT1A ;[35] IF IN SENDALL, THERE IS OUTPUT
CALL TTSOBE ;ANY OUTPUT?
IMPT1A: CALL NETTCS ;YES
-------
-------
25-Dec-79 02:33:36-EST,1326;000000000001
Mail from SU-SCORE rcvd at 25-Dec-79 0233-EST
Date: 24 Dec 1979 2301-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: follow up to Chuck Hedrick's message about NVT sendall
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.22: ;
Chuck's patch does indeed fix the problem of sendall to NVT's
hanging until output to the NVT is ready. Actually, any TTMSG to
an NVT will hang this way, and since we also have unprivileged
TTMSG fixing this problem was also desirable for us.
However, I felt that having IMPDV know about the format of
dynamic data wasn't the right way to go about it, so I wrote a
different patch, hitting at the root of the problem in TTSOBE.
In TTYSRV.MAC, change:
TTSBE1: JN TTOTP,(T2),R ;NONSKIP IF OUTPUT IS STILL ACTIVE
to:
TTSBE1: LOAD T1,TSALC,(T2) ;TRY TO GET SOMETHING REASONABLE IN T1
JN <TTSAL,TTOTP>,(T2),R ;NONSKIP IF OUTPUT ACTIVE OR SENDALL
What this does is make SOBE non-skip if there is a pending
sendall at the same time it checks for output active. In
addition, T1 is set up with the size of the sendall buffer, in an
attempt to try to have SOBE always return a defined quantity in
AC1 -- at the worse, you'll get 0 in AC1, but that's better than
randomness if output was active; and if it's non-zero then it's
probably the right thing.
-- Mark --
-------
26-Dec-79 03:55:58-EST,727;000000000001
Mail from SU-SCORE rcvd at 26-Dec-79 0355-EST
Date: 26 Dec 1979 0007-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: FTSCTL bug
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.22: ;
Problem: FTSCTL occasionally hangs during ICP when talking to a host
not in the system's hash table (for example, an unknown host) in release
4. It is hanging after sending the socket on the contact socket, but it
doesn't try to open up the telnet sockets.
Diagnosis: The TLO A,(1B1) at ACPT1+1 should be removed. It's a good
idea anyway, if only to speed up ICP. If the FTP contact socket is
wedged hanging for it isn't going to help things any, and it's a bug the
way it now interacts with the NCP and other hosts.
-------
26-Dec-79 03:56:19-EST,601;000000000001
Mail from SU-SCORE rcvd at 26-Dec-79 0356-EST
Date: 26 Dec 1979 0011-PST
From: Mark Crispin <Admin.MRC at SU-SCORE>
Subject: more on FTSCTL
To: [SU-SCORE]<Admin.MRC>Tops-20.DIS.22: ;
(This is probably for release 4 only)
It takes 5 seconds after the current on-shelf job is given to be an FTP
server before a new one is made. At ZATACH+3, between the SETOM PNDJOB
and JRST REGO you should insert a PUSHJ P,GETJOB to cause a new on-shelf
job to be made as soon as possible.
Otherwise, multiple FTPs trying to get at the same host can only come in
one every 5 seconds or so.
-------