* * * * *

                                 Kludge works

The server that this site was being hosted on. The one that went down last
month [1]? Well, last week it was retrieved and left behind my cubicle with
that neat Zen-like emptiness to it. I trotted it down to the data center at
The Company, powered it up, and set it to constantly read from the harddrives
as I suspected those were the failure point.

Today I checked in on the server and found spewed all over the console
window:

> eth0: command 0x5800 did not complete! Status=0xffff
> eth0: Resetting the Tx ring pointer.
> NETDEV WATCHDOG: eth0: transmit timed out
> eth0: transmit timed out, tx_status ff status ffff.
>   diagnostics: net ffff media ffff dma ffffffff.
> eth0: Transmitter encountered 16 collisions -- network cable problem?
> eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
>   Flags; bus-master 1, dirty 47985(1) current 48001(1)
>   Transmit list ffffffff vs. f7cea240.
> eth0: command 0x3002 did not complete! Status=0xffff
>   0: @f7cea200  length 8000002a status 8000002a
>   1: @f7cea240  length 8000002a status 0000002a
>   2: @f7cea280  length 8000002a status 0000002a
>   3: @f7cea2c0  length 8000002a status 0000002a
>   4: @f7cea300  length 8000002a status 0000002a
>   5: @f7cea340  length 8000002a status 0000002a
>   6: @f7cea380  length 8000002a status 0000002a
>   7: @f7cea3c0  length 8000002a status 0000002a
>   8: @f7cea400  length 8000002a status 0000002a
>   9: @f7cea440  length 8000002a status 0000002a
>   10: @f7cea480  length 8000002a status 0000002a
>   11: @f7cea4c0  length 8000002a status 0000002a
>   12: @f7cea500  length 8000002a status 0000002a
>   13: @f7cea540  length 8000002a status 0000002a
>   14: @f7cea580  length 8000002a status 0000002a
>   15: @f7cea5c0  length 8000002a status 8000002a
>
> wait_on_irq, CPU 0:
> irq:  0 [ 0 0 ]
> bh:   1 [ 0 1 ]
> Stack dumps:
> CPU 1:00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>        00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> Call Trace:
>
> CPU 0:c1f31f2c 00000000 00000000 ffffffff 00000000 c0109a5d c029d094 00000000
>        f6a2c000 c0108d22 00000000 c02c5ce4 c02001a4 ecf7a000 c1f31f98 c0115621
>        ecf7a000 f6a2c368 c02c5ce4 c1f31f8c c1f30664 c1f30000 c011c5cf f6a2c000
> Call Trace:    [<C0109A5D>] [<C0108D22>] [<C02001A4>] [<C0115621>] [<C011C5CF>]
>   [<C0125234>] [<C0125070>] [<C0105000>] [<C010588B>] [<C0125070>]
>

eth0, by the way, is the network interface.

The harddrives were still running the test I left them to, and no errors
whatsoever. Dan, the network engineer, did admit to borrowing the network
cable that was plugged into that server earlier this morning, and that was
enough to trigger the slew of errors I saw (and I did thank him for doing
that—otherwise I don't think I would have ever found what the problem might
have been).

Well (and it's here I wished I remembered to take pictures).

The rack mounted case the system is in is not very tall, and certainly not
tall enough to house a PCI (Peripheral Component Interconnect) card. So in
one of the PCI slots was special card with PCI slots on it, such that you
could now mount cards parallel to the motherboard (instead of perpendicular).
Only you couldn't use the existing mounting bracket on the PCI card in
question. So the mounting bracket on the network card had been removed, and
only held in place by the friction between the card and the slot itself.

Not a stable connection.

Problems with this server only really started when some equipment was removed
from the cabinet [2] it was in, and I'm guessing the network cable was bumped
just enough to make life interesting.

On the plus side, that means the server doesn't need to be replaced or even
rebuilt (thankfully!).

[1] gopher://gopher.conman.org/0Phlog:2005/01/16.1
[2] gopher://gopher.conman.org/0Phlog:2004/12/22.2

Email author at [email protected]