* * * * *

           ”Red Alert!” “Where? I don't see any lerts around here!”

Yesterday felt like Friday, mainly because The Weekly Meeting™ was canceled
at the last minute, thus I didn't have to leave The Home Office.

But it was still a weekday today when I got a call rather early from Smirk.
“Sean!” he yelled through the phone into my ear, “not only do we have a
customer down, but we have network traces that show it's our equipment that's
down!” Behind him I could hear the Klaxons™ [1] in the background.

I mumbled something about calling him back, wandered over to The Home Office,
and started [DELETED-a Level Five Diagnostic Program-DELETED] poking into the
routers.

> Oct 16 11:06:59.490 EDT: %RSP-3-ERROR: MD error 0080000080000000 -Traceback= 60385460 60385B24 60385C48 60386614 603513F4
> Oct 16 11:06:59.494 EDT: %RSP-3-ERROR:  SRAM parity error (bytes 0:7) 80 -Traceback= 60385538 60385B24 60385C48 60386614 603513F4
> Oct 16 11:06:59.494 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SVIP_CYBUSERROR_INTERRUPT: A Cybus Error occured.
> Oct 16 11:06:59.498 EDT: %VIP2 R5K-1-MSG: slot1 CYASIC Error Interrupt register 0xC
> Oct 16 11:06:59.502 EDT: %VIP2 R5K-1-MSG: slot1   Parity Error internal to CYA
> Oct 16 11:06:59.506 EDT: %VIP2 R5K-1-MSG: slot1   Parity Error in data from CyBus
> Oct 16 11:06:59.514 EDT: %VIP2 R5K-1-MSG: slot1 CYASIC Other Interrupt register 0x100
> Oct 16 11:06:59.518 EDT: %VIP2 R5K-1-MSG: slot1   QE HIGH Priority Interrupt
> Oct 16 11:06:59.522 EDT: %VIP2 R5K-1-MSG: slot1   QE RX HIGH Priority Interrupt
> Oct 16 11:06:59.526 EDT: %VIP2 R5K-1-MSG: slot1 CYBUS Error Cmd/Addr 0x8001A00
> Oct 16 11:06:59.530 EDT: %VIP2 R5K-1-MSG: slot1 MPUIntfc/PacketBus Error register 0x0
> Oct 16 11:06:59.534 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SVIP_PMAERROR_INTERRUPT: A PMA Error occured.
> Oct 16 11:06:59.538 EDT: %VIP2 R5K-1-MSG: slot1 PA Bay 0 Upstream PCI-PCI Bridge, Handle=0
> Oct 16 11:06:59.542 EDT: %VIP2 R5K-1-MSG: slot1 DEC21050 bridge chip, config=0x0
> Oct 16 11:06:59.546 EDT: %VIP2 R5K-1-MSG: slot1 (0x00):dev, vendor id    = 0x00011011
> Oct 16 11:06:59.550 EDT: %VIP2 R5K-1-MSG: slot1 (0x04):status, command   = 0x02800147
> Oct 16 11:06:59.554 EDT: %VIP2 R5K-1-MSG: slot1 (0x08):class code, revid  = 0x06040002
> Oct 16 11:06:59.562 EDT: %VIP2 R5K-1-MSG: slot1 (0x0C):hdr, lat timer, cls = 0x00010000
> Oct 16 11:06:59.566 EDT: %VIP2 R5K-1-MSG: slot1 (0x18):sec lat,cls & bus no = 0x08010100
> Oct 16 11:06:59.570 EDT: %VIP2 R5K-1-MSG: slot1 (0x1C):sec status, io base = 0x22807020
> Oct 16 11:06:59.574 EDT: %VIP2 R5K-1-MSG: slot1     Received Master Abort on secondary bus
> Oct 16 11:06:59.578 EDT: %VIP2 R5K-1-MSG: slot1 (0x20):mem base & limit   = 0x01F00000
> Oct 16 11:06:59.582 EDT: %VIP2 R5K-1-MSG: slot1 (0x24):prefetch membase/lim = 0x0000FE00
> Oct 16 11:06:59.586 EDT: %VIP2 R5K-1-MSG: slot1 (0x3C):bridge ctrl     = 0x00030000
> Oct 16 11:06:59.590 EDT: %VIP2 R5K-1-MSG: slot1 (0x40):arb/serr, chip ctrl = 0x00100000
> Oct 16 11:06:59.594 EDT: %VIP2 R5K-1-MSG: slot1 (0x44):pri/sec trgt wait t. = 0x00000000
> Oct 16 11:06:59.598 EDT: %VIP2 R5K-1-MSG: slot1 (0x48):sec write attmp ctr = 0x00FFFFFF
> Oct 16 11:06:59.606 EDT: %VIP2 R5K-1-MSG: slot1 (0x4C):pri write attmp ctr = 0x00FFFFFF
> Oct 16 11:06:59.610 EDT: %VIP2 R5K-1-MSG: slot1 PA Bay 1 Upstream PCI-PCI Bridge, Handle=1
> Oct 16 11:06:59.614 EDT: %VIP2 R5K-1-MSG: slot1 DEC21050 bridge chip, config=0x0
> Oct 16 11:06:59.618 EDT: %VIP2 R5K-1-MSG: slot1 (0x00):dev, vendor id    = 0x00011011
> Oct 16 11:06:59.622 EDT: %VIP2 R5K-1-MSG: slot1 (0x04):status, command   = 0x02800147
> Oct 16 11:06:59.626 EDT: %VIP2 R5K-1-MSG: slot1 (0x08):class code, revid  = 0x06040002
> Oct 16 11:06:59.630 EDT: %VIP2 R5K-1-MSG: slot1 (0x0C):hdr, lat timer, cls = 0x00010000
> Oct 16 11:06:59.634 EDT: %VIP2 R5K-1-MSG: slot1 (0x18):sec lat,cls & bus no = 0x08020200
> Oct 16 11:06:59.638 EDT: %VIP2 R5K-1-MSG: slot1 (0x1C):sec status, io base = 0x2280F0A0
> Oct 16 11:06:59.642 EDT: %VIP2 R5K-1-MSG: slot1     Received Master Abort on secondary bus
> Oct 16 11:06:59.650 EDT: %VIP2 R5K-1-MSG: slot1 (0x20):mem base & limit   = 0x03F00200
> Oct 16 11:06:59.654 EDT: %VIP2 R5K-1-MSG: slot1 (0x24):prefetch membase/lim = 0x0000FE00
> Oct 16 11:06:59.658 EDT: %VIP2 R5K-1-MSG: slot1 (0x3C):bridge ctrl     = 0x00030000
> Oct 16 11:06:59.662 EDT: %VIP2 R5K-1-MSG: slot1 (0x40):arb/serr, chip ctrl = 0x00100000
> Oct 16 11:06:59.666 EDT: %VIP2 R5K-1-MSG: slot1 (0x44):pri/sec trgt wait t. = 0x00000000
> Oct 16 11:06:59.670 EDT: %VIP2 R5K-1-MSG: slot1 (0x48):sec write attmp ctr = 0x00FFFFFF
> Oct 16 11:06:59.674 EDT: %VIP2 R5K-1-MSG: slot1 (0x4C):pri write attmp ctr = 0x00FFFFFF
> Oct 16 11:06:59.678 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SVIP_RELOAD: SVIP Reload is called.
> Oct 16 11:06:59.690 EDT: %VIP2 R5K-3-MSG: slot1 VIP-3-SYSTEM_EXCEPTION: VIP System Exception occurred sig=22, code=0x0, context=0x60A8D368
> Oct 16 11:07:01.714 EDT: %DBUS-3-CXBUSERR: Slot 1, CBus Error
> Oct 16 11:07:01.714 EDT: %DBUS-3-DBUSINTERRSWSET: Slot 1, Internal Error due to VIP crash
> Oct 16 11:07:01.718 EDT: %RSP-3-ERROR: End of MEMD error interrupt processing -Traceback= 60385BF0 60385C48 60386614 603513F4
> Oct 16 11:07:01.842 EDT: %DBUS-3-CXBUSERR: Slot 1, CBus Error
> Oct 16 11:07:01.842 EDT: %DBUS-3-DBUSINTERRSWSET: Slot 1, Internal Error due to VIP crash
> Oct 16 11:07:05.599 EDT: %CBUS-3-CMDTIMEOUT: Cmd timed out, CCB 0x5800FF20, slot 0, cmd code 2 -Traceback= 603E3AF8 603E3FA4 603DC230 603D9F70 603A1E8C 6034462C 602FA6F0 603241C8 603241B4
> Oct 16 11:07:07.623 EDT: %LINK-3-UPDOWN: Interface FastEthernet0/0, changed state to down
> Oct 16 11:07:08.687 EDT: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/0, changed state to down
> Oct 16 11:07:16.527 EDT: %RSP-3-RESTART: cbus complex
>

I've never seen that happen to a Cisco router before.

I called Smirk back. “Smirk, you better get G [our Cisco consultant —Editor]
on the phone to diagnose this issue. I'm out of my league.”

And thus began a few hours of scrambling to get a replacement router for the
customer, and by the time I got onsite with a temporary replacement, I was
told it was too late to do the change (the current router was still limping
along—these crashes were happening about every half hour) and that they were
planning on taking down the network at 11:00 tomorrow, so I could do the
replacement then.

Wonderful!

Worse, is that this isn't the first time this router has had problems. A few
weeks ago we appeared to have a similar issue and ended up replacing one of
the interfaces (the errors weren't nearly as scary then). I remarked at the
time that I had never seen a Cisco router go bad (I've been working with
Smirk at The Company for five years now and this is a first—and even when I
worked at a webhosting company in the late 90s, never saw a bad Cisco router,
nor did I come across one when working at two ISPs (one in the mid 90s, and
one around the turn of the century). Smirk also informed the customer,
multiple times since then, that they need a redundant router, but the request
never made it past a certain level of management.

Sigh.

[1] http://en.wikipedia.org/wiki/Klaxon

Email author at [email protected]