1

1
Alert Management Working Group
Chairperson: Louis Steinberg/IBM

CURRENT MEETING REPORT
Reported by Lee Oattes

AGENDA

o Introduction

o Discussion of draft flow control document

o Preliminary discussion of alert-generation document note: this was
shelved due to a lack of time

ATTENDEES

1. Bierbaum, Neal/vitam6!bierbaum@vitam6

2. Carter, Glen/[email protected]

3. Cohn, George/[email protected]

4. Cook, John/[email protected]

5. Denny, Barbara/[email protected]

6. Easterday, Tom/[email protected]

7. Edwards, David/[email protected]

8. Fedor, Mark/[email protected]

9. Hunter, Steven/[email protected]

10. Kincl, Norman/[email protected]

11. Malkin, Gary/[email protected]

12. Oattes, Lee/[email protected]

13. Paw, Edison/[email protected]

14. Replogle, Joel/[email protected]

15. Salo, Tim/[email protected]

16. Sheridan, Jim/[email protected]

17. Taft, Vladimir/[email protected]

2
18. Waldbusser, Steve/[email protected]

19. Wintringham, Dan/[email protected]
20. Steinberg, Louis/[email protected]

3
MINUTES

1. The meeting of the Alert Management Working Group began with an
introduction from the Chairman (Lou Steinberg).

2. A discussion of several independent implementations of feedback/pin and
polled, logged alerts led to an agreement to adopt these mechanisms in
some form.

3. The following questions were answered by discussion and consensus:

(a) Can we have a read-only alerts_enabled mib object, by limiting the
transmission rate of alerts (no shutoff) and not use feedback? No.
We need a total shutoff mechanism in case a number of alert
generators are "screaming" all at once. The total traffic might be
too much for the manager, and this "stable" situation cannot
improve (while a disabling mechanism would tend to be
self-correcting).
Total shutoff implies the use of a resetable, read-write mib
object.
An automated, timer-based reset mechanism was discussed but it was
felt that such a system might tend to sync resets of multiple
generators and could still lead to an over-reporting condition.

(b) Might an automated-reset of alerts_enabled from the manager station
create a "blast-off-blast-off..." alert traffic pattern?
Yes, but such a manager would still tend to only get as much
traffic as he could handle. A re-enable would only be sent when
the manager isn't swamped (i.e., is capable of sending one).
A manager experiencing such a traffic pattern should readjust his
window prior to setting alerts_enabled TRUE.

(c) When pin disables alerts due to the generation of many similar
alerts (e.g., link flapping) might we also lose an unrelated alert
from the same system prior to resetting alerts_enabled?
Yes, but the rate limiting (as opposed to shutoff) technique has
the same problem; the probability of sending a single, specific
alert is much lower than the probability of sending any one of many
identical alerts.
This problem is minimized by using polled, logged alerts along with
feedback/pin (could still lose alerts if log is overwritten).

(d) Should we allow the implementation to decide if alerts are totally
disabled or limited to a max rate? No.
Implementations should be consistent since this affects the way we
manage our alert generators.

4
(e) Can the alert log in polled, logged alerts be overfilled?
Yes, but the standard suggests that a manager should attempt to

keep the log empty by removing known alerts.
If an individual implementation has no mechanism for removing old
alerts (no set) then the log must wrap when full and the manager
might lose alerts.

(f) If using the SNMP get-next, do we want the oldest logged element
first, or the newest first?
Clearly the manager wants the oldest first if a full log will
wrap...this gives him the most chance to see the oldest alert (in a
full log) before losing it.
No real concensus here. It seems as though this should be
implementation specific since it only applies to SNMP, and since
the log, actually being a table, makes this a question of "are new
table entries added at the table top or bottom?".

(g) Can we shrink the log size by stripping out only the "important"
information from each alert?
We can, but this is something we decided we shouldn't do. It
requires a different parser at the manager (can't run it through
the alert parser), and we did not know how do decide what
information might be needed (it varied with the protocol and alert
type).

(h) How about only logging alerts, and sending an "alert logged" alert
for each new log entry? The manager gets the asynch. "alert
logged" notice and reads the alert log to determine what happened.
While this is an interesting concept, it was felt that it might
tend to aggravate some of the other logging problems (e.g., if the
log is filled and not over-writing, the only chance of getting the
alert information is from the async alert...this removes the asynch
alert information and replaces it with "see the log" information).

(i) A discussion of the cpu cycles and memory needed for keeping a log
followed. Since the log size might be settable (to 0) it was felt
that systems could allow managers to disable logging. It was also
felt that the performance and memory hits were not large, but
numbers to confirm this were not available.

4. The following were decided by vote:

(a) Feedback/Pin
Mandatory mib objects:

5
alerts_enabled read/ write
window (time) read/ optional write

max_alerts read/ optional write
Do not include alert counters as mib objects for this document.
Individual implementors will decide if they need total dropped
and/or sent, but not everybody likes the idea of adding more
counters as (even optional) mib objects.
Do not optionally allow a reduced rate mode on the over reporting
condition...require total async. Alerts to be shutoff for reasons
given in earlier discussion.

(b) Polled, Logged Alerts Remove time field from the table, as most
alerts are time stamped and the information in an alert should be
defined by the protocol...not us.