Dell™ OpenManage™ it assistant: Understanding Events How to Select Events for Monitoring




Дата канвертавання19.04.2016
Памер159.5 Kb.


Dell™ OpenManage™ IT Assistant: Understanding Events -- How to Select Events for Monitoring

Enterprise Systems Group (ESG)

Dell™ OpenManage™

Systems Management


Dell White Paper

By Manoj Gujarathi

Systems Engineer



OpenManage Development




Contents

Section 1 3

Introduction 3

Introduction 3

Section 2 4

IT Assistant Event Management System Overview 4

IT Assistant Event Management System Overview 4

Section 3 5

Event Categories/Types and Event Source Organization 5

Event Categories/Types and Event Source Organization 5

Section 4 8

Understanding the Events in IT Assistant 8

Understanding the Events in IT Assistant 8

Section 5 20

Best Practices While Selecting Events to be Monitored 20

Best Practices While Selecting Events to be Monitored 20

Section 6 22

Conclusion 22

Conclusion 22



Table 1: Agent Applications supporting SNMP and/or DMI 5

Table 2: Agent Applications supporting SNMP and/or DMI 9

Table 3: Environmental Event Agents 9

Table 4: Memory Event Agents 10

Table 5: ECC Single Bit Error Counts 11

Table 6: Network Event Agents 11

Table 7: Operating System Event Agents 12

Table 8: Other Events Agents 12

Table 9: Power Events Agents 14

Table 10: Processor Events Agents 15

Table 11: Security Events Agents 17

Table 12: Software Events Agents 17

Table 13: Storage Events Agents 18

Table 14: Critical Events in the Storage Categories 19

Table 15: Baseline Event Types Associated with Critical Failures 20

Section 1


Introduction
Dell OpenManage IT Assistant is a browser-based tool that monitors and manages Dell servers, desktops, and portables using industry standard Simple Network Management Protocol (SNMP), Desktop Management Interface (DMI), and Common Information Model (CIM) protocols. IT Assistant provides a broad set of features that are designed to help system administrators carry out important system management operations in a heterogeneous environment of Dell systems. The IT Assistant feature set includes system discovery and status reporting, comprehensive event management, asset and inventory reporting, remote system configuration and storage management. (For complete details on the features of IT Assistant, please refer to IT Assistant User’s Guide.)
This paper addresses the following topics in detail:

  • Predefined Event Categories in IT Assistant

  • Detailed description of the critical events (SNMP traps or DMI indications) logged by Dell Agents

  • Best practices in configuring events and key important critical events (that Dell recommends to monitor to take action on)

This paper has been written to help administrators with monitoring and managing Dell systems that have different Dell-supported agents. The topics presented here throw light on the IT Assistant Event Management System and go into detail on the pre-populated events: what they mean and how to decide which events to select for monitoring. This paper will also help administrators using Dell OpenManage Connections to gain more insight on the events logged by Dell Agents.



Section 2


IT Assistant Event Management System Overview
The events described in this paper are based on the pre-populated categories in IT Assistant Event Management System (EMS). The EMS in IT Assistant is a versatile and powerful tool that allows administrators to monitor specific systems for certain types of events occurring at the selected times of the day, and take pre-defined actions when those events occur. EMS allows administrators to create filters to monitor events from different Dell systems -- including servers as well as clients (managed system generating the event). These filters are configurable based on severity, source node name, and time period, and allow administrators to associate actions with each filter. For more details on this and how to configure EMS for creating filters and actions, please refer to the Configuring and Using the Dell OpenManage IT Assistant Event Management System Dell OpenManage white paper by Ross Burns.


Section 3


Event Categories/Types and Event Source Organization
The Event Categories in IT Assistant EMS allows users to look at the pre-defined event categories and the Event Types for all the events (traps and indications) pre-populated in IT Assistant. These category group names are: Cluster, Environmental, Memory, Network, Operating System, Other, Power, Processor, Security, Software, and Storage. The Event Types are the actual events generated by one or more agents that can be monitored by IT Assistant. Users can rename the event type or even change the names of the predefined categories for simplicity and customization. Note: Users will have to remove and add the Event Type assigned to the filter after a name change.
The events can consist of SNMP traps and DMI indications. These are generically called ‘Events’ or ‘Alerts’ interchangeably throughout IT Assistant documents and user interface screens. To find out the source of the event, select the event type and click ‘Edit’ button. There could be different agents generating the same event having different types – SNMP or DMI to indicate that the event is an SNMP trap or a DMI indication respectively.


DMI Indications

In order for IT Assistant to receive DMI events, the managed node has to “register” with the IT Assistant management station. When IT Assistant discovers the node, it automatically does so through the Remote Procedure Call (RPC) mechanism.



SNMP Traps

Like DMI events, it is not sufficient for IT Assistant just to discover the managed node system to receive SNMP events. Users must configure the SNMP service on a managed node to create a community and discover that node through IT Assistant using that community name, and also create trap destinations for the IT Assistant management system to receive those traps. Note: The SNMP service must be restarted to take the change into effect.

Table 1 provides the list of agents sending indications/traps to IT Assistant.

Table 1: Agent Applications supporting SNMP and/or DMI



Agent Application

SNMP Traps Support

DMI Indication Support

Broadcom Agent






DMTF



(only physical container global table)



Dell Array Manager Agent






Dell DRAC2 Card and Agent






Dell OpenManage Client Instrumentation






Dell OpenManage HIP





Dell OpenManage Server Agent





Dell OpenManage IT Assistant






Dell Remote Assistant Server






Fiber Channel Switch Agent






Giganet Agent






Netware






NuView ClusterX and Veritas ClusterX Agent






RAID Agent (PERC and PERC2)






SCSI Agent (CIO)






SNMP Agent Traps






Veritas ClusterX Agent






Windows






Adaptec CI/O Agent






Dell OpenManage Client Instrumentation






Intel NIC Instrumentation






Qlogic Agent






Symbios Agent





The agents in the above table are either applications by themselves, or installed by one or more Dell applications.


Distributed Management Task Force (DMTF) Tables include -- Cooling Device, Disk Controller, Disks, Electrical Current Probe, Indications, Logical Memory, Mass Storage Logical Drives, Motherboard, Physical Container, Physical Memory Area, Portable Battery, Power Supply, Power Unit, Processor, Structure Dependency, System Cache, System Hardware Security, System Reset, Temperature Probe, UPS Battery, and Voltage Probe.

For event monitoring, certain agents are managed only through DMI (e.g. Dell OpenManage Client Instrumentation) or through SNMP (e.g. Dell Array Manage Agent) or both (Dell OpenManage Server Agent). The events from DRAC2 Agent support in-band SNMP (originating from the system) while the DRAC2 Card supports out-of-band SNMP (originating from card itself).


For more details on the Agent versions and IT Assistant versions supporting those events, see the Dell OpenManage white paper: Configuring and Using the Dell OpenManage IT Assistant Event Management System by Ross Burns. Also check out the IT Assistant Database Management Utility (dcdbmng.exe) –shipped with IT Assistant – on how the events are pre-populated.


Section 4


Understanding the Events in IT Assistant
There are close to 800 events in the IT Assistant database, so it can be very difficult to determine what each event type means to understand which events to monitor. This is especially difficult in some event categories where the event type names are similar and the events differ only slightly. This section addresses this topic, and provides details according to pre-existing categories as defined in IT Assistant.
The following are some important points to remember about the pre-populated event types:

  • While describing events, more focus is put on the events related to Dell Instrumentation Agents and, where possible, the event type (traps or indications) is mentioned.

  • The current Dell Server Instrumentation shipping is Dell OpenManage Server Agent (Version 4.3), while the earlier agent was called Dell OpenManage Hardware Instrumentation Package (HIP). Dell OpenManage Client Instrumentation (OMCI) is for client systems only.

  • The Dell OpenManage Server Agent instrumentation events up to version 4.3 are supported through SNMP as well as DMI. In upcoming version 4.4, only SNMP is supported. OMCI 5.x, 6.0 events are supported through DMI only.

  • Certain event types starting with DMTF are DMI indications converted to SNMP traps by DMI to SNMP mapper. There is more information on these events in DMTF documents.

  • To fine a description attached to each event type and the source name, select that event type and click on the ‘Edit’ button.

  • Because of the large number of pre-populated events in IT Assistant, only important/critical events for monitoring are described in this paper.

In the following sections, the important events under each category are examined.



Cluster Events

The events in this category are generated by the agents listed in Table 2, along with the type of event.


Table 2: Agent Applications supporting SNMP and/or DMI


Event Sources

Event Types

Dell OpenManage Cluster Assistant with ClusterX Application v. 2.x

(Source Names: NuView ClusterX)



SNMP traps

Dell OpenManage Cluster Assistant with ClusterX Application v. 3.x

(Source Names: Veritas ClusterX)



SNMP traps

The events in this category are SNMP Traps generated by the Dell OpenManage Cluster Assistant with ClusterX Application. The events show the event source as NuView ClusterX or Veritas ClusterX, depending on if the trap is generated by Dell OpenManage Cluster Assistant version 2.x or 3.x respectively. The event types starting with ‘WLBS’ (Windows Load Balancing Service) would not be generated simply by the basic Dell OpenManage Cluster Assistant with ClusterX Application and would be available if you upgrade to ClusterX application from Veritas. The following are the details on the critical events:




  • Failure of a node in an MSCS cluster/Failure of an MSCS cluster – These events are generated when a node in Microsoft Cluster Server (MSCS) is failed (e.g. because of system crash), or the whole cluster is down (say because of storage system is down) respectively.

  • Failure of a resource detected – When any resource like a disk, or an application like Exchange is failed.

  • Detection of a failure of the private or public cluster interconnects – When the public or private interconnects like cluster heartbeat fails.

  • Detected the cluster service wrote a critical event to the NT event log – When a monitored resource writes a critical event in the NT event log. This could be monitored to get critical updates from the cluster resources.



Please refer to Dell OpenManage Cluster Assistant with ClusterX Application documents for additional details.

Environmental Events

The events in this category are generated by the agents listed in Table 3, along with the type of event.


Table 3: Environmental Event Agents


Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Dell OpenManage HIP/Server Agent

SNMP traps, DMI indications

DRAC

SNMP traps

This category consists of the system environmental events related to cooling devices (fans, blowers), temperature, current, voltage sensor/probes etc. Dell Instrumentation Agents and DRAC agents generate the events. The following events are categorized based on the failing device.


Cooling Device Events (Fans, Blowers)


  • Cooling Device Failure, Warning, Normal – these events occur when the fan sensor exceeds its failure or warning threshold for one or more devices. A normal event is logged when the fan sensor for one or more devices returns to a valid range after crossing the warning or failure thresholds. Note: The normal event is logged by Server Agent only, for HIP agent, you may need to monitor ‘Fan Failure returned to Normal’ or ‘Fan Warning returned to Normal’.

  • Fan Enclosure insertion/removal traps can be monitored to discover interventions in the system fan assembly. Only certain Dell servers – including the PowerEdge 4350, 6350 and 6450 – support Fan Enclosure Extended Removal, and the system may shut down as a result of it.

The above are SNMP traps, and details such as device location and readings are provided with the traps.

  • Cooling Device Status Change events for fan, Cooling Device Status Change – Critical (Fan) – can be monitored to detect the change in status of fan while using Server instrumentation through DMI.

Temperature Sensor Events


  • Temperature Failure/Warning, Temperature Failure/Warning returned to Normal – occur when temperature sensor in the backplane board, system board or the drive carrier in the specified system exceeds its failure/warning threshold. Crossing failure threshold could lead the system to shut down. A normal event is logged when the temperature sensor returns to a valid range after crossing such threshold.

The above events are SNMP traps and details such as location and the sensor readings are provided with the trap.

  • Temperature Fault – Critical, Non Critical, Non Recoverable – are DMI indications equivalent to above described traps generated by Dell server instrumentation.

DRAC2 temperature events are generated by DRAC2 agent and can be monitored if the managed node has a DRAC2 card configured.



Memory Events

The events in this category are generated by the agents listed in Table 4, along with the type of event.


Table 4: Memory Event Agents


Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Dell OpenManage HIP/Server Agent

SNMP traps, DMI indications

Dell OpenManage Client Instrumentation

DMI indications

As the category name suggests, the events populated here are related to the system memory and are generated by Dell Instrumentation and DMTF tables. Dell Server Instrumentation generates traps as well as indications and if you are monitoring servers only through SNMP or DMI, you can monitor the common subset of these events to avoid duplication and confusion.



Dell Instrumentation SNMP traps


Memory Device Warning/Failure/Non Recoverable – These are memory ECC error traps. These occur when the memory device pre-failure sensor (which monitors memory modules and detects when memory is about to fail) exceeds the warning/critical /non-recoverable thresholds. These thresholds are defined by ECC single bit error counts and are defined in Table 5.
Table 5: ECC Single Bit Error Counts


ECC single bit error count more than:

Event

2

Memory Device Warning

10

Memory Device Failure (Critical)

20

Memory Device Non Recoverable

Please note that if you get an event Memory Device Non Recoverable, it does not mean that system memory stopped functioning. It is a signal that there is a severe problem with the memory, or the hardware or software using it.



Dell Instrumentation DMI Indications:


  • Memory ECC Errors – are the mapped from DMTF SystemChassisExtension table while Memory Errors -- are mapped from DMTF Physical Memory Array table. The ECC errors equivalent to above described SNMP events are Memory Errors event types.

  • Memory size increased or decreased – Dell OpenManage Client Instrumentation logs these only when the changes in memory size are detected.


Network Events

The events in this category are generated by the agents listed in Table 6, along with the type of event.


Table 6: Network Event Agents


Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Intel NIC Instrumentation

DMI indications

Giganet CLAN agent

SNMP traps

Broadcom Agent

SNMP traps

SNMP Agent

SNMP traps

Following are the important Intel NIC events that could be monitored to get critical update on NIC status.



  • Adapter initialization failure – Failure to open a handle to adapter miniport driver because of initialization failure.

  • Intel NIC Link Down – when the network media state is disconnected

  • LAN Controller hardware Failure – when the hardware status is not ready (because of failure)

Note: Intel NIC Link Down, Line Down and Cable unplugged/No LAN activity are the same events. Also note that the S/W error event is no longer generated.

  • The following Intel NIC event types are related to teaming NICs and need to be monitored if you have teamed NIC configuration:
    The last Adapter has lost link. Network connection has been lost, Preferred Primary Adapter has been detected, The team only has one active adapter, Preferred Primary Adapter has taken over

  • NIC Failover Event – is generated by Broadcom NIC agent while the event types starting with ‘CLAN’ are generated by Giganet CLAN agent.

  • Please refer to the documents related to the Intel NIC, Broadcom, Giganet CLAN and SNMP agents for more information.

Operating System Events

The events in this category are generated by the agents listed in Table 7, along with the type of event.


Table 7: Operating System Event Agents


Event Sources

Event Types

SNMP Agent

SNMP traps

Windows OS

SNMP traps

There are only two event types in this category – SNMP Cold Start and SNMP Warm Start – and these are generic SNMP Events logged by SNMP Agent, Windows OS as well as Linux OS (in ITA 6.1). A Cold Start trap signifies that the sending protocol entity is reinitializing itself such that the agent's configuration or the protocol entity implementation may be altered; this trap is generated mostly due to system crash or restart.


A Warm Start trap is generated when SNMP reinitializes without altering the agent configuration. This is mostly because of normal restart.

In some cases it is important to monitor the ‘cold start’ trap to know any inadvertent unintended re-initializations.



Other Events

The events in this category are generated by the agents listed in Table 8, along with the type of event.


Table 8: Other Events Agents


Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Adaptec CI/O

DMI indications

Qlogic

DMI indications

Dell OpenManage HIP/Server Agent

SNMP traps, DMI indications

DRAC

SNMP traps

The remaining events that don’t fit into other pre-defined categories are included in this category. The events consist of Qlogic NIC events, Adaptec CI/O events and some Dell Instrumentation events.


Events in this category are described according to the source agent. This will help you in deciding whether to monitor these events depending on if you are using that agent on the managed node.

Events from Adaptec CI/O Agent


  • Bus Port Error – this event occurs because of errors in bus port, which is the attachment point for the devices connecting to the bus.

  • Enclosure CI/O Event – this event occurs because of error event regarding entity’s enclosure devices.

  • Existing Object is Gone, Existing Object Replaced -- These events are associated with Mass Storage Association, a DMTF group defining the relationship between various components of the storage system.

  • Volume Set Events -- All these events are defined from DMTF Volume Set Group, which is a contiguous block of logical block addresses for reading and writing user data.

  • These events are DMI indications. Please refer to DMTF documents for more details on these events.



Events from Qlogic Agen


Adapter Error, Adapter Warning, Unknown Adapter Event -- These are the critical errors to be monitored if the node is using the Qlogic agent. These events are DMI indications.

Instrumentation Events


  • Container Security Breach, Logical Device Status Change, Physical Device Status Change – These events are related to the status change in the system containers like chassis, sub-chassis, expansion chassis etc, mapped from DMTF Physical Container Global Table. All these events are DMI indications.

  • Redundancy Degraded– The redundancy unit sensor in the main chassis detected that one of the units of redundancy has failed, but the overall unit is still redundant.

  • Redundancy Lost – When one of the components in the redundancy unit is disconnected or failed or is not present. You can monitor both of these events for Fans, Power Supplies etc. All redundancy events are SNMP traps.
    Note: The redundancy units in the system could be power supply, fan, AC cord etc. Add this event while monitoring these components.


  • Thermal Shut down – This is generated when the system is configured for thermal shutdown due to an error event – like temperature sensor exceeding the error threshold.



Power Events

The events in this category are generated by the agents listed in Table 9, along with the type of event.


Table 9: Power Events Agents


Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Dell OpenManage HIP/Server Agent

SNMP traps, DMI indications

DRAC2

SNMP traps

RAID

SNMP traps

This important category consists of many different events related to the power supply, battery, voltage, current, and temperature coming from all Dell instrumentation agents, Dell Remote Assistant agents and Dell RAID agents. The events consist of SNMP traps as well as indications.



Battery Events


  • DMTF:Portable Battery Critical Combined Batteries Charge, DMTF:Portable Battery Maintenance Required -- These events are generated when the combined charge of all portable batteries in a system is running critically low, or if the battery is defective and needs maintenance.

  • DMTF:UPS Battery Utility Power Lost System On Battery and DMTF:UPS Battery Utility Power Up System Off Battery -- These two events are generated when the primary power used by the system is lost and system starts using UPS Battery, and when the power is back and system stops using UPS Battery.

  • Drac2 Battery Good – associated with Dell Remote Assistant Card battery condition and occurs when battery with low charge is re-charged above the specific threshold.

  • RAID: Battery Events – if you have Dell Power Edge RAID Controller to monitor your storage devices you can monitor these events.

The events described here are generated by DMTF components: DRAC as well as Dell Power Edge RAID Controller (PERC) agents. These events could be monitored if using DRAC or PERC are in use on the Dell systems.


All the above events are SNMP traps.

Electric Current Events


  • Current Warning/Failure and Returned to Normal – Current sensor on the power supply exceeded its warning or failure threshold. A normal event is logged when the current sensor reading is back to normal after crossing such threshold.

  • Current Probe Non Recoverable – Current sensor detected a value from which it cannot recover.



The above events are SNMP traps and additional details, such as location and readings, are provided with these events.

Voltage Events


  • Voltage Warning/Failure and Returned to Normal – When the voltage sensor exceeds the warning or failure range threshold. A normal event is logged when it’s returned to normal after crossing the threshold.

  • Voltage Probe Non Recoverable – when the voltage sensor in the specified system detects a value from which it cannot recover

  • Above events are SNMP traps and additional details, such as location and readings are provided with these events.

  • Voltage Too High – (for DRAC2 agent) This trap is sent each time a voltage channel reading for the Dell Remote Assistant Card goes out of critical range.

Power Supply Events


  • Power Supply Failure, Power Supply Failure returned to Normal – occurs when power supply is disconnected or is failed. A normal event is logged when it comes back to normal from such state.

  • Power Supply Lost Redundancy, Power Supply Redundancy Normal -- When one of the power supply components in the redundancy unit is disconnected or failed or is not present. A normal event is logged when it is back from such state.

  • Power Supply Degraded Redundancy, Power Supply Redundancy Normal -- The redundancy unit sensor in the main chassis detected that one of the power supply units has failed, but the overall power supply is still redundant. IMPORTANT NOTE: These redundancy events are generated only by the Dell HIP instrumentation agent. To monitor the redundancy events for any redundant unit (including power supply) generated by Dell Server Agent instrumentation, please look at the redundancy event types in ‘Other’ category.


All the above events are SNMP traps and additional details, such as location and readings, are provided with these events.


  • Power Supply Status Change events are all DMI indications; the status change events for power supply e.g. – ‘Power Supply Status Change - Critical (Power Supply)’ could be monitored to find out any status changes in power supplies while using DMI.

  • The AC Power Cord events are generated by the Dell Server Agent Instrumentation for redundant AC power cords associated with AC fail over switch. This feature is available only on certain Dell Servers like PE2500.



Processor Events

The events in this category are generated by the agents listed in Table 10, along with the type of event.


Table 10: Processor Events Agents


Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Dell OpenManage Client Instrumentation

DMI Indications, SNMP traps

There are no processor related events generated by Dell Server Instrumentation. The events generated in this category are from DMTF and Dell OpenManage Client Instrumentation agent. Here is some explanation on the critical events in this category.




  • DMTF: Motherboard processor failure -- Associated with the processor on the system motherboard

  • Processor Failure – Associated with any type of processor in the system

  • DMTF: Processor Configuration Error, Processor Initialization Failure are evident and the Processor System Up will be sent when the processor is initialized properly.

  • Number of Processors Increased/Decreased, Processor Type Changed -- these are SNMP as well as DMI events while the Processor Type Changed is DMI only.
    Only Client Instrumentation generates these events.



These events could be monitored to find out processor configuration/initialization related changes on nodes, though these should not be frequently occurring.

Security Events

The events in this category are generated by the agents listed in Table 11, along with the type of event.


Table 11: Security Events Agents


Event Sources:

Event Types:

DMTF (through mapper)

SNMP traps

Dell OpenManage HIP/Server Agent

DMI Indications

SNMP Agent

SNMP traps

The event types described under this category are extremely important events to monitor on remotely managed nodes.





  • DMTF: Physical container configuration errorWhen chassis or other physical container is not properly configured

  • DMTF: System Hardware Security Container Security Breach, Security Settings Change – Critical, Security Settings Change – OK – occur when the chassis intrusion sensor detects that chassis is intruded when the system is in operation. A normal event is generated when the intrusion returns to normal.

  • Security Settings Change – Non-Critical, Security Settings Change – Non-Recoverable - These two events are same and ITA EMS would be updated for it.

These events are generated by Dell OpenManage HIP/Server Agent through SNMP and DMI.

  • SNMP Community Name incorrect – In the SNMP service properties page under ‘Security’ tab, if you check ‘Send Authentication Trap’ check box , and if SNMP agent receives a request with incorrect community name or the request is not sent from an acceptable host, this event is generated.



Software Events

The events in this category are generated by the agents listed in Table 12, along with the type of event.


Table 12: Software Events Agents


Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Dell OpenManage HIP/Server Agent

DMI Indications

Dell OpenManage IT Assistant

SNMP traps

Dell OpenManage HIP/Server Agent/Client Instrumentation applications and the Dell OpenManage IT Assistant itself generate these events. All the events in this category are either related to system up status or system down status.



Events from Instrumentation Agents


  • System Up – Critical, Non-Critical and Non-recoverable -- all these are DMI Indications generated by Dell instrumentation agents. Typically these are logged when the system is started after a reboot, reset or crash. Depending on system health the severity and hence the message text will be changed.

  • Note that System Up – Critical and Non-recoverable are the same events having critical severity and you can monitor either of these two depending on what message you want to see.



Events from IT Assistant


The System Up and The System Down mssg (SNMP traps) from IT Assistant are the only two events generated by IT Assistant application itself. These events especially the System Down message could be very important to monitor to execute an action when the monitored system goes down. Note that it is IT Assistant, and not the SNMP Agent, that generates these traps. Even if you don’t have the SNMP agent configured correctly in terms of community name or specifying the trap destination; you can configure IT Assistant to send out system up and down traps.
The status of the system is detected by IT Assistant during discovery and if IT Assistant discovers the system as powered off, the IT Assistant management station (not the managed node system which is powered off) will send out the system down trap impersonating the managed node system as the sender. This will help in setting up the filter for the system up or down trap, for that managed node system and associate an action to execute.
The trapconfig.cfg file installed with IT Assistant must be configured to receive these events.

Storage Events

The events in this category are generated by the agents listed in Table 13, along with the type of event.


Table 13: Storage Events Agents

Event Sources

Event Types

DMTF (through mapper)

SNMP traps

Dell Array Manager Agent

SNMP traps

RAID Agent (PERC, PERC2)

SNMP traps

Dell Remote Assistant Server

SNMP traps

Dell OpenManage Client Instrumentation

SNMP traps

SCSI Agent (CIO)

SNMP traps

Symbios Agent

DMI indications

The events in this category are logged mainly by the Dell Array Manager Agent and Dell PERC agents. Depending on the storage agent application you are using, you should monitor the events from these two agents respectively. This category contains events on array disks, battery backup units, events on consistency checks and their progress, controllers, disks, enclosures, drives, mirrors, SMART events, RAID Drives, UPS, Virtual Disks etc. Because of the very high number of events in this category, it isn’t possible to discuss all the important events here. Most of the events are obvious, and depending on event subtype, you can be selective in deciding which events to monitor. Table 14 shows the important events for each subtype.


Table 14: Critical Events in the Storage Categories


Storage Category

Critical Event Types to Monitor

Array Disks

Array Disk Failed, Array Disk diagnostics failed, Array Disk Format Failed, Array Disk Initialize Failed, Array Disk Rebuild Failed,

Consistency Check Events

Check Consistency Failed3, Consistency Check Error On Logical Drive, Consistency Check Failed On Logical Drive, Consistency Check Failed On Physical Device Failure, Container Failure,

Controllers

Container Failure2, Controller Dead, Controller Firmware Mismatch, Internal Controller Hung, Internal Controller I960 Processor Specific Error, Internal Controller Strong-ARM Processor Specific Error, Storage Controller ErrorSYMBIOS, Storage Device ErrorSYMBIOS, System Disconnecting From Absent Controller

Disks

Device Failure2, Enclosure Fan Error2, Enclosure General Error2, Enclosure Power Supply Error2, Enclosure Temperature Abnormal2, Enclosure Temperature Over User Threshold2, Hard Disk Failure events, Hard Disk SCSI Bus Reset Failed, Hard Disk Write Recovery Failed,

Enclosures

Storage Works Enclosure Failed,

Drives

Error- Rebuild Of Logical Drive Failed, Logical Drive Critical, Logical Drive Initialization Failed, Mirror Drive Failure2, Physical Drive Missing On Startup, Rebuild Of Logical Drive Failed

SMART Events

IDE SMART Pre-FailureOMCI, SCSI SMART Pre-FailureOMCI

RAID Drives

Raid Drive Failed2, Raid Failed On Lack Of Resource2, RAID: Check Consistency Aborted1,2, RAID: Initialize Aborted1,2, RAID: Initialize Failed1,2, RAID: Physical Drive State Failed1,2, RAID: Logical Drive State Offline1,2, RAID: Logical Drive State Degraded1,2, RAID: Reconstruction Failed1,2,

UPS

Uninterruptible Power Supply Failed,

Virtual Disks

Virtual Disk Failed, Virtual Disk Format Failed, Virtual Disk Initialize Failed, Virtual Disk Rebuild Failed, Virtual Disk Reconfig Failed, WARM BOOT Failed, Write Back Error

Other

Expand Capacity Stopped with error, Error- Rebuild Stopped, Fan Failure, Initialization Canceled, Initialization Failed, Installation Aborted, Over Temperature, Possible Data Loss, Power Supply Failure, SCSI Command Abort, Server Lost Connection Or Down, Temperature Over Safe Limit


1 PERC Agent

2 PERC2 Agent

3 OpenManage Array Manager, PERC and PERC2 agents

OMCI OpenManage Client Instrumentation

SYMBIOS Symbios Agent


Section 5


Best Practices While Selecting Events to be Monitored
If the administrator wants to monitor only important events for the selected Dell Agents and carry out a particular action, it is very important to understand the critical events from all the pre-populated events in IT Assistant database. What follows are some of the important points to remember:


  • Monitor only SNMP traps or DMI indications whenever possible to avoid duplication of events and the confusion that can arise because of different wording of the same event, or from the common subset.



  • While creating custom events for monitoring, make sure the SNMP OIDs or DMI Source definitions (like Associated Group, Event Type etc.) are not the same as any existing event. This is important even if you select only the custom-created event for monitoring, and not select the duplicate one which is pre-populated in IT Assistant; the custom event created and selected can be ignored because of the way IT Assistant filtering criteria works. Remember, you can rename the pre-populated events and change the message text to customize for your environment.



  • The event types starting with ‘DMTF’ are DMTF indications converted into SNMP traps. Note that these events are not supported by Dell OpenManage Server Agent instrumentation. Dell Hardware Instrumentation Package agent and Dell OpenManage Client Instrumentation support subset of these events.



  • While monitoring the events, you don’t need to know the SNMP OIDs or DMI Associated Groups. You need these only while creating custom events. You can view the OIDs or Group names by back referencing the event type.

Table 15 shows the important events associated with some critical failures in the system. Note that unless stated, these events are SNMP traps.


Table 15: Baseline Event Types Associated with Critical Failures


Critical Failure Type

Event Category

Baseline Event Type to monitor

System Fan Failure

Environmental

Cooling Device Failure

System Fan in Critical Status

Environmental

Cooling Device Malfunction (DMI only), Cooling Device Status Change - Critical (Fan)(DMI only)

Power Supply Failure

Power

Power Supply Failure

Power Supply (or Fan) Redundancy Lost

Other

Redundancy Lost (For Server Agent Instrumentation. This is for any redundant unit like Fan, AC Cords for the power switch etc.),

Power Supply Lost Redundancy (only by HIP Instrumentation)



Memory Failure

Memory

Memory Device Failure (Note that this really means that memory device pre-failure sensor detected a critical value), Memory Errors (DMI only),

Memory Pre-Failure Warning

Memory

Memory Device Warning

Processor Failure

Processor

DMTF: Motherboard processor failure, Processor Failure (Please see the ‘Processor Events’ category for details)

Temperature Failure (in a system backplane or board etc)

Environmental

Temperature Failure, Temperature Fault – Critical (DMI only)

Chassis security breach

Security

Security Settings Change – Critical, Container Security Breach – Critical (DMI only, from ‘Other’ category)

System down

Software

The System Down mssg from IT Assistant (Only when using IT Assistant to discover the node)

NIC Failure (for Intel NIC)

Network

Intel NIC Link Down, LAN Controller hardware Failure

NIC Failure (for Broadcom NIC)

Network

NIC Failover Event

Drive Failure

Storage

Logical Drive Critical, Physical Drive Missing On Startup, RAID:Physical Drive State Failed, Mirror Drive Failure

Disk Failure

Storage

Array Disk Failed, Device Failure, Hard Disk Failed, Virtual Disk Failed, Hard Disk May Fail Soon

Failure of a cluster or node in a cluster

Cluster

Failure of a node in an MSCS cluster, Failure of an MSCS cluster




Section 6


Conclusion
The Event Management System feature of IT Assistant can be used in various ways to monitor different Dell servers and client systems, and can be customized to suit the environment. Even though you can select all these events for all the systems and not pay attention to individual events, if you want to be selective, it is imperative to know the important events to be monitored. Knowing these events can also help you execute email or paging actions for critical events. This paper provided the details of important events generated by all supported Dell Agents, and explained the granularity and low-level details of each of these events. In planned future releases of IT Assistant all these pre-populated events will be cleaned and streamlined.
Please refer to DMTF documents (available at http://www.dmtf.org) for more details on DMTF events and the following Dell documents for details about the events generated by specific Dell Agent Applications.

There are User’s Guides available for:



  • Dell OpenManage Server Agent 4.x

  • Dell OpenManage Hardware Instrumentation Package 3.x

  • Dell OpenManage Array Manager 2.x, 3.x

  • Dell OpenManage IT Assistant 6.x

  • Dell OpenManage IT Assistant Database Management Utility



Other useful references include the following:

  • Dell OpenManage Server Agent Message Reference Guide

  • Configuring and Using the Dell OpenManage IT Assistant Event Management System Article



Manoj Gujarathi (Manoj_Gujarathi@dell.com) is a systems engineer at Dell Computer Corporation (http://www. dell.com). He has over four years of experience in system management applications and he currently works as a lead engineer for Dell OpenManage IT Assistant and Dell OpenManage Connections applications. Manoj has a Master’s in Engineering from Washington State University and Master’s in Computer Science from Texas Tech University. He is a Microsoft Certified Systems Engineer.


Dell, OpenManage, PowerEdge, PowerVault, and PowerApp are trademarks of Dell Computer Corporation.

Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others.

©Copyright 2001 Dell Computer Corporation. All rights reserved. Reproduction in any manner whatsoever without the express written permission of Dell Computer Corporation is strictly forbidden. For more information, contact Dell. Dell cannot be responsible for errors in typography or photography.

Information in this document is subject to change without notice.




August 2001


База данных защищена авторским правом ©shkola.of.by 2016
звярнуцца да адміністрацыі

    Галоўная старонка