HOW TO - Device Status Monitoring with ESM - Detailed Explanation by pbrettle in ArcSight Discussions
A common question that I have had focuses around what is known as device status monitoring, or the ability to identify when a log source stops sending logs. ArcSight solutions have had this for a number of years, but there have been a lot of confusion around how it operates, where to obtain the content and how to get it to generate an alert.
This post has been produced to clarify the content, define what the content is and how to generate an alert at the end of this! Firstly, please do check out the posts below that covers a lot of the background around this and what is aviailable in ArcMC and ESM:
Specifically, we want to dig into ESM so that we can generate alerts. Please note that device status monitoring is only currently available in ESM for the level of alerting that many customers are looking for. ArcMC can do generic monitoring, but cannot do the individual device monitoring that ESM can do.
For this post, I will be addressing the detail around the Activate packages in the links above. I will be digging into the rules, dashboards and alerting mechanism that is provided.
Firsrly, a few words around device status monitoring - this is a standard feature that is generated at the SmartConnector level. A simple counter is triggered for each individual device where an internal audit event is generated if no events are generated in that time period. Its adjustable per SmartConnector, but you must have a unique device registered with the SmartConnector. What does this mean? It means that the SmartConnector framework will identify devices based on the following factors:
- IP Address
- Device Product / Vendor
The tracking of the device is done at the SmartConnector framework layer and tracked accordingly. The way that device status montioring for alerting and tracking purposes operates is at the ESM and correlation layer. Anyway, enough of the background, lets dig into the content.
A common requirement is to generate an alert, such as below:
Above we can see a notification generated for a device that is no longer reporting, but clearly this could be an email or what ever you want. So how do we generate these? What we need to do is ensure that device status monitoring is enabled for the SmartConnector. You can view the settings in your usual SmartConnector management screen. Here I am showing the ESM Console view of the settings:
You can see the 'Enable Device Status Monitoring' is in milliseconds. Here we have the monitoring set to 60 seconds (60000 milliseconds). What this means is that the SmartConnector will look to identify if it receives NO log messages within the 60 second time range. You can adjust this (and should) based on the event flows of the log sources. However, typically look to aim for 1-2 minutes as this usually works for most situations.
What this will generate is the infamous 'agent:043' messages. You can simply set up an active channel to view these messages if you wish, but here is an example. A simple filter based on deviceEventClassId = "agent:043" is sufficient for this:
What does an agent:043 message look like and what does it show? Take a look at one below:
The relevant data sections are in the Device Custom section. Here we can see the relevant information that is generated. Specifically the following
- deviceCustomNumber1 - this is the total event count for this device since the connector was started
- deviceCustomNumber2 - this is the total count Since Last Count - or the last timer interval
- deviceCustomDate1 - the time stamp of the last event received from this device
There are other relevant fields, but I am noting these as they are critical to understand the status and speed of reporting. Please note that you will want to know the attackerHostName and attackerAddress also as they will refer to the log source.
If you have installed the default ArcSight Administration package as well as the Active Content detailed above, you will find the following rules present:
Specifically, I will work backwards on this, because it is the rule called 'Alert - Critical Devices Inactive for more than 1 hour'. When triggered, you will see the following correlation trigger:
When we takea look at the event in detail, we can see a little more information:
The way that the content works is to have a step between the audit events, as a small misconfiguration can flood the Console and analyst with hundreds of messages. We want to step between the messages, two lists and delay reporting so that we can buffer this and manage the volume of alerts.
If you check in the lists section, you will find the following relevant lists that will be used:
For alerting purposes, we are interested in two - Critical Devices and Critical Monitored Devices. To trigger the monitoring and alerting, you MUST enter the relevant details into the Crtical Devices list. Select the list above and right mouse click it and select View Entries to see the list and added entries:
Simply add entries to the list by pressing the + button and adding them as needed. You must enter the hostname and address of the host, but I do recommend checking and confirming the details from the active channel that we created above. Putting the devices in the list will trigger the detailed alerting process, without them on the list, NOTHING will be generated.
What happens is that the rules check for the generated 'agent:043' messages and confirming if they have been added to the Critical Devices list - if so, then an entry will be added to the Critical Monitored Device list. It is a much more detailed set of elements on the list, so you can see this.
This step process means that we refer to the Critical Devices list and when we see these devices reporting, we will add the data to the Critical Monitored Device list. If we dont see any logs from these sources, we wont generate any alerts - we MUST receive some logs for these devices before any alerting will be triggered.
A sample of the Crtical Monitored Device list:
And more data stored on this active list - its quite comprehensive!
The rule itself operates pretty simply though. The rule simply checks that the count of events in the time period (remember, see above for the deviceCustomNumber details) drops to zero AND the device itself is on the Critical Monitored Devices list:
Please note that the default rule trigger will only actually trigger AFTER 60 minutes from the sending of the "no more logs' agent:043 message. We really dont want this to be triggered just after the 60 seconds or so from the SmartConnector message. This is the reason for the content operating this way - if there is a 5 minute delay (which can happen), the trigger will be sent from the SmartConnectors, BUT as long as we are getting some data before the 60 minute time interval, the alert will not be triggered.
The action tab shows what will happen when the rule triggers though.
In the case of the standard configuration, this will be blank, so it is recommended to attach a relevant message to the notification and to have this sent to a suitable destination. Here in this example, a message is constructed using the $deviceCustomString1 data and sent to the Device Administrators notification group.
If you dont have a notification group, create one. Go to the Notifications section in the Resource Navigator and add a suitable notification group and add an escalation level by right mouse clicking it and selecting Add Escalation. This will add a default level of 1, which we can customize.
When we go to customize an escalation level, we can now trigger the method we want to use for the notification. By default this is a console message, but you can change this as needed. Sending to the Console will trigger an alert to appear as a notification in the Console itself. For example, as below:
Alternatively, if you want to use an email sent for this, simply just change the destination type to reflect this. Once you select email, it will present the option to enter an email address to send it to. Once you are happy, simply press Apply to save and push the configuration:
What will now happen is the follow flow:
- Update connectors to send device status monitoring events
- Add critical devices to the list
- Once data is received from these critical devices, the critical device monitoring list is updated
- Once data stopped from a critical device, after 60 minutes, send an alert to the notification channel
- Action the notification when received - IT WILL KEEP SENDING.
You can of course use the dashboards that are provided also:
The above dashboard shows some operational devices and some not operational, so we can see this clearly and easily. However, should there be a major issue, you might end up with the following:
Either way though, see the notifications above (sending to the console) for the alerts that are generated.
I hope this explains the content, what the agent:043 messages are and how we can trigger reasonable and non-flooding messages for device status monitoring and the generation of messages when a device stops sending logs!