Practical Guide to ESM Rules
As defined by the ESM 101 guide, ESM rules are programmed procedures that evaluate events for specific conditions and patterns, and when a match is found actions are triggered. Rules are the centerpiece of the ESM correlation engine.
Rules behavior, performance and output depend on multiple attributes: conditions, aggregation settings, variables, actions as well as the type of rule itself.
This guide will explore some of these attributes as well as some tools (ESM Resources) that can be leveraged when testing rules before they are deployed in a production environment.
What this guide is…
- A document to be used as a reference to help you effectively build your ESM rules
- A quick start guide to help you improve your skills in ESM rules authoring
What this guide is not…
- A written-in-stone guide. As every environment is different, different conditions or assumptions apply to them
- A replacement for ArcSight ESM Training
- A replacement for the ESM Console User guide or any other ArcSight official guide
Types of ESM rules
First things first, a brief recap on the types of rules we have available in ESM:
- Lightweight Rules: Simple rules faster than standard rules that allow only one event condition (no joins, only one type of event). They do not allow aggregation and only trigger on every event. Their only available actions are add/remove to an active/session list. They are ideal to track events.
- Standard rule: Rules that when conditions (and aggregation settings) are met, create correlation events and defined actions are triggered.
- Pre-persistence rule: Simple rules that allows only one type of event (no joins). They are processed earlier than lightweight and standard rules in the correlation engine. These rules set event fields values for incoming base events before they are written in the ESM CORR-e database. These values/events cannot be modified later on so they have to be used carefully.
- Join rules: Standard rules that allow more than 1 type of event. These rules infer a relationship between 2 or more events; for example a join rule can be triggered when a firewall and an antivirus event have the same internal asset as destination. These rules are more resource intensive.
Join Rules vs Rule-Active List-Rule approach
Depending on the Use Case being developed, a more performace-efective alternative to Join rules is the usage of a Lightweight rule- Active list - Standard rule approach:
1. A lightweight rule looks for a specific (base event most of the times) event and when a match is found, it writes relevant event field data (such as destination address, destination port, destiation zone) to an active list.
2. Active list TTL value will define how long data will remain in the active list.
3. A standard rule as part of its conditions, queries this active list for matches of the event fields relevant to the Use Case.
This approach removes the Join rule restriction of having a reduced timeframe in the aggreation settings tab.
Creating rule conditions, for most part, follows the same criteria as creating filter conditions. Filter conditions are detailed in the Practical Guide for ESM filters 1 and Practical Guide for ESM filters 2.
As shown in the image, a key difference regarding filter conditions is the ability to look for more than 1 type of event (join rules). In this case the rule will trigger when both events are found (following aggregation settings). Instead of defining conditions within the rule's CCE (common conditions editor) we nest filters with the appropiate conditions, as we can then reuse them in other rules or resources.
The Consume After Match option (right-click on each alias-defined by the blue brackets) makes sure an event is evaluated only once and discarded after that evaluation, this will reduce the number of correlation events.
Aggregation settings govern how many events must match our conditions within a specific timeframe. As a rule of thumb for performance purposes, this timeframe should be short, the shorter the better, no more than a few minutes. If a longer timeframe is required then consider using the Lightweight rule-Active List-Standard rule strategy instead.
When aggregating event fields you should only include fields that are needed and/or useful to your Use Case:
In the Aggregate only if these fields are unique box, you should only include fields that are unique (or almost) among all other similar events (such as eventID). Event field uniqueness will be given by the context or Use Case; For example, for some Use Cases destination address might be unique a unique field (X number of hosts infected in the same subnet) but for others it might not (host infected multiple times by same malware).
Data you want to show up in the correlation event should be explicitly added to the Aggregate only if these fields are identical box. Should you need categorization data or any other event field, such as destination, source or device event fields, these fields need to be added to the aggregation tab (Aggregate only if these fields are identical box) as shown in the image. Local and global variables can be added as well if they need to be evaluated and shown in the correlation event.
Many rule firings cause many correlation/audit events. If we want to limit rule firings the best triggers to use are OnFirstThreshold, OnTimeUnit or OnTimeWindowExpiration.
Every enabled trigger will produce a correlation event when conditions and aggregations settings are met, even if no action is defined within them.
When the rule is triggered and the correlation event is created, If you want an explicit value to appear in such correlation event you can use the Set Event Field action to overwrite that event field.
As shown in the image, by using the Set Event Field action, event fields of the correlation event can be overwritten with:
- Fixed value. Any value (string,number) can be set to overwrite the event field. In the example all Category fields are overwritten with fixed (string) values.
- Variables. Local or global variables can be set using the notation $variableInCamelCase. In the example, in the message field the variable $localToGlobalVar is added to the event field.
- Event fields. Event fields can be used to overwrite other event fields, using the notation $fieldInCamelCase. In the example, in the message field the event field $destinationHostName is added to the event field.
The active channel in the Rules Testing section of this guide, will show the resulting Message event field used in this example.
When ovewriting event fields the proper data types must be kept - strings, integer, date, address.
Categorization is the process to map different . It allows the ArcSight ESM content to be vendor independent, more generic and powerful.
Correlation events should be categorized so they can be easily used by other ESM resources, such as other rules, dashboards and query viewers. In the image above we can see how this correlation event will contain specific categories that will be used to track Virus deleted events accross the ESM infrastructure.
Activate Framework Wiki contains examples of how categorization can be applied at the L1 Indicators and Warnings level, so resulting correlation events will be used at the L3 Impact and Threat Analysis layer, using the Data Fusion model.
Local variables can be used within rules to provide more flexibility when creating rules, but if these variables (same type and fields/data involved) are going to be used by other resource/rule elsewhere in our ESM Server, it is more convenient to create a global variable instead, as global variables are evaluated just once no matter how many rules use them. If you already created a local variable, you can promote it to a global one.
When a rule is evaluated in the correlation engine and at least 1 condition is matched for a particular event, this event is saved in memory; This is called a Partial Match. The time this event is kept in memory depends on the timeframe set in the aggregation tab.
The more partial matches, the more memory is used. This is why it is important to write performance-efficient rules. In AND operators using the most restrictive condition at the top/beggining of the condition list to discard as many events as possible, and in OR operators setting the least restrictive condition at the beggining, will improve rules execution.
In join rules, a partial match is stored in memory when one of the rule aliases matches one or more events.
Before testing our rules, a few guidelines:
- ESM rules should be created and tested If available using a non-production ESM server, but If a non-production ESM Server is not available, make sure to create a rule within a folder not linked to the Real-Time Rules folder. As we can see in the image we are creating My Test Rule in a folder different than the personal one (admin's Rules folder) and it is not linked to the Real-time Rules folder.
- Rules are not evaluated by the ESM correlation engine until they are placed (linked or copied) in the Real-Time Rules folder. You can deploy rules by right-clicking on the folder that contains your rules and select Deploy Real time. Do not deploy any personal folder into the Real Time Rules folder.
- Lightweight and Pre-persistence rules do not create neither correlation nor audit events. You can create these rules as standard and after properly test them, you can convert them to lightweight or pre-persistence.
On the process of crafting/testing ESM rules, the first thing we have to validate is that the rule actually triggers on the events we want it to (and not based on other events as it will create false positives). This can be done before deploying the rule in real time. To do se we use the test button at the bottom-left of the rule edit panel. This option will allow us to test the rule based on events filtered by an active channel (existing or new filtered/unfiltered active channel).
In the following example My Test Rule is being tested and triggered several times because of the aggregation settings and action triggers specified within the rule, its behavior (number of firings) as explained before, changes when those settings are modified. The active channel shows also the modified Message event field we explained before.
The Rules Status Dashboard (/All Dashboards/ArcSight Administration/ESM/System Health/Resources/Rules) contains valuable resources that will help us testing our rules.
The Sortable Rule Stats datamonitor (available in ESM 6.x versions as well as ESM 7 in compact mode) gives us valuable data to measure the performance of the deployed rules. A few things to consider when using this datamonitor:
- Statistics belong to a specific timeframe, shown in the left-lower part of the datamonitor.
- Time% column shows the CPU Time used by every deployed rule. A rule with a high Time% does not necesarily mean that the rule is ineficient but if other resources used to test the rule show decreased performance then we might review the rule attributes.
- ESM automatically disables any rule if certain conditions are met: rules with high CPU time usage, recursive rules, rules with excesive correlation events generated.
- Rule's conditions can be modified (as explained in the Practical Guide to ESM Filters - Part 1, by using short-circuit evaluation) if the rule is found to have a high number of partial Matches.
The Top Firing Rules datamonitor (belongs to the same Rules Status dashboard) displays the rules that have fired the most. Rules with many firings might indicate rules' conditons are too broad so many events match those conditions (not all of them might be related to the Use Case) or too many triggers are enabled in the actions tab (On every Event, On Subsequent Events, and so on). Aggreation settings also impact rule firings.
The Event Throughput Dashboard (/All Dashboards/ArcSight Administration/ESM/System Health/Events/) contains 2 useful Datamonitors when testing rules performance.
The Event Throughput Statistics datamonitor in the E/s(lmin) column will show the current EPS (events per second) received by each connector, and the manager as a whole. If the rule being tested exhibits poor performance we might see a meaningful drop in the eps received by the ESM Server.
In this example we can observe a drop from 217 to 28 eps. This might be likely due to the rule just enabled. If the EPS drop is sustained and does not recover to the original value, we can disable to rule to confirm that the rule is causing this drop if EPS goes up again.
The Event Throughput datamonitor (moving average) will show a decrease in the total events received by the ESM on a longer timeframe.