“Playing in an orchestra is completely different to playing on my own. Sometimes I played, sometimes listened; instead of waiting my turn, I sometimes interrupted another player, sometimes I argued, sometimes agreed.” ― Kevin Crossley-Holland, Heartsong
In my previous entry, I discussed Selecting Use Cases for Automation. Next comes the exciting part of playbook design—technology integration. You still need a disciplined approach, however, if you want to get an effective, efficient playbook built.
There are few factors that you should consider when choosing the technologies that you want to leverage inside a playbook. For this blog post, I’m going to focus on the selection and ordering of your available tools, plus the value of using Custom Lists.
With a music orchestra, the results are not pleasing if the members are not carefully selected, coordinated, and timed. If orchestra members lack aligned objectives (e.g. playing the same song), sequencing (e.g. playing the sections is the same order), and synchronization (e.g. following the same beat), the result is a jarring experience for everyone involved.
Security orchestration is much the same as the music orchestra example. If we don’t work in harmony with the rest of the organization, then our role, outputs, and value are diminished.
Let’s look at each of these factors and determine how they could affect your technology selection.
To help explain the challenge I created an example playbook, as shown below.
For this discussion, suppose the trigger is a suspicious IP address alert (IP: 220.127.116.11) and you don’t know if you can trust it. Would you want to just blindly block 18.104.22.168? (Hint: that’s a Google IP address and unless your organization blocks access to Google, you might not want to make that adversely career impacting decision). The right approach in this situation is validating the alert first.
So great, we need to check out the IP address. We have loads of tools to check an IP address, but before you rush off and query every service you have access to, stop and think about the consequences. Which services do you trust? Which ones provide the best context? Which ones are expensive in terms of resource loading? It might it be better to step through a model that doesn’t adversely impact your time-to-detect metrics or break your infrastructure, while also providing the best available information.
What I tend to do in operations is follow an approach that could be defined as the SOEL of automation. Yes, you can groan at the acronym… SOEL stands for Security Operations Event Lifecycle. (The good news is that you can have multiple SOELs for different use cases, just like a cat’s nine lives :-).
Your SOEL is how you process an event in a modern security organization. You need to ask yourself questions like:
- Where do the alerts come from?
- Are they pushed or pull activities?
- Should I be relying on pulling or is pushing better (yes, I will do a blog post about that.).
- When you get the event, when do you create ticket?
- What type of workflow should you use?
- Is the workflow totally automated, or does a human need to get involved at some point in this particular SOEL.
A good SOEL combines triggers, trusted data sources, people, process flow, interaction, decision points and results.
Referencing the workflow in Figure 1 above, I decide to check Cisco OpenDNS and VirusTotal. These services typically respond quickly and give me a good idea of whether the artifact is potentially good or bad. Notice that I consciously don’t give a definitive decision of good or bad. I once made the mistake of jumping to the conclusion and basing my decision off of a single piece of intel. My advice: always (ALWAYS!) verify and then verify again (in a timely manner). If VirusTotal AND OpenDNS both say an IP address is bad, then I am more confident that the IP address is bad, but I still don’t block it straight-away.
Also in figure 1, there is an investigation action that queries “Paul’s Hot IP list” (don’t ask to be added, it doesn’t exist). Would you trust this service? I wouldn’t.
Would you include the action that searches all servers for an IP address? How would the server team react if you performed this action multiple times a day without strong justification, since the action could adversely impact server performance? You should always strive to maintain a good relationship with asset owners across your organization. Impacting the performance of one of their critical servers in an unpredictable and uncontrolled manner is a very quick route to a bad relationship.
What about including the action that searches workstations? In addition to the performance impact, what about the time it would take to complete this action? Perhaps someone has gone on vacation for two weeks and their laptop is offline. How does this factor affect your ability to close a case?
As you can see, your orchestrated workflows must balance response times, costs, value, and thresholds.
Given the options, in this situation I would suggest an economic SOEL triage workflow:
- Determine if others think the IP is dangerous (External Contact)
- Establish a confidence threshold that could impact the severity rating and the level of response (see Custom Lists comments at the end of this post)
- If it suspect or dangerous, use internal low cost tools to check to see if there is any activity in environment (Internal Context)
- Based on the available information, escalate, alert, or watch the situation.
- Determine if human decision making is required
Here’s how the workflow might look in reality. As you can see, I start off with just checking external sources since they are fast and provide me with the current assessment of the IP address’ reputation (Maybe skip Paul’s Hot IP List).
After that, I use a Custom List to track the number of times that this IP address has been marked bad, with a time last updated.
Then, using my Custom List, I have a decision point that escalates or notifies an analyst based on the amount of activity. If there are a large number of systems reporting the bad IP the case would be escalated. Otherwise, the system would just send a ticket to the server team to resolve the issue.
In the example, I also instruct security sensors to watch for any new activity associated with the bad IP to signal that an outbreak is likely occuring.
Imagine trying to do this all at once. The result would be noisy, too much data, invalid conclusions, wasted resources, and missed events. By choosing when to do what, you can create a harmonious flow of intelligence and an effective response.
This is only the beginning, we can add more “instruments” and sections to enhance the workflow and make the playbook more impactful.
In summary, you must change your perspective on how you view your security toolkit. Think about the trust level of the data you are using, how quick they are, and how much it costs to perform an action.
VP of Delivery