Thierry Berthier and Thomas Anglade's Podcast: gains and metrics in incident response automation

SOCs (Security Operations Centers) are oftentimes overloaded with alerts to handle. The volume of these security alerts and the number of “false positives” increase steadily as threats become more complex. Analysts spend a lot of time producing reports and defining activity metrics. SOCs must manage dozens of miscellaneous tools from different vendors. The response to incidents is still quite manual and “reactive”. Yet, the longer the reaction times, the greater the impacts. And so, how can we reduce the time between detection and response to incidents?

Among the solutions are the use of machine learning in detecting threats as well as the orchestration and automation of incident response. This is what Thierry Berthier, Researcher and Director of the “Sécurité Intelligence Artificielle” (Security Artificial Intelligence) group of the France AI Hub, and Thomas Anglade, Lead Data Scientist, presented during the 2020 Forensik Conference.

Together, they made an inventory of the Security Operations Centers (SOC) to put into perspective the need to orchestrate and automate playbooks for incident response to respond effectively to attacks and reduce the risk of data loss. Next, they explained how human intelligence and artificial intelligence work together to:

optimize the detection of potential threats;
reduce false positives;
and continuously improve the investigation of incidents.

To learn more about the “benefits and metrics in incident response” aspect, they accepted our invitation to come on our podcast, INTRASEC, In Fidem’s cybersecurity channel.

During your conference, you emphasize Key Performance Indicators and how to measure them in incident management and response. Can you tell us more?

T. Berthier: The performance, above all, comes down to saving time. Currently, when we have an IT response, the incident response is technically done by people resources and the teams in place.

Sometimes there are processes that are automated, but they are typically hybrid techniques: some human, more technology. Now the goal becomes to automate incident response as much as possible by putting more people to save time. That was the main idea of the conference. What are some solutions to save time, reduce incident response times, etc.?

T. Anglade: Still the problem today is that the team, which oversees the monitoring and surveillance of the network, lacks resources and has far too much information to process given their resources and a limited time. At the same time, there are a lot of misconceptions about artificial intelligence. For many, it would work independently and would be able to make decisions. In other words, it would be a kind of algorithm that would be able to say with 100% accuracy that there is an attack as soon as it occurs. In reality, we must ask “How can I use these algorithms to optimize the handling of information given the resources I have?” “.You have to see AI as a tool that allows you to work faster on a project.

So, we have to find a way to be more efficient …

T. Berthier: Exactly, and especially with regards to targeted attacks. The idea is to reduce detection and response times because today the average time is over 100 days. So, between the time at which the attack takes place and when it is detected, more than 100 days can go by. So, the damage has been done for quite some time and so we must absolutely reduce these delays. Automation will help reduce these detection and response times. We really are in a race against the clock. And as Thomas said, artificial intelligence will make it possible to reduce these response times, not for everything, but for anything that can be automated.

Can you give me an idea of the current response mechanism versus the proposed response mechanism?

T. Anglade: The current mechanism is more human-centered and aims to set up an organization with a SIEM, a monitoring system that stores events and processes them and an organization around this SIEM with humans who bring their expertise and who are able to tell when the recorded events appear suspicious. We then come to the “ticketing” system where I have levels – level 1, level 2, level 3.

Level 1: I create a ticket for any event that seems suspicious.
Level 2: I do a deeper analysis, and if the event is potentially an attack, I go to level 3.
Level 3: This is where I qualify the attack and trigger a response.

And so, we have something more hierarchical and in the whole process at the moment, there isn’t much digital aid.

This means we are dealing with systematic processing – I have rules in my SIEM that will say what looks suspicious. Once all the suspicious elements have been identified, I end up with a list of events to deal with according to my levels 1, 2 or 3. But I don’t have a way of prioritizing that tells me one event is more suspicious than another. And the added value of these technologies is to allow the prioritization of all these events so that at each stage, I can know what the things are that are the most suspicious and how best to invest my time in responding to these potential attacks. Is this evaluation activity based on static rules a bit like in an expert system or are they more dynamic?

T. Anglade: Initially, it was really static rules. They still are often static rules, but it depends on the context, the customer, the company. In fact, the more work we do to understand the habits and behaviors of the networks, the more we can put in place more dynamic rules, hence rules that are specific to each company.

For example, if I know that my updates are done every Thursday between 11 p.m. and midnight and I integrate that into my rule system, I will be able to start saying that for that company, if I see any events non-compliant, it is therefore suspicious . Or conversely, if I see a lot of activity right now, it’s not necessarily suspicious, it’s just an OS update.

And so gradually, with the help of data science, we will progress in understanding the network to put in place rules that are increasingly dynamic and contextualized, or even move towards algorithms that are able to observe the network and deduce the right rules. This is what allows us to move from a simple expert system to an augmented expert system where human competencies remain central and the AI brings this part of contextualization.

You mentioned in your conference the large number of false positives. How would AI help solve part of this problem?

T. Anglade: That’s a very good question. In fact, that is almost the heart of the matter today, because it directly affects the process of investigation. You have to assume that it’s absolutely impossible – for AI or humans – to have an algorithm that would only detect attacks. It would be like having an oracle who would know what attacks would happen in the future and who would warn us; it is absolutely impossible. So, in fact, it’s inherent in the process of having false positives, so the question becomes how we can have alerts that are accurately qualified. I want are alerts that are relevant with as few false positives as possible. And so, with the help of AI, we will be able to contextualize. When I have an alert that comes up thanks to the algorithms, I will go look for data and habits of the system. This is where the notion of prioritization will come into play.

In other words, even if today there are different levels, the AI ??would add a notion of “scoring” to allow me to better assess the alerts at each level.

T. Anglade: Exactly.

T. Berthier: With a high-risk approach which will prioritize [risks] according to the probability that the sequence of events corresponds to a real sequence of attack.This is what AI allows us to do, it is really a statistical approach with risks considered. So, this is where there is added value compared to purely static detection. This doesn’t mean that static rule engines should be thrown out. On the contrary, they are still very effective and there are lots of attacks that are detected by these classic rule engines where there is no machine learning behind them. These machine learning models simply signal events that fall outside the statistical normality.

After an event, a new rule is created, clients will have to update to detect them in the future, etc. Is update problem something we are trying to solve with the new approach?

T. Anglade: Indeed, with AI, we go through a gap, or even with a static system, unless you have a very good ability to anticipate, for example, an expert who would be able to say “attention, here are the new vulnerabilities to which we should do that ”and suddenly, upstream of these vulnerabilities, I can then put rules in place, [we will be reactive]. And that is practically impossible because every day I would have to have someone who is aware of vulnerabilities and able to infer rules upstream. Since it is impossible, with static rules, I will always be a little bit late.

With AI, I’m going to take a step back because I’ll be able to see trends and changes better. So, if the AI ??feeds those changes back to the operator, saying “Watch out! That sort of thing needs to going to change! “, it can immediately see what the changes are in the network and if there are new rules to be put in place or if he needs to adjust a detection threshold, for example.

However, the next step is to have something self-learning, but we are not there yet. Because you can never get out of the human ability to say, “I’m familiar with hackers and the darknet, so I can anticipate what their next attacks will be.” We cannot replace that, but we will be able to allow operators to see changes in the network more quickly.

So, AI allows us to have smarter tools to allow us to identify new trends faster and follow them more easily. We continue to work as we did before, but we can trust our tool more because we don’t need to check as much whether the rules outdated, etc.

T. Berthier: Exactly. And on the machine learning part, we can build models with a learning phase on “business” data, which is very complicated and has to do with fixed or static rules, it is much more complex.We should be able to constantly update the rules and vulnerabilities that come out every day. It’s not just a physical impossibility, it’s also a mathematical impossibility. Today we can say that absolute security does not exist, so we cannot have a systematic rulemaking behind it. AI makes it possible to avoid this impossibility by allowing for the best without being completely exhaustive on the detection, but at least to add a statistical approach to the detection and there are plenty of cases that show that this does work.

AI thus plays an important role in detection. The response to incidents is still very manual and reactive. And here we are looking to move more towards orchestration and automation of response playbooks. Will AI also play a role in automating this response.

T. Berthier: Exactly. You should be aware that there are several stages in the response to incidents. There are almost 6.

First, there is the preparation stage, that is, before the attacks. In the security team, we will agree on a plan in case of an attack. For example: who are we going to look for if our system is down or if part of the system is no longer working.
Then there is the identification. What attacked us? What failure do I see? What service is no longer accessible?
Then the containment stage: is my entire network “down”? All or only a part of it? If I am a large business, are there any parts on the network that have been compromised? And on the contrary, which ones are still working? Once we have been able to do this mapping – and sometimes it is quite complicated to do – especially now with the new attack tools, we will try to contain it. We will put barriers between the healthy and infected part to maintain a minimum activity, even if it is degraded.
Then there is eradication. The compromised areas will have to be thoroughly cleaned. This sometimes means reinstalling everything, looking for backups, copies, etc. And be aware that today even copies are targeted by the new ransomware. The ransomware will even start by targeting backups, before even targeting systems to be sure that if we try to restore the system, we will do so with compromised files. Still, the attackers are very smart. This makes the eradication part very complicated and very expensive.
Then follows the phase of recovering the lost data. Can we get them back? Has everything been recovered or has some information been lost despite the backups? Thus, we must do an inventory.
Finally, once the crisis has passed and the system has been restarted, there is all the feedback part of experience and know-how. That’s the last thing that will feed the loop in the other direction. In other words, we will use it in preparation and planning to then improve the incident response. It truly is a cycle.

In addition, there can be a lot of actors involved like HR because if the attack came from an internal attack, from an unscrupulous employee, HR will be notified.

The legal part also plays a role. If my system is attacked, the regulations in Europe, GDPR oblige me to warn my suppliers, my service providers, my partners, startups, and large groups that I work with. I need to make them aware of the attack and the possible data breach and its side effects. And so, the regulatory aspect goes into the incident response. Essentially, there is the technical aspect, but also all the collateral.

If I may summarize, there is a playbook that needs to be applied, actions that need to be taken and at the end there is a review to assess the quality of the response. Does this assessment only inform about the response or also the detection?

T. Berthier: It is used everywhere. It fits into the business data of the company. It can also be used in other companies. So there really is a respectable information loop. Was it the correct response? Did we respond quickly enough? Was our response effective? Well, all of this, all of these elements will come, will stay in databases and will allow us to create new answers and new, more efficient playbooks.

And when we talk about reducing the loop between “finding” and “applying better solutions”, are we talking about these playbooks or something else?

T. Berthier: We are indeed talking about the cycle. I talked about the 6 components and we will try to automate some of these components. There are some that are easier than others. Then others – like quarantine – that are quite complicated because it means you are able to differentiate between the part of the network that is still healthy and the part that has been compromised. For a human, it is complicated, for a machine as well, especially in relation to new threats where we have malware which itself can embed envelopes of machine learning. The malware can be very stealthy and not activate at one point, and then decide depending on the context whether to deploy. This was a demo made at BlackHat 2019 by IBM Security Center.

As such, it becomes difficult because it involves detecting them, even if they haven’t done anything on the system. There is a review of the system and at some point, you have to decide what is going to keep working, whether all the traffic is going to be concentrated on that to keep the business operating.

Essentially, it is very complicated. It can be automated in some components, but not everywhere.

How can companies change their practices to integrate these new techniques?

T. Anglade: Actually, it’s more about setting up a data culture. To move towards automation and machine learning that will allow us to save time and improve the quality of detection, we need to become data driven. This means that we must, just like humans, put data at the heart of our processes because it is data that will allow us to make decisions. To be able to calculate high quality probabilities, they must be based on a large volume of data. To be able to say in a relevant way “Hey! Here I estimate that the risk is 98/100 “, you must have seen a lot of cases, thousands probably, to trust the probability.

On the other hand, if I only have 10 cases in my database, and I know that 8/10 which were ok and two which were bad, that gives me a probability of 80%, but it’s only 8/10, so, it’s not a huge level of certainty. So, the more data I collect and the more I put it at the heart of my learning and decision-making, the more I will benefit from all these technologies. What’s really going to decide the overall performance of my ability to sense and respond is going to be how much data I have. That is for right now.

And for tomorrow, the data / algorithm / user interaction because really, once I have set up this machine learning base for the first time, which understands the context well and which goes back well to the relevant things and in contextualized ways , the second question will be “How do we ensure that these algorithms learn continuously?”.

Because, in fact, these algorithms have learned about a context, at a given moment. So, their decisions are relevant at some point. But if I finally leave some time past, after a year it will no longer be relevant because there will be a lot of things that will have changed.

So, that requires continuous data from me, and therefore for the resource that is available to do this educational work with my algorithm.

For businesses that are interested in making this transition, what advice would you give? Where should they start?

T. Anglade: Fankly, I would say that these are more questions of governance and change management. First, there is a issue of people and membership because it will depend on the ability of companies to get employees to buy into these new working methods. And, in my opinion, to do this, there are several biases to overcome. The first bias is going to be on the cyber part, to think that we are going to be replaced by an algorithm. To tell yourself, “I don’t want an algorithm that is going to take my job, because if it does my job, what will I do?” So, in reality it is not going to do so at all, we will just improve the overall effectiveness of the team. I also think that it is important for companies to support employees in mastering these technologies, which are much more focused on data and algorithms. We must get past this idea that artificial intelligence does everything on its own “I click on a button and it sends me the alerts”. It’s a process that I have to engage with and prepare for; clean the data, ingest it, interact between teams, work with technical teams, data scientists and people who have cybersecurity skills to work cohesively.So, if we want this very theoretical and mathematical part to work in a real way, everyone has to understand how it is used, what it is for and who needs to communicate with who. So, the advice I would give is to support people in adopting these technologies, which are the ones that will work in the coming years.

T. Berthier: Also to add, for companies that have questions, the purpose is to develop a culture of risk. This is in fact oftentimes what is missing. Always keep in mind the classic impact / risk table where there are certain risks that are unlikely to occur, but which have a significant impact on the business. These are processes that are known in heavy manufacturing industries, they must be adapted in cybersecurity.Part of this risk culture is the second question- do I have a business continuity plan? What are the existing plans? Have I thought about what an attack might look like, what impact it might have on my system, and what I will do? And when we get this double culture of risk and continuity, we’re already on the right track and we can look more at the technical side and how we can save time, especially in the aspect of response.

We touched on it earlier, but today, it is with a SIEM tool that embeds UBA (user behavior analytics) and automated response, the gain factor is 40. We are therefore 40 times faster thanks to these dynamic playbooks that integrate personnel, processes, technologies versus the classic human-only response. This is a pretty significant time saver. For example, with a SIEM, it took 20 days to close an incident, using an automated response it’s more like 5 days.

Would you like to know more about our managed service offerings to increase your efficiency in incident management and response? Contact our experts!

The full interview is available in French on Ausha, Spotify, Apple Podcasts, Google Podcasts, Podcast Addict.

Thierry Berthier and Thomas Anglade’s Podcast: gains and metrics in incident response automation