Designing a Vulnerability Management Program

Published in

FAUN — Developer Community 🐾

16 min readMay 17, 2022

Vulnerability detection and remediation is an essential process for securing any technology-centric business. This is a deep dive post into this topic as a followup post to The Many Facets of Infrastructure.

There are thousands of resources online about why vulnerability detection and remediation is a good thing, and I am going to assume you have already bought into the idea that “vulnerabilities are bad.” Therefore, this post will cover the following topics.

Breadth: What are the six different types of vulnerabilities that need scanning and remediation?
Foundation of a good process: In order to minimize overhead and maximize resolution, what key qualities must be baked into the foundation of the process?
Key qualities of a scanner: When evaluating a specific scanner, what are the most important qualities to consider?

By knowing the breadth of the domain, some core tenants to build into the foundation of the program, and an evaluation process for a scanner, you’ll be able to suggest and make changes to your vulnerability management program — if you don’t fall asleep first.

What is the vulnerability management process?

A vulnerability management process consists of three core parts, listed below:

Detection: Detection is done by a scanner (usually purchased) that checks your assets using vulnerability definitions.
Assignment: Assignment handles how the vulnerability results are delivered to the correct engineering group that can take action. This usually takes the form of filing a ticket against the owner of the service.
Remediation: Vulnerabilities should either be fixed or purposefully ignored via an exception process. Exceptions take the form of false positives, risk downcasting, or risk acceptance.

Vulnerabilities are everywhere

Sorry to be the bearer of bad news, but it turns out everything is potentially vulnerable — your cloud configuration, your source control practices, your hosts, your containers, your application, your laptop, everything. Unfortunately, the adversaries that want to hurt you know this all too well.

I’ve divided up vulnerabilities into the following six categories.

Code vulnerabilities: Otherwise known as static application security testing (SAST), these are in the code itself. They include unpatched libraries, anti-patterns, code structures, and so on.
Deployment vulnerabilities: Deployment vulnerabilities are quickly becoming a favorite among adversaries. There are now scanners that can check for misconfigurations in the CI/CD chain, such as poisoned pipeline execution.
Container vulnerabilities: Similar to host vulnerabilities, container vulnerabilities are so closely tied to the application that their detection and remediation processes are often different.
Hosting provider vulnerabilities: Hosting provider vulnerabilities are misconfigurations in a cloud provider that directly create an attack path, or violate a fundamental model of network segmentation, the principle-of-least-privilege, data privacy, or something else. The scanner for this type of vulnerability is known as cloud security posture management.
Host vulnerabilities: Host vulnerabilities cover both unpatched systems and host configurations that violate hardening benchmarks like CIS/STIG. For the purpose of this blog, I am classifying network vulnerabilities as host vulnerabilities since they are detected using similar tooling and processes.
Web vulnerabilities: Otherwise known as dynamic application security testing (DAST), running web applications must be checked for vulnerabilities such as the OWASP Top Ten.

Not to double down on the bad news, but there isn’t a one-size-fits-all vulnerability management software that appropriately reduces your risk to all these vulnerabilities. This means that your company might have as many as six contracts with six vendors to deploy six scanners. And people wonder why DevOps and Security folks are so salty.

But there is good news — most scanners are of very high quality, are easy to install, and integrate easily with your existing workflow. Hah, you fool. It’s turtles all the way down. These products are notorious for being difficult to deploy, configure, and maintain. They are also filled with false positives and poor UX, and generally leave a bad taste in your mouth.

In all seriousness, vulnerability scanners are getting much better over time. Companies like Semgrep and Cider are doing some amazing things, and you need to check them out!

Key qualities of a good vulnerability management program

Vulnerability detection is frequent and occurs as early in the cycle as possible.
False positives, risk downcasting, and risk acceptance are a well-defined, tightly integrated, and executed as a fast workflow.
Vulnerability remediation tasks are part of the normal support queue.

Vulnerability detection is frequent and occurs as early in the cycle as possible

The earlier a vulnerability is detected in the cycle, the cheaper it is to make, test, and roll out a fix.

It is not uncommon that it takes multiple attempts to fully resolve a vulnerability. This may occur for several reasons.

A weekly patching cycle didn’t cover a vulnerability library that was pinned to a specific version.
A vulnerable library was updated in only part of the code base.
An HTTPS settings was applied at the wrong proxy layer.

A weekly scan would make these tasks intolerable. The faster you can validate your fix, the sooner you can go back to doing something more fun (literally anything).

Vulnerabilities can appear without a code change

If you don’t have a lot of experience with vulnerability scanning, you might wonder why not just run a scanner at commit time and block any pull request that introduces a vulnerability (bad code, unpatched image, etc.)?

Unfortunately, vulnerability definitions are updated over time, and a piece of code or image that was thought to be safe this morning, might now have a known critical vulnerability. If you only scanned at merge time, you would miss this case. You need to periodically re-scan all assets that are deployed to production.

That being said, it can still be valuable to provide vulnerability information in a pull request. The key is to ensure that pull requests are only blocked if they introduce a vulnerability and to include an override option for emergency changes.

False positives, risk downcasting, and risk acceptance are well-defined, tightly integrated, and executed as fast workflows

Vulnerability exceptions:

Risk downcasting means that a detected vulnerability is legitimate but, through other controls, has been significantly mitigated. You aren’t refusing to fix the issue; you are just saying that due to your other security controls, there is less urgency.
False positives refer to when a detected vulnerability is not applicable in the environment and should be ignored. This means you are asserting that the scanner itself is broken or wrong.
Risk acceptance is when a detected vulnerability cannot be fixed without jeopardizing the entire system.

This is the rarest exception because you are admitting the vulnerability is legitimate, but you are incapable of fixing it.

Key qualities in an exception workflow:

Well-defined: Making an exception (downcast, false positive, or acceptance) to a critical vulnerability is a serious decision. You need an engineer that can attest to the state of the infrastructure and an unbiased security engineer that approves the exception. Auditors and customers are going to want to know about your exceptions, so your process needs to be clearly documented and enumerable.
Tightly integrated: Once an exception has been approved, it needs to be put into the scanner in a way that will be reflected in the downstream results or automation (e.g., a downcasted vulnerability should show up in the remediation ticket and SLA calculation). You also need to ensure that all exceptions originate from the well-defined process and aren’t just jammed in by a cowboy engineer.
Fast: By its nature, patching is a tedious and frustrating task. The last thing you want is unnecessary friction that makes a boring task even slower. Streamline the exception approval process to avoid creating a rift between security and DevOps.

My favorite model for this is an infrastructure-as-code model that uses Terraform to define the exceptions and apply the changes to the vulnerability scanner. The engineer makes a pull request to create the exception, and the security team approves the code. This way, you cover the requirements for documentation and multiple eyes on the change and do so in a way that is familiar to everyone.

Vulnerability remediation is part of the normal work queue

DevOps is a field known for cognitive overload because of the vast breadth of services we are expected to be knowledgeable in. Handing a DevOps team six dashboards for six types of scanning and expecting to have a good remediation experience is foolish.

My proposal is to deliver the vulnerability results into the existing work queues of the engineers that need to fix the issue.

Does the DevOps team work out of Jira? Then your cloud vulnerability results need to be delivered as Jira tickets.
Does the product team work out of GitHub Issues? Then your code vulnerability results need to be GitHub Issues.

The ideal end state is one where the pebble is in the right shoe. The vulnerability scanner generates the work, automation drops that work into the appropriate queue, and the engineers execute on their queue. Ensure there is a process that escalates to the security team if a vulnerability SLA is breached.

Note: While it can be useful context to upload vulnerability metrics to a system like Datadog, a monitoring platform is not ideal for the assignment of remediation tasks that can take months to complete. You’ll end up with monitors that are firing for months, and before long, the team will have snoozed all the alerts.

Every DevOps team will have a ticketing system for tracking the longer-running tasks for exactly this purpose. Most ticketing systems support useful things like auto-assignment, escalations, SLA trackers, dashboards, fields for filtering/searching, notification systems, and so on.

Unfortunately, in my experience, most vulnerability scanners have no ticket integration, or if they do, it doesn’t work quite right for the target use case. At the end of this article, I propose a workflow for how you to automate this process fairly easily.

Key qualities of a vulnerability scanner

The key qualities of a vulnerability scanner are as follows:

vulnerability detection efficacy,
the process for updating rules/checks,
asset management,
the API,
ticketing integration,
overviews,
the customization of rules/checks, and
the performance implications of rules/check execution.

Vulnerability detection efficacy

It probably goes without saying, but the most important quality of a vulnerability scanner is its ability to detect a vulnerability. If you are unable to detect a vulnerability, none of the other qualities matter. Additionally, if the results are flakey, then the trust factor goes way down.

A high quality scanner:

minimizes false negatives and false positives,
has a maximum breadth in checks, and
provides reliability of results.

When evaluating a new vulnerability scanner, always check with a similar tool and compare results. You should lean strongly toward picking whichever scanner produced the most actionable results.

Process for updating rules/checks

Every scanner relies on an upstream source to provide it with new definitions. There are three major categories of things to consider.

What is the mechanism for the update? Is it manual or automated? How does it work with custom rules? How does it affect existing vulnerability reports? Can you roll back to a previous update? Are there alerts if the update mechanism fails to execute? Are there changelogs?

Are the updates made in a secure manner? Especially for host vulnerability scanners, the scanner itself can have privileged access to your entire fleet. If an adversary can compromise the upstream source, can they escalate into your environment? What processes does the vendor have in place to prevent this? What processes do you have in place to detect this?

How quickly do high or critical vulnerabilities get detection definitions? One of the most critical moments for a vulnerability scanner is the day a CVSS 10.0 vulnerability drops. You want a vendor that has a track record of producing reliable detection mechanisms within hours, not days. Verify this by looking at the timestamps of the vulnerability announcement date and the vulnerability detection definition being made available. In addition, check to see if the vendor needed to patch the code multiple times and if this was for false negatives or false positives.

When evaluating a new scanner, model how long it would take for a CVSS 10.0 vulnerability to be announced and the scanning tool running a check against your assets. If the vendor is a major limiting factor, consider looking elsewhere.

Asset management

As a foundation, all vulnerability scanners will need some form of asset management (i.e., how do they determine what needs to be scanned/checked?).

Code vulnerability scanners (SASTs) need to know which repositories to scan.
Deployment scanners need a list of repositories and continuous integration systems to check.
Container scanners need to know which containers in the registry are used in production.
Cloud security posture scanners need to know which accounts they are scanning.
Host vulnerabilities need either a list of hosts or a way of identifying an IP address from a network scan.
Web vulnerability scanners (DASTs) need to know which hosts/endpoints need to be scanned

How assets are discovered, categorized, and used is an important feature point. If asset management is poorly implemented, you can really struggle to operationalize the vulnerability results. For host scanning, if a host with a vulnerability is deleted and replaced with a new host, should the scanner consider the vulnerability resolved? Or wait till the new host is scanned?

When evaluating a new vulnerability scanner, look carefully for how it implements asset management and deals with ephemerality. The workflows here are important.

API

Whether it’s for ticketing or for integrating into your continuous integration or delivery process, you are going to want an API that you can hit to pull the latest posture and vulnerability results with enough granular detail that the remediation assignment is easy. Some vulnerability scanners don’t support the level of depth you might need, which can be a problem down the road.

As a real-world example, I once was dealing with a Java vulnerability that affected two different copies of the JRE on the same system (one owned by DevOps, the other by AppOps). In a tool like Tenable, it would show as a single vulnerability result, and the only way to tell that there were two different findings was to regex through the output (gross).

When evaluating a new vulnerability scanner, look at the API to ensure the granularity of vulnerability findings, the ability to delete hosts or vulnerability results, and the performance while pulling the totality of findings.

Ticketing integration

Again, what is the point of a vulnerability scanner that produces results that aren’t actioned? You need to ensure you can get actionable results into the hands of the engineers who can fix the problem.

An example of an interesting workflow complication — the DevOps team runs a weekly patching cycle on Sundays. The DevOps team doesn’t want to have a ticket in their queue for a vulnerability that will get fixed automatically next Sunday.

When evaluating the usefulness of a new scanner, look for features such as the following.
- The scanner breaks up vulnerability results into a set of tickets based on the properties of the vulnerability itself.
- Tickets have a path to auto-assignment (either set by the scanner or by the ticketing system).
- Tickets have due dates, SLAs, or the equivalent.
- False positives, risk acceptance, and risk downcasting are integrated into the ticketing process.
- There is the ability to delay ticketing until an automated patching process has run.
- The scanner provides clarity with an ephemeral asset set.
- The scanner has a mechanism for handling ticket swarm (i.e., it doesn’t file 1,000 tickets for a single host).

Overviews

Overviews are needed by at least three different audiences: engineering, security, and compliance.

Engineering teams are focused on the priority of vulnerability remediation. They are going to want an overview that shows only the vulnerabilities that they are responsible for, with a focus on what is affected and the due date. They potentially want to collaborate with people fixing similar vulnerabilities.

Security teams are focused on the severity of the existing vulnerabilities and the duration of the exposure. They want high visibility and alerting for the most critical vulnerabilities, with rich information on what is affected. If a new critical vulnerability drops, they may want to escalate to engineering to fix it at a faster speed than normal.

Compliance teams are focused on the remediation rate (SLAs) of vulnerabilities. From their perspective, it doesn’t matter how many vulnerabilities there are or how risky the current state is as long as all the vulnerabilities are fixed within the required timeframe. Any vulnerability that isn’t remediated in time will require justification and potentially a discussion with auditors, governing bodies, or customers.

When evaluating a new scanner, consider how the tool will work for each of these three groups. There will almost certainly be a gap, so consider the time that will be spent building a workflow.

Customization of rules/checks

The best scanning tools provide flexibility regarding rules/checks. This can be very useful for dealing with false positives or false negatives beyond just ignoring the entire check. However, when there is a brand new critical or zero-day vulnerability that is publicly disclosed, it is absolutely paramount that you are able to submit a fully custom check and run it immediately (think of the epic log4j vulnerability).

Don’t fall for the sales pitch of “of course you can add your own rules.” Dig into the actual process. For one scanner, adding a rule meant creating a check in XML in a rigid library. However, if you created a custom rule, it would block automatic vulnerability definition updates (expected to update manually to handle your custom modifications). This negated the value generated by the custom rule.

When evaluating a new scanner, ensure that existing rules can be deleted or modified and that custom rules can be easily added in. Pay close attention to the resulting workflow and side effects.

Performance implications of rules/checks execution

This refers not only to how fast checks can be run but also to whether those executions have a detrimental effect on the asset. The most obvious examples would be host vulnerability scanners that slow down your laptop or code vulnerability checkers that add minutes to your CI build. Like most things in life, it’s a tradeoff. The more security tooling there is and the more aggressively you run it, the more productivity hits you’ll take.

What you want to look for is the customizability of controlling the execution of the check. Do you have the option to run partial checks for only critical vulnerabilities? Can the scanner itself limit its performance footprint? How fast does a run take? What would happen if you ran it continually?

A very common question from the operations team is, “what was the impact on performance for the last scan?”. Excellent tools should also come with metrics around the execution of the check (duration, performance, etc.). This will save you a world of difficulty in troubleshooting if a production performance issue is caused by a scanner.

When evaluating a new scanner, evaluate which controls exist to measure and limit the performance impact of the scanner. Hand-in-hand with impact on performance is the total duration of a scan, which can go against one of the core tenets of a good vulnerability management program: speed.

Wrap-up: Putting this into action

This post covered the following vulnerability management topics:

Vulnerabilities come in six forms: Code, deployment, container, hosting provider, host, and web vulnerabilities.
Foundational Qualities: Ensure that when building your management program that vulnerability detection is frequent and occurs early. As engineers come across exceptions, the exception resolution workflow is well-defined, tightly integrated, and fast to execute. Vulnerability remediation tasking should be done as part of a normal support queue, and not as a separate process.
Scanner Qualities: Vulnerability detection efficacy, the process for updating rules/checks, asset management, the API, ticketing integration, overviews, the customization of rules/checks, and the performance implications of rules/check execution.

A good vulnerability management program requires care and feeding and will go significantly better if someone is accountable for the whole thing.

I suggest a measured, incremental approach to making modifications to vulnerability management programs. I’ve seen people go on aggressive campaigns, thinking that all they needed was to get the scanner deployed and results flowing, and their vulnerability problems would be over. The majority of your time needs to be spent on making sure that the assignment phase is smooth, the remediation is cheap, and that there is someone accountable for ensuring that the system is constantly improving.

As for where to get started, the bad guys will target any phase of the pipeline indiscriminately. Nothing is off-limits for them. This means that being great at half the problem is insufficient.

Good luck on your journey! Please hit me up if you find a vulnerability scanner that isn’t soul-sucking!

Appendix: Advice for automating the vulnerability-to-ticket process

If you are lucky, the vulnerability scanner you are using will auto-generate useful tickets, and no automation will be necessary. Unfortunately, DevOps engineers are rarely lucky (maybe that’s why we tend to be salty?). Regardless of whether you are using Tenable for scanning a host, or Anchore for docker containers, or Dome9 for cloud posture checking, the following workflow can be applied:

Pull the state of all vulnerabilities in the scanner.
Pull the state of all vulnerability tickets in the task system.
For each vulnerability without a ticket, create a new ticket.
For each vulnerability with a ticket, update the ticket.
For each ticket without a vulnerability, close the ticket.

The content of the ticket is also generally the same across different use cases and includes

the name, description, severity, and category of the vulnerability;
the list of currently affected assets, including the basic metadata of the asset;
the SLA and due date for the vulnerability to be resolved;
links to documentation for remediation processes, the false-positive workflow, etc.; and
the output from the scanner.

If you are following a pattern like this, it’s relatively simple.

Automation files the ticket.
The ticketing system determines who the owner is and routes accordingly.
Engineers fix the issue.
Unfixed tickets that violate SLA get routed to the security team.

Where things can get tricky but not insurmountable is in how the vulnerabilities get batched together. You obviously don’t want to file a ticket per asset (a vulnerability affecting 100,000 assets shouldn’t create 100,000 tickets). However, it’s often not useful to have just one large ticket because multiple teams are often involved (a vulnerability affecting both laptops and servers will likely need to be resolved by different teams). The trick is to figure out how to create filters to separate assets into collections and create vulnerability tickets for each asset collection.