HAProxy has been around since long before Kubernetes was even a twinkle in Google’s eyes, but now the “world’s fastest and most widely used software load balancer” has made the leap into cloud native computing with the introduction of HAProxy 2.0, which adds a Kubernetes Ingress controller, a Data Plane API, and much more in its efforts to enmesh itself even further into the fabric of modern infrastructure.
“The release of HAProxy 2.0 along with the new HAProxy Data Plane API and HAProxy Kubernetes Ingress Controller marks the culmination of a significant re-architecture of HAProxy to add the flexibility and features needed to optimize support for modern application architectures,” said Willy Tarreau, HAProxy community lead and HAProxy Technologies CTO, in a company statement.
Daniel Corbett, director of product at HAProxy, explained in an interview with The New Stack that many of the features have been in the works since earlier versions, with HAProxy 1.9 serving as “a bit of a technical preview” to 2.0, which has evolved according to market shifts and community feedback.
“We’ve seen this shift in the market over time from hardware-based load balancers to companies and users exploring software-based load balancers. We’re seeing a massive shift right now into containers and Kubernetes and microservices,” said Corbett. “We’ve been observing these trends and listening to user feedback within the general community, and so HAProxy 2.0 has a focus around these cloud and container-based environments. We’ve spent a lot of time working in conjunction with our community to make sure that the product is on the path to supporting cloud native environments.”
According to the company statement, the Data Plane API “represents another important advancement in extensibility for HAProxy, providing true dynamic configuration management in any environment” by exposing “a modern REST API for configuring HAProxy on the fly, including dynamically adding or removing frontends, backends and servers, creating ACL rules, inserting HTTP routing directives, and setting IP and port bindings.” Polyglot extensibility, meanwhile, means the addition of new Stream Processing Offload Engine (SPOE) libraries and examples, which make it easier to extend HAProxy in languages other than C (which HAProxy is written in), such as Golang, Python, Lua and .NET Core.
Nick Ramirez, senior content strategist at HAProxy Technologies, explained that HAProxy’s move into cloud native environments and effort to bolster extensibility has opened the software-based load balancer to a new world of opportunity.
“Traditionally people are familiar with using HAProxy on the edge of their network and these new architectures are moving proxies into these new architectures like Kubernetes,” said Ramirez. “It’s really thrilling for us to be able to meet them there and to give them these features that they’d been clamoring for, like the Data Plane API. Without an HTTP restful API, it’s really difficult for someone to integrate a proxy into something like a service mesh, so now that we have this, it’s just going to open all sorts of doors.”
Service meshes are not new territory for HAProxy, which was actually a part of SmartStack, an early service mesh created by Airbnb, Corbett pointed out. As such, it seems like an obvious next step.
“HAProxy has a history of being involved within service architecture. We’re interested in further expanding our support in these areas,” said Corbett. “HAProxy existed in the original service mesh and we’re going to see what we can do to integrate into the service mesh architectures of today.”
In addition to HAProxy 2.0, the company behind the open source load balancer also announced its inaugural community user conference, HAProxyConf 2019, which will take place in Amsterdam, the Netherlands on Nov. 12 and 13, 2019.
It’s hard to define good SLOs, especially when outcomes aren’t fully under the control of any single party. The authors of today’s paper should know a thing or two about that: Jeffrey Mogul and John Wilkes at Google1! John Wilkes was also one of the co-authors of chapter 4 “Service Level Objectives” in the SRE book, which is good background reading for the discussion in this paper.
The opening paragraph of the abstract does a great job of framing the problem:
Cloud customers want strong, understandable promises (Service Level Objectives, or SLOs) that their applications will run reliably and with adequate performance, but cloud providers don’t want to offer them, because they are technically hard to meet in the face of arbitrary customer behavior and the hidden interactions brought about by statistical multiplexing of shared resources.
When it comes to SLOs, the interests of the customer and the cloud provider are at odds, and so we end up with SLAs (Service Level Agreements) that tie SLOs to contractual agreements.
What are we talking about
Let’s start out by getting some terms straight: SLIs, SLOs, SLAs, and how they fit together.
A Service Level Indicator (SLI) is something you can measure (e.g. a rate, average, percentile, yield, or durability).
A Service Level Objective (SLO) is a predicate over a set of SLIs. For example, monthly uptime percentage (the SLI) will be at least 99.99%.
A Service Level Agreement (SLA) is “an SLO plus consequences” : a promise made by a provider than, in exchange for payment, it will meet certain customer visible SLOs.
When SLOs are tied to SLAs, they tend to end up defining the worst case behaviour that a customer can possibly tolerate (because anything beyond that triggers the penalty clause). If a provider consistently delivered service right up to the SLO limit however, it’s unlikely that customers would be very happy. There are a set of Service Level Expectations (SLEs) which need to be met in order to keep customers happy, that are stricter than the SLOs defined in an SLA. From a cloud provider perspective, these are likely to be internal SLOs: the targets that the provider strives to meet, but is not contractually obligated to meet.
So there are different kinds of SLOs, which the authors argue are best categorised based on the consequences of failing to meet them:
Contractual SLOs, connected to SLAs, for which a failure to meet them usually results in financial penalties
Customer satisfaction SLOs (SLEs), for which a failure to meet them results in unhappy customers
Compositional SLOs are expectations over sets of resources such as “VM failures in two different availability zones are uncorrelated.” These are SLOs that inform a customer’s application design, and failure to meet them may result in invalidated design assumptions — which doesn’t generally turn out well!
Control loop SLOs express the active management actions a provider will take, e.g. shedding of low-priority load will occur on over-utilised network links. Failure to meet a control loop SLO usually results in cascading failures and violation of other SLOs.
Why are SLOs so hard to define?
Creating an SLA seems simple: deﬁne one or more SLOs as predicates on clearly-deﬁned measurements (Service Level Indicators, or SLIs), then have the business experts and lawyers agree on the consequences, and you have an SLA. Sadly, in our experience, SLOs are insanely hard to specify. Customers want diﬀerent things, and they typically cannot describe what they want in terms that can be measured and in ways that a provider can feasibly commit to promising.
Consider for example “monthly uptime percentage for a VM will be at least 99.99%.” How are we measuring uptime? On what granularity (seconds, minutes, calendar months, rolling 30 days,…)? What is ‘up’? The VM is provisioned? The VM is running an OS? The VM is reachable from the Internet? Is a performance brownout an outage? And so on.
For cloud providers things get extra complicated due to multi-tenancy, and the fact that the behaviour of their clients can also impact SLIs. As a simple example, an SLO around network throughput might rely on the customer running software capable of driving the network fast enough. Or a system availability SLO, for availability seen by the end user, may well depend on the user carefully exploiting the available redundancy.
Expressing available SLOs in terms of ‘nines’ also causes some issues: it hides the difference between many short outages and a few long ones, which is something many customers care about. It also treats all outages as equal, whereas an outage on Black Friday for a retailer is much worse than an outage on a quiet day of the year.
Is there another way of thinking about this?
The big idea in the paper is to draw lessons from statistics by analogy.
A good statistician will look at what decision needs to be made, define hypotheses to test in order to make the decision, decide how to collect sufficient data without bias, often sampled from the underlying population while staying within a budged, and choose an appropriate method to test the hypotheses against the sample.
In the context of SLAs, the decision is whether or not to invoke the contractual consequences. The problem of measuring SLIs is akin to sample gathering; and choosing a predicate over an SLI is akin to choosing an appropriate method.
Just as “statistician” and “data scientist” are distinct roles that share many, but not all, skills, “SLOgician” is also a distinct role with its own specific skills.
How would a SLOgician approach defining SLOs?
List the good outcomes you want, and the bad outcomes to be avoided
Agree with business decision makers what the consequences should be
Operationalize these outcomes, e.g. deciding on level of network capacity
Decide what data you need to collect in order to decide whether you are suffering from a bad outcome, and what kinds of aggregation are possible. (Analogous to ‘power analysis’ in statistics).
Decide what predicate on the data tells you whether an outcome has happened
Decide how much of the desired data you can collect given your resource budget and check it is enough to actually compute the SLOs
If you don’t have enough data collection budget available you could offer fewer SLOs; accept lower confidence in determining whether SLOs are being met; or dynamically lower measurement rate when an SLO is not at risk of violation.
One thing that a statistical outlook reminds us of is that SLOs are very rarely black-and-white, and we need to be accept a level of uncertainty.
Whose responsibility is it?
My interpretation of the introduction to section 5 in the paper is “we have a problem because our business model seems to depend on us making promises we can’t keep” ;). Or in more technical terms, there are too many SLOs, poorly defined, and depending on decisions outside of the providers control. Wouldn’t it be nice if…
… following our analogy with statistics, we could focus less on SLOs that guarantee outcomes, and instead use SLOs as a tool for providers to provide structured guidance about decisions that create or remove risk.
That is, SLOs given by a provider could focus only on risks entirely under the provider’s control. Returning to the availability example, it would be the cloud provider’s responsibility to provide isolated failure zones, but the customer’s responsibility to use them correctly to achieve a desired level of availability.
Instead of focusing on outcomes, we should focus on expectations, and make these expectations bilateral: what service level the customer can expect from the provider (an SLE), and what the provider can expect from the customer (Customer Behavior Expectations, or CBEs). An SLE only applies if its related CBEs are met. Our view is that the customer and provider should each bear part of the risk of unpredictability, and use SLEs and CBEs to explicitly manage the sharing of risks.
So far so good, but one set of issues the authors would like to share responsibility for are those caused by resource sharing. Now if customer A violates their own CBEs and this causes desired SLOs not to be met, that’s fair game in my mind. With CBEs in place…
…one could limit sharing-dependent SLOs to be only compositional, not contractual – that is, sharing-dependent SOLs are offered as guidance: the provider implicitly promises not to undermine well-accepted SLEs, but makes no enforceable promises (SLAs) about sharing-dependent outcomes.
But in noisy neighbour scenarios, customer B violating their own CBEs (or maybe even staying within them!) can impact customer A’s SLOs. I’m not so comfortable with cloud providers side-stepping responsibility for that. After all, they control the isolation mechanisms, the resource overcommitment levels, and so on, and it’s certainly not something the customer can control.
About that recent outage…
I know this has been a longer write-up than usual, but I can’t resist quoting this paragraph about risks that are under the cloud provider’s control, to be juxtaposed with the emerging details of the recent Google Cloud Networking Incident #19009.
If we could ignore resource sharing, we could focus SLOs on various risks that arise from poor engineering or operational practices, such as not repairing control-plane outages; SDN designs that allow short-term control-plane failure to disrupt the data plane; failover mechanisms that do not actually work; operational procedures that create correlated risks, such as simultaneous maintenance on two availability zones in a region; routing network packets along surprisingly long WAN paths.
To be clear though, given how enormously complex these environments are, I think it’s pretty amazing that cloud providers are able to provide the levels of service that they do.
The last word
We do not pretend to have a complete solution to the problems of cloud-SLO definition, but we think such a solution could emerge from re-thinking our use of SLOs, and using the combination of SLEs and CBEs to create harmonious cooperation in normal times… Perhaps the most important lesson we can learn from statistics, however, is humility — that the combination of unpredictable workloads, hard-to-model behaviour of complex shared infrastructures, and the infeasibility of collecting all the necessary metrics means that certain kinds of SLOs are beyond our power to deliver, no matter how much we believe we need them.
Delightfully, at the time I’m writing this if you follow the ‘see my personal page’ link on John Wilkes’ Google profile page, you end up with a 500 Internal Server Error! Looks like a 404 handling misconfiguration. The information you seek can be found at https://john.e-wilkes.com/work.html instead.
A common debate in software development projects is between spending
time on improving the quality of the software versus concentrating on
releasing more valuable features. Usually the pressure to deliver
functionality dominates the discussion, leading many developers to
complain that they don't have time to work on architecture and code
quality. But the counter-intuitive reality is that internal software
quality removes the cruft that slows down developing new features, thus
decreasing the cost of enhancing the software.
Infinite Essence: “James” (2018) All images courtesy of Mikael Owunna
Mikael Chukwuma Owunna, a queer Nigerian-Swedish artist raised in Pittsburgh, has spent the past two and a half years photographing Black men and women for a series titled Infinite Essence. Hand-painted using fluorescent paints and photographed in complete darkness, Owunna’s subjects are illuminated by a flash outfitted with a UV filter, which turns their nude bodies into glowing celestial figures.
Owunna tells Colossal that the series was his response to the frequent images and videos of Black people being killed by those sworn to protect them: the police. The photographer’s friends, family members, dancers, and one person he connected with on Instagram serve as models for the project, which is named after an idea from his Igbo heritage. “All of our individual spirits are just one ray of the infinite essence of the sun,” Owunna explains. “By transcending the visible spectrum, I work to illuminate a world beyond our visible structures of racism, sexism, homophobia and transphobia where the black body is free.”
Infinite Essence: “Uche” (2019)
Having struggled with his own body image (and with his identity as a gay African man, which has inspired his previous work), Owunna says that the response to the project has been powerful, both from the public and from the models. “One of the models, Emem, broke down in tears looking at their pictures saying that they had always dreamed of seeing their body adorned with stars and that these images were beyond their wildest imagination,” he said. “They then told me – ‘every black person deserves to see themselves in this way’ and how the experience was life-altering for them.”
After seeing Owunna’s work via an NPR feature, a 60-year-old Black woman told the photographer, “I’ve hated my body all my life, but–for a glorious instant–that photo made me feel good about it.”
To see more of Mikael Owunna’s work and to be informed about his upcoming lectures and exhibitions, follow the artist on Instagram and Twitter.
Companies often require employees to regularly change their passwords for security purposes. PCI compliance, for example, requires that passwords be changed every 90 days. However, NIST, whose guidelines commonly become the foundation for security best practices across countless organizations, recently revised its recommendations around password security. Its Digital Identity Guidelines (NIST 800-63-3) now recommends removing periodic password-change requirements due to a growing body of research suggesting that frequent password changes actually makes security worse. This is because these requirements encourage the use of passwords which are more susceptible to cracking (e.g. incrementing a number or altering a single character) or result in people writing their passwords down.
Unfortunately, many companies have now adapted these requirements to other parts of their IT infrastructure. This is largely due to legacy holdover practices which have crept into modern systems (or simply lingered in older ones), i.e. it’s tech debt. Specifically, I’m talking about practices like using username/password credentials that applications or systems use to access resources instead of individual end users. These special credentials may even provide a system free rein within a network much like a user might have, especially if the network isn’t segmented (often these companies have adopted a perimeter-security model, relying on a strong outer wall to protect their network). As a result, because they are passwords just like a normal user would have, they are subject to the usual 90-day rotation policy or whatever the case may be.
Today, I think we can say with certainty that—along with the perimeter-security model—relying on usernames and passwords for system credentials is a security anti-pattern (and really, user credentials should be relying on multi-factor authentication). With protocols like OAuth2 and OpenID Connect, we can replace these system credentials with cryptographically strong keys. But because these keys, in a way, act like username/passwords, there is a tendency to apply the same 90-day rotation policy to them as well. This is a misguided practice for several reasons and is actually quite risky.
First, changing a user’spassword is far less risky than rotating an access key for a live, production system. If we’re changing keys for production systems frequently, there is a potential for prolonged outages. The more you’re touching these keys, the more exposure and opportunity for mistakes there is. For a user, the worst case is they get temporarily locked out. For a system, the worst case is a critical user-facing application goes down. Second, cryptographically strong keys are not “guessable” like a password frequently is. Since they are generated by an algorithm and not intended to be input by a human, they are long and complex. And unlike passwords, keys are not generally susceptible to social engineering. Lastly, if we are requiring keys to be rotated every 90 days, this means an attacker can still have up to 89 days to do whatever they want in the event of a key being compromised. From a security perspective, this frankly isn’t good enough to me. It’s security by happenstance. The Twitter thread below describes a sequence of events that occurred after an AWS key was accidentally leaked to a public code repository which illustrates this point.
To recap that thread, here’s a timeline of what happened:
AWS credentials are pushed to a public repository on GitHub.
55 seconds later, an email is received from AWS telling the user that their account is compromised and a support ticket is automatically opened.
A minute later (2 minutes after the push), an attacker attempts to use the credentials to list IAM access keys in order to perform a privilege escalation. Since the IAM role attached to the credentials is insufficient, the attempt failed and an event is logged in CloudTrail.
The user disables the key 5 minutes and 58 seconds after the push.
24 minutes and 58 seconds after the push, GuardDuty fires a notification indicating anomalous behavior: “APIs commonly used to discover the users, groups, policies and permissions in an account, was invoked by IAM principal some_user under unusual circumstances. Such activity is not typically seen from this principal.”
Given this timeline, rotating access keys every 90 days would do absolutely no good. If anything, it would provide a false sense of security. An attack was made a mere 2 minutes after the key was compromised. It makes no difference if it’s rotated every 90 days or every 9 minutes.
If 90-day key rotation isn’t the answer, what is? The timeline above already hits on it. System credentials, i.e. service accounts, should have very limited permissions following the principle of least privilege. For instance, a CI server which builds artifacts should have a service account which only allows it to push artifacts to a storage bucket and nothing else. This idea should be applied to every part of your system.
For things running inside the cloud, such as AWS or GCP, we can usually avoid the need for access keys altogether. With GCP, we rely on service accounts with GCP-managed keys. The keys for these service accounts are not exposed to users at all and are, in fact, rotated approximately every two weeks (Google is able to do this because they own all of the infrastructure involved and have mature automation). With AWS, we rely on Identity and Access Management (IAM) users and roles. The role can then be assumed by the environment without having to deal with a token or key. This situation is ideal because we can avoid key exposure by never having explicit keys in the first place.
For things running outside the cloud, it’s a bit more involved. In these cases, we must deal with credentials somehow. Ideally, we can limit the lifetime of these credentials, such as with AWS’ Security Token Service (STS) or GCP’s short-lived service account credentials. However, in some situations, we may need longer-lived credentials. In either case, the critical piece is using limited-privilege credentials such that if a key is compromised, the scope of the damage is narrow.
The other key component of this is auditing. Both AWS and GCP offer extensive audit logs for governance, compliance, operational auditing, and risk auditing of your cloud resources. With this, we can audit service account usage, detect anomalous behavior, and immediately take action—such as revoking the credential—rather than waiting up to 90 days to rotate it. Amazon also has GuardDuty which provides intelligent threat detection and continuous monitoring which can identify unauthorized activity as seen in the scenario above. Additionally, access credentials and other secrets should never be stored in source code, but tools like git-secrets, GitGuardian, and truffleHog can help detect when it does happen.
Let’s look at a hypothetical CI/CD pipeline as an example which ties these ideas together. Below is the first pass of our proposed pipeline. In this case, we’re targeting GCP, but the same ideas apply to other environments.
CircleCI is a SaaS-based CI/CD solution. Because it’s deploying to GCP, it will need a service account with the appropriate IAM roles. CircleCI has support for storing secret environment variables, which is how we would store the service account’s credentials. However, there are some downsides to this approach.
First, the service account that Circle needs in order to make deploys could require a fairly wide set of privileges, like accessing a container registry and deploying to a runtime. Because it lives outside of GCP, this service account has a user-managed key. While we could use a KMS to encrypt it or a vault that provides short-lived credentials, we ultimately will need some kind of credential that allows Circle to access these services, so at best we end up with a weird Russian-doll situation. If we’re rotating keys, we might wind up having to do so recursively, and the value of all this indirection starts to come into question. Second, these credentials—or any other application secrets—could easily be dumped out as part of the build script. This isn’t good if we wanted Circle to deploy to a locked-down production environment. Developers could potentially dump out the production service account credentials and now they would be able to make deploys to that environment, circumventing our pipeline.
This is why splitting out Continuous Integration (CI) from Continuous Delivery (CD) is important. If, instead, Circle was only responsible for CI and we introduced a separate component for CD, such as Spinnaker, we can solve this problem. Using this approach, now Circle only needs the ability to push an artifact to a Google Cloud Storage bucket or Container Registry. Outside of the service account credentials needed to do this, it doesn’t need to deal with secrets at all. This means there’s no way to dump out secrets in the build because they will be injected later by Spinnaker. The value of the service account credentials is also much more limited. If compromised, it only allows someone to push artifacts to a repository. Spinnaker, which would run in GCP, would then pull secrets from a vault (e.g. Hashicorp’s Vault) and deploy the artifact relying on credentials assumed from the environment. Thus, Spinnaker only needs permissions to pull artifacts and secrets and deploy to the runtime. This pipeline now looks something like the following:
With this pipeline, we now have traceability from code commit and pull request (PR) to deploy. We can then scan audit logs to detect anomalous behavior—a push to an artifact repository that is not associated with the CircleCI service account or a deployment that does not originate from Spinnaker, for example. Likewise, we can ensure these processes correlate back to an actual GitHub PR or CircleCI build. If they don’t, we know something fishy is going on.
To summarize, requiring frequent rotations of access keys is an outdated practice. It’s a remnant of password policies which themselves have become increasingly reneged by security experts. While similar in some ways, keys are fundamentally different than a username and password, particularly in the case of a service account with fine-grained permissions. Without mature practices and automation, rotating these keys frequently is an inherently risky operation that opens up the opportunity for downtime.
Instead, it’s better to rely on tightly scoped (and, if possible, short-lived) service accounts and usage auditing to detect abnormal behavior. This allows us to take action immediately rather than waiting for some arbitrary period to rotate keys where an attacker may have an unspecified amount of time to do as they please. With end-to-end traceability and evidence collection, we can more easily identify suspicious actions and perform forensic analysis.
Note that this does not mean we should never rotate access keys. Rather, we can turn to NIST for its guidance on key management. NIST 800-57 recommends cryptoperiods of 1-2 years for asymmetric authentication keys in order to maximize operational efficiency. Beyond these particular cryptoperiods, the value of rotating keys regularly is in having the confidence you can, in fact, rotate them without incident. The time interval itself is mostly immaterial, but developing this confidence is important in the event of a key actually being compromised. In this case, you want to know you can act swiftly and revoke access without causing outages.
The funny thing about compliance is that, unless you’re going after actual regulatory standards such as FedRAMP or PCI compliance, controls are generally created by the company itself. Compliance auditors mostly ensure the company is following its own controls. So if you hear, “it’s a compliance requirement” or “that’s the way it’s always been done,” try to dig deeper to understand what risk the control is actually trying to mitigate. This allows you to have a dialog with InfoSec or compliance folks and possibly come to the table with better alternatives.
We love the illustrations of Italian artist Virginia Mori (previously) who adds a subtle hint of dark humor to her quirky illustrations of young women and men. Recently the artist has been drawing scenes that revolve around the unconscious thoughts that spring to life while in bed. Each illustration presents an improbable or unique vision of a bedroom—from a bed composed of live grass, to another balanced on the tips of four trees. The illustrations seem to peek into her subjects’ dreams, projecting their hidden hopes or fears onto their surroundings as they slumber. You can see more of her work on her website, and keep updated with future exhibitions on Instagram and Facebook.