Operations

Code Deployments

Service Code

We release small-increment (atomic) updates to our multi-tenant services on a constant basis, following the continuous delivery practice called Scaled Trunk-based Development. On average, there are roughly 80-100 releases per month. After ensuring 100% test coverage and a (human) code review, service updates are automatically rolled out to all customers simultaneously. The majority of them are dependency updates which typically go completely unnoticed, or fixes for issues some or all customers were having. Features or enhancements can come feature-flagged for single customers or projects, meaning they will only be activated on an opt-in basis via configuration.

Customer Code

Unlike traditional AEM as a Cloud Service, there is no CI/CD pipeline for Edge Delivery Service. Code changes are picked up directly from the branches in your GitHub or bring your own git repository. Within seconds, every branch is automatically published under its own distinct URL for testing and staging or changes. Production code is typically served from the main branch.

We strongly recommend project developers follow the same Scaled Trunk-based Development model as we do for our services. This ensures you merge small pull requests into production often, but the quality assurance & review efforts are limited to small change sets. Nobody wants to review and test large pull requests, and long-lived branches with lots of changes tend to be difficult (and dangerous) to merge. For more details, please read our developer best practices.

Service Level Objectives

Thanks to our unique multi-cloud architecture, our services are designed with the highest resilience and availability in mind. But of course this doesn't fully protect us from the odd service outage. We are aiming for these service level objectives and have historically exceeded them:

Delivery Service SLO

99.99%

Publishing Service SLO

99.9%

Your Service Level Agreement (SLA) with Adobe covers the exact availability assurances we offer, depending on your contract details. The relevant status reports for that can be found on status.adobe.com.

Observability

Logging

All our technical services feed into a centralized, and redundant SIEM system, powered by Coralogix and Splunk. Based on this, alerts for critical levels of errors are set based on thresholds that are constantly updated based on the evolution of the services. Our observability infrastructure is operated independently from the operational infrastructure.

Our logging is focussed on perimeter logging, so that the operational properties of the individual services are prioritized over internal state. Depending on the nature of the collected logs, different retention policies are applied, ranging from two weeks to 25 months.

Monitoring

Our observability setup consists of a fine-grained set of highly sensitive synthetic monitors and log-based alerts for all our technical services. The slightest anomaly, be it related to a change we made ourselves or an issue with one of our 3rd party vendors, immediately alerts our on-call rotation.

We also have extra synthetic monitoring in place for our top 10 customer sites by traffic, which gives us the confidence that our delivery service is performing and scaling as intended.

Incident Management

Our operations team is assigned to a 24/7 on-call rotation split into two 12 hour shifts on two continents. Adobe On-Call ensures prompt notification of on-call engineers on several channels including phone calls, text messages and push notifications. We vow to acknowledge every incident within 15 minutes, although in reality we are typically a lot quicker.

We maintain detailed runbooks for each type of incident to ensure we can restore the affected service as fast as possible. Our process includes root cause analysis (RCA) and we publish postmortems for every single incident, no matter how small the customer impact was.

Disaster Recovery

Publishing

The Admin API is currently a single-cloud deployment. In case of an outage in this service, the intended RTO is 12 hours. Adobe regularly conducts disaster recovery tests to validate that both delivery and API services can be restored well within their respective intended RTOs.

Delivery

The Content Hub is where all published content, media, and code is stored. For this tier, we rely on active/active replication in a multi-cloud setup. Unlike traditional approaches to disaster recovery where like active/standby or multi-region deployments, an active/active multi-cloud setup ensures that any content, media, or code that gets previewed or published is stored redundantly in at least two different cloud providers with different, but functionally identical software stacks.

In case of an outage, even a global outage of a cloud provider's control plane that would affect a multi-region setup, all content is still available in the second cloud provider and operations can resume without data loss.

The active/active deployment means that during normal operations, the workloads are split roughly equally between our cloud providers and only in an outage of one the remaining providers will pick up the load.

The delivery service itself is also deployed redundantly, so in case of an outage at one cloud provider, Adobe can plan to switch to the other and resume delivery instantaneously, if possible. The intended recovery time objective (RTO) for this service is 15 minutes.

Security

Up Next

Documentation

Resources