AZR333 – Planning for Failure in Cloud Applications

You can count on one of three things failing: hardware, software, or people. One of the most important considerations when moving applications into the public cloud is how to plan for – and mitigate – these failures.

Certainly there are best practices in building any application that help you to handle failures, but what are the practices when your applications run in the public cloud? In this presentation, Wade Wegner will draw upon his years of experience with cloud applications in Windows Azure to share proven practices for handling failure in cloud applications.

Presented by Wade Wegner

Disclaimer: These are conference session notes I compiled during various sessions at Microsoft Tech Ed 2012, September 11-14, 2012.  The majority of the content comprises notes taken from the presentation slides accompanied, occasionally, by my own narration.  Some of the content may be free hand style.  Enjoy… Rob


  • Architectural options for designing highly-available, fault tolerant applications
  • Best practices for these options
  • Multi-availability Zones (AZ)

Cloud Outages

  • AWS 21/4/2011
  • Azure 29/2/2012
  • AWS 14/6/2012
  • AWS 29/6/2012
  • Azure 26/7/2012

Quite a range of outages, listed above, Leap year created date parsing issues, etc.  Additional outages caused due to lightning/infrastructure issues.  In essence, failures can and will occur.

What do we need to consider?

  • Fault Tolerance
  • High availability
  • Disaster recovery

Read your SLAs!

Windows Azure has monthly SLAs, for example.  Keep in mind that most SLAs will rarely reimburse for lost revenue due to outages.

‘Compounding SLAs’

  • If different systems have different SLAs, e.g.
  • Azure Compute = 99.95%
  • SQL Azure = 99.9%
  • Azure Storage = 99.9%

Total SLA: 4.38h + 8.76h + 8.76h
Total outage: 21.9 hours

Lets define ‘Cloud’

  • Physical data centre behind an API
  • Cloud is a ‘resource pool’ behind an API
  • A cloud is not
    • Azure
    • AWS
  • A cloud is defined by the isolation of resources
  • Sometimes might need to go across Cloud platforms
    • e.g. Azure, AWS, different data centres, different geo-locations

A cloud is a specific data centre (rather than the platform itself)

Define High Availability?

  • Remove all single point of failures
    • Multiple hosts, load balancers, data replication
  • Graceful failover
    • Platforms might provide functionality to support this
    • Sometimes you need to build it

Define Disaster Recovery?

  • Processes or procedures to recover from a failure
    • Network, hardware, software etc
  • Practice and test DR strategies, takes a lot of time
    • document, train, rehearse
  • disaster can occur anywhere

Typical Approach

  • Duplication of infrastructure
  • identical spec
  • cold failover
  • typically under-provisioned, over provisioned

DR with Cloud

  • Consider the advantages/features of each platform
    • to support migration, durability, restoration of data
  • Scale up as needed
  • Geo-located
    • Azure: Regions & Fault domains
    • AWS: Regions & availability zones
  • Move applications into separate fault domains

Design for Failure

  • Large scale failures are rare, but happen
  • Applications need to be fault aware, can recover
  • Balance cost of tolerance against cost/risk

API Endpoint Differences

  • APIs differ
  • Different resources, billing
  • Network architectures vary (VLANs, security groups)
  • Different storage architecture
  • Abstractions and management vary
  • Each Cloud is unique in various ways

Overcoming Multi-Cloud Pain

  • Design using generic concepts
  • Have tools which translate concepts to cloud-specific clouds
  • How to share resources across clouds

Infrastructure Abstraction/Automation

  • Simplify deployments across multiple regions/zones
  • Automate deployments
    • Reproducible, consistent
  • Advanced server and deployment monitoring
    • Some API support, e.g. custom performance counters
    • Azure aggregates a lot of data, performance counters etc
    • Still maturing
  • Automatic scaling and operations (and throttling)
  • Third party services/apps/tools can help
  • Make use of diagnostic information

Reduced cost of maintenance..  ScaleExtreme works across cloud.

HA/DR Checklist for Risk Mitigation

  • Determine who owns the design, processes, testing
    • Who will support, and operate the application(s)
  • Develop in-house expertise (or bring help in)
  • Conduct a risk assessment
  • Specify recovery time objectives/recovery point objectives
  • Design for failure (start with application design)
  • Implement HA best practices
    • Balance cost/risk/complexity
    • automate/abstract infrastructure
    • It can be costly to support referential integrity across zones
  • Document operational processes/automations & test them
  • Test the failover and recoveries
  • Unleash the Chaos Monkey!
    • Acknowledge that things do fail

General HA Best Practices

  • Avoid single point of failure (again)
  • Place at least one of each component in different fault domains
  • Maintain sufficient capacity to absorb faults
  • Replicate data across fault domains
  • Monitoring and alerts to automate problem resolution
  • Design stateless applications (to support failover/reboot/relaunch)
    • Avoid internal instance dependencies
  • Make use of platform specific monitoring features
  • Framework services can be slow to respond

Some General DR Scenarios

  • Backup/restore
  • Simple Recovery
  • Warm standby
  • Multi-site
  • Multi-cloud

IMG_2319 IMG_2318

Consider cost, complexity and risk implications.  Defines different levels of availability and recovery times.

Multi-Cloud:Cold DR:


  Takes time to spin up the cold DR.  DNS switching can be time sensitive, even if fully automated, reduced running costs

Multi-Cloud:Warm DR:


Slightly better approach, can replicate data/exports.  Data tier doesn’t need to spin up, just the other tiers.  Storage can be partitioned into a separate fault domain, etc.  Still fairly minimal cost, same DNS timeframe issues.  DB could be put into read-only mode for reporting etc.

Azure SQL Database: Multi-tenant service.  Export can be put into Azure Storage BLOB and can be replicated to other regions.

Multi-Cloud:Hot DR:


Apps are spun up, Much higher cost, DNS wou ld need to fail over.



For designs which can tolerate NO downtime.  Route DNS traffic to different clouds.  Data consistency becomes an issue as real-time production data is being captured in two completely separate clouds.  Is real-time synchronization something which is entirely necessary in this configuration?  High cost.

How do I make my service immortal?

  • Hope for the best, plan for the worst
    • Failures do occur, design for them
  • Embrace the cloud mentality
  • Fit for purpose – no one design suits all
    • Analyse requirements, appetite for risk
    • Costs
  • Start easy – build HA first, then expand
    • Start at process and procedures
    • Automation

Open Source/Standards: needs community push to garner some attention.

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.