AZR333 – Planning for Failure in Cloud Applications

You can count on one of three things failing: hardware, software, or people. One of the most important considerations when moving applications into the public cloud is how to plan for – and mitigate – these failures.

Certainly there are best practices in building any application that help you to handle failures, but what are the practices when your applications run in the public cloud? In this presentation, Wade Wegner will draw upon his years of experience with cloud applications in Windows Azure to share proven practices for handling failure in cloud applications.

Presented by Wade Wegner

Disclaimer: These are conference session notes I compiled during various sessions at Microsoft Tech Ed 2012, September 11-14, 2012. The majority of the content comprises notes taken from the presentation slides accompanied, occasionally, by my own narration. Some of the content may be free hand style. Enjoy… Rob

Introduction

Architectural options for designing highly-available, fault tolerant applications
Best practices for these options
Multi-availability Zones (AZ)

Cloud Outages

AWS 21/4/2011
Azure 29/2/2012
AWS 14/6/2012
AWS 29/6/2012
Azure 26/7/2012

Quite a range of outages, listed above, Leap year created date parsing issues, etc. Additional outages caused due to lightning/infrastructure issues. In essence, failures can and will occur.

What do we need to consider?

Fault Tolerance
High availability
Disaster recovery

Read your SLAs!

Windows Azure has monthly SLAs, for example. Keep in mind that most SLAs will rarely reimburse for lost revenue due to outages.

‘Compounding SLAs’

If different systems have different SLAs, e.g.
Azure Compute = 99.95%
SQL Azure = 99.9%
Azure Storage = 99.9%

Total SLA: 4.38h + 8.76h + 8.76h
Total outage: 21.9 hours

Lets define ‘Cloud’

Physical data centre behind an API
Cloud is a ‘resource pool’ behind an API
A cloud is not
- Azure
- AWS
A cloud is defined by the isolation of resources
Sometimes might need to go across Cloud platforms
- e.g. Azure, AWS, different data centres, different geo-locations

A cloud is a specific data centre (rather than the platform itself)

Define High Availability?

Remove all single point of failures
- Multiple hosts, load balancers, data replication
Graceful failover
- Platforms might provide functionality to support this
- Sometimes you need to build it

Define Disaster Recovery?

Processes or procedures to recover from a failure
- Network, hardware, software etc
Practice and test DR strategies, takes a lot of time
- document, train, rehearse
disaster can occur anywhere

Typical Approach

Duplication of infrastructure
identical spec
cold failover
typically under-provisioned, over provisioned

DR with Cloud

Consider the advantages/features of each platform
- to support migration, durability, restoration of data
Scale up as needed
Geo-located
- Azure: Regions & Fault domains
- AWS: Regions & availability zones
Move applications into separate fault domains

Design for Failure

Large scale failures are rare, but happen
Applications need to be fault aware, can recover
Balance cost of tolerance against cost/risk

API Endpoint Differences

APIs differ
Different resources, billing
Network architectures vary (VLANs, security groups)
Different storage architecture
Abstractions and management vary
Each Cloud is unique in various ways

Overcoming Multi-Cloud Pain

Design using generic concepts
Have tools which translate concepts to cloud-specific clouds
How to share resources across clouds

Infrastructure Abstraction/Automation

Simplify deployments across multiple regions/zones
Automate deployments
- Reproducible, consistent
Advanced server and deployment monitoring
- Some API support, e.g. custom performance counters
- Azure aggregates a lot of data, performance counters etc
- Still maturing
Automatic scaling and operations (and throttling)
Third party services/apps/tools can help
Make use of diagnostic information

Reduced cost of maintenance.. ScaleExtreme works across cloud.

HA/DR Checklist for Risk Mitigation

Determine who owns the design, processes, testing
- Who will support, and operate the application(s)
Develop in-house expertise (or bring help in)
Conduct a risk assessment
Specify recovery time objectives/recovery point objectives
Design for failure (start with application design)
Implement HA best practices
- Balance cost/risk/complexity
- automate/abstract infrastructure
- It can be costly to support referential integrity across zones
Document operational processes/automations & test them
Test the failover and recoveries
Unleash the Chaos Monkey!
- Acknowledge that things do fail

General HA Best Practices

Avoid single point of failure (again)
Place at least one of each component in different fault domains
Maintain sufficient capacity to absorb faults
Replicate data across fault domains
Monitoring and alerts to automate problem resolution
Design stateless applications (to support failover/reboot/relaunch)
- Avoid internal instance dependencies
Make use of platform specific monitoring features
Framework services can be slow to respond

Some General DR Scenarios

Backup/restore
Simple Recovery
Warm standby
Multi-site
Multi-cloud

Consider cost, complexity and risk implications. Defines different levels of availability and recovery times.

Multi-Cloud:Cold DR:

Takes time to spin up the cold DR. DNS switching can be time sensitive, even if fully automated, reduced running costs

Multi-Cloud:Warm DR:

Slightly better approach, can replicate data/exports. Data tier doesn’t need to spin up, just the other tiers. Storage can be partitioned into a separate fault domain, etc. Still fairly minimal cost, same DNS timeframe issues. DB could be put into read-only mode for reporting etc.

Azure SQL Database: Multi-tenant service. Export can be put into Azure Storage BLOB and can be replicated to other regions.

Multi-Cloud:Hot DR:

Apps are spun up, Much higher cost, DNS wou ld need to fail over.

Multi-Cloud-HA:

For designs which can tolerate NO downtime. Route DNS traffic to different clouds. Data consistency becomes an issue as real-time production data is being captured in two completely separate clouds. Is real-time synchronization something which is entirely necessary in this configuration? High cost.

How do I make my service immortal?

Hope for the best, plan for the worst
- Failures do occur, design for them
Embrace the cloud mentality
Fit for purpose – no one design suits all
- Analyse requirements, appetite for risk
- Costs
Start easy – build HA first, then expand
- Start at process and procedures
- Automation

Open Source/Standards: needs community push to garner some attention.

Author

Rob Sanders

Leave a comment Cancel reply