AWS Amazon Reseller: Hands-on Teach You to Configure AWS Multi-AZ (Multi-AZ) and Cross-Regional Automated Disaster Recovery Architecture

cloud 2026-05-29 阅读 75

Technical friends should have heard a saying: "everything will fail, it's only a matter of time."

In the cloud architecture, putting all eggs in one basket is the biggest taboo. Many teams thought that everything would be fine if they moved their business to AWS (Amazon Web Services). As a result, the optical cable in an available zone (AZ) was cut off, or the entire region (Region) was cut off due to extreme weather, resulting in instantaneous service downtime. That's when you realize that the so-called "high availability" is a joke in front of an architecture that is not properly configured.

Today we will not talk about the concept of emptiness or memorize official documents. Prepare your AWS account number, let's go directly to the hard core dry goods, hand in hand to take you to configure a set.

Internet golden architecture that takes into account both "multi-zone (Multi-AZ) high availability in the same region" and "cross-region (Cross-Region) automatic disaster recovery"

Phase 1: Multi-zone (Multi-AZ) in the same region-extinguishing the flame of single point of failure

"Available Zone" is a core concept of AWS. A territory (such as Oregon

us-west-2

) contains multiple availability zones (such

us-west-2a

us-west-2b

), each available area is physically isolated (independent power supply, heat dissipation and network), but there is ultra-high-speed optical fiber interconnection between them.

Our first goal is:

Even if one of the available areas is completely paralyzed, the remaining available areas can take over the business in seconds, which is completely imperceptible to users.

1. Refined design of VPC and subnet (Subnet)

This is the foundation. You can't throw all the servers into one subnet.

Standard practice: Select at least two different Availability Zones (for example, AZ-A and AZ-B) in your VPC.

In each Availability Zone, create a public subnet (load balancer) and a private subnet (EC2 service server and database). In this way, you have 4 subnets, which naturally form cross defense.

2. Computing layer: ALB load balancing Auto Scaling (auto scaling group)

Don't directly bind the public network IP of a single EC2 to the user application.

Create an Application Load Balancer (ALB) and mount it on the public subnets of both Availability Zones. ALB will be incarnated as a gate, and when traffic comes in, it will evenly distribute requests to the back-end servers.

Create an Auto Scaling Group (ASG): Launch Template Select your business image. Key configuration: When selecting a network, check all the private subnets of the two zones. Set the required capacity to 2 (representing 2 machines running business at ordinary times). Underlying logic: AWS would be very smart to launch an EC2 in AZ-A and AZ-B. If one day AZ-A sudden failure machine hanging

After that, ASG will immediately sense it and automatically pull up a new machine to make up for it in a healthy AZ-B. It will cooperate with ALB to automatically eliminate the dead machines and automate the whole process.

3. Data layer: RDS database one-click Multi-AZ

If the server is hung up, it can be restarted at will. If the data is hung up or messed up, the company will directly open the seat.

When creating an AWS RDS (such as MySQL), there is a golden switch called "Multi-AZ Deployment (Multi-AZ deployment)", check it without hesitation.

Operation Insider: AWS will establish a primary library in the primary availability zone (AZ-A) and a fully synchronized mirror backup library in the backup availability zone (AZ-B). All data written to the primary database is synchronized to the secondary database in real time at the block level.

In the event of a AZ-A disaster, RDS automatically initiates a failover (Failover), turning the secondary database into the primary database, and automatically resolves the connection domain name (Endpoint) of the database to the new primary database. Your back-end code does not need to modify any line of IP address, and it usually comes back to life automatically within 30 seconds.

The second stage: cross-region (Cross-Region) disaster preparedness-to defend the enemy thousands of miles away

After completing the multi-availability zone, your system is already immune to 99% of daily physical failures. But what if you encounter a major earthquake-level disaster such as network paralysis and policy compliance turmoil across the region? This requires the introduction

Cross-Region (Cross-Region) Automatic Disaster Recovery

We assume that:

The production region (Primary) is in Tokyo (ap-northeast-1) and the disaster recovery region (DR) is in Singapore (ap-southeast-1).

1. Cross-regional replication of data

To synchronize data from Tokyo to Singapore in real time.

Database level: On the RDS master database in Tokyo, click Actions-> "Create cross-region read replica" and select Singapore as the region. AWS will use its global backbone to asynchronously replicate data from Tokyo to Singapore.

File storage level: if you use S3 bucket to store user pictures or files, turn on the "cross-region copy (CRR, cross-region Replication)" function of the bucket to automatically fly the files in tokyo bucket to Singapore bucket.

2. Infrastructure cold readiness in the disaster recovery region (Singapore)

In the Singapore region, a VPC, ALB, and Auto Scaling group are also deployed.

Save money trick: usually in order to save money, you can set the "minimum capacity" and "expected capacity" of Singapore's Auto Scaling group to 0 (or 1 low-priced machine for heartbeat test). At this time, Singapore does not incur large EC2 computing costs,

Only cheap storage fees and database synchronization fees.

3. Soul Commander: Route 53 Intelligent Routing and Failover

How to switch the traffic of global users from Tokyo to Singapore in the event of a disaster? This requires the DNS service of AWS--

Route 53

Configure the Failover Routing Policy (Failover Routing Policy) for the domain name of your website (for example, api.yourcompany.com) in Route 53 ".

Configure two records: primary record (Primary): pointing to the ALB load balancer in Tokyo. Secondary Record (Secondary): Pointing to the ALB Load Balancer in Singapore.

Bind health check (Health Check): Give ALB in Tokyo a Route 53 health check, and let AWS route to detect the homepage of the website in Tokyo every 10 seconds.

Disaster drill logic: If the entire region of Tokyo is bombed, Route 53's health check will turn red after several consecutive failures. It will instantly start the fusing mechanism and directly cut the domain name resolution to the ALB in Singapore.

The third stage: the real scene resurrection process in the event of a disaster

Once the Tokyo Region is really out of contact, Route 53 has automatically brought the traffic to Singapore. At this time, the operation and maintenance personnel only need to perform the last two "power raising" actions, and the system can completely resume production:

Database upgrade (Promote): Log in to the AWS console in Singapore, select the read-only copy synchronized from Tokyo, and click "upgrade to independent database (Promote Read Replica)". It will break the synchronization link with Tokyo in a few minutes and become a readable and writable standard master library.

One-click wake-up of the computing layer: change the expected capacity of the Singapore Auto Scaling group from 0 to the production quantity you need (for example, 10). Within minutes, a large number of EC2 servers were in place in Singapore, automatically reading the new database after the upgrade.

Traffic can come in, the server can be connected, and the database can read and write. This regional disaster that can bankrupt ordinary companies has only turned into a short loading delay of tens of seconds on the client side under your precise architecture.

Phase 4: The cost of high-availability architecture and the history of avoiding the pit.

Data Transfer Fee: AWS charges (though cheap) for data transfer across Availability Zones within the same VPC. Therefore, try to make your front-end EC2 and back-end intranet services interact within the same AZ. Only when database synchronization or distributed cluster node synchronization is done, cross-AZ traffic can be taken.

Trade-off between RPO and RTO: RPO (when data can be recovered): because cross-regional databases are "different."

Copy ", the moment Tokyo falls, there may be a second or two of data before it can be transmitted to Singapore, and this part of the data will be temporarily stranded. Businesses need to be prepared for data reconciliation. RTO (How long does it take to recover): Using the architecture of this article, automation and minimal manual intervention can control RTO within 5 to 15 minutes.

Regular destruction (chaos project): highly available architecture is the most taboo of "putting it there to eat ashes after matching". Many companies have been equipped with cross-regional disaster preparedness and have not moved for three years. When the accident occurred in the fourth year, they found that Singapore's mirror image had long expired and could not run. It is recommended to manually cut Route 53 to the disaster recovery area in the early morning of a weekend every six months to do a real network disconnection drill.

Summary

Zero single point of failure is not by luck, but by scientific architecture design.

Multi-availability zones in the same region solve "high availability (HA)" to ensure that the line does not drop on a daily basis. Cross-regional automatic disaster recovery solves "disaster recovery (DR)" to ensure that the company can survive in extreme cases.

Solder these two sets of defense lines to your AWS account. From then on, no matter how big the external network is, you can sit in front of the computer and be as stable as Mount Tai.