High Availability Architecture Design of Amazon Cloud International Station: Disaster Recovery and ELB Load Balancing Across Availability Zones (AZ)

2026-05-19 阅读 80

In the era of cloud computing, "high availability (High Availability, HA)" is almost all kinds of architects in the mouth of the word. Many people think that as long as the system is moved to Amazon Cloud (AWS), EC2 is bought, and load balancing is equipped, high availability will naturally be realized.

However, the real production environment will often give this blind optimism a resounding slap in the face.

When an AWS available zone (Available Zone, AZ) is paralyzed due to the underlying optical cable being cut off, power failure or software-defined network (SDN) abnormality, will your service automatically and seamlessly switch, or will it go down together and the customer service phone will be knocked out?

The real

The highly available architecture is not bought, but designed.

. This article will completely abandon the complicated PPT slang and use the most easy-to-understand "vernacular" to take you apart the two core highly available propositions on Amazon Cloud International Station:

Cross-zone (Cross-AZ) disaster tolerance

with

ELB Load Balancing

1. Redefine "High Availability": The Underlying Truth of Region and AZ

Before we can start designing an architecture, we must first understand the underlying physical infrastructure of AWS. This is the foundation of all highly available designs.

Region: A geographically isolated region (e. g., Tokyo, Virginia, Ireland). Regions are so far apart that, barring a geopolitical level catastrophe, a blackout in one region never affects another.

Availability Zones (AZ): A region contains multiple Availability Zones. Remember: an AZ is not the same as a Data Center. An AZ may consist of several groups of data centers that are close to each other, but are completely independent of power and network.

💡Core pain point: Why is the single AZ architecture dead? In order to save money, many overseas enterprises leave their Web servers and databases in the same AZ (such as ap-northeast-1a). This is equivalent to putting all the eggs in a basket that is known as "bulletproof" but may still leak. Once the AZ has a backbone network failure, your business will be lost instantly.

Therefore, the cross-AZ(Multi-AZ) architecture is not an "advanced option", but a production environment.

Hard Bottom Line

Traffic Police for 2. Traffic: The Right Way to Turn on ELB (Elastic Load Balancing)

To achieve cross-AZ disaster tolerance, the first step must be a "traffic police" to evenly and intelligently distribute requests from global users to servers in different availability zones. In AWS, this role is defined by

ELB(Elastic Load Balancing)

Sered.

AWS offers a variety of load balancers, but in modern web architectures, we focus primarily on

ALB(Application Load Balancer)

and

NLB(Network Load Balancer)

1. ALB vs NLB: Don't choose

The wrong weapon.

Characteristics

ALB (Application Load Balancer)

NLB (Network Load Balancer)

Working OSI Level

Layer 7 (Application layer: HTTP/HTTPS)

Layer 4 (Transport Layer: TCP/UDP/TLS)

Core Advantages

Smart. Identifies URL paths, cookies, and HTTP headers for advanced routing. Supports SSL certificate offloading.

Extremely fast. Ultra-low latency, capable of handling tens of millions of concurrent requests per second, supporting fixed IP.

Applicable Scenarios

The vast majority of web applications, microservice APIs, e-commerce sites.

The game gateway, the Internet of Things (IoT) receiver, and the original TCP service for non-HTTP protocols.

for the vast majority of enterprises going to sea,

ALB

is the most recommended choice.

2. Cross-zone load balancing (Cross-Zone Load Balancing)

This is the easiest place for people to step on the pit in the ELB design.

By default, ALB's cross-AZ load balancing is

Enable

and the NLB default is

Of it. What difference does it make?

In the off state: if DNS distributes 50% of the traffic to the AZ-A load balancing node and 50% to the AZ-B load balancing node. Even if there is only one server in the AZ-A and four servers in the AZ-B, the AZ-A server will be exhausted with 50% of the traffic.

In the enabled state: No matter which AZ's load balancing node the traffic reaches first, ELB will evenly distribute the requests to all backend instances in all AZ behind it.

Conclusion:

Unless you have extremely demanding latency requirements (microseconds)

It is strongly recommended to keep cross-zone load balancing turned on.

, to ensure that the rear end pressure is absolutely average.

3. the Iron Triangle of Cross-AZ Disaster Recovery: Computing, Network and Storage

With ELB, the traffic police is not enough. The back-end lanes (calculation), road network (network) and warehouse (storage) must also have the ability to cross AZ in order to realize "non-sense switching" when an AZ is killed ".

1. Calculation layer: Auto Scaling (auto-expansion group) is the only correct posture

Never manually create two machines and put two AZ. You should use

AWS Auto Scaling Group (ASG)

All you need to do is configure a Launch Template and tell ASG: "I need at least 2 machines and at most 10 machines.

ap-northeast-1a

and

ap-northeast-1b

These two AZ."

Health Check: Don't let the ELB just check if EC2 is on (Ping). To configure ELB to request a specific interface in your code (such as/health). If this interface returns 500 or the database connection fails, ELB will

The instance is determined to be "unhealthy" and stops sending traffic to it.

Self-healing ability: once an AZ is hung up, all the machines in the AZ are lost, ASG will immediately trigger an alarm and automatically open a new machine to fill the gap in another healthy AZ.

2. Network Layer: Subnet and the Golden Rule of Routing

In the design of a private network (VPC), many people make the mistake of structural confusion. A standard high-availability network architecture should follow the principle of "pair occurrence and absolute isolation.

Public Subnet: Places the Internet Gateway, ELB, and NAT gateway. One for each AZ.

Private subnets (Private Subnet): Place real EC2 application servers. These machines do not need public IP and must not be directly exposed to the Internet. One for each AZ.

Highly Available NAT Gateway Trap: In order to save money, many teams have only built one NAT gateway in the entire VPC and put it AZ-A, allowing AZ-B private EC2 to bypass across AZ to access the Internet. Once the AZ-A is hung up, the AZ-B server is alive, but it is also destroyed because it cannot access the Internet (cannot access external API, cannot download dependencies). The right thing to do: Each AZ has its own independent NAT gateway.

3. Storage and database layer: farewell to a single point, embrace the Multi-AZ

Compute nodes are stateless (Stateless) and can be reopened at any time when they die. But

Data is stateful and must not be lost.

Relational database (RDS / Aurora): Multi-AZ deployment is strongly enabled. AWS automatically creates a fully synchronized standby (Standby) instance in another AZ. When the AZ where the main library is located fails, RDS will automatically perform DNS drift and promote the standby library to the main library within a few s to tens of s. Your application code does not even need to modify the connection string (Endpoint).

File storage (EFS vs EBS):EBS (cloud disk): It is bound to a specific AZ. This means that the AZ-A EC2 is dead, and you cannot directly mount its EBS on the AZ-B EC2. EFS (Resilient File System): Native support across AZ. EC2 of multiple AZ can read and write to the same EFS at the same time. If your business requires shared file storage (such as a Wordpress image upload directory), don't hesitate to choose EFS.

4. Ultimate Combat: A Standard Multi-AZ Highly Available Architecture Exercise

To give you a more intuitive feel, let's simulate how a standard user request flows through the AWS cross-AZ high availability architecture.

Scenario: A user visits an e-commerce website [https: // example.com]

Domain name resolution: When a user initiates a request, AWS Route 53 (smart DNS) resolves the domain name. Due to the configuration

With a delay or polling policy, Route 53 directs traffic to the ALB deployed in the public subnet.

Traffic distribution: ALB receives the HTTPS request, decrypts the SSL certificate locally (offloads the pressure), and then forwards the request to an EC2 instance located in the AZ-A or AZ-B private subnet according to the cross-availability zone load balancing policy.

Business processing and data reading and writing: Applications on EC2 process business. If you need to read from and write to the database, it connects to the RDS master library (located AZ-A). In this case, RDS automatically synchronizes the data to the RDS standby database (located in the AZ-B).

Disaster occurs (simulating AZ-A complete network disconnection):ELB response: ALB discovers that the EC2 instance in the AZ-A has a heartbeat stop or the health check fails continuously, and immediately removes the AZ-A from the forwarding list. 100 percent of the traffic was instantly imported into AZ-B EC2. RDS self-healing: AWS monitors that the RDS master database is lost and automatically initiates failover (Failover). The AZ-B standby library is upgraded to the new primary library within 30 seconds, and Route 53 automatically updates the internal Endpoint of RDS. AZ-B EC2 resumes normal reading and writing after a brief error. ASG Expansion: Auto Scaling Group found that the number of currently living machines was less than the set minimum expectation, and immediately popped up a brand-new EC2 in a healthy AZ-B and automatically registered it behind ALB.

Result:

In the whole process, except for a few users who initiate a request at the moment of switching, 99% of users may encounter a retry (502/504), and 99% of users are completely unaware that the background has experienced a computer room-level disaster of "life and death speed.

5.'s Guide to Avoidance: Three Fatal Mistakes Architects Often Mistakes

In the actual landing of this architecture, based on a large number of rollover cases, I summarized the following three most easily overlooked reefs:

1. Cross-AZ transfer costs (Cross-AZ Data Transfer Costs)

Inbound traffic to AWS (coming in from the internet) is free, but within the same region,

Traffic across different AZ transfers is charged

(usually $0.01/GB).

If your microservice is too fragmented, service A(AZ-A) frequently calls service B(AZ-B) via RPC, and at the end of the month, you will receive a horrible network bill.

Optimization scheme: Try to make the traffic complete the intranet closed loop in the same AZ, and only cross the AZ when disaster recovery is switched.

2. Database "brain split" and synchronization delay

Although RDS Multi-AZ is synchronous replication and has very good performance, if you build your own self-built database cluster (such as MySQL MHA modified by your own hard-core magic), when the AZ network shakes, it is very easy to cause "brain split" in which both nodes think that they are the main database, resulting in data writing disorder.

Optimization scheme: professional things to professional tools, strong production environment

It is recommended to use RDS or Aurora first and leave the underlying distributed consistency problem to AWS.

3. Forget the test: high availability on paper is not high availability.

Many team architectures are perfectly designed, but they have never done a disaster recovery drill in the three years since they were launched. As a result, when a real failure occurs, it is found that the IP address of the database is written to death in the code, or the security group (Security Group) forgets to release the network segment of another AZ.

Optimization: Periodically execute Chaos Engineering (chaos engineering). During the low peak period, manually go to RDS console and click "Failover", or deliberately turn off all EC2 of an AZ to see if the system can heal itself as expected.

Conclusion

In the cloud-native era,

Disaster tolerance should not be an expensive and tedious physical task, but a design intuition.

Through will

Intelligent distribution of ELB

Stateless self-healing of Auto Scaling

and also

Cross-AZ Synchronous Replication for RDS

Together, we used only a few standard Amazon Cloud International Station services to build a steel architecture that can withstand computer room-level disasters.

Remember, the highest level of high availability is not to protect the failure from ever happening. It's that when the failure arrives as scheduled, your system just shakes slightly and continues to move steadily through the wind and rain.