Are you not afraid of a fire in the computer room? The vernacular is fully equipped with the disaster-resistant SLB of Aliyun load balancing. High availability

2026-05-26 阅读 65

When talking about high-availability architecture, many brothers had a misunderstanding, thinking: "I have 10 back-end servers and the database has been equipped with master and backup, so my architecture is as stable as Mount Tai."

But have you ever thought that in front of these 10 servers

What if the "leading party"-load balancing (SLB) hangs up? No matter how powerful your back-end ECS is, user traffic cannot even enter the door, and the entire system is instantly paralyzed. This is a typical single point of failure (SPOF).

As the "main gate" of the entire network traffic, how does Alibaba Cloud's load balancing SLB (now subdivided into traditional CLB and applied ALB) manage to forward traffic as steadily as an old dog even if the computer room is on fire and the backbone network is disconnected? Today, we will not pull those false ones, but directly pull the hood to see its highly available chassis.

1. the first layer of life protection armor: cross-zone (Zone) "spare tire" mechanism

If you buy SLB in the background of Aliyun, you will find that the system will definitely let you choose two things:

Primary zone (Primary Zone)

and

Backup Zone

. For example, select Beijing Availability Zone A and select Beijing Availability Zone B.

This is the most basic and core cross-room high availability architecture of SLB.

Normal state (main and backup are clear): SLB actually starts at least two sets of hardware or virtual machine instances for you in the underlying computer room. At ordinary times, 100 percent of all traffic goes to SLB instances in primary availability zone a. The instance of spare zone B is in a "hot standby" state, just like the spare tire of a car, watching silently while turning around.

Extreme state (second-level switching): Assume that the computer room in Beijing usable area A is suddenly powered off, or the optical cable is cut off. The underlying health check system of Alibaba Cloud will respond within 2-5 seconds and directly drift the virtual IP(VIP) of the domain name to the SLB instance in the standby zone B.

User perception: the user's network request may flash off due to disconnection and reconnection, but normal access can be restored immediately. You don't need to modify any DNS resolution, nor do you need to manually go to the background point to switch, the bottom layer is fully automated.

2. the second layer of dimension reduction strike: super large scale cluster and Anycast seamless disaster tolerance

"What if two availability zones in a certain area hang up at the same time?" (Although the probability is extremely low, technical leverage is a good tradition of operation and maintenance).

At this time, it depends on the top-level cluster design of SLB. Aliyun's SLB is not a single server fighting, it is behind a huge

LVS (four layers) + Tengine (seven layers)

physical cluster.

In Layer 4 Load Balancing (CLB), Alibaba Cloud adopts

Anycast BGP (anycast)

Technology:

On the backbone network, Alibaba Cloud allows multiple core computer rooms around the world to declare the same SLB at the same time.

The public IP address of.

At the moment when the user's traffic enters the Alibaba Cloud network, the BGP router of the operator is automatically allocated to the nearest and healthiest SLB cluster according to the "congestion" of the main network.

If the SLB cluster in one of the data centers is smoking, the BGP routing protocol will automatically "detour" traffic to the SLB cluster in another city within a few seconds. This "multi-active" architecture has gone beyond the limits of a single region.

3. the third layer of microscopic defense: SLB's "life and death naming" of back-end ECS"

It is not enough for the SLB itself to be highly available; it must also ensure that the servers it distributes are also alive. This involves

Health Check

Many novices configure health checks casually, resulting in an "avalanche effect". SLB's health check is to help you save your life:

Layer 4 (TCP) roll call: SLB is like a ruthless clock-in machine, shaking hands with your back-end ECS port (such as port 80) every few seconds. If the handshake succeeds, you are alive; if three consecutive handshakes fail, SLB will kick you out of the queue in milliseconds. The new traffic is definitely not distributed to you.

Layer 7 (HTTP) in-depth physical examination: Many times the port is open, but the back-end code is stuck (for example, it throws a 500 error). At this time, SLB will simulate the browser to visit the URL you specified (such as/health.html). If the returned status code is not 2xx or 3xx, directly hack the server.

Fault self-healing: Once your ECS is restarted and the code returns to normal, SLB will automatically pull it back to the queue to continue working after passing the re-examination. The whole process is zero human intervention.

4. actual combat to avoid the pit guide: how can we not be violent?

Alibaba Cloud has achieved the ultimate high availability of SLB, but if you are stupid in configuration, this high availability will be useless. It is important to remember the following three iron laws:

1. Backend ECS must be deployed across zones

This is the most common mistake! Many people have bought SLB (main area A, backup area B) across the available area, but for the convenience of the figure, all four ECS at the back end have been bought in available area A.

As a result, once the available area A was powered off, the SLB did successfully switch to the standby area B, but the SLB in the standby area B looked back-it was empty and there was no server. High available direct breaking work.

Correct posture: SLB spans A/B area, and ECS at the back end should also be evenly distributed in A/B area.

2. Must Session Hold (Session Stickiness) be turned on?

If your business requires a user to log on (the status is saved in the server memory), open session persistence to allow the same user to send requests to the same ECS all the time.

But! If an EC

If S hangs up, the user's session will definitely be cut off. For real high availability,

It is strongly recommended to separate the Session and put them into the Redis cache (such as Alibaba Cloud Redis Edition).

, make the back-end ECS become "stateless. In this way, if any ECS is killed, SLB can seamlessly cut the traffic to other servers, and users have no sense at all.

3. Set TTL and health check thresholds reasonably

The interval of health check should not be set too long (for example, check once every 10 seconds and confirm only after 5 consecutive failures, which means that the server will not find out until SLB hangs up for nearly one minute, and a large number of users will report errors in this minute); However, don't set it too aggressively (for example, check once every second), otherwise the traffic of health check itself will crush your server at high concurrency.

Gold recommendation: response timeout 3 seconds, check interval 2-3 seconds, unhealthy threshold 3 times, healthy threshold 2 times.

Summary

The high availability of Alibaba Cloud SLB is a set

Global BGP routing, to the cross-room primary and secondary hardware, and then to the back-end server second health check

Three-dimensional defense system.

For operations and architects, SLB is the most cost-effective component of the entire system. You don't need to study how to match complicated Keepalived, how to drift virtual IP, how to maintain LVS cluster, and spend a little money a month to directly enjoy the disaster-resistant gateway set up by top architects of large factories. To hand over professional matters to SLB, the only thing you need to do is to honestly divide the back-end servers into different computer rooms.