Tencent Cloud Load Balancing CLB High Availability Architecture: How to Make Your Business System "Stable as Mount Tai"?
In the Internet world, there are two things that make architects and bosses most worried: one is that the business suddenly exploded and the server was instantly washed out; The other is that a server went down without warning, resulting in users being unable to access it collectively.
In order to solve this problem, everyone will engage in "group fighting tactics" at the bottom-deploying several more servers. But then the question arises: with so many servers, who should the user's request be sent to? Who is idle? Who is almost unable to hold on? Who is actually "dead?
At this time, it is the turn of the load balancing Balancer (Cloud Load, referred to as CLB) debut. It is like an experienced "traffic police and first responder", standing in front of all servers, distributing massive user traffic evenly and intelligently to the servers behind.
Today we will take a steak
High Availability Architecture of Tencent Cloud CLB
. Don't talk about obscure professional terms, use the vernacular and real-life perspective to see what hard-core means tengxunyun uses, in order to ensure that the business can still "run and dance" even in extreme disasters ".
1. why stand-alone high availability is a "false proposition"?
Before talking about the architecture of CLB, we first reached a consensus:
Any hardware and single point system, there is a possibility of failure.
Many new start-up teams feel that if I buy a top-of-the-line luxury car (buy a cloud server with extremely high configuration), my business will be very stable. However, in reality, the optical cable may be cut off by the construction team, the computer room may be powered off, the motherboard may be burned down, and even the system patch upgrade can trigger a blue screen.
Therefore, the real stability is not to bet that "it is not bad", but "it is broken, but there is a spare tire top immediately, and the user can not find it at all".
Tencent Cloud CLB is essentially a traffic distribution center. If the center hangs up by itself, even if there are thousands of servers behind it, it will be useless. Therefore, the high availability design of the CLB itself is the lifeblood of the entire business system.
2. Tengxun Cloud CLB's "Three-Layer Defense Line": From Single Machine to Cross-City
The secret to tengxunyun CLB's ability to achieve 99.99 or more availability is that it has built three layers of copper walls from bottom to top. Let's look at it layer by layer:
The first line of defense: "dual hot standby" and clustering in the data center
If you bought a basic public network CLB in tengxunyun, you think you only bought one IP address, but in fact, tengxunyun has prepared a whole for you at the bottom.
high availability cluster
.
Layer -4 load balancing (based on LVS/DPDK): adopts a cluster-wide architecture. Simply put, there are a bunch of servers to carry your traffic. When any physical server dies suddenly, the underlying routing protocol (OSPF/ECMP) automatically switches traffic to other healthy servers in the cluster in milliseconds.
Layer -7 load balancing (based on Nginx): uses master-slave or multi-active mode.
Once the primary heartbeat stops, the standby node takes over.
Real translation:
You think you hired a bodyguard, but there's a bodyguard company standing behind you. One of them fell down, and the person behind immediately filled the position. The user could not even perceive the webpage Caton.
The second line of defense: "double living/multi-zoning" in the same city"
The first line of defense is to solve the problem of "a certain machine is broken", but what if something big happens to the whole computer room (usable area)? For example, on a heavy rain day, computer room A is unfortunately flooded with water and cut off power.
In order to prevent this "black swan" event, Tencent Cloud CLB supports
High Availability Across Availability Zones
.
When you configure CLB, you can choose to deploy it in "Guangzhou two (main)" and "Guangzhou three (standby)". The two Availability Zones are physically separated by tens of kilometers and have independent power and networks.
Under normal circumstances: traffic mainly enters from the main availability zone.
In extreme cases, Guangzhou District 2 is completely disconnected, and Tengxunyun's DNS and underlying network will automatically switch all external network traffic to the standby CLB in Guangzhou District 3. This process is usually completed in a few seconds.
The third line of defense: global/cross-regional high availability combined with Anycast networks
If your business level is too high to "accept any form of single-region disaster" (such as the disconnection of the network backbone network in the entire South China region),CLB can also cooperate with Tencent Cloud's
Anycast (Anycast)
Technology upgrade to global high availability.
To put it simply, the IP address of the same CLB is published by the server rooms in different cities around the world. Users in Beijing will enter the Beijing computer room nearby, and users in Shanghai will enter the Shanghai computer room. If the Shanghai computer room is unavailable as a whole, the network will automatically pull the traffic originally to Shanghai to Beijing or Wuhan. This "space for safety" play, is currently the Internet's top disaster prevention configuration.
3. stability alone is not enough: how does CLB take care of back-end servers?
CLB has achieved its own "golden diamond", so how can it ensure that the following business servers (CVM/Lighthouse) will not fall off the chain? Here are two core mechanisms:
1. millisecond health check
The CLB is like a stern overseer, knocking on the door of the back server every few seconds (sending ping, TCP handshake or HTTP requests).
Supervisor: "Server 1, are you still alive?" 1: "Alive, everything is normal." (continue to send traffic) after a few seconds... supervisor: "no. 1, are you still alive?" no. 1: (no response due to memory overflow) supervisor: "no good, no. 1 has pulled the crotch, move out of the queue immediately! all the traffic behind is given to no. 2 and no. 3!"
When server 1 is repaired by the administrator and brought back online, CLB checks that it has recovered and will automatically add it back to the team. The entire process
Automation, zero human intervention
.
2. A variety of intelligent scheduling algorithms (teach students in accordance with their aptitude)
different industries.
The strategy for distributing traffic is also different. CLB offers several clever ways to play:
Weighted Polling (WRR): If you have two machines behind you, one is a 4-core 8G veteran and the other is a 16-core 32G fierce. You can set a higher weight for the fierce generals and let them carry more traffic.
Weighted Minimum Connections (WLC): Assign new guests to whoever has the fewest active connections on hand. Very suitable for games or long connection business.
Source address hash (Source Hash): according to the user's IP address to fix the distribution server. This ensures that the same user always accesses the same back-end server for a period of time, making it easy to stay logged in (Session).
4. Practical Guide to Avoid Pits: How to Really Squeeze CLB's High Availability?
Many students bought CLB on Tengxunyun, but the business still often collapsed. Here are some "real people's experience of avoiding pits" that often step on thunder in actual operation and maintenance ":
Don't put eggs in one available area: when buying CLB, you chose to cross the available area, but the cloud server (CVM) mounted at the back is eager to save trouble and buy it all in "Guangzhou area 2". As a result, although CLB was strong, there was no server behind it that could work and became a "shell commander". The correct approach is: CLB across the availability zone, back-end servers should also be evenly distributed in different availability zones.
Remember to turn on "session hold", but don't abuse it: if your website needs to log in, turning on session hold can prevent users from refreshing the page and being kicked offline without any reason. However, if the session is held for too long, it may cause uneven distribution of traffic (large customers all crowded on the same machine).
Make good use of "elegant and unbroken" (smooth connection migration): when you want to maintain a back-end server offline, don't untie it directly. Turn on CLB's "elegant overdue" function, which will allow existing old users to finish the work at hand (such as passing on this document and paying for this money), while no longer distributing new users to the past, thus realizing real "non-sensitive maintenance".
5. conclusion: high availability, in fact, is a kind of "insurance"
When the system is calm, the load balancing CLB looks like an unknown microphone, and some people even think it increases the number of hops on the network link.
But once it comes to the life-and-death juncture of big promotion, attack, or physical failure of the underlying hardware, CLB's highly available architecture is the "airbag" that can keep your year-end bonus and the company's business ". It trades underlying complexity for superficial simplicity, allowing developers to focus on writing business logic without worrying about underlying physical details.
In this era of "experience first", with a little more awe of high availability, the business can have more confidence to break through the waves in the wind and rain.

