What should I do if my Alibaba Cloud server ECS goes down? Snapshot, mirroring, and cross-zone migration and recovery strategy
In the actual operation and maintenance of cloud computing, although the cloud platform has extremely high stability, the "single availability zone (AZ) interruption" caused by force majeure such as earthquakes and power failures is still a risk that must be considered in the architecture design. For businesses that do not deploy "double live in the same city" or "multi-live in different places" due to cost considerations, master a set
Fast data backup and cross-zone migration
The emergency plan is the last line of defense to ensure business continuity.
This article describes the business recovery solutions based on snapshot, image, and SMC migration platforms.
1. Core Disaster Recovery Scenarios and Technical Means
For different fault depths, we usually use the following three-dimensional recovery strategies:
Recovery Strategy
Technical principle
Applicable Scenarios
Cloud disk data backtracking
Based on ECS snapshots. You can restore a new disk created by rolling back a snapshot or mounting a snapshot to a specific point in time.
Data was deleted by mistake, encountered blackmail virus, database logic error.
System environment reconstruction
Based on custom images. Package the instance in full and quickly restore the operating system and preinstalled environment.
System crashes, configuration corruption, and the need for batch cloning of a completely consistent operating environment.
Pan across Availability Zones
With the help of SMC (Migration Center). Copy the instance and its associated resources to other zones in the same region.
The overall escape in the event of a physical failure (such as a power outage or network interruption) across the entire Availability Zone.
2. solution 1: cloud disk-level data recovery (snapshot technology)
Snapshots are the most basic and efficient means of data protection on the cloud.
1. Backup strategy
Manual backup: Manually create snapshots of system disks and data disks before major system updates or application releases.
Policy backup: Through the "automatic snapshot policy", configure the retention period (such as daily morning backup, retention for 7 days) to achieve unattended backup.
Batch backup: Use OOS to take snapshots of hundreds of instances at the same time with one click.
2. Resuming actual combat
In-situ rollback: If the instance is still in use, directly perform cloud disk rollback to instantly restore data to the snapshot point.
Cross-zone mounting: If the current zone A is unavailable, you can use the snapshot to create a new cloud disk in zone B and mount it on a new instance in zone B to extract data from different places.
3. Scheme 2: Environment-level One-click Restore (Mirror Technology)
Images contain not only data, but also the configuration of the operating system, environment variables, and software licenses.
Image creation: We recommend that you create a custom image immediately after the business environment is initialized ".
Environment cloning: When the original instance system cannot be started due to illegal intrusion or environmental pollution, use the image to re-purchase the instance to ensure "out-of-the-box" without reinstalling the software.
Cross-zone deployment: Custom images are region-level resources. You can directly pull up new instances in any zone in the same region.
4. Scheme 3: Overall Escape (SMC Migration Across Availability Zones)
When an Availability Zone is confirmed to have a large area failure and cannot be recovered for a short period of time, the Server Migration Center (SMC) is used
Overall migration is the best path.
1. Migration Principle
SMC invokes the cross-zone replication capability of the underlying block storage. It not only migrates data, but also synchronizes the specification properties of the instance.
2. Migration Process
Environment preparation: Confirm that the target zone has sufficient resources (vCPU and memory share), and enable RAM authorization and snapshot services.
Create a task: In the SMC console, choose Cross-zone Migration to specify the target zone and the new VSwitch (VPC).
Smooth switching: SMC automatically synchronizes the back-end block storage data. Note: The migration will cause the instance to restart and the private IP address will change. Please ensure that your application code or load balancing (SLB) is bound to the dynamic domain name instead of the dead address.
5. Fault Drill: How to Verify the Effectiveness of Disaster Recovery?
"Disaster recovery plans that have not been rehearsed are all fake plans". It is recommended that the following simulation tests be performed on a regular basis:
Cloud disk damage drill: Manually delete some data and test the time (RTO) required to roll back to the 100 percent status through snapshots.
Environment exception drill: Simulate system core file damage, and verify whether the business can be automatically pulled up by mirroring and reloading the system.
SMC simulation escape: A cross-zone migration is performed during the off-peak period to record the impact of public/private network IP changes on upstream and downstream businesses and optimize the switching process.
Summary
For businesses with low disaster tolerance requirements or cost-sensitive services, "snapshot periodic backup + mirror environment backup + SMC emergency migration" is the most cost-effective solution. It does not require you to double the cost of running the server, but provides a reliable means of business recovery in times of crisis.

