Suddenly unable to access? Aliyun ECS CPU Full (100%) Troubleshooting and Optimization Tutorial

cloud 2026-05-28 阅读 66

The website was fine during the day, but suddenly got stuck in the evening. The browser kept circling and finally reported "504 Gateway Timeout" or "unable to connect".

The in the mind a surprised, hurriedly connected to the ariyun console, a look at ECS instance monitoring:

The CPU is fully loaded and pulls a 100 percent red line.

This kind of scene, the vast majority of personal webmaster and operation and maintenance development have encountered. Don't panic in this case, and don't rush to restart the server (restarting can only cure the symptoms, and the CPU will still explode after a few minutes). Today, I don't want to talk about empty theories. I want to give you a set of theories directly.

On-line production environment investigation and optimization of military regulations

, follow the steps and find out who is behind it in 5 minutes.

Core troubleshooting ideas: three-step positioning method

When the CPU is full, our troubleshooting logic should be:

Look at the whole: Which process (Nginx, PHP, Java or Trojan) swallowed the resources?

Look at the local: which piece of code, which thread (Thread) or which SQL in this process is idling wildly?

Next heavy hand: after positioning, is it to optimize the code, add cache, or kill the process directly?

Step 1: Log in to the server and find out the problem process (1 minute)

No matter what the website card is, as long as SSH can be connected, it will be connected immediately. If the local SSH is stuck and cannot be connected, directly use ariyun console

$\rightarrow$

ECS instance

$\rightarrow$

Remote connection (Workbench) Force logon.

Enter the following command, which is the ultimate weapon for Linux performance troubleshooting:

Bash

top

Enter

top

interface, press the uppercase

(sorted by CPU usage). You will see a dynamic list similar to the following:

Plaintext

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

12345 nginx 20 0 354m 45m 12m R 98.5 2.3 12:34.56 php-fpm

6789 mysql 20 0 2.5g 1.2g 24m S 1.5 60.2 45:12.89 mysqld

Result analysis:

Look at the first row.

COMMAND

What is it:

If it is php-fpm or node or java: it means that the business code of your website has encountered an endless loop, or the performance cannot be carried due to sudden large traffic.

If it is mysqld: it means that the database has encountered slow query, missing index, or high concurrency lock.

If it is nginx or httpd: large

The probability is that it has encountered malicious brush volume, CC attack or crawler crawling.

If it is alphanumeric garbled (e. g. kdevtmpfsi, miner): forget it, the server has been hacked and caught mining as a miner.

Step 2: In-depth subdivision of the scene, accurate bomb disposal (3 minutes)

According to you in

top

According to the results seen in the website, take a seat and choose the following solution path.

Scene A: The Command is

mysqld

(Database Stuck)

This is a high frequency scenario. It is usually because a certain section of business code is written too rubbish and hundreds of thousands of rows of data are checked without indexing.

1. Log in to the database to view the currently executed SQL

SQL

mysql -u root -p

-- Execute after login

SHOW PROCESSLIST;

If the prompt list is too long and incomplete, you can use:

SQL

SHOW FULL PROCESSLIST;

2. Catch the mole

In the output list, observe

Time

(execution time) is long, and

State

is explicitly

Sending data

Sorting for group

Creating tmp table

of that line. Look at its

Info

What SQL statement is written in the column.

Emergency avoidance: see the slow SQL that makes people vomit blood, remember its Id, and run KILL Id directly; (KILL 142, for example;), first release the database, the website can immediately resume access.

Radical solution: take this SQL to find the reason in the code, and quickly add indexes to the fields after WHERE or JOIN; If it is a large table association, consider adding Redis cache.

Scene B: The Command is

java

(program internal dead loop/OOM)

The Java application CPU is soaring, usually because a thread is stuck.

while(true)

or in frequent garbage collection (Full GC).

1. Find out the thread that consumes the most CPU.

Suppose Java's process PID is

12345

. Run the command to view which threads consume the most resources in the process:

Bash

top -Hp 12345

Press

Sorting, if the PID of the thread that consumes the most CPU is caught

12366

2. Hexadecimal conversion

Thread PID

12366

Convert to hexadecimal:

Bash

printf "%x\n" 12366

# The output result is: 304e

3. Print stack information

Take advantage of the JDK's own

jstack

Tool, directly locate the line of code that has the problem:

Bash

jstack 12345 | grep "304e" -A 20

The terminal will directly print out the Java code class name and line number that this thread is executing. In the past, it was definitely a dead loop or recursion without boundaries. Change the code and redeploy it.

Scenario C: The Command is

nginx

php-fpm

(Encountered malicious swipe/CC attack)

If the traffic is very small at ordinary times and the CPU suddenly bursts, take a look at Nginx's access log.

1. Statistics recently visited the highest IP

Bash

# Suppose your Nginx log is in/var/log/nginx/access.log

awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -n 20

If you find that a strange IP has been brushed tens of thousands of times in a few minutes, there is no doubt that you are targeted.

2. Emergency Blocking IP

Directly use the firewall or Alibaba Cloud security group that comes with Linux to blacklist this IP:

Bash

# Banned using iptables

iptables -I INPUT -s Malicious IP Address-j DROP

If you use Aliyun, go directly to the "Security Group Rules" of ECS and add an inbound deny (Drop) rule.

Scenario D: Unexpected strange process (server reduced to broiler/mining)

If you see some strange processes that take up 99% of the CPU and can't find regular software along the path.

Use ls -l /proc/process PID/exe to see where this malicious program is hiding.

Eradicalization: Bashkill -9 Process PID# Forcibly Kill Process rm -rf Malicious Program Path# Delete Virus File

Check the backdoor: Hackers usually write timed tasks. Enter the crontab -l to see if there are any timing scripts that automatically download the virus, and delete them all with crontab -e.

The ultimate precaution: how to avoid the next long red line?

After the cold sweat comes out, we need to do some basic defense and current limiting measures, so as not to let the CPU have another chance to be a full score player.

Use Aliyun "Cloud Monitoring" to configure alarms. Don't wait for user feedback to open before checking. In Alibaba Cloud Monitoring, set a rule: "When ECS CPU utilization is greater than 85% for 5 minutes, send SMS/DingTalk alarm immediately". Intervene at the first sign.

Configuring the maximum worker processes for PHP-FPM / Nginx

If the server is 2-core 4G, limit the max_children in the php-fpm.conf to about 30-40. In this way, even if the traffic is maxed out, only some users will prompt 502, and the bottom layer of the server will not be unable to connect SSH because the memory and CPU are completely drained.

Reasonable use of "elastic expansion" If your website or application is really engaged in activities, or because the hot search ushered in the real "splashing traffic", how to optimize the single machine is useless. Go to Aliyun to activate Elastic Scaling (ESS) and configure a rule: when the CPU exceeds 80%, it will automatically help you clone and charge the second and third ECS to share the traffic, and release it automatically after the activity ends. Use technical compound interest to combat flow impermanence.