Alibaba Cloud Server CPU Full Load Troubleshooting and Resolution Tutorial

cloud 2026-06-02 阅读 17
2

For operation and maintenance and development, the most alarming alarm message in the middle of the night is "the CPU utilization rate of your cloud server ECS instance system has reached 100".

CPU full load means that all the computing power of the server is drained instantly, followed by website jam, API response timeout, database connection full, and the whole business is paralyzed. Faced with this situation, novices often go to the console to restart the server. Although restarting can be a temporary emergency, if the root cause is not found, the CPU will still soar to 100 percent after 5 minutes.

Today's tutorial takes you to calmly and normly troubleshoot and solve the problem of full CPU load of Aliyun server like an experienced veteran.

The first stage: misunderstanding investigation and preparation before "seeing a doctor"

Before you start typing orders, go first.

Alibaba Cloud Console

Take a look at the monitor market and identify a key indicator:

Basic (T-Series) CPU Credits

.

If you buy Aliyun's "burst performance instances" (such as t5 and t6), this server usually limits the basic CPU power (such as only 20%). When your business exceeds the standard, it will consume "CPU points" to get full calculation power. Once the points are exhausted, the server will be forced to "speed limit", which shows that the CPU is locked in a low fraction and the system can't move.

Solution: If the burst-performance instance runs out of credits, either enable the Unconstrained Mode in the console or upgrade to a Generic or Compute instance.

If the limitation of hardware models is excluded, it means that there are indeed processes inside the server that are frantically eating power, and we must log into the system to start "catching thieves".

The second stage: Linux server CPU full load troubleshooting full steps

Linux servers are high-incidence areas of CPU explosion, mostly due to bad code, high concurrency, or mining trojans.

Step 1: Find out the culprit of "eating (

top

command)

Use SSH to connect to the server and directly enter the universal command in the terminal:

Bash

top

After entering the interactive interface, press the uppercase on the keyboard

P

(Shift + p) to sort processes from high to low CPU usage.

Keep an eye on the first few lines and observe the following three core fields:

PID: The ID number of the process. It is up to it to kill the process later.

USER: Which user started the process. If it is www or nginx, it is probably a code problem. If it is root and the name is strange, be careful it is a Trojan horse.

COMMAND: The name of the process.

Analysis of common COMMAND suspects:

php-fpm, java, node, python: the business code is running a dead cycle, or the database is not indexed under high concurrency, resulting in hard lifting.

mysql: The database is executing complex association queries and full table scans.

kswapd0: The system is running out of memory and is frantically moving memory data into the Swap partition of the hard disk, causing the CPU to soar together.

kinsing, sysrv, and a string of random garbled codes: Congratulations, the server has been hacked. This is a classic mining trojan.

Step 2: Deep stripping to see what is going on inside the process.

If you find that your own code (such as a Java or PHP process) is full of CPU, it is not enough to know PID, you need to know which line of code is making a demon.

Scenario A:Java Process Full Load Troubleshooting (Classic Interview and Actual Combat)

Assume that the fully loaded Java process PID is

1234

.

Find out the thread number (TID) that consumes the most CPU in this process: Bashtop -Hp 1234 assumes that thread 1256 accounts for 90% of the CPU.

Convert this thread number to hexadecimal (because hexadecimal is used in Java stack):Bashprintf "%x\n" 1256# The output is assumed to be: 4e8

Use jstack tool to print out Java's thread stack, and use grep to grab the hexadecimal thread number: Bashjstack 1234 | grep -A 20 "4 e8" screen will directly display which class and line of code (such as com.xx.service.impl.OrderServiceImpl.lambda$0(OrderServiceImpl.java:88)) are running, and the dead loop is clear at a glance.

Scenario B:MySQL causes CPU to be full

If it is

MySQL

The process comes first. Log in to the database immediately and execute:

SQL

SHOW FULL PROCESSLIST;

View the currently executing SQL statement. Focus on this

Time

(execution time) of the longest statements, if you see a large number

Selecting

If the bad SQL is in a state and has not been indexed, directly notify the development to add indexes, or ask DBA to temporarily kill this bad SQL.

Step 3: Decisive Disposal (How to End Gracefully)

Case 1: Ordinary code dead loop

If it affects survival, you can first use PID to kill the process in exchange for breathing time:

Bash

kill -9 Process PID

Then hurry to fix the code bug.

Case 2: In the mining Trojan horse

Hackers usually hang regular tasks, you simply

kill -9

It cannot be killed and will be resurrected in a second.

Check the scheduled task: crontab -l, found a strange download script, immediately use cron

tab -e deleted.

Check for residual processes and kill.

The ultimate weapon: if the Trojan is infected in a large area, the fastest and cleanest solution is to "restore the system by snapshot" or to reinstall the system directly with reference to the previous article.

The third stage: Windows server CPU full load troubleshooting full steps

Windows server is relatively intuitive, directly through the graphical interface to solve.

Step 1: Open Task Manager

The remote desktop connects to the server.

Right-click the bottom taskbar and select Task Manager.

Click the header of the CPU column to sort it from high to low.

Step 2: Deep Resource Monitor

Task Manager can sometimes only see

w3wp.exe

(IIS process) or

sqlservr.exe

Very high, no details.

At the bottom of the Task Manager, click Open Resource Monitor.

Switch to the CPU tab.

Here you can see the specific services for each process. If w3wp.exe goes up, it means there is a problem with the code of a website on IIS. You can check which URL of which website is consuming resources crazily through the "working process" of IIS.

Stage 4: Prevention is better than cure-how to prevent the CPU from filling up again?

After putting out this fire, in order not to wake up in the middle of the night, you must set up three lines of defense:

Configure CPU Alarm Rules in the Alibaba Cloud console: Go to CloudMonitor-> Application Group or Host Monitoring. Add an alarm to the server: when CPU usage> = 85% for 5 minutes, send a text message/DingTalk alarm to the mobile phone immediately. Get in early before it gets to 100 percent.

Add "Slow Query Log" to the database: Turn on MySQL slow_query_log and record all SQL that takes more than 1 second to execute. Every day, it will be distributed to development and optimization to eliminate hidden dangers in the bud.

The code layer sets the timeout mechanism: whether it is an external API call, a complex loop, or multiple threads, the maximum timeout period must be set. Don't let a dead loop hang indefinitely to consume computing power.

Summary

Checking CPU full load is like catching a thief:

First, use top/task manager to lock the suspect (PID), then use jstack / PROCESSLIST to get the criminal evidence (which line of code/which SQL), and finally deal with it decisively and install monitoring alarm.

.

Don't panic when you encounter problems. According to the standard process, any rotten code or Trojan horse that makes the system stuck has nowhere to hide.

2
← 返回新闻中心