Linux High CPU Usage: Step-by-Step Troubleshooting Guide
Step-by-step guide to diagnosing Linux high CPU usage — using ps, top, and htop to identify the culprit, distinguish user vs kernel vs I/O wait CPU, and resolve the issue in production.
TL;DR
- High load average does not always mean high CPU — check
%wa(I/O wait) in top first ps aux --sort=-%cpu | head -10— fastest way to find the CPU culprit- Press
1in top to see per-core breakdown — one saturated core is easy to miss in aggregate %CPUin ps is a lifetime average — use top for current spikes- High
%sy(kernel CPU) means system call overhead — check for excessive I/O or context switching - High
%st(steal time) on a VM means the hypervisor is throttling your instance - Never kill -9 first — identify the cause, then decide
Introduction: Not All High CPU Is the Same
An alert fires: CPU is at 85%. Engineers open top and start looking for the process to kill. Half the time, the process they find is not the actual problem — it is a symptom.
Linux high CPU usage troubleshooting starts with understanding what kind of CPU usage you are dealing with. User-space CPU (%us) is different from kernel CPU (%sy), I/O wait (%wa), and steal time (%st). Each points to a different root cause and a different fix.
This guide walks through the exact steps to diagnose high CPU on a Linux system, with real commands and real scenarios.
Step 1: Get the Overview
Before touching anything, understand the scope.
# Uptime and load average
uptime
# 15:42:07 up 14 days, 2:31, 3 users, load average: 6.32, 4.21, 2.87
# How many CPUs do you have?
nproc
grep -c processor /proc/cpuinfo
Interpreting load average:
- Load average / CPU core count = utilization ratio
- Load 6.32 on 4 cores = 158% — significantly overloaded
- Load 6.32 on 8 cores = 79% — high but not critical
A rising load average (1-min > 15-min) means the system is getting worse. A falling load average means it is recovering.
Step 2: Identify What Kind of CPU Usage
top
Look at the CPU line immediately:
%Cpu(s): 78.2 us, 8.1 sy, 0.0 ni, 10.4 id, 2.8 wa, 0.0 hi, 0.5 si, 0.0 st
| Reading | What it means | What to do |
|---|---|---|
High us (>70%) |
Application consuming CPU | Find which process — next step |
High sy (>20%) |
Kernel overhead | Check system calls, context switching |
High wa (>10%) |
I/O wait — disk/NFS | Not a CPU problem — check storage |
High st (>5%) |
VM steal time | Hypervisor throttling — infrastructure issue |
Low id (<10%) |
System near capacity | Prioritize finding the cause |
Critical insight: If %wa is high and %id is also non-trivial, the system is waiting on I/O. Adding CPU or killing processes will not help. The bottleneck is storage.
Step 3: Find the Process Consuming CPU
Snapshot with ps
ps aux --sort=-%cpu | head -15
This shows the top CPU consumers. The %CPU column is averaged over the process lifetime — useful for identifying sustained consumers, but it misses brief spikes.
# Watch it update every 2 seconds
watch -n 2 'ps aux --sort=-%cpu | head -10'
Live view with top
top
# P = sort by CPU (should be default)
# Press 1 to see per-core breakdown
Always press 1 in top. On a multi-core server, a single thread saturating one core shows as 25% in aggregate view on a 4-core system. That 25% does not look alarming — but that thread is completely bottlenecked and all work assigned to it is queued.
# After identifying the PID in top, monitor it specifically
top -p 1234
Filter by user
# If you know which service is slow
ps -u appuser -o pid,%cpu,%mem,comm --sort=-%cpu | head -10
Step 4: Investigate the High-CPU Process
Once you have the PID, understand what it is doing.
Check the full command
ps -p 1234 -o pid,ppid,user,%cpu,stat,cmd
# Shows full command line, not just process name
Check how long it has been running
ps -p 1234 -o pid,etime,%cpu,comm
# etime shows elapsed time since process started
A process that just spiked in the last minute behaves differently from one that has been consuming CPU for hours.
Check CPU usage breakdown for threads
# Show individual threads of a multi-threaded process
ps -eLf | grep 1234 | sort -k12 -rn | head -10
# or
top -H -p 1234
# H = thread mode inside top
If one thread is consuming 100% and others are idle, you have a single-threaded bottleneck or a deadlock.
Check what the process is actually doing
# What system calls is it making?
strace -c -p 1234 # summary mode, 10-30 second window
# If output shows high time in futex: thread contention
# If output shows high time in read/write: I/O bound
# If no syscalls dominate: pure compute in user space
# What is the kernel wait channel?
cat /proc/1234/wchan
# schedule = voluntarily yielding CPU (normal)
# futex = waiting on mutex
# poll_schedule_timeout = waiting on I/O polling
Step 5: Correlate With Application Behavior
High CPU is a symptom. The cause is in the application logs.
# When did CPU spike? Check logs around that time
journalctl -u myapp --since "15:30:00" --until "15:45:00" --no-pager | grep -iE "warn|error|exception"
# Is it processing a large job?
journalctl -u myapp --since "15:30:00" | grep -i "processing\|batch\|job\|queue"
Pattern matching:
- CPU spike + "batch job started" in logs = expected behavior, not a bug
- CPU spike + "queue backed up" = processing backlog, may need scaling
- CPU spike + no relevant log activity = something unexpected is running
Real Troubleshooting Scenarios
Scenario 1: Java Process at 200% CPU
ps aux --sort=-%cpu | head -5
# app 8823 198.4 4.2 ... java -jar service.jar
Java at 198% on a 4-core server (50% of total capacity).
# Check thread breakdown
top -H -p 8823
# Shows: 2 threads at ~99%, rest near 0%
Two threads are maxed out. Check Java GC:
# Check if GC is the problem
journalctl -u myapp | grep -i "gc\|garbage" | tail -20
# Or check JVM metrics if available
# jstat -gcutil 8823 1000 10 (if JDK is installed)
High GC activity causes CPU spikes because the JVM is spending time collecting garbage instead of running application code. The fix is usually heap sizing or memory leak investigation.
Scenario 2: High CPU But No Obvious Process
top shows CPU at 80% but no single process over 10%.
# Check context switching rate
vmstat 1 5
# r b cs
# 4 0 8234 <- 8234 context switches/second is high
High context switching with many moderate-CPU processes = CPU contention between too many active processes. The system is spending more time switching contexts than doing work.
# Count R-state processes
ps -eo stat | grep '^R' | wc -l
# If more than 2-3x your core count, you have run queue saturation
Scenario 3: CPU Spike in %sy (Kernel Space)
%Cpu(s): 12.3 us, 67.4 sy, 0.0 ni, 18.1 id, 1.8 wa
67% kernel CPU. The OS itself is doing heavy work.
# High sy often means excessive system calls — check which process
strace -c -p <top_pid> 2>&1
# If futex dominates: thread lock contention
# If read/write dominates: I/O-heavy app making many small syscalls
# If clone/fork dominates: process spawning excessively
High %sy from a web server often means too many small I/O operations — sending small packets, writing many tiny files. Batching I/O operations usually resolves it.
Scenario 4: High Steal Time on a VM
%Cpu(s): 15.2 us, 3.1 sy, 0.0 ni, 42.3 id, 0.4 wa, 0.0 hi, 0.2 si, 38.8 st
38.8% steal time. The hypervisor is taking 39% of this VM's CPU allocation.
This is an infrastructure problem, not an application problem. The VM is oversubscribed on the hypervisor.
# Confirm with vmstat
vmstat 1 | awk '{print $15}' | head -10 # st column
Actions:
- Move the VM to a less-loaded hypervisor host
- Upgrade to a larger instance type
- Contact cloud provider if on a dedicated instance that should not have steal
What Not To Do
Do not kill -9 the high-CPU process immediately. Understand why it is running hot first. It might be doing legitimate work (batch job, index rebuild, backup). Killing it might:
- Corrupt data mid-write
- Leave lock files that prevent restart
- Cause the parent to respawn it immediately
Do not restart the service without checking logs. If you restart without understanding the cause, it will happen again. Check application logs first.
Do not add more servers before diagnosing. High CPU caused by a bug (infinite loop, N+1 query, GC thrash) scales linearly with more servers — you just have the problem on more machines.
Quick Reference
# ── OVERVIEW ─────────────────────────────────────────────────────
uptime # load average
nproc # CPU core count
top # live CPU breakdown
# ── FIND THE CULPRIT ─────────────────────────────────────────────
ps aux --sort=-%cpu | head -10 # snapshot top consumers
top # live, press P then 1
watch -n 2 'ps aux --sort=-%cpu | head -5' # watch over time
# ── INVESTIGATE ──────────────────────────────────────────────────
top -H -p <pid> # thread CPU breakdown
strace -c -p <pid> # syscall summary
cat /proc/<pid>/wchan # kernel wait channel
vmstat 1 5 # context switch rate
# ── CORRELATE ─────────────────────────────────────────────────────
journalctl -u service --since "HH:MM" --no-pager # logs around spike time
Conclusion
Linux high CPU usage is almost always one of: a misbehaving application process, CPU contention between too many processes, I/O wait being mistaken for CPU usage, or hypervisor steal time.
The diagnostic order matters: check what type of CPU usage it is first (%us vs %wa vs %sy vs %st), then find which process, then understand why that process is consuming CPU.
top gives you the type. ps gives you the process. Logs give you the reason. Use all three, in that order.
Related reading: top Command Linux: Real-World Guide — full top guide including CPU line interpretation. ps Command Linux: The Engineer's Troubleshooting Guide — ps-based CPU investigation. Linux Process States Explained — understanding D-state high load.