Senior L3 Infrastructure & Security Engineer

Linux Troubleshooting & Performance — From Production

If you've ever stared at a server with 100% CPU, a 502 waterfall, or a hung process that won't die — this blog is for you. Real incidents, real commands, real fixes. No padding.

About Damon

Senior Technical Support Engineer · L3

OPSWAT · Ho Chi Minh City

7+ years working on Linux systems in production — from running server fleets at an Anti-DDoS company to L3 support for enterprise deployments at OPSWAT, where the bugs are always someone else's OS-level problem.

Most of what I write comes from incidents that took too long to diagnose the first time. The goal is to make it faster for you.

Full background →

What This Blog Covers

Written for engineers who are mid-incident, not mid-tutorial. Every post has the exact commands and the reasoning behind each one — not just the fix, but why it works.

  • Linux performance troubleshooting — CPU, memory, load average, I/O wait
  • Process debugging — ps, top, strace, lsof, process states (D, Z, R)
  • NGINX production issues — 502 errors, upstream keepalive, SSL hardening
  • Security hardening — CIS benchmarks for Ubuntu, RHEL, Windows Server
  • Incident response — log analysis, root cause, postmortem workflows

Common searches that land here: linux high cpu usage, debug linux server, load average explained, nginx 502 under load, linux process monitoring.

Start Here

The three guides most engineers need first

Browse by Topic

Grouped by what you are actually trying to do

Linux Commands

ps, top, htop, strace, ss — the tools you reach for first

Troubleshooting

High CPU, memory leaks, zombie processes, 502 errors

Monitoring & Debugging

strace, lsof, auditd, log analysis

Security & Infrastructure

CIS hardening, firewall, Docker, NGINX config

All Topics

linux troubleshooting · nginx debugging · security hardening · infrastructure

view all →

Latest Articles

Most recent Linux and DevOps troubleshooting guides

all articles →

NGINX 502 Bad Gateway Under Load: Root Causes and Fixes

NGINX 502 errors under load are almost never a simple app crash. This guide covers the real root causes — connection backlog overflow, keepalive misconfiguration, ephemeral port exhaustion — with diagnostic commands and config fixes from production incidents.

12 min read

Log Analysis for Security Investigations: Windows Event Logs and Web Server Access Logs

A practical guide to log analysis for security investigations — Windows Event Viewer, critical Event IDs, Apache access log parsing, and the Linux command-line tools that make manual log analysis fast and effective.

9 min read

Diamond Model of Intrusion Analysis: 4 Core Components Explained (2026)

A technical breakdown of the Diamond Model of Intrusion Analysis — adversary, victim, capability, and infrastructure — with real attack examples, meta-features, and how it compares to the Cyber Kill Chain and MITRE ATT&CK.

19 min read

Cyber Kill Chain: All 7 Phases Explained with Real Attack Examples (2026)

A technical deep-dive into the Cyber Kill Chain — all 7 phases mapped with real attacker techniques, detection indicators, and defensive controls. Includes a full real-world attack walkthrough and Kill Chain vs MITRE ATT&CK comparison.

20 min read

How to Trace Route in Linux: traceroute Examples

Use traceroute in Linux to diagnose network path issues — read hop output, interpret timeouts, use TCP mode to bypass firewalls, and identify where packets are being dropped.

5 min read

Tools & Resources

Beyond the blog — scripts, CLI tools, and guides built from the same production experience. If you find yourself doing the same thing manually three times, it becomes a tool.

CLI Tools

sys-monitor and seo-pro-audit — open-source utilities from this blog

Browse tools →

Troubleshooting Guides

Deep-dive walkthroughs for the incidents that take the longest to debug

Start with Linux →

Security Hardening

CIS benchmark implementation guides for Ubuntu, RHEL, and Windows Server

Read the guide →