~/blog

#troubleshooting

All articles tagged #troubleshooting — practical guides from production experience.

Filter by topic

68 posts tagged #troubleshooting · page 8 of 8

December 5, 2024·12 min read

Linux TIME_WAIT Explained: Why It Causes Connection Failures and How to Fix It

Linux TIME_WAIT exhausts ephemeral ports and causes ECONNREFUSED under load — even when your app is healthy. Learn what TIME_WAIT is, how to detect port exhaustion with ss and netstat, and the exact sysctl fixes that resolve it.

#linux#networking#troubleshooting#infrastructure#debugging
November 28, 2024·10 min read

NGINX Upstream Keepalive Explained: Why Missing It Causes 502 Errors

Missing keepalive in your NGINX upstream block silently kills connections under load. Here's exactly what keepalive does, how TCP connection reuse works, and the production-ready config that stops 502s before they start.

#nginx#infrastructure#production#networking#troubleshooting
October 3, 2024·6 min read

Docker Ate My Disk: Fixing Log Rotation Before It Kills Production

How a single verbose container filled a 500GB disk in 72 hours, and the exact daemon.json config that stops it from ever happening again.

#docker#logs#infrastructure#troubleshooting
August 11, 2024·7 min read

Reading Logs Like a Detective: A Field Guide to Incident Triage

The exact commands and mental models I use to go from 'something is wrong' to 'I know exactly what happened' in under 15 minutes.

#logs#debugging#incident#troubleshooting#security-ops
May 2, 2024·9 min read

strace, lsof, and ss: The Trio That Solves Every Mystery

When logs give you nothing and the debugger isn't an option, these three tools let you see exactly what a running process is doing at the system call level.

#debugging#linux#troubleshooting#production
#troubleshooting Articles — Linux & DevOps Troubleshooting | damonsec.com