Author
Jun Wang writes from the perspective of a product owner working close to delivery, operations, and real-world product risk.
Incident response
Writing
I am not an operations engineer. But when the Soulful Asia website started returning 504 errors and I was handling it alone, AI helped me turn a confusing outage into a structured investigation, a confirmed attack, and a step-by-step recovery process.
Jun Wang writes from the perspective of a product owner working close to delivery, operations, and real-world product risk.
I am not an engineer.
I design products, plan roadmaps, and write requirements. Servers, containers, and logs are usually background terms in a conversation, not the things I naturally want to touch.
But one morning, the Soulful Asia website that I was independently responsible for went down. The site is a core public entry point for the association and also supports membership, event sign-up, and journal content. There was no warning. It simply returned a 504 Gateway Timeout.
And I was handling it alone.
I opened the site and saw a blank page with "504 Gateway Time-out nginx" at the bottom of the browser.
My first thought was simple: maybe I should restart it. I restarted the container, the site recovered briefly, and then failed again within minutes. That immediately told me this was not a random temporary glitch. Something persistent was dragging the service down.
I pasted the exact error into AI and asked what to check. The most useful thing AI gave me was not an instant solution. It was a troubleshooting path. AI said a 504 usually means nginx is waiting too long for an upstream service, so I should first check whether the upstream container was actually healthy by looking at CPU, memory, and process count.
The numbers were shocking: CPU close to 200 percent, memory in gigabytes, and process count in the thousands. For a frontend service, normal CPU should be very low, memory should be a small fraction of that, and process count should be around a few dozen. At that point the situation no longer looked like an ordinary performance problem.
AI told me to inspect the running processes inside the container. Among the normal service processes, there were unfamiliar programs with random-looking names and launch flags that matched background crypto-mining behaviour.
At that point the pattern became much clearer. The site was timing out because the container resources were being consumed by a malicious process. Later log review also showed repeated suspicious POST requests with changing user agents, which strongly suggested automated exploitation rather than normal traffic.
AI then helped me interpret what I was seeing: this looked like a crypto-mining program using the server's CPU while starving the website of resources.
The first useful question was not "How do I restart it?" It was "Why is this happening?"
That shift mattered. Restarting the container only gave me a short illusion of recovery. AI helped me understand that if the vulnerability was still present, the attacker could come back almost immediately. A restart without a root-cause fix would only restart the problem too.
When I asked how they had gotten in, AI led me toward a likely answer. The day before, I had rebuilt the frontend container using an older framework version to solve an unrelated issue. That version had a known security vulnerability. Because the container was running with very high privileges, a successful exploit gave the attacker too much freedom inside the environment.
I also checked the access logs and found repeated suspicious POST requests coming from the same source with constantly changing user-agent strings. That pattern strongly suggested automated attack traffic rather than normal user behaviour.
The day before, I had rebuilt the frontend container using an older framework version to solve an unrelated issue. AI helped connect that choice to a known vulnerability in that version. Because the container was running with high privileges, successful exploitation gave the attacker too much freedom inside the environment.
This was an uncomfortable but useful lesson. A known security issue that sits in a backlog is still an open window.
I also asked AI a question that many people might overlook: if monitoring showed the container as healthy, why was the website still down? AI explained that the health check was only confirming that the service process responded somehow. It was not checking whether the HTTP response was actually useful for users. The process had not crashed, so the health check kept reporting healthy, even while real requests were timing out.
That was a very important lesson for me: monitoring tells you the answer to the question you chose to ask. It does not automatically tell you whether the system is truly working for users.
AI was especially clear about one step: the rebuild had to avoid old cached layers. If the build tool reused cached image layers, it might also reuse layers that still contained malicious files. A clean rebuild was part of the remediation, not an optional extra.
After recovery, the outcome was obvious in the runtime numbers: CPU dropped back close to zero, memory usage returned to normal, the process count collapsed back to a small number, and the website returned to normal response.
One detail taught me a lot. Monitoring still showed the container as healthy. The website was down for users, but the health check passed because it only confirmed that a process responded, not that the response was useful or correct.
That reminded me of a broader product truth: your monitoring only answers the question you designed it to answer. If the check is too shallow, the dashboard can look calm while the user experience is already broken.
As a product manager, I naturally think through user journeys and system logic. What this incident taught me is that the same thinking also matters in operations. A health check is a designed product artifact. A recovery path is a designed process. A monitoring gap is often a requirement gap in disguise.
AI did not perform any action for me. What it did was help me turn a chaotic situation into a sequence I could understand, evaluate, and execute without giving up control.
This article is not written for security specialists. It is written for people like me who carry product ownership and still sometimes end up handling real operational problems.
When something goes wrong, you do not need to pretend you understand everything already. You need a good question, a clear troubleshooting structure, and a tool that is willing to reason through the situation with you step by step.
That was the role AI played for me that day.