When the website suddenly went down: how a product manager used AI to investigate and recover from a production infrastructure failure.

I am not an engineer. But when the Soulful Asia website suddenly returned a 504 error and I had to deal with it alone, AI helped me build a troubleshooting path, identify the root cause, and understand each step before I acted.

Author

Jun Wang writes from the perspective of a product manager who sometimes has to handle problems beyond product design and delivery, and who values understanding before execution.

Focus: Product management, AI-assisted troubleshooting, incident response
Read time: 8 min
Context: Based on a real 504 outage, unexpected container resource exhaustion, and the recovery process on Soulful Asia infrastructure

I am not an engineer.

I design products, plan roadmaps, and write requirements. Servers, containers, and logs are usually background terms in a conversation, not the things I naturally want to touch.

But one morning, the Soulful Asia website that I was independently responsible for went down. The site is a core public entry point for the association and also supports membership, event sign-up, and journal content. There was no warning. It simply returned a 504 Gateway Timeout.

And I was handling it alone.

AI gave me a troubleshooting framework, not a direct answer

Many people use AI hoping it will say, "Here is the solution." For a problem I was completely unfamiliar with, the first valuable thing AI gave me was not an answer. It was a troubleshooting path.

It told me that a 504 usually means nginx is waiting too long for an upstream service, and that I should first confirm whether the frontend container was actually healthy by checking CPU, memory, and process count.

I followed that path and saw numbers that made me stop for a few seconds.

`CPU: 197% Memory: 4.36 GiB Process count: 7038`

For a frontend service, normal CPU should be very low, memory should stay around a small fraction of that, and process count should be only a few dozen. I sent those numbers back to AI. Its response was direct: this did not look like a normal performance problem. It looked like an unexpected process might be running inside the container.

Identifying the unexpected process behind the failure

I went into the container and checked the process list. The normal frontend service processes were still there, but I also found unfamiliar processes with random-looking names and launch parameters like `-c` and `-B`.

I sent that back to AI and asked what I was looking at. The answer was very direct: it looked like a cryptocurrency mining process, or something very close to it. The `-c` flag pointed to a config file, and the `-B` flag meant background mode, which is typical for this kind of program.

I then asked the next question: why had this process appeared inside the container?

AI asked whether I had rebuilt the frontend container recently and what framework version I had used. I remembered that the day before, while fixing an unrelated issue, I had rebuilt the frontend using an older minor version that still had a known but unpatched security vulnerability.

I also checked the access logs and saw a very clear signal: repeated suspicious `POST /` requests coming every second or two, with constantly changing user-agent strings. That pattern strongly suggested automated external traffic presenting itself as different real users.

The question many people would miss

The first useful question was not "Can I restart it?" It was "Why is this happening?"

After I understood the failure path, I asked AI a question many people would probably skip: if monitoring showed the container as healthy, why was the website still down?

AI explained that the health check only confirmed that the service process returned some response. It did not check whether the HTTP status was 200 or whether the content was actually useful. The mining process had consumed most of the CPU, but the service process itself had not crashed, so the health check still reported "healthy."

That healthy label was a false signal. It taught me something important: monitoring only gives the answer to the question you designed it to ask. It does not automatically tell you whether the system is really working for users.

Phased response: understand first, then act

With AI's help, I made a response plan and followed it in order.

First, I blocked the suspicious traffic source at the reverse-proxy layer and at the operating-system firewall layer. AI pointed out that application-layer blocking alone is not enough if the same external requests can still reach the machine another way.

Second, I stopped the affected container. Restarting it was only creating an illusion of recovery. The abnormal process had reappeared within minutes after my first restart because the underlying vulnerability was still there.

Third, I upgraded the framework to a patched stable version and updated the dependency files.

Fourth, I rebuilt the image from scratch without reusing build cache. AI was very clear about this. If cached layers were reused, they might also reuse layers that still contained unexpected files. A full rebuild was not optional.

Fifth, I restarted the service and verified the result by checking CPU, memory, process count, and the website itself.

The numbers made the recovery obvious: CPU dropped from 197% to 0.01%, memory went from 4.36 GB to 97 MB, process count went from 7,038 to 24, and the site returned from 504 timeout to normal 200 responses.

What I learned from this was not only technical

Known risk plus delayed action creates a failure window. A security issue waiting in backlog can still become an active production incident.
Containers running with the highest privilege are an underestimated risk. A small config choice can greatly expand the blast radius.
"Can a restart fix it?" is the wrong first question. Temporary recovery can hide the real cause.
AI's biggest value was giving me a thinking framework. It helped me understand each step instead of blindly following commands.

As a product manager, I am used to thinking through user journeys and system logic. After this event, I realised the same thinking matters in operations too. A health check is a designed artifact. A recovery path is a designed process. A monitoring blind spot is often a requirement blind spot in another form.

AI did not do anything for me directly. It helped me break a chaotic situation into an ordered path, understand why each action mattered, and keep my own judgment in control.

What I did afterwards

After the incident was handled, I asked AI to help me organise a full incident-response record. It captured the failure timeline, the likely entry path, the logic behind each fix, and the remaining risks plus hardening suggestions.

I saved that document into the project's operations handbook. If something similar happens again, whether it is me or someone else handling it, there is now a written path to follow.

Final thought

I know many engineers reading this might think these are basic operations skills. I agree. But this article is not written for engineers.

It is written for people like me who do not come from a deep engineering background, but still have to take responsibility for real systems when something goes wrong. You do not need to pretend you already know everything. What you need is a good question, a clear troubleshooting path, and a tool willing to think through it with you step by step.

That was the role AI played for me that day.

More writing

AI-native product workRole evolution, implementation alignment, and operational recovery Product judgment & leadershipROI decisions, stakeholder leadership, and influence without authority Product & UX strategyPlatform boundaries, commercial design, and user-experience strategy