Diagnosing Slow Response Issues in Production Systems

Problem Description

The system is a WeChat third-party platform that hosts Official Accounts and Mini Programs.

One evening around 7:00 PM, multiple account administrators reported that their official accounts kept showing:

“The service provided by this account is temporarily unavailable.”

Manual verification confirmed the issue, and the problem was system-wide, not isolated.

Problem Investigation

Verify the Issue

Repeated requests all returned the same error message, indicating a global service failure.

Check Web Service Logs

Error Logs

Several exceptions were found in the backend logs:

Error 1:

java.io.IOException: APR error: -32
...

Error 2:

java.lang.IllegalStateException: getOutputStream() has already been called for this response
...

Initially, it was suspected that WeChat terminated requests after a 5-second timeout.

Analyze Access Logs

Response Time Analysis

Access logs showed that request durations were not exceeding 5 seconds.

QPS Analysis

Real-time log inspection revealed:

Incoming requests were returning 404
Requests were being rejected by Tomcat

Tomcat configuration:

<Connector
    port="31542"
    maxThreads="150"
    acceptCount="200"
    connectionTimeout="40000" />

Key limits:

maxThreads: 150
acceptCount: 200

When concurrency exceeded these limits, requests were rejected.

Concurrency Analysis

Peak concurrency reached 400+ requests per second, exceeding server capacity.

Although concurrency later dropped, the service did not recover, suggesting a deeper bottleneck.

System Resource Check

CPU and memory usage were inspected:

CPU Status Memory Status

Memory was normal. CPU showed abnormal fluctuations but was not the primary cause.

Database Investigation

DBA analysis revealed:

MySQL CPU usage jumped from 5% to 100%
Query latency increased dramatically
Write pressure spiked during traffic surge

Historical metrics showed:

Long-running queries
High row scans
CPU saturation under load

Root Cause

Two key issues were identified:

Sudden concurrency spike caused excessive database writes
Slow SQL queries became significantly worse under CPU constraints

This led to a feedback loop: high concurrency → DB slowdown → request backlog → service failure.

Resolution

Actions taken:

Optimized slow SQL queries (added indexes)
Reduced database CPU pressure
Cleared request backlog

Service recovered after database optimization.

Lessons Learned

No traffic estimation for newly integrated large accounts
Lack of database slow-query alerts
Insufficiently structured troubleshooting approach

Recommended troubleshooting order:

Logs → System resources → Database

Summary

The incident was triggered by a high-traffic official account pushing a popular article, causing traffic far beyond expected capacity.

Performance issues are rarely caused by a single factor— they emerge from system-wide interactions under load.