How We Pinpointed a 244ms Latency Spike in a 500k QPS OpenResty Gateway (original) (raw)

Recently, we partnered with a leading fintech client to conduct a routine performance evaluation of their core cross-border payment clearing system. The system’s entry point is a high-performance API gateway built on OpenResty, handling billions of requests daily, with peak QPS exceeding 500,000. In the fintech sector, system stability and latency are the lifeblood of business operations. They maintain exceptionally stringent requirements for the Service Level Objectives (SLOs) governing critical transaction paths. At first glance, the system appeared to be running smoothly: P50 latency remained stable within 10ms, and all core indicators were in a healthy state.

Despite healthy average latency metrics, a deeper dive into the latency curve revealed a significant stability risk: periodic spikes pushed latency up to 300ms, exceeding the strict SLA thresholds for critical transaction paths. For systems where OpenResty functions as a critical gateway, this not only signals performance degradation but also poses a potential transaction timeout risk.

When Traditional Monitoring Fails to Pinpoint the Root Cause

During a routine performance health check for a client, we utilized OpenResty XRay to perform a non-intrusive deep scan of their production environment. Although the client’s existing monitoring dashboards indicated the system was running smoothly overall, OpenResty XRay’s analysis quickly uncovered two critical performance risks hidden beneath seemingly stable averages:

This represents a classic dilemma many have encountered: you might have a general idea of where the problem lies, likely within the Lua code, but you lack the precise details—which specific line, which function, and under what exact conditions it’s triggered.

From Heuristics to the Power of Dynamic Observability

Clearly, the challenge is no longer merely collecting more monitoring data, but rather how to extract actionable insights from that vast amount of information. To break free from the inefficient “observe-guess-verify” cycle, we need a tool capable of safely performing deep investigations in a production environment. This is precisely where OpenResty XRay demonstrates its core value, with its non-intrusive dynamic tracing capability being paramount—requiring no code modifications and no service restarts, which is a non-negotiable requirement for critical financial systems.

We initiated OpenResty XRay’s automated analysis on one of the client’s high-load production Pods. Within minutes, the initial deep analysis reports were generated, and the answers to the puzzle began to surface.

Pinpointing Performance Hotspots

During an in-depth analysis of a customer’s high-performance gateway cluster, our primary objective was to address a persistent latency spike issue.

Based on these findings, we recommended replacing this function in the critical path with a more modern, JIT-friendly alternative solution available within the framework, specifically designed for high-concurrency environments. Following implementation, the system’s latency spikes were effectively eliminated, and service availability metrics stabilized.

Deep Overhead Analysis

Even after resolving the latency issues, the system’s overall CPU utilization remained above the expected baseline, indicating further optimization opportunities.

We recommended enabling the “compilation cache” option for the relevant regular expression calls, ensuring a “compile once, run many times” approach. This adjustment significantly reduced the CPU footprint of the logging module, freeing up computational resources and thereby increasing the server’s capacity to handle core business logic.

The Forgotten PCRE JIT Setting

While the previous two optimizations addressed specific application-level issues, the “Lua-Land” report from OpenResty XRay uncovered a deeper, more systemic problem.

Quantifiable Engineering Efficiency and Resource Optimization Results

Leveraging insights from OpenResty XRay, the customer team implemented a series of optimizations. The effects were immediate and quantifiable:

  1. Latency Spikes Completely Eliminated: Post-optimization, the latency curve became consistently stable, dropping from over 300ms to a steady level.
  2. 30% CPU Cost Savings: After resolving the regex cache issue and globally enabling JIT, the gateway cluster’s overall CPU utilization decreased by approximately 30%, leading to significant reductions in cloud infrastructure costs.
  3. MTTR (Mean Time To Resolution) Significantly Shortened: The diagnosis time for performance issues was dramatically reduced from “weeks of guesswork and meetings” to “minutes of accurate pinpointing.”

Building Continuous Performance Observability

The immediate value of resolving these two performance bottlenecks is clear. However, the deeper insight reinforces a core engineering principle: in high-concurrency, low-latency OpenResty environments, performance issues often lurk within the build system, runtime configurations, and the intricate details of the underlying infrastructure.

When the root cause of a problem extends beyond application code logic, traditional monitoring methods quickly become inefficient. Without dynamic, non-intrusive, deep-level tracing capabilities, even the most seasoned engineers face significantly higher costs in pinpointing these elusive performance regressions.

Building on this experience, the client engineering team is diligently planning the next phase of their engineering system optimization. They intend to shift left their continuous performance analysis capabilities, leveraging OpenResty XRay by integrating it into the CI/CD pipeline’s benchmark testing phase. This ensures that before any code or configuration that could degrade performance is merged into the main branch, automated benchmark reports can reliably detect performance anomalies stemming from environmental factors, configurations, or compilation processes. This initiative signifies a crucial shift in mindset from “passive response” to “active defense.”

We hope this in-depth analysis of two common performance blind spots in the OpenResty environment offers valuable insights and strategies for those of you on the front lines, dedicated to enhancing system stability and efficiency.

What is OpenResty XRay

OpenResty XRay is a dynamic-tracing product that automatically analyzes your running applications to troubleshoot performance problems, behavioral issues, and security vulnerabilities with actionable suggestions. Under the hood, OpenResty XRay is powered by our Y language targeting various runtimes like Stap+, eBPF+, GDB, and ODB, depending on the contexts.

If you like this tutorial, please subscribe to this blog site and/or our YouTube channel. Thank you!

Yichun Zhang (Github handle: agentzh), is the original creator of the OpenResty® open-source project and the CEO of OpenResty Inc..

Yichun is one of the earliest advocates and leaders of “open-source technology”. He worked at many internationally renowned tech companies, such as Cloudflare, Yahoo!. He is a pioneer of “edge computing”, “dynamic tracing” and “machine coding”, with over 22 years of programming and 16 years of open source experience. Yichun is well-known in the open-source space as the project leader of OpenResty®, adopted by more than 40 million global website domains.

OpenResty Inc., the enterprise software start-up founded by Yichun in 2017, has customers from some of the biggest companies in the world. Its flagship product, OpenResty XRay, is a non-invasive profiling and troubleshooting tool that significantly enhances and utilizes dynamic tracing technology. And its OpenResty Edge product is a powerful distributed traffic management and private CDN software product.

As an avid open-source contributor, Yichun has contributed more than a million lines of code to numerous open-source projects, including Linux kernel, Nginx, LuaJIT, GDB, SystemTap, LLVM, Perl, etc. He has also authored more than 60 open-source software libraries.