Scaling SDK Generation to Handle Millions of Tokens Per Second
How Sideko built a high-performance caching system in Rust that transformed our SDK generation pipeline from hundreds to millions of tokens per second
The Challenge
When we first launched our SDK generation API, we were processing a modest few hundred tokens per second. As demand grew, we quickly hit a wall. Our system was spending 80% of its time recompiling the same parsing queries over and over again, creating an unsustainable bottleneck that threatened to collapse under growing load.
Understanding Our Query-Driven Architecture
To understand why caching was so critical, it's important to understand what we mean by "queries" in our SDK generation context. Our system performs sophisticated code transformations through structured pattern matching queries - think of them as SQL for source code syntax trees.
These queries allow us to precisely identify and manipulate code patterns like:
Finding function definitions with specific parameters:
Locating class methods that need SDK wrapper generation
Identifying import statements that require modification
Why did we build this complex system? To generate extremely high quality SDKs that can be annotated by our users (or their LLM coding assistants) and that code is retained in subsequent automated SDK updates.
Each query gets compiled into an optimized state machine that can efficiently traverse and match against parsed code structures. The compilation step was our bottleneck - a single query could take 10ms to compile, but less than 0.01ms to execute once compiled.
With thousands of concurrent SDK generation requests, each using 20-50 different queries for various transformations (adding parameters, inserting methods, replacing identifiers, etc.), we were spending most of our time recompiling the same patterns repeatedly instead of actually generating SDKs.
Our SDK generation pipeline includes several core transformation operations that each rely heavily on queries. Here is an example snippet showing how Sideko adds new parameters to a Python function:
Each operation requires compiling its query pattern into an efficient matcher before it can process code. In a typical SDK generation request, we might execute thousands of these operations across hundreds of files, with each compilation step adding latency.
We used Flamegraph to visualize the query compilation bottleneck:

https://github.com/brendangregg/FlameGraph
The Solution: A Two-Tier Caching Architecture
With this context in mind, our caching strategy becomes much clearer. We designed a caching system with two distinct layers, each optimized for different query patterns and access requirements.
Static Query Cache: The Foundation
Our first layer handles the bread-and-butter transformation queries that every SDK generation request uses - finding typical function definitions, locating class bodies, identifying import statements, and other fundamental code patterns.
This cache is populated at application startup with all our commonly-used queries. Since these queries are immutable and used across all sessions, we can safely share them using Arc<Query>
for zero-copy access.
Dynamic Query Cache: Session-Aware Intelligence
The second layer handles user-specific transformation patterns that can't be predetermined - custom naming conventions, project-specific code patterns, or dynamically generated queries based on the input code structure:
This cache is session-aware and includes automatic cleanup to prevent memory leaks from abandoned sessions.
Implementation: The Query Resolution Strategy
Our caching strategy follows a clear hierarchy that maximizes cache hits while maintaining flexibility:
When a session ends, all associated dynamic queries are automatically cleaned up, preventing memory accumulation over time.
Performance Impact: The Numbers
The results of our caching implementation were dramatic:
Metric | Improvement |
---|---|
Throughput | ~200x increase |
CPU Usage | 58% reduction |
Memory Efficiency | New capability |
Key Architectural Decisions
We chose Mutex<HashMap<>>
over more complex lock-free structures because:
Query compilation is expensive (10-50ms), making lock contention negligible
Simple implementation reduces bugs and maintenance overhead
Excellent performance characteristics for our read-heavy workload
Using Arc<Query>
allows us to:
Share compiled queries across multiple concurrent requests
Eliminate redundant memory usage for identical queries
Maintain thread safety without performance penalties
Partitioning dynamic queries by session provides:
Natural cleanup boundaries when sessions end
Prevention of query interference between users
Simplified cache invalidation logic
Lessons Learned
Measure Before Optimizing: We initially tried to optimize query execution when compilation was the real bottleneck.
Cache at the Right Level: Our first attempt cached raw strings, but caching the compiled objects provided much better performance.
Plan for Cleanup: Dynamic caches without cleanup strategies become memory leaks in production.
Monitor Cache Effectiveness: Understanding hit rates helped us tune which queries to include in the static cache.
By implementing a thoughtful two-tier caching strategy, we transformed our SDK generation service from a performance bottleneck into a high-throughput system capable of handling millions of tokens per second. The key was recognizing that query compilation, not execution, was our primary constraint.
Sideko Releases Intelligent API and SDK Versioning
Automatically detect breaking changes and get accurate version suggestions with Sideko's latest release
LLM SEO and Why Documented SDKs Matter
LLM SEO is the process of optimizing content for AI chatbots and language models, just as traditional SEO optimizes for search engines. If your SDK documentation isn't structured for machine understanding, you're missing a growing channel for developer acquisition.