Sideko Blog - Scaling SDK Generation to Handle Millions of Tokens Per Second

Try it Now

Scaling SDK Generation to Handle Millions of Tokens Per Second

Patrick Kelly

Co-Founder

How Sideko built a high-performance caching system in Rust that transformed our SDK generation pipeline from hundreds to millions of tokens per second

The Challenge

When we first launched our SDK generation API, we were processing a modest few hundred tokens per second. As demand grew, we quickly hit a wall. Our system was spending 80% of its time recompiling the same parsing queries over and over again, creating an unsustainable bottleneck that threatened to collapse under growing load.

Understanding Our Query-Driven Architecture

To understand why caching was so critical, it's important to understand what we mean by "queries" in our SDK generation context. Our system performs sophisticated code transformations through structured pattern matching queries - think of them as SQL for source code syntax trees.

These queries allow us to precisely identify and manipulate code patterns like:

Finding function definitions with specific parameters:
Locating class methods that need SDK wrapper generation
Identifying import statements that require modification

Why did we build this complex system? To generate extremely high quality SDKs that can be annotated by our users (or their LLM coding assistants) and that code is retained in subsequent automated SDK updates.

Each query gets compiled into an optimized state machine that can efficiently traverse and match against parsed code structures. The compilation step was our bottleneck - a single query could take 10ms to compile, but less than 0.01ms to execute once compiled.

With thousands of concurrent SDK generation requests, each using 20-50 different queries for various transformations (adding parameters, inserting methods, replacing identifiers, etc.), we were spending most of our time recompiling the same patterns repeatedly instead of actually generating SDKs.

Our SDK generation pipeline includes several core transformation operations that each rely heavily on queries. Here is an example snippet showing how Sideko adds new parameters to a Python function:

AppendNodes::new(
    "Query to find the relevant function params",
    &["timeout: int".into(), "retries: int".into()] // adding two new parameters
)
.delimiter(',');

Each operation requires compiling its query pattern into an efficient matcher before it can process code. In a typical SDK generation request, we might execute thousands of these operations across hundreds of files, with each compilation step adding latency.

We used Flamegraph to visualize the query compilation bottleneck:

https://github.com/brendangregg/FlameGraph

The Solution: A Two-Tier Caching Architecture

With this context in mind, our caching strategy becomes much clearer. We designed a caching system with two distinct layers, each optimized for different query patterns and access requirements.

Static Query Cache: The Foundation

Our first layer handles the bread-and-butter transformation queries that every SDK generation request uses - finding typical function definitions, locating class bodies, identifying import statements, and other fundamental code patterns.

// Global cache for compiled static queries
pub static STATIC_QUERY_CACHE: Lazy<Mutex<HashMap<StaticQueryKey, Arc<Query>>>> =
    Lazy::new(|| Mutex::new(init_static_query_cache()));

// Initialize static query cache with predefined queries
fn init_static_query_cache() -> HashMap<StaticQueryKey, Arc<Query>> {
    let mut cache = HashMap::new();
    // Precompile all static queries at startup
    for (editor_lang, query_str) in get_all_static_queries() {
        if let Ok(compiled) = compile_query(&editor_lang, &query_str) {
            let key = (editor_lang, query_str.to_string());
            cache.insert(key, Arc::new(compiled));
        }
    }
    cache
}

This cache is populated at application startup with all our commonly-used queries. Since these queries are immutable and used across all sessions, we can safely share them using Arc<Query> for zero-copy access.

Dynamic Query Cache: Session-Aware Intelligence

The second layer handles user-specific transformation patterns that can't be predetermined - custom naming conventions, project-specific code patterns, or dynamically generated queries based on the input code structure:

pub struct TimestampedQuery {
    query: Arc<Query>,
    last_used: Instant,
}

pub static DYNAMIC_QUERY_CACHE: Lazy<Mutex<HashMap<DynamicQueryKey, TimestampedQuery>>> =
    Lazy::new(|| Mutex::new(HashMap::new()));

type DynamicQueryKey = (SessionId, EditorLanguage, String);

This cache is session-aware and includes automatic cleanup to prevent memory leaks from abandoned sessions.

Implementation: The Query Resolution Strategy

Our caching strategy follows a clear hierarchy that maximizes cache hits while maintaining flexibility:

impl QueryCache for CodeEditor {
    fn get_cached_query(&self, query_str: &str) -> EditorResult<Arc<Query>> {
        // First try static cache (fastest path)
        if let Ok(query) = self.get_static_query(query_str) {
            return Ok(query);
        }

        // Fall back to dynamic cache with compilation if needed
        self.get_dynamic_query(query_str)
    }

    fn get_dynamic_query(&self, query_str: &str) -> EditorResult<Arc<Query>> {
        let mut cache = DYNAMIC_QUERY_CACHE.lock().unwrap();
        let key = (
            self.session_id.clone(),
            self.editor_lang.clone(),
            query_str.to_string(),
        );

        // Check cache and update timestamp
        if let Some(timestamped_query) = cache.get_mut(&key) {
            timestamped_query.last_used = Instant::now();
            return Ok(Arc::clone(&timestamped_query.query));
        }

        // Compile and cache new query
        let compiled = self.compile_query(query_str)?;
        let query = Arc::new(compiled);

        cache.insert(key, TimestampedQuery {
            query: Arc::clone(&query),
            last_used: Instant::now(),
        });

        Ok(query)
    }
}

When a session ends, all associated dynamic queries are automatically cleaned up, preventing memory accumulation over time.

The results of our caching implementation were dramatic: a ~2000x increase in throughput and massive reduction in memory consumption.

Key Architectural Decisions

We chose Mutex<HashMap<>> over more complex lock-free structures because:

Query compilation is expensive (10-50ms), making lock contention negligible
Simple implementation reduces bugs and maintenance overhead
Excellent performance characteristics for our read-heavy workload

Using Arc<Query> allows us to:

Share compiled queries across multiple concurrent requests
Eliminate redundant memory usage for identical queries
Maintain thread safety without performance penalties

Partitioning dynamic queries by session provides:

Natural cleanup boundaries when sessions end
Prevention of query interference between users
Simplified cache invalidation logic

Lessons Learned

Measure Before Optimizing: We initially tried to optimize query execution when compilation was the real bottleneck.
Cache at the Right Level: Our first attempt cached raw strings, but caching the compiled objects provided much better performance.
Plan for Cleanup: Dynamic caches without cleanup strategies become memory leaks in production.
Monitor Cache Effectiveness: Understanding hit rates helped us tune which queries to include in the static cache.

By implementing a thoughtful two-tier caching strategy, we transformed our SDK generation service from a performance bottleneck into a high-throughput system capable of handling millions of tokens per second. The key was recognizing that query compilation, not execution, was our primary constraint.

Introducing Hybrid Codegen: The Best of Both Worlds for SDK Generation

Combining deterministic reliability with AI intelligence to improve how you build client SDKs

Production-Ready Integration Code in Minutes

Sideko's deterministic engine generates reliable, production-ready SDKs, then guided AI customizes workflows to your exact needs.

Sideko automatically creates premium tooling and documentation for your APIs that help:

Improve
Your API DevEx

Ensure generated SDKs and integrations stay consistent with API/language updates, minimizing errors.

Boost
Internal Development

Auto-generate SDKs for microservices and internal APIs, eliminating repetitive coding work.

Quicken
Client/Partner Integrations

Give users tools to manage complete integrations faster, and reduce overall time to value/revenue.

Improve
Your API DevEx

Ensure generated SDKs and integrations stay consistent with API/language updates, minimizing errors.

Boost
Internal Development

Auto-generate SDKs for microservices and internal APIs, eliminating repetitive coding work.

Quicken
Client/Partner Integrations

Give users tools to manage complete integrations faster, and reduce overall time to value/revenue.

Improve Your API DevEx

Ensure generated SDKs and integrations stay consistent with API/language updates, minimizing errors.

Boost Internal Development

Auto-generate SDKs for microservices and internal APIs, eliminating repetitive coding work.

Quicken Client/Partner Integrations

Give users tools to manage complete integrations faster, and reduce overall time to value/revenue.

Improve
Your API DevEx

Ensure generated SDKs and integrations stay consistent with API/language updates, minimizing errors.

Boost
Internal Development

Auto-generate SDKs for microservices and internal APIs, eliminating repetitive coding work.

Quicken
Client/Partner Integrations

Give users tools to manage complete integrations faster, and reduce overall time to value/revenue.

Try it now

Request a demo

Trust Center