Scaling SDK Generation to Handle Millions of Tokens Per Second

How Sideko built a high-performance caching system in Rust that transformed our SDK generation pipeline from hundreds to millions of tokens per second

The Challenge

When we first launched our SDK generation API, we were processing a modest few hundred tokens per second. As demand grew, we quickly hit a wall. Our system was spending 80% of its time recompiling the same parsing queries over and over again, creating an unsustainable bottleneck that threatened to collapse under growing load.

Understanding Our Query-Driven Architecture

To understand why caching was so critical, it's important to understand what we mean by "queries" in our SDK generation context. Our system performs sophisticated code transformations through structured pattern matching queries - think of them as SQL for source code syntax trees.

These queries allow us to precisely identify and manipulate code patterns like:

  • Finding function definitions with specific parameters:

  • Locating class methods that need SDK wrapper generation

  • Identifying import statements that require modification

Why did we build this complex system? To generate extremely high quality SDKs that can be annotated by our users (or their LLM coding assistants) and that code is retained in subsequent automated SDK updates.

Each query gets compiled into an optimized state machine that can efficiently traverse and match against parsed code structures. The compilation step was our bottleneck - a single query could take 10ms to compile, but less than 0.01ms to execute once compiled.

With thousands of concurrent SDK generation requests, each using 20-50 different queries for various transformations (adding parameters, inserting methods, replacing identifiers, etc.), we were spending most of our time recompiling the same patterns repeatedly instead of actually generating SDKs.

Our SDK generation pipeline includes several core transformation operations that each rely heavily on queries. Here is an example snippet showing how Sideko adds new parameters to a Python function:

AppendNodes::new(
    "Query to find the relevant function params",
    &["timeout: int".into(), "retries: int".into()] // adding two new parameters
)
.delimiter(',');

Each operation requires compiling its query pattern into an efficient matcher before it can process code. In a typical SDK generation request, we might execute thousands of these operations across hundreds of files, with each compilation step adding latency.

We used Flamegraph to visualize the query compilation bottleneck:

https://github.com/brendangregg/FlameGraph

The Solution: A Two-Tier Caching Architecture

With this context in mind, our caching strategy becomes much clearer. We designed a caching system with two distinct layers, each optimized for different query patterns and access requirements.

Static Query Cache: The Foundation

Our first layer handles the bread-and-butter transformation queries that every SDK generation request uses - finding typical function definitions, locating class bodies, identifying import statements, and other fundamental code patterns.

// Global cache for compiled static queries
pub static STATIC_QUERY_CACHE: Lazy<Mutex<HashMap<StaticQueryKey, Arc<Query>>>> =
    Lazy::new(|| Mutex::new(init_static_query_cache()));

// Initialize static query cache with predefined queries
fn init_static_query_cache() -> HashMap<StaticQueryKey, Arc<Query>> {
    let mut cache = HashMap::new();
    // Precompile all static queries at startup
    for (editor_lang, query_str) in get_all_static_queries() {
        if let Ok(compiled) = compile_query(&editor_lang, &query_str) {
            let key = (editor_lang, query_str.to_string());
            cache.insert(key, Arc::new(compiled));
        }
    }
    cache
}

This cache is populated at application startup with all our commonly-used queries. Since these queries are immutable and used across all sessions, we can safely share them using Arc<Query> for zero-copy access.

Dynamic Query Cache: Session-Aware Intelligence

The second layer handles user-specific transformation patterns that can't be predetermined - custom naming conventions, project-specific code patterns, or dynamically generated queries based on the input code structure:

pub struct TimestampedQuery {
    query: Arc<Query>,
    last_used: Instant,
}

pub static DYNAMIC_QUERY_CACHE: Lazy<Mutex<HashMap<DynamicQueryKey, TimestampedQuery>>> =
    Lazy::new(|| Mutex::new(HashMap::new()));

type DynamicQueryKey = (SessionId, EditorLanguage, String);

This cache is session-aware and includes automatic cleanup to prevent memory leaks from abandoned sessions.

Implementation: The Query Resolution Strategy

Our caching strategy follows a clear hierarchy that maximizes cache hits while maintaining flexibility:

impl QueryCache for CodeEditor {
    fn get_cached_query(&self, query_str: &str) -> EditorResult<Arc<Query>> {
        // First try static cache (fastest path)
        if let Ok(query) = self.get_static_query(query_str) {
            return Ok(query);
        }

        // Fall back to dynamic cache with compilation if needed
        self.get_dynamic_query(query_str)
    }

    fn get_dynamic_query(&self, query_str: &str) -> EditorResult<Arc<Query>> {
        let mut cache = DYNAMIC_QUERY_CACHE.lock().unwrap();
        let key = (
            self.session_id.clone(),
            self.editor_lang.clone(),
            query_str.to_string(),
        );

        // Check cache and update timestamp
        if let Some(timestamped_query) = cache.get_mut(&key) {
            timestamped_query.last_used = Instant::now();
            return Ok(Arc::clone(&timestamped_query.query));
        }

        // Compile and cache new query
        let compiled = self.compile_query(query_str)?;
        let query = Arc::new(compiled);

        cache.insert(key, TimestampedQuery {
            query: Arc::clone(&query),
            last_used: Instant::now(),
        });

        Ok(query)
    }
}

When a session ends, all associated dynamic queries are automatically cleaned up, preventing memory accumulation over time.

Performance Impact: The Numbers

The results of our caching implementation were dramatic:

Metric

Improvement

Throughput

~200x increase

CPU Usage

58% reduction

Memory Efficiency

New capability

Key Architectural Decisions

We chose Mutex<HashMap<>> over more complex lock-free structures because:

  • Query compilation is expensive (10-50ms), making lock contention negligible

  • Simple implementation reduces bugs and maintenance overhead

  • Excellent performance characteristics for our read-heavy workload

Using Arc<Query> allows us to:

  • Share compiled queries across multiple concurrent requests

  • Eliminate redundant memory usage for identical queries

  • Maintain thread safety without performance penalties

Partitioning dynamic queries by session provides:

  • Natural cleanup boundaries when sessions end

  • Prevention of query interference between users

  • Simplified cache invalidation logic

Lessons Learned

  1. Measure Before Optimizing: We initially tried to optimize query execution when compilation was the real bottleneck.

  2. Cache at the Right Level: Our first attempt cached raw strings, but caching the compiled objects provided much better performance.

  3. Plan for Cleanup: Dynamic caches without cleanup strategies become memory leaks in production.

  4. Monitor Cache Effectiveness: Understanding hit rates helped us tune which queries to include in the static cache.

By implementing a thoughtful two-tier caching strategy, we transformed our SDK generation service from a performance bottleneck into a high-throughput system capable of handling millions of tokens per second. The key was recognizing that query compilation, not execution, was our primary constraint.

Scale your DevEx and Simplify Integrations

Time Saved (Automation)

Automate API connections and data flows, eliminating repetitive manual coding.

Ship Cleaner Code

Production-ready, native-quality code: clean, debuggable, custom SDK structures to your standards.

Always Up-to-Date Docs

SDKs and integrations remain consistent with API and language version updates.

Time Saved (Automation)

Automate API connections and data flows, eliminating repetitive manual coding.

Ship Cleaner Code

Production-ready, native-quality code: clean, debuggable, custom SDK structures to your standards.

Always Up-to-Date Docs

SDKs and integrations remain consistent with API and language version updates.

Time Saved (Automation)

Automate API connections and data flows, eliminating repetitive manual coding.

Ship Cleaner Code

Production-ready, native-quality code: clean, debuggable, custom SDK structures to your standards.

Always Up-to-Date Docs

SDKs and integrations remain consistent with API and language version updates.

Copyright© 2025 Sideko, Inc. All Rights Reserved.

Copyright© 2025 Sideko, Inc. All Rights Reserved.

Copyright© 2025 Sideko, Inc.
All Rights Reserved.