FastAPI in Practice: Building High-Performance APIs
Your FastAPI endpoint returns in 50ms locally, but under production load, response times spike to 2 seconds. You’ve already checked the usual suspects—database queries are optimized, network latency is minimal, and your infrastructure scales horizontally. Yet users are complaining, and your monitoring shows requests queueing up like cars in a traffic jam. The culprit isn’t your database or network—it’s how you’re blocking the async event loop without realizing it.
FastAPI’s async capabilities are one of its biggest selling points. The framework promises high concurrency through Python’s asyncio, handling thousands of simultaneous connections with minimal overhead. But here’s what the tutorials don’t emphasize: slapping async def on your route handlers doesn’t magically make your application asynchronous. In fact, doing this incorrectly creates a performance bottleneck worse than using synchronous code in the first place.
The problem is subtle and insidious. Your local development environment, processing one request at a time, masks the issue entirely. Everything feels fast. But production traffic exposes the truth: a single blocking call inside an async handler doesn’t just slow down one request—it freezes the entire event loop, stalling every concurrent request waiting behind it. Ten users making simultaneous requests experience the cumulative latency of all ten operations running sequentially.
This pattern catches experienced engineers off guard because the code looks correct. No errors, no warnings, just degraded performance that only manifests under load. Understanding why requires looking under the hood at how FastAPI actually handles your endpoint definitions—and the critical difference between truly async operations and synchronous code wearing an async costume.
The Async Illusion: Why Your FastAPI App Isn’t Actually Async
FastAPI’s async capabilities are one of its biggest selling points. The framework promises high concurrency and performance through Python’s asyncio. But here’s the uncomfortable truth: most FastAPI applications in production aren’t running asynchronously at all. They’re quietly falling back to thread pool execution, losing the very benefits that made async attractive in the first place.

Understanding this distinction is the difference between an API that handles 10,000 concurrent connections and one that chokes at 100.
The Two Execution Paths
FastAPI routes requests through two fundamentally different execution models based on a single keyword: async.
When you define an endpoint with async def, FastAPI runs it directly on the main asyncio event loop. This is the high-performance path—a single thread can juggle thousands of concurrent requests because it never blocks. Each request yields control back to the event loop while waiting for I/O operations.
When you define an endpoint with plain def, FastAPI assumes you’re running blocking code. It offloads the entire function to a thread pool, specifically Python’s anyio thread pool with a default size of 40 workers. Your “async” framework is now running synchronous code on threads, exactly like Flask or Django.
This fallback exists for good reason—it prevents blocking operations from freezing the event loop. But it comes with severe limitations. Those 40 threads become your concurrency ceiling. Exceed that, and requests queue up, latency spikes, and your p99 response times collapse.
The Blocking Operation Trap
The real danger isn’t choosing def over async def. It’s using async def while unknowingly calling blocking operations. The event loop has no way to preempt running code. When you block, you block everything.
These operations silently destroy your API’s throughput:
- Synchronous database drivers:
psycopg2,pymysql, and the default SQLAlchemy engine block the event loop on every query - File system operations:
open(),os.path.exists(), and file reads/writes block until the disk responds - CPU-intensive work: JSON serialization of large payloads, image processing, or complex computations starve other requests
- Synchronous HTTP clients:
requestslibrary calls block the entire event loop during network I/O - Time.sleep(): This blocks the thread entirely instead of yielding control
💡 Pro Tip: A single 100ms blocking call in an
async defendpoint means zero other requests can be processed during that time—even if you have 1,000 waiting.
The symptom is always the same: your API performs beautifully under light load, then response times explode non-linearly as traffic increases. The event loop becomes a bottleneck that no amount of horizontal scaling can fix.
Recognizing these patterns is step one. The next challenge is detecting when blocking actually occurs in a running application—before your users do.
Detecting Event Loop Blocking in Production
A blocked event loop is silent until it’s catastrophic. Your FastAPI application handles requests smoothly in development, then grinds to a halt under production load. The culprit: synchronous operations masquerading as async code. Detecting these bottlenecks before your users do requires deliberate instrumentation and a systematic approach to monitoring.
Enable asyncio Debug Mode
Python’s asyncio includes a built-in debug mode that logs warnings when coroutines block the event loop for too long. Enable it in your FastAPI application startup:
import asyncioimport loggingfrom fastapi import FastAPI
logging.basicConfig(level=logging.WARNING)
app = FastAPI()
@app.on_event("startup")async def enable_async_debugging(): loop = asyncio.get_running_loop() loop.set_debug(True) loop.slow_callback_duration = 0.1 # Log callbacks taking >100msWhen a synchronous call blocks for longer than slow_callback_duration, asyncio emits a warning with a stack trace pointing directly to the offending code. In production, set this threshold based on your latency SLAs—100ms catches egregious blockers without flooding your logs.
The debug mode also enables additional checks that catch common mistakes. It validates that coroutines are properly awaited, detects when callbacks are scheduled from the wrong thread, and warns about resources that aren’t properly closed. While these checks add overhead, they’re invaluable during staging deployments and load testing phases where you need maximum visibility into async behavior.
Build Request Latency Middleware
Debug mode catches blocking calls, but you need visibility into request-level patterns. Custom middleware tracks how long each request spends waiting versus executing:
import timeimport asynciofrom fastapi import Requestfrom starlette.middleware.base import BaseHTTPMiddleware
class EventLoopMonitorMiddleware(BaseHTTPMiddleware): async def dispatch(self, request: Request, call_next): loop = asyncio.get_running_loop()
wall_start = time.perf_counter() loop_start = loop.time()
response = await call_next(request)
wall_duration = time.perf_counter() - wall_start loop_duration = loop.time() - loop_start blocking_time = wall_duration - loop_duration
if blocking_time > 0.05: # 50ms threshold request.state.blocking_detected = True print( f"Blocking detected: {request.url.path} " f"wall={wall_duration:.3f}s loop={loop_duration:.3f}s " f"blocked={blocking_time:.3f}s" )
response.headers["X-Blocking-Time"] = f"{blocking_time:.4f}" return responseThe difference between wall clock time and event loop time reveals how long the request spent in synchronous operations. Export these metrics to your observability platform—spikes in blocking time correlate directly with degraded throughput.
Consider aggregating these measurements by endpoint and tracking percentiles rather than averages. A single endpoint with occasional 500ms blocks might not move your average significantly, but it devastates tail latency for affected users. Set up alerts on the 99th percentile of blocking time to catch intermittent issues before they become systemic problems.
Profile with py-spy for Hidden Blockers
Some blocking calls hide in third-party libraries or deeply nested code paths. The py-spy profiler attaches to running Python processes and samples the call stack without modifying your application:
pip install py-spypy-spy top --pid 48291 --subprocessesRun this against your Uvicorn workers during load testing. Functions spending significant time on the main thread outside of await statements are blocking candidates. Look for database drivers, file I/O, serialization libraries, and HTTP clients that don’t use async interfaces. Pay particular attention to JSON serialization of large payloads, DNS resolution calls, and logging handlers that write synchronously to disk.
💡 Pro Tip: Combine py-spy with
asyncio.to_thread()wrapping. Profile first to identify blockers, then wrap only the specific calls that need it. Blanket thread offloading adds overhead and obscures the real problem.
For continuous monitoring, integrate yappi or austin into your staging environment. These profilers distinguish between wall time and CPU time, making it clear whether slowdowns come from blocking I/O or CPU-bound work. The distinction matters because the remediation differs: blocking I/O benefits from thread offloading or async alternatives, while CPU-bound work requires process pools or architectural changes.
When analyzing profiler output, focus on functions that appear frequently in samples while the event loop should be idle. These represent stolen cycles that could have served other requests. Document the blocking calls you discover and their measured impact—this baseline helps prioritize remediation and validates fixes.
The instrumentation patterns above surface problems early. Once you know where blocking occurs, the fix often involves choosing the right async-compatible library—starting with your database layer.
Database Operations: The Right Way to Go Async
Database operations represent the most common source of blocking behavior in FastAPI applications. A single synchronous database call can stall your entire event loop, turning your async API into a bottleneck. Getting this right determines whether your application scales to thousands of concurrent requests or collapses under load. The difference between a well-tuned async database layer and a naive implementation can be orders of magnitude in throughput.
Choosing Your Database Driver
The decision between native async drivers and thread pool execution depends on your database and use case. Native async drivers provide direct integration with Python’s event loop, while thread pool execution wraps synchronous code to prevent blocking:
## Native async approach with asyncpg (PostgreSQL)from sqlalchemy.ext.asyncio import create_async_engine, AsyncSessionfrom sqlalchemy.orm import sessionmaker
engine = create_async_engine( DATABASE_URL, pool_size=20, max_overflow=10, pool_pre_ping=True, pool_recycle=3600,)
AsyncSessionLocal = sessionmaker( engine, class_=AsyncSession, expire_on_commit=False)
async def get_db(): async with AsyncSessionLocal() as session: yield sessionNative async drivers like asyncpg for PostgreSQL and asyncmy for MySQL provide true non-blocking I/O. They yield control back to the event loop while waiting for database responses, allowing other requests to proceed. This means a single worker can handle hundreds of concurrent database queries without spawning additional threads.
For databases without mature async drivers, or when integrating with legacy ORMs, use SQLAlchemy’s run_sync or Starlette’s thread pool:
from starlette.concurrency import run_in_threadpool
async def get_legacy_data(db: Session): # Offload blocking call to thread pool result = await run_in_threadpool(db.execute, select(LegacyTable)) return result.scalars().all()💡 Pro Tip: Thread pool execution adds overhead (context switching, memory per thread). Reserve it for legacy integrations. For new projects, choose databases with native async support.
Connection Pool Sizing
Async applications handle more concurrent requests per process than sync counterparts. Your connection pool must accommodate this concurrency without becoming a bottleneck. Undersized pools create artificial chokepoints that negate async benefits entirely:
## Calculate pool size based on expected concurrency## Rule of thumb: pool_size = (workers * expected_concurrent_requests_per_worker) / safety_factor
engine = create_async_engine( DATABASE_URL, pool_size=20, # Base connections maintained max_overflow=10, # Additional connections under load pool_timeout=30, # Seconds to wait for available connection pool_pre_ping=True, # Verify connections before use)A FastAPI worker handling 100 concurrent requests with a pool size of 10 forces 90 requests to wait for connections. Monitor your pool_timeout exceptions in production—they indicate undersized pools. Consider your database server’s maximum connection limit when sizing pools across multiple workers; exceeding this limit causes connection failures that cascade into application-wide outages.
The pool_pre_ping option deserves attention. It validates connections before checkout, preventing errors from stale connections after network interruptions or database restarts. The slight latency cost pays dividends in reliability.
Non-Blocking Transactions
Transactions require careful handling to avoid holding connections longer than necessary. Long-running transactions consume pool resources and can create lock contention that ripples across your application:
from sqlalchemy import selectfrom sqlalchemy.ext.asyncio import AsyncSession
async def transfer_funds( db: AsyncSession, from_account_id: int, to_account_id: int, amount: float): async with db.begin(): # All operations within this block are transactional from_account = await db.get(Account, from_account_id, with_for_update=True) to_account = await db.get(Account, to_account_id, with_for_update=True)
if from_account.balance < amount: raise InsufficientFundsError()
from_account.balance -= amount to_account.balance += amount # Commit happens automatically on block exitThe async with db.begin() context manager ensures the transaction completes or rolls back without manual commit calls. The with_for_update=True parameter acquires row-level locks asynchronously, preventing race conditions while allowing other queries to proceed. Exceptions within the block trigger automatic rollback, eliminating a common source of data corruption bugs.
Avoid performing external API calls or heavy computations inside transaction blocks. Every millisecond spent holding a transaction is a millisecond that connection is unavailable to other requests. Structure your code to minimize transaction duration:
## Prepare data BEFORE starting transactionenriched_data = await fetch_from_external_api(user_id)validated_data = validate_and_transform(enriched_data)
## Transaction block contains only database operationsasync with db.begin(): await db.execute(insert(Records).values(validated_data))This pattern—prepare outside, execute inside—keeps transactions short and predictable. It also makes error handling cleaner since validation failures don’t require transaction rollback.
With your database layer properly async, the next challenge is handling external API calls and background tasks without compromising response times.
External API Calls and Background Tasks
External service calls represent one of the most common sources of latency in production APIs. A single slow third-party endpoint can cascade into request timeouts and degraded user experience. The solution lies in proper async HTTP clients, defensive timeout strategies, and knowing when to offload work entirely.
Truly Async HTTP with httpx
The popular requests library blocks the event loop entirely—even when called from an async endpoint. This happens because requests uses synchronous socket operations under the hood, meaning your carefully crafted async endpoint becomes a bottleneck the moment it reaches out to an external service. For async-compatible HTTP calls, httpx provides a drop-in replacement with native async support.
import httpxfrom fastapi import HTTPException
class PaymentClient: def __init__(self): self.base_url = "https://api.stripe.com/v1" self.timeout = httpx.Timeout(10.0, connect=5.0)
async def charge(self, amount: int, token: str) -> dict: async with httpx.AsyncClient(timeout=self.timeout) as client: response = await client.post( f"{self.base_url}/charges", data={"amount": amount, "source": token}, headers={"Authorization": "Bearer sk_live_xxx"} ) if response.status_code != 200: raise HTTPException(status_code=502, detail="Payment processing failed") return response.json()For high-throughput services, create a shared client instance with connection pooling rather than instantiating per-request. Creating a new client for each request incurs TCP handshake overhead and prevents connection reuse—a significant performance penalty when you’re making hundreds of calls per second:
from contextlib import asynccontextmanagerimport httpx
http_client: httpx.AsyncClient = None
@asynccontextmanagerasync def lifespan(app): global http_client http_client = httpx.AsyncClient( limits=httpx.Limits(max_connections=100, max_keepalive_connections=20), timeout=httpx.Timeout(30.0, connect=5.0) ) yield await http_client.aclose()The max_connections limit prevents your service from overwhelming downstream APIs during traffic spikes, while max_keepalive_connections balances memory usage against connection reuse benefits.
Retry Strategies with Exponential Backoff
Network calls fail. DNS resolution times out, load balancers return 503s during deployments, and rate limits trigger 429 responses. Production code expects this and handles it gracefully through structured retry logic.
import asynciofrom functools import wrapsfrom typing import Type
def async_retry( max_attempts: int = 3, base_delay: float = 1.0, exceptions: tuple[Type[Exception], ...] = (httpx.RequestError,)): def decorator(func): @wraps(func) async def wrapper(*args, **kwargs): last_exception = None for attempt in range(max_attempts): try: return await func(*args, **kwargs) except exceptions as e: last_exception = e if attempt < max_attempts - 1: delay = base_delay * (2 ** attempt) await asyncio.sleep(delay) raise last_exception return wrapper return decoratorThe exponential backoff pattern (1s, 2s, 4s) prevents retry storms from compounding an already struggling service. Consider adding jitter—a small random delay—to prevent synchronized retries from multiple clients hitting the recovering service simultaneously.
💡 Pro Tip: Set aggressive connect timeouts (2-5 seconds) but more generous read timeouts. A service that accepts your connection quickly but responds slowly is behaving normally under load—one that won’t connect at all is likely down.
BackgroundTasks vs Dedicated Queues
FastAPI’s BackgroundTasks runs work after the response returns, but it still executes in the same process. For fire-and-forget operations that complete quickly, it works well:
from fastapi import BackgroundTasks
async def send_confirmation_email(email: str, order_id: str): async with httpx.AsyncClient() as client: await client.post("https://api.sendgrid.com/v3/mail/send", json={ "to": email, "template_id": "order_confirmed", "data": {"order_id": order_id} })
@app.post("/orders")async def create_order(order: OrderCreate, background_tasks: BackgroundTasks): result = await process_order(order) background_tasks.add_task(send_confirmation_email, order.email, result.id) return resultThe key limitation: if your server restarts or crashes before the background task completes, that work is lost. There’s no persistence, no retry mechanism, and no visibility into what’s queued.
Switch to Celery, Dramatiq, or ARQ when you need:
- Task persistence across server restarts
- Retries with dead-letter handling for failed jobs
- Rate limiting to protect downstream services
- Tasks exceeding 30 seconds that risk blocking workers
- Distributed execution across multiple worker processes
The boundary is clear: if losing the task on a deploy or crash is acceptable and execution takes under a few seconds, use BackgroundTasks. For anything else, invest in proper task infrastructure. The operational complexity of running Redis and worker processes pays dividends when your 3 AM pages stop being about lost order confirmations.
With external dependencies handled defensively, the next challenge is organizing your codebase to maintain this performance discipline as the application grows.
Structuring FastAPI for Scale
A well-architected FastAPI application handles thousands of requests per second while remaining maintainable. Poor structure, however, creates bottlenecks that no amount of horizontal scaling fixes. The patterns you establish early determine whether your codebase scales gracefully or collapses under its own weight.

Router Organization for Large Applications
Flat route files become unmanageable beyond a dozen endpoints. Organize routers by domain, keeping related functionality together while maintaining clear boundaries.
from fastapi import APIRouterfrom app.api.v1.endpoints import users, orders, products, webhooks
api_router = APIRouter()
api_router.include_router( users.router, prefix="/users", tags=["users"],)api_router.include_router( orders.router, prefix="/orders", tags=["orders"], dependencies=[Depends(require_authentication)],)api_router.include_router( products.router, prefix="/products", tags=["products"],)api_router.include_router( webhooks.router, prefix="/webhooks", tags=["webhooks"], include_in_schema=False, # Hide internal endpoints from docs)Apply shared dependencies at the router level rather than repeating them on every endpoint. This reduces boilerplate and ensures consistent security policies across related routes.
Dependency Injection Without Bottlenecks
FastAPI’s dependency injection system is powerful but creates subtle performance issues when misused. The most common mistake: recreating expensive resources on every request.
from functools import lru_cachefrom app.core.config import Settingsfrom app.services.payment import PaymentClient
@lru_cachedef get_settings() -> Settings: """Settings loaded once and cached for application lifetime.""" return Settings()
## Wrong: Creates new client per requestdef get_payment_client_slow() -> PaymentClient: return PaymentClient(api_key=get_settings().payment_api_key)
## Right: Reuse client with connection pooling_payment_client: PaymentClient | None = None
async def get_payment_client() -> PaymentClient: global _payment_client if _payment_client is None: _payment_client = PaymentClient( api_key=get_settings().payment_api_key, pool_size=20, ) return _payment_clientFor database sessions, yield dependencies properly to ensure cleanup happens even when requests fail:
from collections.abc import AsyncGeneratorfrom sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
async def get_db_session( session_factory: async_sessionmaker = Depends(get_session_factory),) -> AsyncGenerator[AsyncSession, None]: async with session_factory() as session: try: yield session await session.commit() except Exception: await session.rollback() raiseLifespan Management for Shared Resources
FastAPI’s lifespan context manager replaced the deprecated on_startup and on_shutdown events. Use it to initialize connection pools, load ML models, and establish connections to external services.
from contextlib import asynccontextmanagerfrom fastapi import FastAPIfrom app.db.session import create_engine, dispose_enginefrom app.cache.redis import redis_poolfrom app.services.search import SearchIndex
@asynccontextmanagerasync def lifespan(app: FastAPI): # Startup: Initialize shared resources app.state.db_engine = create_engine(pool_size=25, max_overflow=10) app.state.redis = await redis_pool.connect("redis://cache.internal:6379") app.state.search_index = await SearchIndex.load("/models/search_v2.bin")
yield # Application runs here
# Shutdown: Clean up resources await app.state.search_index.close() await app.state.redis.close() await dispose_engine(app.state.db_engine)
app = FastAPI(lifespan=lifespan, title="Orders API")Access these shared resources through request.app.state in your endpoints, avoiding the overhead of repeated initialization.
💡 Pro Tip: Store immutable configuration in
app.statebut never store request-specific data there. Multiple concurrent requests share the same application instance, and mutable shared state causes race conditions.
Structure your project directory to mirror these patterns:
app/├── api/v1/endpoints/├── core/config.py├── db/session.py├── dependencies.py├── models/├── services/└── main.pyWith proper structure and resource management in place, your FastAPI application handles growth without architectural rewrites. The next challenge is deploying these applications to handle production traffic levels.
Deployment Strategies for High-Traffic APIs
Your FastAPI application performs flawlessly in development, but production demands careful orchestration of workers, containers, and infrastructure. The right deployment configuration transforms a capable API into one that handles thousands of concurrent requests without breaking a sweat. Understanding the trade-offs between different deployment models helps you choose the architecture that matches your scaling requirements and operational constraints.
Uvicorn Workers and the Gunicorn Question
Uvicorn serves as FastAPI’s preferred ASGI server, but running a single worker process leaves performance on the table. For CPU-bound workloads or when you need process-level isolation, wrap Uvicorn with Gunicorn:
import multiprocessing
## Workers = (2 x CPU cores) + 1 for I/O-bound APIsworkers = multiprocessing.cpu_count() * 2 + 1worker_class = "uvicorn.workers.UvicornWorker"bind = "0.0.0.0:8000"keepalive = 120timeout = 30graceful_timeout = 30max_requests = 10000max_requests_jitter = 1000The max_requests setting prevents memory leaks from accumulating by recycling workers after handling a set number of requests. The jitter value randomizes restarts across workers, preventing thundering herd scenarios where all workers recycle simultaneously. For purely async workloads where you trust your code to never block, running Uvicorn directly with --workers provides lower latency by eliminating the Gunicorn intermediary.
When choosing between Gunicorn and native Uvicorn workers, consider your debugging needs. Gunicorn provides superior worker management, including automatic restart of failed workers and detailed process monitoring. Native Uvicorn workers offer simpler deployment but require external process supervision for production reliability.
Serverless Deployment with Mangum
AWS Lambda offers automatic scaling without infrastructure management. The Mangum adapter translates Lambda events into ASGI requests:
from fastapi import FastAPIfrom mangum import Mangum
app = FastAPI()
@app.get("/health")async def health_check(): return {"status": "healthy"}
## Lambda handlerhandler = Mangum(app, lifespan="off")💡 Pro Tip: Set
lifespan="off"unless you genuinely need startup/shutdown events. Lambda’s execution model doesn’t align well with persistent lifespan state, and disabling it reduces cold start times by 50-100ms.
Cold starts remain Lambda’s Achilles heel for latency-sensitive APIs. Provisioned concurrency keeps instances warm, but at a cost. Reserve it for endpoints where p99 latency matters. For background processing endpoints or internal APIs where occasional latency spikes are acceptable, on-demand scaling provides significant cost savings. Consider splitting your API surface: latency-critical endpoints on provisioned concurrency, everything else on standard scaling.
Kubernetes Container Sizing
Container resource allocation directly impacts throughput. Undersized containers throttle performance; oversized ones waste cluster resources:
apiVersion: apps/v1kind: Deploymentmetadata: name: fastapi-appspec: replicas: 3 template: spec: containers: - name: api image: myregistry.io/fastapi-app:v1.2.0 resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "1Gi" cpu: "1000m" readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 15 periodSeconds: 20Start with 2-4 Uvicorn workers per container, matching the CPU limit. A container with a 1-core limit runs optimally with 2 workers—enough parallelism to hide I/O latency without context-switch overhead. Memory requests should account for your worker count; each Uvicorn worker typically consumes 100-200MB depending on your application’s dependencies and in-memory caching.
Horizontal Pod Autoscaler (HPA) scales replicas based on CPU or custom metrics. Target 70% CPU utilization as your scaling threshold; this leaves headroom for traffic spikes while maintaining cost efficiency. For APIs with unpredictable traffic patterns, consider implementing KEDA (Kubernetes Event-Driven Autoscaling) to scale based on queue depth or request rate rather than raw CPU utilization.
With your deployment architecture defined, you need visibility into whether these configurations actually deliver. The next section covers the metrics that reveal your API’s true performance characteristics.
Measuring What Matters: API Performance Metrics
Response time averages lie. A 200ms average tells you nothing about the user experience when 5% of your requests take 3 seconds. Production monitoring for FastAPI applications demands metrics that reveal the full picture of API health and user impact.
Beyond Average Response Time
Three metrics form the foundation of meaningful API performance monitoring:
Percentile latencies (p95, p99) expose the tail of your response time distribution. When your p99 hits 2 seconds while your average sits at 150ms, you’ve identified that 1% of users experience unacceptable performance. This 1% often represents your most engaged users—those making the most requests.
Throughput (requests per second) establishes your baseline capacity. Track this alongside latency to distinguish between “slow because overloaded” and “slow because broken.” A throughput drop with stable latency points to upstream issues; stable throughput with rising latency indicates internal bottlenecks.
Error rates by status code provide early warning signals. A spike in 503s suggests resource exhaustion. Growing 422s indicate API contract violations, possibly from a misbehaving client. Categorize errors to route them to the right team.
Implementing OpenTelemetry Tracing
OpenTelemetry provides vendor-neutral instrumentation that traces requests across service boundaries. For FastAPI applications, the OpenTelemetry Python SDK auto-instruments incoming HTTP requests, database queries, and outgoing HTTP calls.
Traces reveal where time goes within a request. When your p99 spikes, traces show whether the database query, external API call, or your business logic caused the delay. Without distributed tracing, debugging production latency issues becomes archaeological guesswork.
Export traces to your observability backend—Jaeger, Zipkin, or commercial platforms like Datadog and Honeycomb all accept OpenTelemetry data. The instrumentation stays constant; only the exporter configuration changes.
Alerting Thresholds That Catch Issues Early
Effective alerting balances sensitivity against noise. Start with these thresholds and tune based on your traffic patterns:
- p99 latency exceeding 3x your p50 indicates emerging performance degradation
- Error rate above 1% over a 5-minute window warrants investigation
- Throughput drop of 20% compared to the same time window last week suggests upstream problems or deployment issues
Alert on symptoms, not causes. “High database connection wait time” tells you what’s broken; “high p99 latency” tells you users are affected. Both matter, but user-facing metrics trigger pages.
💡 Pro Tip: Create dashboard views for different contexts—real-time for incident response, hourly for capacity planning, weekly for trend analysis. The same metrics serve different purposes at different time scales.
With observability in place, you have the foundation to measure the impact of every optimization in your FastAPI application—and catch regressions before users report them.
Key Takeaways
- Audit every async endpoint for blocking calls using asyncio debug mode and replace synchronous libraries with async alternatives
- Size your database connection pools based on concurrent request volume, not worker count, and use async drivers when available
- Configure Uvicorn workers equal to CPU cores for compute-bound workloads, or use a single worker with higher concurrency for I/O-bound APIs