How Lyft Uses ML to Make 100 Million Predictions A Day

Database Benchmarking for Performance: Virtual Masterclass (Sponsored)

Learn how to accurately measure database performance

Free 2-hour masterclass | June 18, 2025

This masterclass will show you how to design and execute meaningful tests that reflect real-world workload patterns. We’ll discuss proven strategies that help you rightsize your performance testing infrastructure, account for the impact of concurrency, recognize and mitigate coordinated omission, and understand probability distributions. We will also share ways to avoid common pitfalls when benchmarking high-performance databases.

After this free 2-hour masterclass, you will know how to:

Register for Free


Disclaimer: The details in this post have been derived from the articles/videos shared online by the Lyft Engineering Team. All credit for the technical details goes to the Lyft Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Hundreds of millions of machine learning inferences power decisions at Lyft every day. These aren’t back-office batch jobs. They’re live, high-stakes predictions driving every corner of the experience from pricing a ride to flagging fraud, predicting ETAs to deciding which driver gets which incentive.

Each inference runs under pressure and a single-digit millisecond budget. This translates to millions of requests per second. Dozens of teams, each with different needs and models, are pushing updates on their schedules. The challenge is staying flexible without falling apart.

Real-time ML at scale breaks into two kinds of problems:

Early on, Lyft leaned on a shared monolithic service to serve ML models across the company. However, the monolith created more friction than flexibility. Teams couldn’t upgrade libraries independently. Deployments clashed and ownership blurred. Small changes in one model risked breaking another, and incident investigation turned into detective work.

The need was clear: build a serving platform that makes model deployment feel as natural as writing the model itself. It had to be fast, flexible, and team-friendly without hiding the messy realities of inference at scale.

In this article, we’ll look at how Lyft built an architecture to accomplish this requirement and the challenges they faced.

Architecture and System Components

LyftLearn Serving doesn’t reinvent the wheel. It slots neatly into the microservices foundation already powering the rest of Lyft. The goal wasn’t to build a bespoke ML serving engine from scratch. It was to extend proven infrastructure with just enough intelligence to handle real-time inference, without bloating the system or boxing teams.

At the core is a dedicated microservice: lightweight, composable, and self-contained. Each team runs its instance, backed by Lyft's service mesh and container orchestration stack. The result: fast deploys, predictable behavior, and clean ownership boundaries.

Let’s break down this architecture flow diagram:

HTTP Serving Layer

Every request to a LyftLearn Serving service hits an HTTP endpoint first. This interface is built using Flask, a minimalist Python web framework. While Flask alone wouldn’t scale to production workloads, it’s paired with Gunicorn, a pre-fork WSGI server designed for high concurrency.

To make this stack production-grade, Lyft optimized the setup to align with Envoy, the service mesh that sits in front of all internal microservices. These optimizations ensure:

This layer keeps the HTTP interface thin and efficient, just enough to route requests and parse payloads.

Core Serving Library

This is where the real logic lives. The LyftLearn Serving library handles the heavy lifting:

This library is the common runtime used across all teams. It centralizes the “platform contract” so individual teams don’t need to re-implement the basics. But it doesn’t restrict customization.

Custom ML/Predict Code

The core library is dependency-injected with team-owned inference logic. Every team provides two Python functions:

def load(self, file: str) -> Any:
    # Custom deserialization logic for the trained model
    ...

def predict(self, features: Any) -> Any:     # Custom inference logic using the loaded model     …

Source: Lyft Engineering Blog

This design keeps the platform flexible. A team can use any model structure, feature format, or business logic, as long as it adheres to the basic interface. This works because the predicted path is decoupled from the transport and orchestration layers.

Third-Party ML Library Support

LyftLearn Serving makes no assumptions about the ML framework. Whether the model uses TensorFlow, PyTorch, LightGBM, XGBoost, or a home-grown solution, it doesn’t matter.

As long as the model loads and predicts through Python, it’s compatible. This lets teams:

Framework choice becomes a modeler’s decision, not a platform constraint.

Integration with Lyft Infrastructure

The microservice integrates deeply with Lyft’s existing production stack:

This alignment avoids duplicating effort. Teams inherit baseline reliability, visibility, and security from the rest of Lyft’s infrastructure, without needing to configure it themselves.

Isolation and Ownership Principles

When dozens of teams deploy and serve ML models independently, shared infrastructure quickly becomes shared pain. One broken deploy can block five others. A single library upgrade triggers weeks of coordination and debugging turns into blame-shifting. That’s what LyftLearn Serving was built to avoid.

The foundation of its design is hard isolation by repository, not as a policy, but as a technical boundary enforced at every layer of the stack.

One Repo, One Service, One Owner

Every team using LyftLearn Serving gets its own GitHub repository. This repo isn’t just for code, but it defines the entire model-serving lifecycle:

There’s no central repository to manage and no shared runtime to coordinate. If a team needs five models, they can choose to host them in one repo or split them across five.

Independent Deploy Pipelines

Each repo comes with its deploy pipeline, fully decoupled from others. This includes:

If one team pushes broken code, it doesn’t affect anyone else. If another needs to hotfix a bug, they can deploy instantly. Isolation removes the need for cross-team coordination during high-stakes production changes.

Runtime Isolation via Kubernetes and Envoy

LyftLearn Serving runs on top of Lyft’s Kubernetes and Envoy infrastructure. The platform assigns each team:

This ensures that runtime faults, whether it’s a memory leaks, high CPU usage, or bad deployment. A surge in traffic to one team’s model won’t starve resources for another. A crash in one container doesn’t bring down the serving infrastructure.

Tooling: Config Generator

Getting a model into production shouldn’t mean learning multiple configuration formats, wiring up runtime secrets, or debugging broken deploys caused by missing database entries.

To streamline this, LyftLearn Serving includes a Config Generator: a bootstrapping tool that wires up everything needed to go from zero to a working ML serving microservice. Spinning up a new LyftLearn Serving instance involves stitching together pieces from across the infrastructure stack:

Expecting every ML team to hand-craft this setup would be a recipe for drift, duplication, and onboarding delays. The config generator collapses that complexity into a few guided inputs.

The generator runs on Yeoman, a scaffolding framework commonly used for bootstrapping web projects, but customized here for Lyft’s internal systems.

A new team running the generator walks through a short interactive session:

The tool then emits a fully-formed GitHub repo with:

Once the repo is generated and the code is committed, the team gets a functioning microservice, ready to accept models, run inference, and serve real traffic inside Lyft’s mesh. Teams can iterate on the model logic immediately, without first untangling infrastructure.

Model Self-Testing System

Model serving can often drift when a new dependency sneaks in, slightly changing output behavior. For example, a training script gets updated, but no one notices the prediction shift. Or,  a container upgrade silently breaks deserialization. By the time someone spots the drop in performance, millions of bad inferences have already shipped.

To fight this, LyftLearn Serving introduces a built-in Model Self-Testing System. It’s a contract embedded inside the model itself, designed to verify behavior at the two points that matter most: before merge and after deploy.

Every model class defines a test_data property: structured sample inputs with expected outputs:

class SampleModel(TrainableModel):
    @property
    def test_data(self) -> pd.DataFrame:
        return pd.DataFrame([
            [[1, 0, 0], 1],  # Input should predict close to 1
            [[1, 1, 0], 1]
        ], columns=[“input”, “score”])

Source: Lyft Engineering Blog

This isn’t a full dataset. It’s a minimal set of hand-picked examples that act as canaries. If a change breaks expected behavior on these inputs, something deeper is likely wrong. The test data travels with the model binary and becomes part of the serving lifecycle.

Two checkpoints that matter are as follows:

During Deployment Runtime

After a model loads inside a LyftLearn Serving instance, it immediately runs predictions on its test_data. The results:

This catches subtle breakages caused by environment mismatches. For example, a model trained in Python 3.8 but deployed into a Python 3.10 container with incompatible dependencies.

During Pull Requests

When a developer opens a PR in the model repo, CI kicks in. It performs the following activities:

If the outputs shift beyond an acceptable delta, the PR fails, even if the code compiles and the service builds cleanly. The diagram below shows a typical development flow:

Inference Request Lifecycle

A real-time inference system lives in the milliseconds between an HTTP request and a JSON response. That tiny window holds a lot more than model math. It’s where routing, validation, prediction, logging, and monitoring converge.

LyftLearn Serving keeps the inference path slim but structured. Every request follows a predictable, hardened lifecycle that allows flexibility without sacrificing control.

Here’s a step-by-step on how the request gets served:

Conclusion

Serving machine learning models in real time isn’t just about throughput or latency. It’s about creating systems that teams can trust, evolve, and debug without friction. 

LyftLearn Serving didn’t emerge from a clean slate or greenfield design. It was built under pressure to scale, to isolate, and to keep dozens of teams moving fast without stepping on each other’s toes.

Several lessons surfaced along the way, and they’re worth understanding:

LyftLearn Serving is still evolving, but its foundations hold. It doesn’t try to hide complexity, but it isolates it. Also, it enforces a contract around how the models behave in production. 

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].