Inside Airbnb’s AI-Powered Pipeline to Migrate Tests: Months of Work in Days

DevOps Roadmap: Future-proof Your Engineering Career (Sponsored)

Full-stack isn't enough anymore. Today's top developers also understand DevOps.

Our actionable roadmap cuts straight to what matters.

Built for busy coders, this step-by-step guide maps out the essential DevOps skills that hiring managers actively seek and teams desperately need.

Stop feeling overwhelmed and start accelerating your market value. Join thousands of engineers who've done the same.

GRAB YOUR FREE ROADMAP

This guide was created exclusively for ByteByteGo readers by TechWorld with Nana


Disclaimer: The details in this post have been derived from the articles/videos shared online by the Airbnb Engineering Team. All credit for the technical details goes to the Airbnb Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Code migrations are usually a slow affair. Dependencies change, frameworks evolve, and teams get stuck rewriting thousands of lines that don’t even change product behavior. 

That was the situation at Airbnb. 

Thousands of React test files still relied on Enzyme, a tool that hadn’t kept up with modern React patterns. The goal was clear: move everything to React Testing Library (RTL). However, with over 3,500 files in scope, the effort appeared to be a year-long grind of manual rewrites.

Instead, the team finished it in six weeks.

The turning point was the use of AI, specifically Large Language Models (LLMs), not just as assistants, but as core agents in an automated migration pipeline. By breaking the work into structured, per-file steps, injecting rich context into prompts, and systematically tuning feedback loops, the team transformed what looked like a long, manual slog into a fast, scalable process.

This article unpacks how that migration happened. It covers the structure of the automation pipeline, the trade-offs behind prompt engineering vs. brute-force retries, the methods used to handle complex edge cases, and the results that followed. 


Where Fintech Engineers Share How They Actually Build (Sponsored)

Built by developers, for developers, fintech_devcon is the go-to technical conference for engineers and product leaders building next-generation financial infrastructure.

Still on the fence? Watch past sessions, including Kelsey Hightower’s phenomenal 2024 keynote.

See the agenda


The Need for Migration

Enzyme, adopted in 2015, provided fine-grained access to the internal structure of React components. This approach matched earlier versions of React, where testing internal state and component hierarchy was a common pattern.

By 2020, Airbnb had shifted all new test development to React Testing Library (RTL). 

RTL encourages testing components from the perspective of how users interact with them, focusing on rendered output and behavior, not implementation details. This shift reflects modern React testing practices, which prioritize maintainability and resilience to refactoring.

However, thousands of existing test files at Airbnb were still using Enzyme. Migrating them introduced several challenges:

The migration was necessary to standardize testing across the codebase and support future React versions, but it had to be automated to be feasible.

Migration Strategy and Proof of Concept

The first indication that LLMs could handle this kind of migration came during a 2023 internal hackathon. A small team tested whether a large language model could convert Enzyme-based test files to RTL. Within days, the prototype successfully migrated hundreds of files. The results were promising in terms of accuracy as well as speed.

That early success laid the groundwork for a full-scale solution. In 2024, the engineering team formalized the approach into a scalable migration pipeline. The goal was clear: automate the transformation of thousands of test files, with minimal manual intervention, while preserving test intent and coverage.

To get there, the team broke the migration process into discrete, per-file steps that could be run independently and in parallel. Each step handled a specific task, like replacing Enzyme syntax, fixing Jest assertions, or resolving lint and TypeScript errors. When a step failed, the system invoked an LLM to rewrite the file using contextual information.

This modular structure made the pipeline easy to debug, retry, and extend. More importantly, it made it possible to run migrations across hundreds of files concurrently, accelerating throughput without sacrificing quality.

Pipeline Design and Techniques

Here are the key components of the pipeline design and the various techniques involved:

1 - Step-Based Workflow

To scale migration reliably, the team treated each test file as an independent unit moving through a step-based state machine. This structure enforced validation at every stage, ensuring that transformations passed real checks before advancing.

Each file advanced through the pipeline only if the current step succeeded. If a step failed, the system paused progression, invoked an LLM to refactor the file based on the failure context, and then re-validated before continuing.

Key stages in the workflow included:

This approach worked for the following reasons: 

This structured approach provided a foundation for automation to succeed at scale. It also set up the necessary hooks for advanced retry logic, context injection, and real-time debugging later in the pipeline.

2 - Retry Loops and Dynamic Prompting

Initial experiments showed that deep prompt engineering only got so far. 

Instead of obsessing over the perfect prompt, the team leaned into a more pragmatic solution: automated retries with incremental context updates. The idea was simple. If a migration step failed, try again with better feedback until it passed or hit a retry limit.

At each failed step, the system fed the LLM:

This dynamic prompting approach allowed the model to refine its output based on concrete failures, not just static instructions. Instead of guessing at improvements, the model had specific reasons why the last version didn’t pass.

Each step ran inside a loop runner, which retried the operation up to a configurable maximum. This was especially effective for simple to mid-complexity files, where small tweaks (like fixing an import, renaming a variable, or adjusting test structure) often resolved the issue.

This worked for the following reasons:

Retrying with context turned out to be a better investment than engineering the “ideal” prompt up front. It allowed the pipeline to adapt without human intervention and pushed a large portion of files through successfully with minimal effort.

3 - Rich Prompt Context

Retry loops handled the bulk of test migrations, but they started to fall short when dealing with more complex files: tests with deep indirection, custom utilities, or tightly coupled setups. These cases needed more than just brute-force retries. They needed contextual understanding.

To handle these, the team significantly expanded prompt inputs, pushing token counts into the 40,000 to 100,000 range. Instead of a minimal diff, the model received a detailed picture of the surrounding codebase, testing patterns, and architectural intent.

Each rich prompt included:

const prompt = [
  'Convert this Enzyme test to React Testing Library:',
  `SIBLING TESTS:\n${siblingTestFilesSourceCode}`,
  `RTL EXAMPLES:\n${reactTestingLibraryExamples}`,
  `IMPORTS:\n${nearestImportSourceCode}`,
  `COMPONENT SOURCE:\n${componentFileSourceCode}`,
  `TEST TO MIGRATE:\n${testFileSourceCode}`,
].join('\n\n');

Source: Airbnb Engineering Blog

The key insight was choosing the right context files, pulling in examples that matched the structure and logic of the file being migrated. Adding more tokens didn’t help unless those tokens carried meaningful, relevant information.

By layering rich, targeted context, the LLM could infer project-specific conventions, replicate nuanced testing styles, and generate outputs that passed validations even for the hardest edge cases. This approach bridged the final complexity gap, especially in files that reused abstractions, mocked behavior indirectly, or followed non-standard test setups.

4 - Systematic Cleanup From 75% to 97%

The first bulk migration pass handled 75% of the test files in under four hours. That left around 900 files stuck. These were too complex for basic retries and too inconsistent for a generic fix. Handling this long tail required targeted tools and a feedback-driven cleanup loop.

Two capabilities made this possible:

Migration Status Annotations

Each file was automatically stamped with a machine-readable comment that recorded its migration progress. 

These markers helped identify exactly where a file had failed, whether in the Enzyme refactor, Jest fixes, or final validation.

// MIGRATION STATUS: {"enzyme":"done","jest":{"passed":8,"failed":2}}

Source: Airbnb Engineering Blog

This gave the team visibility into patterns: common failure points, repeat offenders, and areas where LLM-generated code needed help.

Step-Specific File Reruns

A CLI tool allowed engineers to reprocess subsets of files filtered by failure step and path pattern:

$ llm-bulk-migration --step=fix-jest --match=project-abc/**

Source: Airbnb Engineering Blog

This made it easy to focus on fixes without rerunning the full pipeline, accelerating feedback, and isolating scope.

Structured Feedback Loop

To convert failure patterns into working migrations, the team used a tight iterative loop:

This method wasn’t theoretical. In practice, it pushed the migration from 75% to 97% completion in just four days. For the remaining ~100 files, the system had already done most of the work. LLM outputs weren’t usable as-is, but served as solid baselines. Manual cleanup on those final files wrapped up the migration in a matter of days, not months.

The takeaway was that brute force handled the bulk, but targeted iteration finished the job. Without instrumentation and repeatable tuning, the migration would have plateaued far earlier.

Conclusion

The results validated both the tooling and the strategy. The first bulk run completed 75% of the migration in under four hours, covering thousands of test files with minimal manual involvement. 

Over the next four days, targeted prompt tuning and iterative retries pushed completion to 97%. The remaining ~100 files, representing the final 3%, were resolved manually using LLM-generated outputs as starting points, cutting down the time and effort typically required for handwritten migrations.

Throughout the process, the original test intent and code coverage were preserved. The transformed tests passed validation, matched behavioral expectations, and aligned with the structural patterns encouraged by RTL. Even for complex edge cases, the baseline quality of LLM-generated code reduced the manual burden to cleanup and review, not full rewrites.

In total, the entire migration was completed in six weeks, with only six engineers involved and modest LLM API usage. Compared to the original 18-month estimate for a manual migration, the savings in time and cost were substantial.

The project also highlighted where LLMs excel:

Airbnb now plans to extend this framework to other large-scale code transformations, such as library upgrades, testing strategy shifts, or language migrations. 

The broader conclusion is clear: AI-assisted development can reduce toil, accelerate modernization, and improve consistency when structured properly, instrumented well, and paired with domain knowledge.

References:


Jobright Agent : The First AI that hunts jobs for you

Job hunting can feel like a second full-time job—hours each day scrolling through endless listings, re-typing the same forms, tweaking your resume, yet still hearing nothing back.

What if you had a seasoned recruiter who handled 90% of the grunt work and lined up more interviews for you? That’s the experience with Jobright Agent:

Watch How the Agent Works


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].