Skip to content

Say hello to the new PR page.

Read more

Can large language models (LLMs) reliably suggest fixes for flaky tests?

Sara Verdi
Sara Verdi
Graphite software engineer
Try Graphite

Table of contents

Flaky tests—tests that pass or fail unpredictably—are a persistent challenge in software development. They can obscure real issues, slow down CI pipelines, and erode developer trust in your test suite. When tests fail intermittently, teams waste valuable time investigating false alarms instead of building features.

While large language models (LLMs) like GPT-3.5 Turbo have shown promise in suggesting fixes for flaky tests, their reliability varies considerably. This guide examines when LLMs can be effective, where they fall short, and how to use them responsibly in your development workflow.


FlakyFix is a framework that uses LLMs to predict the type of fix needed for a flaky test and then suggests code repairs. It focuses on flaky tests where the issue lies within the test code itself, not the production code.

Experimental results show that with proper guidance, LLMs can repair flaky tests with a success rate between 51% and 83%. However, many of these repairs require further refinement—on average, about 16% of the test code needs additional changes to pass.

The effectiveness of FlakyFix depends on the quality of the fix category prediction and the LLM's ability to generate appropriate code modifications.

FlakyDoctor combines LLMs with program analysis to repair flaky tests. It addresses both order-dependent (OD) and implementation-dependent (ID) flakiness.

In evaluations using 873 confirmed flaky tests from 243 real-world projects, FlakyDoctor achieved success rates of 57% for OD tests and 59% for ID tests.

Importantly, non-LLM components contribute 12–31% to the overall performance, indicating that while LLMs are beneficial, they are not sufficient on their own. Combining AI-powered analysis with traditional program analysis techniques yields the best results.


While LLMs show promise for fixing flaky tests, they have several notable limitations:

  • Complex flakiness causes: Tests involving external systems, network dependencies, or concurrency issues are challenging for LLMs to diagnose and repair accurately. These scenarios often require deep understanding of system architecture and timing.

  • Insufficient context: LLMs may struggle to generate effective fixes without detailed context, such as logs, stack traces, or environment information. The more context you provide, the better the results.

  • Overfitting: LLMs might suggest repairs based on patterns they've seen during training, which may not generalize well to your unique flaky test scenarios or codebase patterns.

  • Maintenance overhead: Suggested fixes may simplify the test code but could introduce new maintenance challenges or obscure the original intent of the test, making it harder for future developers to understand.


The quality of LLM suggestions depends heavily on the context you provide. Include relevant information such as:

  • Failing logs and stack traces
  • Test environment details (OS, dependencies, versions)
  • Recent code changes that might have introduced the flakiness
  • Test framework versions and configurations

This context helps the LLM understand the issue better and generate more accurate repair suggestions that are specific to your situation.

Use a taxonomy to classify flaky test issues before applying LLM fixes. Common categories include:

  • Timing issues: Race conditions, insufficient waits, timeout problems
  • State management problems: Tests that don't properly clean up or isolate state
  • Resource contention: Tests competing for shared resources like files, ports, or database connections
  • External dependencies: Tests that rely on network calls, third-party services, or file systems

This classification can guide the LLM in generating more targeted fixes that address the root cause rather than masking symptoms.

Pair LLM suggestions with static or dynamic analysis to verify the fixes are sound. This hybrid approach can:

  • Verify test dependencies are properly isolated
  • Ensure timing and delays are deterministic
  • Check for resource leaks or contention
  • Validate that the fix doesn't introduce new issues

Modern code review tools can help automate this process. For example, Graphite Agent provides AI-powered code review that can catch potential issues in both your test code and the fixes you apply, helping ensure that flaky test repairs don't introduce new problems.

Tip: Use AI code review tools in your pull request workflow to validate both the original test code and any AI-suggested fixes. This adds an extra layer of verification before changes reach your main branch.

After applying an LLM-suggested fix, rigorous validation is essential:

  • Run the test multiple times (at least 50-100 iterations) in the same environment
  • Test across different environments (local, CI, staging)
  • Track metrics like pass/fail rates and execution time
  • Monitor for any new failure patterns

A truly fixed test should pass consistently across all environments. If you still see intermittent failures, the fix may have only reduced the flakiness rate rather than eliminating it.

Always have a developer review the LLM's suggestions, especially for critical tests. Human oversight ensures:

  • The fix addresses the root cause rather than masking symptoms
  • The test's value and clarity are maintained
  • The solution aligns with your team's testing philosophy
  • No new edge cases or issues are introduced

Consider treating LLM-suggested fixes the same way you'd treat any code change: require code review, run through CI, and validate before merging.


LLMs can be a valuable tool for suggesting fixes for flaky tests, particularly when the issues are within the test code itself. However, their reliability varies, and they should be used as part of a broader strategy that includes providing rich context, categorizing issues, combining with other analysis tools, validating fixes thoroughly, and ensuring human oversight.

By following these best practices, your team can leverage LLMs effectively to reduce flaky tests in your CI pipeline while mitigating potential risks. Remember that AI is a tool to augment your testing strategy, not replace human judgment and expertise.

Ready to improve your code review process and catch test issues early? Try Graphite Agent to get AI-powered feedback on every pull request, including test code quality and potential flakiness issues.


Flaky tests typically arise from timing issues (race conditions, insufficient waits), state management problems (tests that don't properly clean up between runs), external dependencies (network calls, file systems, databases), or resource contention (multiple tests competing for the same resources). Sometimes flakiness is introduced by changes in the production code that expose previously hidden timing sensitivities in tests.

No, LLMs work best for specific types of flaky tests, particularly those with timing issues or state management problems within the test code itself. For tests that fail due to complex concurrency issues, architectural problems, or external system dependencies, manual investigation by experienced developers is usually more effective. Use LLMs as a first pass tool, but always validate and review their suggestions.

Run the test many times (50-100+ iterations) in multiple environments before considering it fixed. Track the pass rate over time and compare it to the baseline. A properly fixed flaky test should have a 100% pass rate across environments. If you still see any failures, the root cause may not be fully addressed. Also ensure the fix doesn't just mask the problem by adding excessive delays or overly broad try-catch blocks.

While custom training can improve results, it requires significant data (many examples of flaky tests and their fixes from your codebase) and resources. A more practical approach is to provide rich context with each request, use prompt engineering to guide the LLM's reasoning, and build up a library of successful fixes that you can reference in future prompts. Many teams find this approach more cost-effective than custom training.

Manual fixes typically involve deeper investigation into root causes, better understanding of system architecture, and more context-aware solutions. LLM-suggested fixes are faster and can handle straightforward cases like missing waits or improper state cleanup, but may miss subtle issues or suggest solutions that work but aren't optimal. The best approach combines both: use LLMs for initial suggestions and speed, but validate with human expertise.

Most teams see benefits within 2-4 weeks if they focus on high-volume flaky tests first. The ROI comes from reduced developer time investigating false failures, faster CI pipelines, and improved developer confidence in the test suite. Track metrics like time spent on flaky test investigations, CI failure rates, and developer satisfaction to measure impact.

Built for the world's fastest engineering teams, now available for everyone