Can large language models (LLMs) reliably suggest fixes for flaky tests?

What are flaky tests and why do they matter?
What we know: Models, tools, and empirical performance
Limitations and failure modes
Best practices for using LLMs for flaky test fixes
Conclusion
Frequently asked questions

What are flaky tests and why do they matter?

Flaky tests—tests that pass or fail unpredictably—are a persistent challenge in software development. They can obscure real issues, slow down CI pipelines, and erode developer trust in your test suite. When tests fail intermittently, teams waste valuable time investigating false alarms instead of building features.

While large language models (LLMs) like GPT-3.5 Turbo have shown promise in suggesting fixes for flaky tests, their reliability varies considerably. This guide examines when LLMs can be effective, where they fall short, and how to use them responsibly in your development workflow.

What we know: Models, tools, and empirical performance

FlakyFix

FlakyFix is a framework that uses LLMs to predict the type of fix needed for a flaky test and then suggests code repairs. It focuses on flaky tests where the issue lies within the test code itself, not the production code.

Experimental results show that with proper guidance, LLMs can repair flaky tests with a success rate between 51% and 83%. However, many of these repairs require further refinement—on average, about 16% of the test code needs additional changes to pass.

The effectiveness of FlakyFix depends on the quality of the fix category prediction and the LLM's ability to generate appropriate code modifications.

FlakyDoctor

FlakyDoctor combines LLMs with program analysis to repair flaky tests. It addresses both order-dependent (OD) and implementation-dependent (ID) flakiness.

In evaluations using 873 confirmed flaky tests from 243 real-world projects, FlakyDoctor achieved success rates of 57% for OD tests and 59% for ID tests.

Importantly, non-LLM components contribute 12–31% to the overall performance, indicating that while LLMs are beneficial, they are not sufficient on their own. Combining AI-powered analysis with traditional program analysis techniques yields the best results.

Limitations and failure modes

While LLMs show promise for fixing flaky tests, they have several notable limitations:

Complex flakiness causes: Tests involving external systems, network dependencies, or concurrency issues are challenging for LLMs to diagnose and repair accurately. These scenarios often require deep understanding of system architecture and timing.
Insufficient context: LLMs may struggle to generate effective fixes without detailed context, such as logs, stack traces, or environment information. The more context you provide, the better the results.
Overfitting: LLMs might suggest repairs based on patterns they've seen during training, which may not generalize well to your unique flaky test scenarios or codebase patterns.
Maintenance overhead: Suggested fixes may simplify the test code but could introduce new maintenance challenges or obscure the original intent of the test, making it harder for future developers to understand.

Best practices for using LLMs for flaky test fixes

Provide rich context

The quality of LLM suggestions depends heavily on the context you provide. Include relevant information such as:

Failing logs and stack traces
Test environment details (OS, dependencies, versions)
Recent code changes that might have introduced the flakiness
Test framework versions and configurations

This context helps the LLM understand the issue better and generate more accurate repair suggestions that are specific to your situation.

Categorize fix types

Use a taxonomy to classify flaky test issues before applying LLM fixes. Common categories include:

Timing issues: Race conditions, insufficient waits, timeout problems
State management problems: Tests that don't properly clean up or isolate state
Resource contention: Tests competing for shared resources like files, ports, or database connections
External dependencies: Tests that rely on network calls, third-party services, or file systems

This classification can guide the LLM in generating more targeted fixes that address the root cause rather than masking symptoms.

Combine automated and symbolic checks

Pair LLM suggestions with static or dynamic analysis to verify the fixes are sound. This hybrid approach can:

Verify test dependencies are properly isolated
Ensure timing and delays are deterministic
Check for resource leaks or contention
Validate that the fix doesn't introduce new issues

Modern code review tools can help automate this process. For example, Graphite Agent provides AI-powered code review that can catch potential issues in both your test code and the fixes you apply, helping ensure that flaky test repairs don't introduce new problems.

Tip: Use AI code review tools in your pull request workflow to validate both the original test code and any AI-suggested fixes. This adds an extra layer of verification before changes reach your main branch.

Validate fixes with repeated runs

After applying an LLM-suggested fix, rigorous validation is essential:

Run the test multiple times (at least 50-100 iterations) in the same environment
Test across different environments (local, CI, staging)
Track metrics like pass/fail rates and execution time
Monitor for any new failure patterns

A truly fixed test should pass consistently across all environments. If you still see intermittent failures, the fix may have only reduced the flakiness rate rather than eliminating it.

Human review

Always have a developer review the LLM's suggestions, especially for critical tests. Human oversight ensures:

The fix addresses the root cause rather than masking symptoms
The test's value and clarity are maintained
The solution aligns with your team's testing philosophy
No new edge cases or issues are introduced

Consider treating LLM-suggested fixes the same way you'd treat any code change: require code review, run through CI, and validate before merging.

Conclusion

LLMs can be a valuable tool for suggesting fixes for flaky tests, particularly when the issues are within the test code itself. However, their reliability varies, and they should be used as part of a broader strategy that includes providing rich context, categorizing issues, combining with other analysis tools, validating fixes thoroughly, and ensuring human oversight.

By following these best practices, your team can leverage LLMs effectively to reduce flaky tests in your CI pipeline while mitigating potential risks. Remember that AI is a tool to augment your testing strategy, not replace human judgment and expertise.

Ready to improve your code review process and catch test issues early? Try Graphite Agent to get AI-powered feedback on every pull request, including test code quality and potential flakiness issues.

Frequently asked questions

What causes tests to become flaky?

Flaky tests typically arise from timing issues (race conditions, insufficient waits), state management problems (tests that don't properly clean up between runs), external dependencies (network calls, file systems, databases), or resource contention (multiple tests competing for the same resources). Sometimes flakiness is introduced by changes in the production code that expose previously hidden timing sensitivities in tests.

Should I use LLMs to fix all my flaky tests?

No, LLMs work best for specific types of flaky tests, particularly those with timing issues or state management problems within the test code itself. For tests that fail due to complex concurrency issues, architectural problems, or external system dependencies, manual investigation by experienced developers is usually more effective. Use LLMs as a first pass tool, but always validate and review their suggestions.

How do I know if an LLM-suggested fix actually works?

Run the test many times (50-100+ iterations) in multiple environments before considering it fixed. Track the pass rate over time and compare it to the baseline. A properly fixed flaky test should have a 100% pass rate across environments. If you still see any failures, the root cause may not be fully addressed. Also ensure the fix doesn't just mask the problem by adding excessive delays or overly broad try-catch blocks.

Can I train an LLM on my codebase to improve fix suggestions?

While custom training can improve results, it requires significant data (many examples of flaky tests and their fixes from your codebase) and resources. A more practical approach is to provide rich context with each request, use prompt engineering to guide the LLM's reasoning, and build up a library of successful fixes that you can reference in future prompts. Many teams find this approach more cost-effective than custom training.

What's the difference between fixing flaky tests manually vs with LLMs?

Manual fixes typically involve deeper investigation into root causes, better understanding of system architecture, and more context-aware solutions. LLM-suggested fixes are faster and can handle straightforward cases like missing waits or improper state cleanup, but may miss subtle issues or suggest solutions that work but aren't optimal. The best approach combines both: use LLMs for initial suggestions and speed, but validate with human expertise.

How long does it take to see ROI from using LLMs for flaky test fixes?

Most teams see benefits within 2-4 weeks if they focus on high-volume flaky tests first. The ROI comes from reduced developer time investigating false failures, faster CI pipelines, and improved developer confidence in the test suite. Track metrics like time spent on flaky test investigations, CI failure rates, and developer satisfaction to measure impact.