OpenAI SWE Interview: Refactoring and Code Review Round Guide
Updated:
Estimated read time: 7-9 minutes
Summary: The OpenAI refactoring and code review round is one of the least-discussed assessments in the SWE loop, partly because it does not appear in every candidate's process and partly because it is easy to underestimate. This round tests something distinct from correctness or algorithm design: it tests whether you can read unfamiliar code, identify what is wrong with it, improve it without breaking it, and explain your reasoning clearly. This guide covers what is evaluated, what kinds of code you will be given, and how to approach the round effectively.
TL;DR + FAQ (read this first)
At-a-glance takeaways
- You will be given existing code to read, critique, and improve; this is not a blank-page implementation exercise
- The evaluation weights code-reading ability, judgment about what to change, and communication of reasoning as heavily as the improvements themselves
- Not every issue needs to be fixed; prioritisation is itself a signal
- Preserving existing behaviour while making improvements is a hard requirement, not an assumption
- This round often appears as part of the onsite loop; format varies by team and seniority
Quick FAQ
What is actually being tested here?
Code reading speed, engineering judgment about what matters, the ability to improve code without breaking it, and the ability to explain your decisions clearly. It is less about whether you can write perfect code and more about whether you can engage with imperfect code the way a professional engineer would.
Will the code be in a specific language?
Typically Python or the language most relevant to the role. Confirm with your recruiter. If the language is one you are less comfortable reading, it is worth reviewing common patterns in it before the round.
Do I need to fix everything I find?
No. Part of the signal is whether you can distinguish between issues that matter and issues that are cosmetic or low priority. Trying to fix everything, especially under time pressure, often leads to worse outcomes than fixing the most important things well.
Should I write tests as part of my refactoring?
Yes, where the existing code lacks them. Adding tests that verify behaviour before and after your changes is a strong positive signal. It demonstrates that you refactor safely, not just confidently.
How much explanation is expected?
More than most candidates give. Narrating or documenting your reasoning, especially for non-obvious changes, is a core part of the round. A refactoring that is clearly correct but unexplained is less valuable than one that is explained well.
Preparing for the full OpenAI SWE loop? The step-by-step roadmap covers every stage in the right order.
View the OpenAI SWE interview roadmapSharpen your code-reading and refactoring skills with targeted practice, or book a mock session to get direct feedback on your approach before the real round.
Try OpenAI practice questions Book a mock code review session1) What this round is actually evaluating
The refactoring and code review round tests a cluster of skills that are distinct from what the other coding rounds assess. The emphasis is not on implementing something new; it is on engaging with something existing, and doing so with the judgment of a senior engineer.
Code-reading ability
Before you can improve code, you have to understand it. OpenAI uses this round to assess how quickly and accurately you can build a mental model of an unfamiliar codebase. Candidates who skim and miss structural issues tend to make superficial changes. Candidates who read methodically tend to identify root causes rather than symptoms.
Engineering judgment about what matters
Not all code problems are equally important. A variable named poorly is not in the same category as a race condition or a silent error swallower. This round tests whether you can distinguish between these categories and focus your effort where it has the most impact.
Safe refactoring practice
Improving code without breaking it is a skill. It requires understanding what the existing code is doing, writing tests that verify that behaviour before changing anything, making targeted changes, and verifying that the tests still pass after. Candidates who change code confidently without verifying behaviour are flagged.
Communication of reasoning
This round is as much a communication exercise as a technical one. Being able to explain what you found, why it matters, what you changed, and what tradeoffs you made is evaluated directly. Reviewers are looking for the kind of clear, precise commentary that would make a code review useful to the author.
2) Types of code you should expect
The code given in this round typically has a mix of issue categories. Below are the most common types of problems you will be expected to identify and address.
Correctness bugs
Code that produces wrong output for certain inputs, mishandles edge cases, or has logic errors that are not immediately obvious. These are the highest-priority issues and should be identified and fixed first.
Error handling gaps
Functions that do not handle failures, swallow exceptions silently, or crash on unexpected input. This is one of the most common categories in OpenAI refactoring exercises, reflecting the company's emphasis on production-readiness.
Concurrency and shared state issues
Race conditions, unsynchronised shared state, or incorrect assumptions about execution order. These are subtle and easy to miss on a quick read, which is partly why they appear in refactoring exercises: they test the depth of your code-reading.
Structure and readability problems
Deeply nested logic, functions that do too many things, poorly named variables, and hard-coded values that should be configurable. These are real problems but lower priority than correctness issues. Demonstrate that you know the difference.
Missing or insufficient tests
Code that has no test coverage, or tests that only verify the happy path without covering edge cases or failure modes. Adding tests is often as valuable as fixing the code itself, since it makes future changes safer.
3) How to read and assess unfamiliar code under time pressure
Reading code quickly and accurately under time pressure is a skill that improves with deliberate practice. Below is a structured approach that works well for this round.
Start with the overall structure before any line of code. What modules or files are present? What are the main functions or classes? What is the entry point? Getting this structural map in your head before reading any implementation saves time and prevents you from getting lost in detail too early.
Read the tests first if they exist. Tests are documentation. They tell you what the code is supposed to do without requiring you to reverse-engineer the implementation. Start there, then read the code with the expected behaviour already in mind.
Follow the data flow. For most code review exercises, the most important things happen along the main data path: what comes in, what transformations happen, and what goes out. Trace this path first. Edge cases, error handling, and secondary paths can be assessed after the main path is understood.
Note issues without fixing them immediately. As you read, note issues you spot without stopping to fix them. This keeps your reading momentum and prevents you from spending all your time on the first problem you find while missing a more serious one later in the code.
Prioritise before you start changing anything. Once you have read the whole thing, rank the issues by severity. Fix correctness bugs first. Then error handling. Then structure. Cosmetic issues last, or not at all if time is limited.
4) How to prioritise which changes to make
One of the clearest signals in this round is whether candidates can distinguish between what matters and what does not. Below is a practical prioritisation framework.
Priority 1: Correctness. Bugs that cause wrong behaviour are always the highest priority. Fix these first, and write or update tests that verify the correct behaviour after the fix.
Priority 2: Safety and reliability. Silent failures, missing error handling, and concurrency issues fall into this category. These may not cause visible wrong output, but they will cause production incidents. Address them after correctness bugs.
Priority 3: Testability and coverage. Adding tests that cover existing behaviour, especially before making other changes, is high-value. Tests make all subsequent changes safer.
Priority 4: Structure and readability. Refactoring for clarity, simplifying complex logic, renaming variables, and improving module boundaries. Important for maintainability but lower priority than correctness and safety.
Priority 5: Style and cosmetics. Formatting, minor naming improvements, and other low-impact changes. These are fine to address if time permits, but spending significant time on them at the expense of higher-priority issues is a red flag.
5) Communicating your reasoning effectively
The quality of your explanation matters as much as the quality of your changes in this round. Below are the patterns that produce strong evaluation outcomes.
Name the issue before describing the fix. "This function swallows exceptions silently; I have added explicit error handling and logging" is more useful than just showing the diff. The interviewer or reviewer should understand what you found before seeing what you changed.
Explain why, not just what. "I extracted this logic into a separate function because it was being repeated in three places and was hard to read in context" is more informative than "I refactored this function." The reasoning demonstrates judgment, not just execution.
Acknowledge what you did not change and why. If you identified an issue but chose not to address it, say so and explain your prioritisation. "I noticed the variable naming in this section is inconsistent but prioritised fixing the race condition in the concurrent handler first" is a mature and professional framing.
Be direct about uncertainty. If you are not sure whether something is a bug or intentional behaviour, say so. "This looks like it could cause an issue under concurrent access, but I want to confirm whether thread safety is a requirement here before changing it" is a much stronger signal than either ignoring the issue or making a confident change that turns out to be wrong.
6) Common failure modes
Fixing everything superficially instead of the most important things well. Candidates who make many small cosmetic changes while missing a correctness bug or race condition score poorly. Reviewers are specifically looking for evidence that you can identify what matters.
Refactoring without verifying behaviour. Making changes without first understanding what the code is supposed to do, or without adding tests to verify that it still does it after your changes, is a recognised failure mode. Safe refactoring requires verification.
Explaining changes without explaining reasoning. Listing what you changed without explaining why you made those choices reads like a commit log, not a code review. The reasoning is the most valuable part of your commentary.
Missing the most serious issues. Correctness bugs and error handling gaps are the highest-priority issues, but they are also sometimes subtle. Candidates who miss these while spending time on naming and formatting issues demonstrate inverted priorities.
Not adding any tests. If the code has no tests, adding them is one of the most valuable things you can do. Submitting a refactored codebase without tests, or without at least identifying the absence of tests as a problem, is a missed opportunity.
7) Frequently asked questions
Q: How much time will I have?
A: This varies by format. Live versions typically run 45-60 minutes. If it is asynchronous, the window will be specified. In either case, read before you change, and prioritise before you commit to any particular fix.
Q: Should I rewrite sections from scratch or make targeted changes?
A: Targeted changes are almost always better. A targeted fix with a clear explanation is more informative than a rewrite, and less likely to introduce new bugs. Rewrites also make it harder for the reviewer to understand what you actually changed and why.
Q: What if the code is so bad that it needs to be completely rearchitected?
A: Say so, and explain what a better structure would look like, but do not attempt a full rearchitecture in the time available. Demonstrate that you can identify the structural problem and reason about the solution, even if you cannot implement it fully in the given timeframe.
Q: What language is this round typically in?
A: Python is most common, but it varies by team and role. Confirm with your recruiter before the round so you can review relevant patterns.
Q: Is this round the same for all seniority levels?
A: The code complexity and the depth of judgment expected both scale with seniority. Senior and staff candidates are expected to identify more subtle issues, reason about architectural concerns, and demonstrate a more nuanced understanding of tradeoffs.
The refactoring round rewards habits built over time. Follow the full OpenAI SWE roadmap to prepare every stage systematically.
View the OpenAI SWE interview roadmapBuild your code-reading and refactoring instincts with targeted practice, or book a mock session to get calibrated feedback before the real round.
Try OpenAI practice questions Book a mock code review session