OpenAI has just unveiled a new family of AI models under the name GPT-4.1 — including GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano.
These models, now available through OpenAI’s API, are designed to excel at coding and following detailed instructions, marking a significant step toward AI-powered software engineering.
One of the standout features of these models is their massive 1-million-token context window, allowing them to process nearly 750,000 words at once — more than the length of War and Peace. However, GPT-4.1 models aren’t currently integrated into ChatGPT. This launch comes as competition heats up in the AI space, with rivals like Google and Anthropic pushing their high-performing models. Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet, both of which also feature 1-million-token capabilities, have achieved strong results in popular coding benchmarks. Chinese startup DeepSeek is also making strides with its updated V3 model.
OpenAI’s long-term goal is to build a fully capable AI software engineer — or, in the words of the company’s CFO Sarah Friar, an “agentic software engineer” that can handle entire app development cycles, from writing code and debugging to QA and documentation.
GPT-4.1 makes significant progress in that direction. According to OpenAI, the model has been fine-tuned based on direct developer feedback, making it more efficient and reliable in areas such as:
- Frontend coding
- Formatting and structure adherence
- Tool usage consistency
- Fewer unnecessary edits
“These improvements enable developers to build agents that are considerably better at real-world software engineering tasks,” OpenAI said in a statement to TechCrunch.
Benchmark performance
OpenAI claims that GPT-4.1 outperforms its predecessors (GPT-4o and GPT-4o mini) on benchmarks like SWE-bench, which evaluates real-world software engineering tasks. While the full GPT-4.1 model offers higher accuracy, the mini and nano versions prioritize speed and efficiency — with nano being OpenAI’s fastest and cheapest model ever.
- GPT-4.1: $2/million input tokens, $8/million output tokens
- GPT-4.1 mini: $0.40/million input, $1.60/million output
- GPT-4.1 nano: $0.10/million input, $0.40/million output
According to OpenAI’s internal testing, GPT-4.1 scored 52% to 54.6% on SWE-bench Verified. In comparison, Google’s Gemini 2.5 Pro scored 63.8%, and Claude 3.7 Sonnet scored 62.3%.
GPT-4.1 also scored well in video understanding. In the Video-MME evaluation, it achieved 72% accuracy on long videos without subtitles — the highest among all tested models.
Limitations and reliability
Despite these advances, OpenAI acknowledges that even GPT-4.1 isn’t perfect. The model can still introduce or fail to fix bugs in the code and become less accurate with extremely long prompts. On the company’s own OpenAI-MRCR test, the model’s accuracy dropped from 84% at 8,000 tokens to just 50% at 1 million tokens.
Additionally, GPT-4.1 tends to be more literal than GPT-4o, sometimes requiring more precise and explicit prompts to yield the best results.
Still, the release of GPT-4.1 represents another leap forward in the race toward fully autonomous coding tools, with OpenAI laying the groundwork for its ambitious vision of AI-driven software engineering.