Skip to content

GPT-5.4: The AI That Finally Learned to Do Your Job (But Still Won’t Make the Coffee)

coffee latte near white wireless keyboard and Apple EarPods on the table photography

Today a shiny new AI model called GPT-5.4 arrived in ChatGPT, the API, and Codex, and it’s apparently the most capable and efficient model yet for professional work. There’s also a fancy sibling called GPT-5.4 Pro, which exists for people who look at a normal AI and say, “Nice… but could it be even more intimidating?”

GPT-5.4 combines recent advances in reasoning, coding, and agent-style workflows into one model. It inherits the coding skills of GPT-5.3-Codex and improves how it works with tools, software environments, spreadsheets, presentations, and documents. In other words, it’s basically the coworker who can open Excel, PowerPoint, and a browser without immediately needing to Google “how to merge cells.”

The goal is simple: get complex real work done accurately and efficiently with less back-and-forth. The AI reads the task, does the task, and ideally doesn’t ask 17 follow-up questions like “Just to clarify, do you want the spreadsheet to… exist?”

In ChatGPT, GPT-5.4 Thinking can now show an upfront plan of its reasoning before finishing the task. That means users can interrupt it mid-process and steer it in a different direction. It’s like watching someone cook dinner and being able to shout, “WAIT, DON’T ADD THE CINNAMON!” before the disaster becomes permanent.

The model also improved deep web research. It’s better at answering extremely specific questions and maintaining context across longer conversations. Essentially, it can now chase obscure internet facts with the persistence of a bored Wikipedia editor at 2 a.m.

For developers, GPT-5.4 introduces something big: native computer-use capabilities. Agents can operate computers, interact with applications, and carry out multi-step workflows across software systems. It can control keyboards and mice based on screenshots, which means the AI has finally discovered the ancient human technology known as “clicking stuff.”

The model supports up to one million tokens of context, allowing it to plan and execute long tasks while remembering earlier steps. That’s roughly the opposite of the average human meeting, where everyone forgets the goal halfway through.

GPT-5.4 also introduces “tool search.” Previously, if a model had many tools available, all their definitions had to be included in the prompt, which filled the context with thousands of tokens. Now the model can look up tool definitions only when needed. It’s basically the AI equivalent of not carrying every tool in the garage inside your backpack “just in case.”

The result is fewer tokens used and faster responses. Which is great, because nobody likes waiting for a computer to think. Humans are perfectly comfortable wasting their own time, but machines should really know better.

In benchmarks measuring knowledge-work tasks across dozens of professions, GPT-5.4 matched or exceeded industry professionals in 83% of comparisons. These tasks included things like creating sales presentations, financial spreadsheets, scheduling plans, manufacturing diagrams, and short videos. Somewhere out there, a PowerPoint slide just felt a sudden disturbance in the Force.

Companies testing the model say it excels at producing long-horizon deliverables such as slide decks, financial models, and legal analysis. In other words, the AI can now produce the sort of documents that humans lovingly craft over three days and four existential crises.

Spreadsheet skills improved dramatically as well. On internal tests resembling tasks done by junior investment banking analysts, GPT-5.4 scored about 87% compared to roughly 68% for GPT-5.2. This suggests the AI now understands spreadsheets better than most people who aggressively drag formulas until something looks right.

Presentations improved too. Human reviewers preferred GPT-5.4’s presentations 68% of the time thanks to better aesthetics and visual variety. Apparently the model finally learned the golden rule of PowerPoint: if the slide contains fewer than twelve fonts and three stock photos, you’re doing it wrong.

Another improvement is factual accuracy. GPT-5.4 is significantly less likely to produce incorrect claims, with individual statements being about 33% less likely to be false compared to GPT-5.2. That means the AI now hallucinates less often, which is reassuring because hallucinating computers were starting to sound like the plot of a very stressful science-fiction movie.

On the coding side, GPT-5.4 matches or outperforms GPT-5.3-Codex on software benchmarks while running with lower latency. A faster “/fast mode” even increases token generation speed by up to 1.5×. This allows developers to iterate on code while staying in flow instead of staring at the screen wondering if the AI fell asleep.

The model also performs especially well at complex frontend tasks, producing more aesthetic and functional interfaces. That’s impressive considering that human developers can spend six hours aligning one button.

An experimental tool called “Playwright Interactive” allows Codex to visually debug apps while building them. The model can actually test its own creations automatically. It’s basically the programming equivalent of writing a book and having your editor live inside your laptop yelling helpful suggestions.

One demonstration had GPT-5.4 generate a fully interactive theme-park simulation game from a single prompt. The game included rides, guest movement, queue systems, park finances, and visual assets. Meanwhile, humans are still arguing about where to place the trash cans in their RollerCoaster Tycoon parks.

Tool usage is another area of improvement. GPT-5.4 is better at deciding when to call external tools and APIs while completing multi-step workflows. In benchmarks where the AI had to coordinate tasks like reading emails, uploading files, grading assignments, and recording results, it completed tasks more accurately and in fewer steps.

The model also improved at persistent web browsing. When searching the internet for difficult information, it performs significantly better than earlier models and can follow leads across multiple sources. Basically, it does research like a determined graduate student who has already consumed three cups of coffee.

Safety work also continued. GPT-5.4 is treated as a high-capability system under cybersecurity safeguards, with monitoring systems and controls designed to prevent misuse. Because cybersecurity knowledge can be used for both defense and offense, the model is deployed cautiously with additional protections.

Researchers also studied whether the model could deliberately hide its reasoning to avoid monitoring. Thankfully, GPT-5.4 showed a low ability to obscure its reasoning, which is exactly what you want from a powerful AI. If a computer ever becomes too good at hiding its thoughts, humanity might start feeling a little uncomfortable.

GPT-5.4 is rolling out across ChatGPT, Codex, and the API. In ChatGPT, GPT-5.4 Thinking replaces GPT-5.2 Thinking for Plus, Team, and Pro users, while GPT-5.4 Pro is available for those who need maximum performance on the hardest tasks.

The model costs slightly more per token than GPT-5.2, though it often uses fewer tokens overall thanks to improved efficiency. In other words, it’s like hiring a more expensive consultant who finishes the job in half the time instead of scheduling seventeen meetings to “circle back.”

In short, GPT-5.4 represents a big step forward: stronger reasoning, better coding, improved computer use, more reliable tool integration, and fewer hallucinations. It’s faster, smarter, and better at professional work.

The only remaining feature request is obvious: teaching it how to refill the office coffee machine. Because if an AI can build a theme-park simulator and run spreadsheets better than humans, surely it can figure out where the coffee beans go. ☕

Leave a Reply

Your email address will not be published. Required fields are marked *