Chinese language AI firm Z.ai has launched GLM-5.1, an open-source coding mannequin it says is constructed for agentic software program engineering. The discharge comes as AI distributors transfer past autocomplete-style coding instruments towards methods that may deal with software program duties over longer durations with much less human enter.
Z.ai mentioned GLM-5.1 can maintain efficiency over a whole bunch of iterations, a capability it argues units it other than fashions that lose effectiveness in longer classes.
As one instance, the corporate mentioned GLM-5.1 improved a vector database optimization activity over greater than 600 iterations and 6,000 software calls, reaching 21,500 queries per second, about six occasions one of the best outcome achieved in a single 50-turn session.
In a analysis observe, Z.ai mentioned GLM-5.1 outperformed its predecessor, GLM-5, on a number of software program engineering benchmarks and confirmed explicit power in repo era, terminal-based downside fixing, and repeated code optimization. The corporate mentioned the mannequin scored 58.4 on SWE-Bench Professional, in contrast with 55.1 for GLM-5, and above the scores it listed for OpenAI’s GPT-5.4, Anthropic’s Opus 4.6, and Google’s Gemini 3.1 Professional on that benchmark.
GLM-5.1 has been launched beneath the MIT License and is offered by way of its developer platforms, with mannequin weights additionally printed for native deployment, the corporate mentioned. That will enchantment to enterprises searching for extra management over how such instruments are deployed.
Longer-running coding brokers
Z.ai says long-running efficiency is a key differentiator for the corporate when in comparison with fashions that lose effectiveness in prolonged classes.
Analysts say it is because many present fashions nonetheless plateau or drift after a comparatively small variety of turns, limiting their usefulness on prolonged, multi-step software program duties.
Pareekh Jain, CEO of Pareekh Consulting, mentioned the trade is now shifting past instruments that may reply prompts towards methods that may perform longer assignments with much less supervision.
The query, Jain mentioned, is now not, “What can I ask this AI?” however, “What can I assign to it for the following eight hours?”
For enterprises, that raises the prospect of assigning an agent a ticket within the morning and receiving an optimized answer by day’s finish, after it has run a whole bunch of experiments and profiled the code.
“This functionality aligns with actual wants equivalent to massive refactors, migration applications, and steady incident decision,” mentioned Charlie Dai, VP and principal analyst at Forrester. “It means that lengthy‑operating autonomous brokers have gotten extra sensible, supplied enterprises layer in governance, monitoring, and escalation mechanisms to handle danger.”
Open-source enchantment grows
GLM-5.1’s launch beneath the MIT License might be vital, particularly for firms in regulated or security-sensitive sectors.
“This issues in 4 key methods,” Jain mentioned. “First, value. Pricing is way decrease than for premium fashions, and self-hosting lets firms management bills as a substitute of paying per use. Second, information governance. Delicate code and information don’t have to be despatched to exterior APIs, which is vital in sectors equivalent to finance, healthcare, and protection. Third, customization. Corporations can adapt the mannequin to their very own codebases and inside instruments with out restrictions.”
The fourth issue, in keeping with Jain, is geopolitical danger. Though the mannequin is open supply, its hyperlinks to Chinese language infrastructure and entities may nonetheless increase compliance issues for some US firms.
Dai mentioned the MIT license makes it simpler for firms to run the mannequin on their very own methods whereas adapting it to inside necessities and governance insurance policies. “For a lot of patrons, this makes GLM‑5.1 a viable strategic possibility alongside business fashions, particularly the place regulatory constraints, IP sensitivity, or lengthy‑time period platform management matter most,” Dai mentioned.
Benchmark credibility
Z.ai cited three benchmarks: SWE-Bench Professional, which checks complicated software program engineering duties; NL2Repo, which measures repository era; and Terminal-Bench 2.0, which evaluates real-world terminal-based downside fixing.
“These benchmarks are designed to check coding brokers’ superior coding capabilities, so topping these benchmarks displays robust coding efficiency, equivalent to reliability in planning-to-execution, much less immediate rework, and sooner supply,” mentioned Lian Jye Su, chief analyst at Omdia. “Nonetheless, they’re nonetheless indifferent from typical enterprise realities.”
Su mentioned public benchmarks nonetheless don’t seize the messiness of proprietary codebases, legacy methods, and code overview workflows. He added that benchmark outcomes come from managed settings that differ from manufacturing, although the hole is closing as extra groups undertake agentic setups.
The article initially appeared in ComputerWorld.
