Microsoft open sources AI evaluation framework for enterprise agents

Microsoft has open-sourced an AI evaluation framework that converts natural-language requirements into executable tests, expanding its push into enterprise AI governance as organizations struggle to validate agent behavior before production deployments systematically.

The framework, called ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), generates evaluation scenarios, datasets, metrics, and scorecards from written specifications, product requirements, and governance documents, Microsoft said in a blog post announcing the release.

“Agents fail in ways that are hard to see,” Microsoft wrote in the blog post. “They drift from policy, produce unsafe outputs in edge cases, and behave differently in production than they did in testing. Generic benchmarks do not catch these failures because they are not built around your policies, your agent, or your use case.”

Rather than requiring developers to manually create evaluation suites, ASSERT translates written intent into reusable tests that can be integrated into AI development pipelines, the company said in the blog post.

With ASSERT, Microsoft is entering an increasingly competitive AI evaluation market that already includes platforms such as LangChain’s LangSmith, Braintrust, Patronus AI, Galileo, Arize AI’s Phoenix, and Promptfoo, which help enterprises benchmark, monitor, and validate large language model applications.

Behavioral testing remains immature

The release comes as enterprises rapidly expand AI agent deployments while formal evaluation practices remain the exception rather than the rule.

“Most organizations, in fact, 99% of them, do not evaluate any AI agents pre-production,” said Anushree Verma, senior director analyst at Gartner.

According to Verma, the industry’s next competitive advantage will depend less on advances in reasoning models than on how effectively organizations simulate and stress-test AI agents before deployment.

“The next competitive moat in agentic AI is not about the sophistication of reasoning models or the underlying architecture,” she said. “It will be about the depth and realism of the training environment through agentic simulation, particularly for mission-critical deployments.”

Gartner estimates that by 2029, more than 75% of domain-specific agents designed without agentic simulation in regulated industries will fail to deliver value.

Forrester sees enterprises moving toward behavioral evaluation but says most organizations have yet to make it a formal production requirement.

“Most enterprises are still in an intermediate stage where behavioral evaluation is inconsistently applied rather than treated as a formal production gate,” said Biswajeet Mahapatra, principal analyst at Forrester.

According to Forrester data, more than 45% of organizations are already using AI agents, and another 25% are piloting them, yet many continue to struggle with scaling because of immature governance and limited operational rigor.

“The net is that behavioral evaluation is becoming important, but for most organizations it is still ad hoc or tool-driven rather than a standardized release gate enforced across the lifecycle,” Mahapatra said.

AI judges still need human oversight

Microsoft said ASSERT uses large language models as judges, with model-generated evaluations agreeing with human reviewers 80% to 90% of the time in the company’s internal validation.

That level of agreement can help automate large portions of AI testing, but should not be treated as a standalone governance mechanism, Mahapatra said.

“An 80% to 90% agreement rate with human reviewers indicates strong alignment but is not sufficient as a standalone control for governance or compliance,” he said.

Instead, enterprises should adopt layered oversight where AI evaluates AI at scale while humans retain supervisory accountability for high-risk, regulated, or ambiguous scenarios. Buyers should also watch for bias, consistency issues, and overreliance on a single model acting as both generator and evaluator, he added.

Open source reduces lock-in, not governance risk

Microsoft released ASSERT under the MIT open-source license, allowing organizations to inspect, modify, and integrate the framework into existing AI development workflows.

But open sourcing a framework does not eliminate questions around evaluation neutrality, Mahapatra said.

“Open sourcing under an MIT license reduces lock-in concerns and enables broad interoperability across model ecosystems,” he said. “However, it does not fully eliminate trust or conflict-of-interest questions because the originating vendor still influences how evaluation criteria, scoring logic, and definitions of acceptable behaviour are encoded.”

Instead of relying on a single evaluation framework, enterprises should validate AI systems against multiple evaluation approaches and retain ownership of internal evaluation policies, he said.

Sources: Info World
Published: Jun 11, 2026, 8:36:04 AM EDT