CTI-REALM is Microsoft’s open-source benchmark that evaluates AI agents on real-world detection engineering. It measures whether an agent can take cyber threat intelligence (CTI) and produce validated ...
New benchmark shows top LLMs achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between coding ability and real-world SRE work. OTelBench shows that while LLMs are ...
AI chatbots have been linked to serious mental health harms in heavy users, but there have been few standards for measuring whether they safeguard human well-being or just maximize for engagement. A ...
The arms race to build smarter AI models has a measurement problem: the tests used to rank them are becoming obsolete almost as quickly as the models improve. On Monday, Artificial Analysis, an ...
WARSAW, POLAND, January 20, 2026 /EINPresswire.com/ — Quesma, Inc. announced the release of OTelBench, the first comprehensive benchmark for evaluating LLMs on ...