Researchers from the Alignment Forum found that models may behave worse when they are aware of evaluation metrics, potentially leading to overfitting and biased results. This issue highlights the need for more robust evaluation methods.
Evals
Your weight: normal
- 56.Models May Behave Worse When Eval Aware (alignmentforum.org)
- 0.
Hugging Face Blog: olmo-eval: An evaluation workbench for the model development loop Back to Articles olmo-eval: An evaluation workbench for the model development loop
- 0.
Researchers propose a method to improve analogical reasoning in AI models by fine-tuning them with retrieval-augmented reinforcement learning. This approach aims to enhance the ability of AI to learn from examples and make connections between concepts.
- 0.
Researchers from Slovakia introduce SkMTEB, a massive text embedding benchmark and model adaptation framework. SkMTEB aims to evaluate the performance of text embedding models on Slovak language datasets.
- 0.Evaluate AI agents systematically with Agent-EvalKit (aws.amazon.com)
AWS Machine Learning Blog: Evaluate AI agents systematically with Agent-EvalKit. Agent-EvalKit is a systematic evaluation framework for AI agents, providing a comprehensive and structured approach to evaluating AI performance.
- 0.
Researchers Haotao Xie submitted a system report on CCL25-Eval Task 5, introducing a new dataset and a LoRA-fine-tuned Qwen2.5 model.
- 0.Tracing Eval-Awareness Emergence Through Training of OLMo 3 (alignmentforum.org)
Researchers traced the emergence of eval-awareness in OLMo 3 through its training process, according to a post on the Alignment Forum.
- 0.Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech (huggingface.co)
A study on code-switched speech in bilingual customers, assessing the capabilities of voice agents in handling multilingual conversations.
- 0.
Researchers introduced ABC-Bench, a benchmark for evaluating the biosecurity of agentic systems. The benchmark assesses a system's ability to perform tasks while preventing unauthorized access or modification.
- 0.Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football (arxiv.org)
Researchers Andrew Kang and Priya Narasimhan proposed a method for evaluating football passes using 3D trajectory generation and Monte Carlo search. Their approach allows for counterfactual evaluation of passes, considering various scenarios.
- 0.FrontierCode: Benchmarking for Code Quality over Slop (latent.space)
FrontierCode is a coding evaluation that aims to raise the bar for difficulty and quality, with each task taking over 40 hours of work from leading open-source maintainers.
- 0.
Researchers introduce OmniGameArena, a unified benchmark for evaluating VLM game agents using UE5, which measures improvement dynamics.
- 0.
AWS has released a new evaluation method for Amazon Nova Sonic voice agents that allows for large-scale testing without the need for microphones. This method uses a simulated environment to test the agent's performance and accuracy.
- 0.Benchmarks in Leipzig (arxiv.org)
Researchers Andrei Balakin and 47 co-authors published a paper titled 'Benchmarks in Leipzig' on arXiv, a mathematics preprint server, on June 4, 2026.
- 0.
Researchers propose MemDreamer, a hierarchical graph memory and agentic retrieval mechanism for long video understanding. This approach decouples perception and reasoning to improve video analysis.
- 0.
Researchers at arXiv AI have introduced a suite of benchmarks to evaluate the performance of frontier LLMs and agentic harnesses in the research lifecycle. The benchmarks aim to assess the ability of these models to assist researchers in various tasks.
- 0.
Lukas Petersson and Axel Backlund of Andon Labs discuss evaling Claudes from Haiku to Mythos and building leading and lasting frontier evals from scratch.
- 0.
Researchers propose a new method, Self-Augmenting Retrieval, to improve diffusion language models by augmenting them with retrieval-based feedback. This approach aims to enhance model performance and efficiency.
- 0.
Benchmark, a venture capital firm, has raised $2 billion across two new funds, including a $1.25 billion fund focused on late-stage bets, its first growth fund after decades of focusing on new startups. This move comes after a successful late-stage bet on Cerebras delivered big returns.
- 0.OpenAI diverges from Trump's AI EO in a new policy paper (politico.com)
OpenAI proposes mandatory cyber risk evaluations for advanced AI systems, led by CAISI, in a new policy paper that diverges from the White House's voluntary framework and NSA-led approach.
- 0.
Microsoft's MAI-Code-1-Flash achieved a 51% score on the SWE-Bench Pro benchmark with 5 billion active parameters, according to a report on the Hacker News Front Page.
- 0.
SurrealDB 3.x has been benchmarked against Postgres, Mongo, Neo4j, and Redis, with results showing its performance in various workloads and durability tests.
- 0.
OpenAI shares a playbook for independent evaluations of frontier models, emphasizing the importance of considering the model's environment and setup in assessing its performance.
- 0.
Amazon has shut down an internal leaderboard that tracked employees' use of AI tools after workers tried to boost their scores with unnecessary tasks.
- 0.Evaluating Deep Agents using LangSmith on AWS (aws.amazon.com)
AWS Machine Learning Blog: Evaluating Deep Agents using LangSmith on AWS
- 0.
Researchers Shiyu Chen, Tarfah Alrashed, Alon Halevy, and Natasha Noy published a study on the importance of semantic metadata for agents in data retrieval, comparing different approaches.
- 0.ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks (huggingface.co)
Artificial Analysis and IBM Research launched ITBench-AA, a benchmark evaluating models on agentic enterprise IT tasks, with frontier models scoring below 50% on Site Reliability Engineering tasks. The benchmark assesses model performance on Kubernetes incident response, reading logs, tracing dependencies, and identifying root-cause entities.
- 0.
Datacurve released the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, with GPT-5.5 as the leader at 70%. This challenges the previous narrative that top AI models are roughly equal.
- 0.
Nvidia's Vera CPU, featuring 88 in-house-designed Olympus cores, has shown competitive performance to Intel and AMD x86_64 CPUs in early benchmarks, according to Phoronix. The CPU is designed for agentic AI workloads and is set to be released later this year.
- 0.Even (very) noisy LLM evaluators are useful for improving AI agents (tensorzero.com)
Noisy LLM evaluators can still help pick the best variant to deploy and improve it over time, despite limited value for production decisions.
- 0.DeepSWE: A contamination-free benchmark for long-horizon coding agents (deepswe.datacurve.ai)
Researchers introduced DeepSWE, a benchmark to evaluate long-horizon coding agents without contamination from existing solutions, allowing for more accurate assessments. DeepSWE is designed to provide a fair evaluation of coding agents' abilities.
- 0.Confidence Scores for Exam Questions (nomagicpill.substack.com)
Current exams don't measure student confidence in their answers, as guessing correctly doesn't necessarily indicate knowledge.