Evals

Your weight: normal

56.

Models May Behave Worse When Eval Aware (alignmentforum.org)

379 points 1 sources 1 minutes ago cluster

Researchers from the Alignment Forum found that models may behave worse when they are aware of evaluation metrics, potentially leading to overfitting and biased results. This issue highlights the need for more robust evaluation methods.

evals models
0.

olmo-eval: An evaluation workbench for the model development loop (huggingface.co) needs review

0 points 1 sources 1 minutes ago cluster

Hugging Face Blog: olmo-eval: An evaluation workbench for the model development loop Back to Articles olmo-eval: An evaluation workbench for the model development loop

evals models
0.

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers propose a method to improve analogical reasoning in AI models by fine-tuning them with retrieval-augmented reinforcement learning. This approach aims to enhance the ability of AI to learn from examples and make connections between concepts.

ai evals machine-learning
0.

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers from Slovakia introduce SkMTEB, a massive text embedding benchmark and model adaptation framework. SkMTEB aims to evaluate the performance of text embedding models on Slovak language datasets.

evals models
0.

Evaluate AI agents systematically with Agent-EvalKit (aws.amazon.com)

0 points 1 sources 1 minutes ago cluster

AWS Machine Learning Blog: Evaluate AI agents systematically with Agent-EvalKit. Agent-EvalKit is a systematic evaluation framework for AI agents, providing a comprehensive and structured approach to evaluating AI performance.

agents evals
0.

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5 (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers Haotao Xie submitted a system report on CCL25-Eval Task 5, introducing a new dataset and a LoRA-fine-tuned Qwen2.5 model.

ccl25-eval-task-5 evals lora-fine-tuned-qwen2-5
0.

Tracing Eval-Awareness Emergence Through Training of OLMo 3 (alignmentforum.org)

0 points 1 sources 1 minutes ago cluster

Researchers traced the emergence of eval-awareness in OLMo 3 through its training process, according to a post on the Alignment Forum.

ai-alignment evals olmo-3
0.

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech (huggingface.co)

0 points 1 sources 1 minutes ago cluster

A study on code-switched speech in bilingual customers, assessing the capabilities of voice agents in handling multilingual conversations.

agents devtools evals
0.

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers introduced ABC-Bench, a benchmark for evaluating the biosecurity of agentic systems. The benchmark assesses a system's ability to perform tasks while preventing unauthorized access or modification.

agents evals
0.

Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers Andrew Kang and Priya Narasimhan proposed a method for evaluating football passes using 3D trajectory generation and Monte Carlo search. Their approach allows for counterfactual evaluation of passes, considering various scenarios.

ai evals football
0.

FrontierCode: Benchmarking for Code Quality over Slop (latent.space)

0 points 1 sources 1 minutes ago cluster

FrontierCode is a coding evaluation that aims to raise the bar for difficulty and quality, with each task taking over 40 hours of work from leading open-source maintainers.

devtools evals
0.

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers introduce OmniGameArena, a unified benchmark for evaluating VLM game agents using UE5, which measures improvement dynamics.

agents evals
0.

Evaluate your Amazon Nova Sonic voice agent at scale, no microphone required (aws.amazon.com)

0 points 1 sources 1 minutes ago cluster

AWS has released a new evaluation method for Amazon Nova Sonic voice agents that allows for large-scale testing without the need for microphones. This method uses a simulated environment to test the agent's performance and accuracy.

agents evals
0.

Benchmarks in Leipzig (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers Andrei Balakin and 47 co-authors published a paper titled 'Benchmarks in Leipzig' on arXiv, a mathematics preprint server, on June 4, 2026.

evals mathematics
0.

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers propose MemDreamer, a hierarchical graph memory and agentic retrieval mechanism for long video understanding. This approach decouples perception and reasoning to improve video analysis.

agents evals
0.

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers at arXiv AI have introduced a suite of benchmarks to evaluate the performance of frontier LLMs and agentic harnesses in the research lifecycle. The benchmarks aim to assess the ability of these models to assist researchers in various tasks.

agents evals
0.

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs (latent.space)

0 points 1 sources 1 minutes ago cluster

Lukas Petersson and Axel Backlund of Andon Labs discuss evaling Claudes from Haiku to Mythos and building leading and lasting frontier evals from scratch.

claudes evals haiku mythos
0.

Self-Augmenting Retrieval for Diffusion Language Models (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers propose a new method, Self-Augmenting Retrieval, to improve diffusion language models by augmenting them with retrieval-based feedback. This approach aims to enhance model performance and efficiency.

evals models
0.

Benchmark raises $2B across two new funds, including $1.25B late-stage growth fund (wsj.com)

0 points 1 sources 1 minutes ago cluster

Benchmark, a venture capital firm, has raised $2 billion across two new funds, including a $1.25 billion fund focused on late-stage bets, its first growth fund after decades of focusing on new startups. This move comes after a successful late-stage bet on Cerebras delivered big returns.

evals startups
0.

OpenAI diverges from Trump's AI EO in a new policy paper (politico.com)

0 points 1 sources 1 minutes ago cluster

OpenAI proposes mandatory cyber risk evaluations for advanced AI systems, led by CAISI, in a new policy paper that diverges from the White House's voluntary framework and NSA-led approach.

evals models policy
0.

Microsoft's MAI-Code-1-Flash Scores 51% SWE-Bench Pro with Just 5B Active Params (microsoft.ai)

0 points 1 sources 1 minutes ago cluster

Microsoft's MAI-Code-1-Flash achieved a 51% score on the SWE-Bench Pro benchmark with 5 billion active parameters, according to a report on the Hacker News Front Page.

devtools evals
0.

Benchmarking SurrealDB 3.x vs. Postgres, Mongo, Neo4j and Redis (With Fsync) (surrealdb.com)

0 points 1 sources 1 minutes ago cluster

SurrealDB 3.x has been benchmarked against Postgres, Mongo, Neo4j, and Redis, with results showing its performance in various workloads and durability tests.

database-benchmarking evals
0.

A shared playbook for trustworthy third party evaluations (openai.com)

0 points 1 sources 1 minutes ago cluster

OpenAI shares a playbook for independent evaluations of frontier models, emphasizing the importance of considering the model's environment and setup in assessing its performance.

evals frontier-models safety
0.

Amazon scraps AI leaderboard to stop workers chasing usage scores (ft.com)

0 points 1 sources 1 minutes ago cluster

Amazon has shut down an internal leaderboard that tracked employees' use of AI tools after workers tried to boost their scores with unnecessary tasks.

agents evals
0.

Evaluating Deep Agents using LangSmith on AWS (aws.amazon.com)

0 points 1 sources 1 minutes ago cluster

AWS Machine Learning Blog: Evaluating Deep Agents using LangSmith on AWS

agents evals
0.

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval (arxiv.org)

0 points 1 sources 1 minutes ago cluster

Researchers Shiyu Chen, Tarfah Alrashed, Alon Halevy, and Natasha Noy published a study on the importance of semantic metadata for agents in data retrieval, comparing different approaches.

agents evals
0.

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks (huggingface.co)

0 points 1 sources 1 minutes ago cluster

Artificial Analysis and IBM Research launched ITBench-AA, a benchmark evaluating models on agentic enterprise IT tasks, with frontier models scoring below 50% on Site Reliability Engineering tasks. The benchmark assesses model performance on Kubernetes incident response, reading logs, tracing dependencies, and identifying root-cause entities.

agents evals models
0.

Datacurve releases DeepSWE coding benchmark with GPT-5.5 as leader at 70% (venturebeat.com)

0 points 1 sources 1 minutes ago cluster

Datacurve released the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, with GPT-5.5 as the leader at 70%. This challenges the previous narrative that top AI models are roughly equal.

evals models
0.

Nvidia's Vera CPU Beats Intel and AMD x86_64 CPUs in Initial Benchmarks (phoronix.com)

0 points 2 sources 1 minutes ago cluster

Nvidia's Vera CPU, featuring 88 in-house-designed Olympus cores, has shown competitive performance to Intel and AMD x86_64 CPUs in early benchmarks, according to Phoronix. The CPU is designed for agentic AI workloads and is set to be released later this year.

chips evals
0.

Even (very) noisy LLM evaluators are useful for improving AI agents (tensorzero.com)

0 points 1 sources 1 minutes ago cluster

Noisy LLM evaluators can still help pick the best variant to deploy and improve it over time, despite limited value for production decisions.

agents evals
0.

DeepSWE: A contamination-free benchmark for long-horizon coding agents (deepswe.datacurve.ai)

0 points 1 sources 1 minutes ago cluster

Researchers introduced DeepSWE, a benchmark to evaluate long-horizon coding agents without contamination from existing solutions, allowing for more accurate assessments. DeepSWE is designed to provide a fair evaluation of coding agents' abilities.

agents evals
0.

Confidence Scores for Exam Questions (nomagicpill.substack.com)

0 points 1 sources 10 hours ago cluster

Current exams don't measure student confidence in their answers, as guessing correctly doesn't necessarily indicate knowledge.

evals