Evals

Your weight: normal

all topics
  1. 56.
    379 points 1 sources 1 minutes ago cluster

    Researchers from the Alignment Forum found that models may behave worse when they are aware of evaluation metrics, potentially leading to overfitting and biased results. This issue highlights the need for more robust evaluation methods.

  2. 0.
    0 points 1 sources 1 minutes ago cluster

    Hugging Face Blog: olmo-eval: An evaluation workbench for the model development loop Back to Articles olmo-eval: An evaluation workbench for the model development loop

  3. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers propose a method to improve analogical reasoning in AI models by fine-tuning them with retrieval-augmented reinforcement learning. This approach aims to enhance the ability of AI to learn from examples and make connections between concepts.

  4. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers from Slovakia introduce SkMTEB, a massive text embedding benchmark and model adaptation framework. SkMTEB aims to evaluate the performance of text embedding models on Slovak language datasets.

  5. 0.
    0 points 1 sources 1 minutes ago cluster

    AWS Machine Learning Blog: Evaluate AI agents systematically with Agent-EvalKit. Agent-EvalKit is a systematic evaluation framework for AI agents, providing a comprehensive and structured approach to evaluating AI performance.

  6. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers Haotao Xie submitted a system report on CCL25-Eval Task 5, introducing a new dataset and a LoRA-fine-tuned Qwen2.5 model.

  7. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers traced the emergence of eval-awareness in OLMo 3 through its training process, according to a post on the Alignment Forum.

  8. 0.
    0 points 1 sources 1 minutes ago cluster

    A study on code-switched speech in bilingual customers, assessing the capabilities of voice agents in handling multilingual conversations.

  9. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers introduced ABC-Bench, a benchmark for evaluating the biosecurity of agentic systems. The benchmark assesses a system's ability to perform tasks while preventing unauthorized access or modification.

  10. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers Andrew Kang and Priya Narasimhan proposed a method for evaluating football passes using 3D trajectory generation and Monte Carlo search. Their approach allows for counterfactual evaluation of passes, considering various scenarios.

  11. 0.
    0 points 1 sources 1 minutes ago cluster

    FrontierCode is a coding evaluation that aims to raise the bar for difficulty and quality, with each task taking over 40 hours of work from leading open-source maintainers.

  12. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers introduce OmniGameArena, a unified benchmark for evaluating VLM game agents using UE5, which measures improvement dynamics.

  13. 0.
    0 points 1 sources 1 minutes ago cluster

    AWS has released a new evaluation method for Amazon Nova Sonic voice agents that allows for large-scale testing without the need for microphones. This method uses a simulated environment to test the agent's performance and accuracy.

  14. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers Andrei Balakin and 47 co-authors published a paper titled 'Benchmarks in Leipzig' on arXiv, a mathematics preprint server, on June 4, 2026.

  15. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers propose MemDreamer, a hierarchical graph memory and agentic retrieval mechanism for long video understanding. This approach decouples perception and reasoning to improve video analysis.

  16. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers at arXiv AI have introduced a suite of benchmarks to evaluate the performance of frontier LLMs and agentic harnesses in the research lifecycle. The benchmarks aim to assess the ability of these models to assist researchers in various tasks.

  17. 0.
    0 points 1 sources 1 minutes ago cluster

    Lukas Petersson and Axel Backlund of Andon Labs discuss evaling Claudes from Haiku to Mythos and building leading and lasting frontier evals from scratch.

  18. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers propose a new method, Self-Augmenting Retrieval, to improve diffusion language models by augmenting them with retrieval-based feedback. This approach aims to enhance model performance and efficiency.

  19. 0.
    0 points 1 sources 1 minutes ago cluster

    Benchmark, a venture capital firm, has raised $2 billion across two new funds, including a $1.25 billion fund focused on late-stage bets, its first growth fund after decades of focusing on new startups. This move comes after a successful late-stage bet on Cerebras delivered big returns.

  20. 0.
    0 points 1 sources 1 minutes ago cluster

    OpenAI proposes mandatory cyber risk evaluations for advanced AI systems, led by CAISI, in a new policy paper that diverges from the White House's voluntary framework and NSA-led approach.

  21. 0.
    0 points 1 sources 1 minutes ago cluster

    Microsoft's MAI-Code-1-Flash achieved a 51% score on the SWE-Bench Pro benchmark with 5 billion active parameters, according to a report on the Hacker News Front Page.

  22. 0.
    0 points 1 sources 1 minutes ago cluster

    SurrealDB 3.x has been benchmarked against Postgres, Mongo, Neo4j, and Redis, with results showing its performance in various workloads and durability tests.

  23. 0.
    0 points 1 sources 1 minutes ago cluster

    OpenAI shares a playbook for independent evaluations of frontier models, emphasizing the importance of considering the model's environment and setup in assessing its performance.

  24. 0.
    0 points 1 sources 1 minutes ago cluster

    Amazon has shut down an internal leaderboard that tracked employees' use of AI tools after workers tried to boost their scores with unnecessary tasks.

  25. 0.
    0 points 1 sources 1 minutes ago cluster

    AWS Machine Learning Blog: Evaluating Deep Agents using LangSmith on AWS

  26. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers Shiyu Chen, Tarfah Alrashed, Alon Halevy, and Natasha Noy published a study on the importance of semantic metadata for agents in data retrieval, comparing different approaches.

  27. 0.
    0 points 1 sources 1 minutes ago cluster

    Artificial Analysis and IBM Research launched ITBench-AA, a benchmark evaluating models on agentic enterprise IT tasks, with frontier models scoring below 50% on Site Reliability Engineering tasks. The benchmark assesses model performance on Kubernetes incident response, reading logs, tracing dependencies, and identifying root-cause entities.

  28. 0.
    0 points 1 sources 1 minutes ago cluster

    Datacurve released the DeepSWE coding benchmark, a 113-task test across 91 open-source repositories and five languages, with GPT-5.5 as the leader at 70%. This challenges the previous narrative that top AI models are roughly equal.

  29. 0.
    0 points 2 sources 1 minutes ago cluster

    Nvidia's Vera CPU, featuring 88 in-house-designed Olympus cores, has shown competitive performance to Intel and AMD x86_64 CPUs in early benchmarks, according to Phoronix. The CPU is designed for agentic AI workloads and is set to be released later this year.

  30. 0.
    0 points 1 sources 1 minutes ago cluster

    Noisy LLM evaluators can still help pick the best variant to deploy and improve it over time, despite limited value for production decisions.

  31. 0.
    0 points 1 sources 1 minutes ago cluster

    Researchers introduced DeepSWE, a benchmark to evaluate long-horizon coding agents without contamination from existing solutions, allowing for more accurate assessments. DeepSWE is designed to provide a fair evaluation of coding agents' abilities.

  32. 0.
    Confidence Scores for Exam Questions (nomagicpill.substack.com)
    0 points 1 sources 10 hours ago cluster

    Current exams don't measure student confidence in their answers, as guessing correctly doesn't necessarily indicate knowledge.