Large Language Models Benchmarks

Sword Health Launches MindEval, the First Multi-Turn Mental Health Benchmark for Evaluating Large Language Models in Realistic Therapeutic Dialogue

Sword Health is releasing MindEval as an open benchmark, including code, prompts, and human evaluation data. This allows researchers, developers, and clinicians worldwide to test their own systems, ...

Tech Xplore on MSN

Squashing 'fantastic bugs' hidden in AI benchmarks

After reviewing thousands of benchmarks used in AI development, a Stanford team found that 5% could have serious flaws with ...

Science News

A look under the hood of DeepSeek’s AI models doesn’t provide all the answers

A peer-reviewed paper about Chinese startup DeepSeek's models explains their training approach but not how they work through ...

11d

Logical Intelligence Achieves 76 Percent on Putnam Benchmark, Highlighting Shift Beyond Large Language Models to Language-free, Mathematically Grounded Models

Over the last decade, artificial intelligence (AI) has been largely built around large language models (LLMs). These systems are based on a language and guess words in a chain in the form of tokens.

EurekAlert!

MathEval: a comprehensive benchmark for evaluating large language models on mathematical reasoning capabilities

Mathematical reasoning is a fundamental aspect of intelligence, encompassing a spectrum from basic arithmetic to intricate problem-solving. Recent investigations into the mathematical abilities of ...

ascopubs.org

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

Large language models (LLMs) show promise in assisting knowledge-intensive fields such as oncology, where up-to-date information and multidisciplinary expertise are critical. Traditional LLMs risk ...

Devdiscourse

Legal AI is being misjudged: Benchmarks don’t match real-world law

The review found that legal use cases are routinely broken down into tasks that do not reflect the complexity of legal reasoning, procedure, or decision-making. Many studies transform rich legal ...

Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning

Chinese AI startup Zhipu AI aka Z.ai has released its GLM-4.6V series, a new generation of open-source vision-language models ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results