NeurIPS 2025 Paper Results

Benchmark Leaderboard

10 Models × 13 Benchmarks × 5 Frameworks — Complete evaluation results from the EffGen paper

10
Models Tested
13
Benchmarks
5
Frameworks
8×A40
GPUs (46GB)
Sort:
🥇

Qwen2.5-32B-Instruct

Qwen32B
70.97
EffGen Avg
+6.0
vs Best
🥈

Gemma3-27B

Google27B
69.19
EffGen Avg
+5.9
vs Best
🥉

GPT-OSS-20B

OpenAI20B
67.82
EffGen Avg
+10.1
vs Best
#4

Qwen2.5-14B-Instruct

Qwen14B
66.38
EffGen Avg
+7.3
vs Best
#5

Qwen2.5-7B-Instruct

Qwen7B
63.07
EffGen Avg
+11.3
vs Best
#6

Gemma3-12B

Google12B
60.92
EffGen Avg
+10.9
vs Best
#7

Qwen2.5-3B-Instruct

Qwen3B
56.80
EffGen Avg
+13.2
vs Best
#8

Gemma3-4B

Google4B
53.76
EffGen Avg
+12.9
vs Best
#9

Qwen2.5-1.5B-Instruct

Qwen1.5B
47.44
EffGen Avg
+13.1
vs Best
#10

Gemma3-1B

Google1B
33.38
EffGen Avg
+10.9
vs Best

Evaluation Setup

From the NeurIPS 2025 paper

Hardware
8× NVIDIA A40 (46GB VRAM)
Software
Python 3.11|effGen v0.1.2
13 Benchmarks
GSM8KGSM-PLUSMATH-500BB-EasyBB-MedBB-HardGAIASimpleQALoCoMoLongMemEvalARC-CARC-ECSQA
Frameworks Compared
EffGenLangChainAutoGenSmolagentsRaw Model

Key Finding

EffGen consistently outperforms LangChain, AutoGen, and Smolagents across all 10 models and 13 benchmarks, with the largest gains on smaller models where optimization matters most.

+13.15% avg gain (3B)
Up to 7.7x faster
70.97% top score (32B)