ICML 2026 Submission (Under Review)

Benchmark Leaderboard

10 Models × 13 Benchmarks × 5 Frameworks — Complete evaluation results from the EffGen paper

Models Tested

Benchmarks

Frameworks

8×A40

GPUs (46GB)

Sort:

🥇

Qwen2.5-32B-Instruct

Qwen32B

70.97

EffGen Avg

+6.0

vs Best

🥈

Gemma3-27B

Google27B

69.19

EffGen Avg

+5.9

vs Best

🥉

GPT-OSS-20B

OpenAI20B

67.82

EffGen Avg

+10.1

vs Best

Qwen2.5-14B-Instruct

Qwen14B

66.38

EffGen Avg

+7.3

vs Best

Qwen2.5-7B-Instruct

Qwen7B

63.07

EffGen Avg

+11.3

vs Best

Gemma3-12B

Google12B

60.92

EffGen Avg

+10.9

vs Best

Qwen2.5-3B-Instruct

Qwen3B

56.80

EffGen Avg

+13.2

vs Best

Gemma3-4B

Google4B

53.76

EffGen Avg

+12.9

vs Best

Qwen2.5-1.5B-Instruct

Qwen1.5B

47.44

EffGen Avg

+13.1

vs Best

#10

Gemma3-1B

Google1B

33.38

EffGen Avg

+10.9

vs Best

Evaluation Setup

From the ICML 2026 submission (under review)

Hardware

8× NVIDIA A40 (46GB VRAM)

Software

Python 3.11|effGen v0.2.7

13 Benchmarks

GSM8KGSM-PLUSMATH-500BB-EasyBB-MedBB-HardGAIASimpleQALoCoMoLongMemEvalARC-CARC-ECSQA

Frameworks Compared

EffGenLangChainAutoGenSmolagentsRaw Model

Key Finding

EffGen consistently outperforms LangChain, AutoGen, and Smolagents across all 10 models and 13 benchmarks, with the largest gains on smaller models where optimization matters most.

+13.15% avg gain (3B)

Up to 7.7x faster

70.97% top score (32B)