CLR-Bench Leaderboard

Sketched Overview

Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. While emerging benchmarks have been proposed to evaluate LLMs in various domains such as mathematics and computer science, they merely measure the accuracy in terms of the final prediction on multi-choice questions. However, it remains insufficient to verify the essential understanding of LLMs given a chosen choice. To fill this gap, we present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.

Contributions:

📌 we prioritize 16 challenging college disciplines in computer science and artificial intelligence. The dataset contains 1018 questions in 5 types, i.e., Multi-choice (MC), Multi-select (MS), Fill-in-blank (FB), Open-ended (OE), True-or-false (TF), each associated with detailed explanations from experts.
⚖ We formalize the criteria with two novel metrics:
Q→A to measure the performance of direct answer prediction, and Q→AR to evaluate the joint ability to answer and provide rationale simultaneously.

Key Insignts:

🤔 LLMs are potentially 'guessing' the answers for college-level questions, instead of truly understanding the rationale with a dramatically dropping Q→AR.
🔍 Model sizes do not inherently guarantee superior performance in reasoning, despite larger models often achieving higher accuracy in Q→A. Several smaller models notably exhibit stronger performance in Q→AR, surpassing larger ones to provide accurate and coherent rationales.

#	Models	Q->A (Direct performance of prediction)						Q->AR (Reasoning performance of rationale)
		MC	MS	TF	FB	OE	Q->A	MC	MS	TF	FB	OE	Q->AR
1	qwen2.5-32b-instruct	80.18	75.23	63.29	44.76	52.04	63.31	55.41	45.50	56.0	48.33	11.80	42.29
2	deepseek-chat	78.80	77.03	62.66	40.95	51.30	62.43	55.41	44.14	55.14	46.43	13.48	42.09
3	gpt-4o	84.33	76.58	63.92	54.29	34.01	60.76	53.11	44.82	57.52	51.19	8.55	41.60
4	qwen2.5-72b-instruct	81.11	77.93	63.61	45.71	52.23	64.05	50.58	45.50	53.56	48.10	13.10	40.79
5	gpt-4-turbo	82.49	78.38	63.92	50.48	45.91	63.31	51.73	45.95	49.45	51.43	8.74	39.00
6	claude-3-opus	80.18	81.53	62.66	52.38	44.42	62.57	50.35	43.02	50.08	50.00	9.20	38.56
7	claude-3-sonnet	76.96	76.58	59.81	42.86	48.33	60.51	50.69	44.59	48.26	45.95	11.80	38.51
8	qwen2.5-72b	80.18	73.42	60.44	47.62	52.04	62.52	50.00	43.02	45.97	49.52	11.99	37.89
9	phi-3-medium-4k-instruct	74.65	63.96	56.01	38.10	48.88	57.12	51.50	33.56	49.68	44.05	11.34	37.60
10	qwen2.5-7b-instruct	70.97	63.06	58.23	37.14	49.26	56.93	52.65	34.23	50.32	41.90	8.92	37.25
11	gemini-1.5-pro	78.34	71.62	55.38	46.67	52.42	60.36	50.46	40.54	43.35	50.24	12.83	37.21
12	phi-3-mini-4k-instruct	69.12	59.46	55.38	34.29	44.80	53.78	49.42	37.84	45.41	41.43	9.57	35.56
13	gpt-3.5-turbo	63.13	66.67	5854	23.81	47.77	53.98	42.51	41.22	49.60	39.29	6.41	34.70
14	gemma2-9b-it	72.81	50.45	55.70	43.81	40.52	53.54	45.62	32.88	47.15	44.52	7.90	34.63
15	yi-1.5-34b-chat	68.66	53.60	56.96	40.95	39.96	52.95	40.90	33.33	48.34	44.76	7.16	33.87
16	llama-3.1-70b-instruct	80.18	74.77	61.39	40.00	44.61	60.22	44.24	38.06	45.09	38.57	7.99	33.67
17	llama-3.1-70b	75.58	51.80	5759	45.71	47.77	56.97	43.89	35.81	41.46	45.48	10.50	33.60
18	qwen2.5-7b	69.12	54.95	54.11	40.00	49.07	54.62	43.55	34.46	43.43	45.00	8.36	33.37
19	yi-1.5-34b	70.97	58.11	5759	41.90	48.33	56.43	40.55	35.59	42.96	43.10	7.25	32.22
20	llama-3-8b-instruct	60.83	49.55	52.22	29.52	47.40	50.15	43.78	26.35	41.30	36.19	9.76	31.34
21	mixtral-8x7b-instruct-v0.1	59.91	45.05	34.18	16.19	43.12	41.36	42.97	31.31	39.08	34.52	8.36	30.48
22	llama-3.1-8b-instruct	63.13	51.80	56.96	25.71	41.82	50.49	39.06	22.97	41.85	33.57	8.55	29.54
23	mixtral-8x7b-v0.1	67.28	39.64	53.48	39.05	45.72	51.38	39.06	31.53	36.00	36.19	8.83	29.00
24	llama-3.2-3b-instruct	57.14	39.19	49.05	23.81	34.94	43.37	40.55	21.40	38.29	32.62	8.09	28.36
25	yi-1.5-6b	53.00	59.01	51.27	30.48	42.75	48.08	33.53	29.95	39.16	35.00	6.13	27.80
26	qwen1.5-7b	58.99	54.50	47.15	37.14	38.29	47.10	36.64	33.11	32.99	35.95	6.78	27.16
27	mistral-7b-v0.1	58.06	42.79	50.63	37.14	43.87	48.18	33.06	29.05	34.10	35.24	7.16	26.33
28	mistral-7b-instruct-v0.1	57.60	45.05	40.82	21.90	42.19	43.27	35.48	27.48	34.41	33.33	6.04	26.28
29.0	llama-3.1-8b	62.67	40.54	5158	30.48	44.98	48.82	34.22	28.83	32.52	33.57	7.99	26.11
30	openchat-3.5	60.83	39.64	49.68	25.71	36.25	44.94	35.94	22.75	33.86	30.24	8.46	26.01
31	yi-1.5-6b-chat	37.33	53.15	48.10	25.71	38.85	41.60	30.30	32.43	32.99	34.05	6.88	25.56
32	llama-3-8b	59.91	35.14	50.95	26.67	43.31	46.61	32.37	24.10	31.09	29.29	7.53	24.19
33	deepseek-7b-chat	49.31	27.48	45.57	20.00	31.41	38.02	30.76	18.92	34.97	27.62	5.11	23.67
34	qwen1.5-7b-chat	32.26	50.00	11.71	13.33	35.32	26.67	27.42	26.58	27.22	33.57	6.41	22.35
35	llama-2-7b-chat	38.25	36.49	39.24	13.33	35.50	35.07	26.04	16.89	31.49	26.67	5.30	21.32
36	deepseek-7b-base	49.77	36.49	44.94	18.10	36.62	40.08	27.88	18.47	28.16	25.24	5.39	20.73
37	llama-2-7b	54.38	15.77	44.62	19.05	36.43	38.75	26.61	18.47	27.29	24.05	6.78	20.43
38	llama-3.2-1b-instruct	41.47	40.99	37.66	14.29	25.09	33.10	28.00	13.06	24.68	24.76	6.69	19.38
39	gemma-7b-it	15.21	37.84	45.25	8.57	13.38	25.83	13.02	1.80	35.52	20.24	10.04	18.74
40	gemma2-9b	16.13	47.75	11.71	9.52	11.34	16.26	11.29	8.33	10.60	11.67	3.62	8.77

Overall framework and pipeline

The overview of our proposed CLR-Bench. Dataset Construction. Domain experts first curate a condensed hierarchical topic graph to guide the collection of five types of questions. GPT-4o is then carefully instructed to assist the experts in gold rationale generation. Benchmark Evaluation. We formally define standardized criteria for each type of question and the corresponding rationale.

Observations and Findings

LLMs tend to ‘guess’ answers.
A detailed analysis of Q→AR scores reveals that despite their relatively high accuracy in answering questions (Q→A), many LLMs struggle significantly to provide coherent and accurate rationales.

LLMs are not good at non-MC questions.
We visualize the radar diagrams to showcase the expertise of LLMs on different types of questions. Performance on non-MC questions presents another significant challenge for LLMs. Many models demonstrate reasonable performance on MC, yet their accuracy drops significantly when handling OE or FB questions, which require deeper reasoning and articulation.

Discipline-specific insights
\((i)\) Context-intensive tasks. Topics such as AI Introduction and Ethics of CS and AI require a solid understanding of conceptual frameworks. All three state-of-the-art LLMs demonstrate a surprisingly low performance with around 12.5% and 6.5% Q→AR. This suggests challenges in grasping nuanced ethical considerations. Similarly, scores in the AI Introduction domain reflect the models’ difficulty in conveying complex ideas coherently. These results highlight that while LLMs possess substantial factual knowledge, they often struggle with articulating and rationalizing that knowledge in contextually rich environments.

\((ii)\) Reasoning-intensive tasks. Mathematics and Computer Architecture reflect a different situation compared with context-intensive tasks. All three models could perform significantly satisfying the Q→A criteria; however, their performance decreases sharply on Q→AR with over a 30% accuracy drop in average. This reflects the necessity for improved training techniques that focus on fostering deeper cognition for reasoning-intensive tasks.

Contact

For inquiries or contributions, please contact us at hanson.dong@connect.polyu.hk.