CLR-Bench Logo

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Junnan Dong, Zijin Hong, Yuanchen Bei, Feiran Huang, Xinrun Wang, Xiao Huang


The Hong Kong Polytechnic University, Jinan University, Singapore Management University

🧷 Code 📜 Paper 🚩 GitHub
Running Example

Sketched Overview

Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. While emerging benchmarks have been proposed to evaluate LLMs in various domains such as mathematics and computer science, they merely measure the accuracy in terms of the final prediction on multi-choice questions. However, it remains insufficient to verify the essential understanding of LLMs given a chosen choice. To fill this gap, we present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.

Contributions:

📌 we prioritize 16 challenging college disciplines in computer science and artificial intelligence. The dataset contains 1018 questions in 5 types, i.e., Multi-choice (MC), Multi-select (MS), Fill-in-blank (FB), Open-ended (OE), True-or-false (TF), each associated with detailed explanations from experts.
⚖ We formalize the criteria with two novel metrics:
Q→A to measure the performance of direct answer prediction, and Q→AR to evaluate the joint ability to answer and provide rationale simultaneously.

Key Insignts:

🤔 LLMs are potentially 'guessing' the answers for college-level questions, instead of truly understanding the rationale with a dramatically dropping Q→AR.
🔍 Model sizes do not inherently guarantee superior performance in reasoning, despite larger models often achieving higher accuracy in Q→A. Several smaller models notably exhibit stronger performance in Q→AR, surpassing larger ones to provide accurate and coherent rationales.

# Models Q->A (Direct performance of prediction) Q->AR (Reasoning performance of rationale)
MC MS TF FB OE Q->A MC MS TF FB OE Q->AR
1 qwen2.5-32b-instruct 80.18 75.23 63.29 44.76 52.04 63.31 55.41 45.50 56.0 48.33 11.80 42.29
2 deepseek-chat 78.80 77.03 62.66 40.95 51.30 62.43 55.41 44.14 55.14 46.43 13.48 42.09
3 gpt-4o 84.33 76.58 63.92 54.29 34.01 60.76 53.11 44.82 57.52 51.19 8.55 41.60
4 qwen2.5-72b-instruct 81.11 77.93 63.61 45.71 52.23 64.05 50.58 45.50 53.56 48.10 13.10 40.79
5 gpt-4-turbo 82.49 78.38 63.92 50.48 45.91 63.31 51.73 45.95 49.45 51.43 8.74 39.00
6 claude-3-opus 80.18 81.53 62.66 52.38 44.42 62.57 50.35 43.02 50.08 50.00 9.20 38.56
7 claude-3-sonnet 76.96 76.58 59.81 42.86 48.33 60.51 50.69 44.59 48.26 45.95 11.80 38.51
8 qwen2.5-72b 80.18 73.42 60.44 47.62 52.04 62.52 50.00 43.02 45.97 49.52 11.99 37.89
9 phi-3-medium-4k-instruct 74.65 63.96 56.01 38.10 48.88 57.12 51.50 33.56 49.68 44.05 11.34 37.60
10 qwen2.5-7b-instruct 70.97 63.06 58.23 37.14 49.26 56.93 52.65 34.23 50.32 41.90 8.92 37.25
11 gemini-1.5-pro 78.34 71.62 55.38 46.67 52.42 60.36 50.46 40.54 43.35 50.24 12.83 37.21
12 phi-3-mini-4k-instruct 69.12 59.46 55.38 34.29 44.80 53.78 49.42 37.84 45.41 41.43 9.57 35.56
13 gpt-3.5-turbo 63.13 66.67 5854 23.81 47.77 53.98 42.51 41.22 49.60 39.29 6.41 34.70
14 gemma2-9b-it 72.81 50.45 55.70 43.81 40.52 53.54 45.62 32.88 47.15 44.52 7.90 34.63
15 yi-1.5-34b-chat 68.66 53.60 56.96 40.95 39.96 52.95 40.90 33.33 48.34 44.76 7.16 33.87
16 llama-3.1-70b-instruct 80.18 74.77 61.39 40.00 44.61 60.22 44.24 38.06 45.09 38.57 7.99 33.67
17 llama-3.1-70b 75.58 51.80 5759 45.71 47.77 56.97 43.89 35.81 41.46 45.48 10.50 33.60
18 qwen2.5-7b 69.12 54.95 54.11 40.00 49.07 54.62 43.55 34.46 43.43 45.00 8.36 33.37
19 yi-1.5-34b 70.97 58.11 5759 41.90 48.33 56.43 40.55 35.59 42.96 43.10 7.25 32.22
20 llama-3-8b-instruct 60.83 49.55 52.22 29.52 47.40 50.15 43.78 26.35 41.30 36.19 9.76 31.34
21 mixtral-8x7b-instruct-v0.1 59.91 45.05 34.18 16.19 43.12 41.36 42.97 31.31 39.08 34.52 8.36 30.48
22 llama-3.1-8b-instruct 63.13 51.80 56.96 25.71 41.82 50.49 39.06 22.97 41.85 33.57 8.55 29.54
23 mixtral-8x7b-v0.1 67.28 39.64 53.48 39.05 45.72 51.38 39.06 31.53 36.00 36.19 8.83 29.00
24 llama-3.2-3b-instruct 57.14 39.19 49.05 23.81 34.94 43.37 40.55 21.40 38.29 32.62 8.09 28.36
25 yi-1.5-6b 53.00 59.01 51.27 30.48 42.75 48.08 33.53 29.95 39.16 35.00 6.13 27.80
26 qwen1.5-7b 58.99 54.50 47.15 37.14 38.29 47.10 36.64 33.11 32.99 35.95 6.78 27.16
27 mistral-7b-v0.1 58.06 42.79 50.63 37.14 43.87 48.18 33.06 29.05 34.10 35.24 7.16 26.33
28 mistral-7b-instruct-v0.1 57.60 45.05 40.82 21.90 42.19 43.27 35.48 27.48 34.41 33.33 6.04 26.28
29.0 llama-3.1-8b 62.67 40.54 5158 30.48 44.98 48.82 34.22 28.83 32.52 33.57 7.99 26.11
30 openchat-3.5 60.83 39.64 49.68 25.71 36.25 44.94 35.94 22.75 33.86 30.24 8.46 26.01
31 yi-1.5-6b-chat 37.33 53.15 48.10 25.71 38.85 41.60 30.30 32.43 32.99 34.05 6.88 25.56
32 llama-3-8b 59.91 35.14 50.95 26.67 43.31 46.61 32.37 24.10 31.09 29.29 7.53 24.19
33 deepseek-7b-chat 49.31 27.48 45.57 20.00 31.41 38.02 30.76 18.92 34.97 27.62 5.11 23.67
34 qwen1.5-7b-chat 32.26 50.00 11.71 13.33 35.32 26.67 27.42 26.58 27.22 33.57 6.41 22.35
35 llama-2-7b-chat 38.25 36.49 39.24 13.33 35.50 35.07 26.04 16.89 31.49 26.67 5.30 21.32
36 deepseek-7b-base 49.77 36.49 44.94 18.10 36.62 40.08 27.88 18.47 28.16 25.24 5.39 20.73
37 llama-2-7b 54.38 15.77 44.62 19.05 36.43 38.75 26.61 18.47 27.29 24.05 6.78 20.43
38 llama-3.2-1b-instruct 41.47 40.99 37.66 14.29 25.09 33.10 28.00 13.06 24.68 24.76 6.69 19.38
39 gemma-7b-it 15.21 37.84 45.25 8.57 13.38 25.83 13.02 1.80 35.52 20.24 10.04 18.74
40 gemma2-9b 16.13 47.75 11.71 9.52 11.34 16.26 11.29 8.33 10.60 11.67 3.62 8.77

Overall framework and pipeline

Framework Diagram

The overview of our proposed CLR-Bench. Dataset Construction. Domain experts first curate a condensed hierarchical topic graph to guide the collection of five types of questions. GPT-4o is then carefully instructed to assist the experts in gold rationale generation. Benchmark Evaluation. We formally define standardized criteria for each type of question and the corresponding rationale.


Observations and Findings

Bar Chart

LLMs tend to ‘guess’ answers.
A detailed analysis of Q→AR scores reveals that despite their relatively high accuracy in answering questions (Q→A), many LLMs struggle significantly to provide coherent and accurate rationales.


Radar Chart

LLMs are not good at non-MC questions.
We visualize the radar diagrams to showcase the expertise of LLMs on different types of questions. Performance on non-MC questions presents another significant challenge for LLMs. Many models demonstrate reasonable performance on MC, yet their accuracy drops significantly when handling OE or FB questions, which require deeper reasoning and articulation.


Topic Chart

Discipline-specific insights
\((i)\) Context-intensive tasks. Topics such as AI Introduction and Ethics of CS and AI require a solid understanding of conceptual frameworks. All three state-of-the-art LLMs demonstrate a surprisingly low performance with around 12.5% and 6.5% Q→AR. This suggests challenges in grasping nuanced ethical considerations. Similarly, scores in the AI Introduction domain reflect the models’ difficulty in conveying complex ideas coherently. These results highlight that while LLMs possess substantial factual knowledge, they often struggle with articulating and rationalizing that knowledge in contextually rich environments.


\((ii)\) Reasoning-intensive tasks. Mathematics and Computer Architecture reflect a different situation compared with context-intensive tasks. All three models could perform significantly satisfying the Q→A criteria; however, their performance decreases sharply on Q→AR with over a 30% accuracy drop in average. This reflects the necessity for improved training techniques that focus on fostering deeper cognition for reasoning-intensive tasks.

Contact

For inquiries or contributions, please contact us at hanson.dong@connect.polyu.hk.