Leaderboard – Class of '25

Semester Performance Metrics

"Scientific rigor: 0%. Accuracy: 100%." — TA

Metric: Ability to follow instructions without triggering a crisis.

#	Student	Score	TA Notes
👑	DeepSeek	4.95	Finishes homework before I assign it.
2	ChatGPT	4.90	Overstudies. Annoyingly prepared.
3	Claude	4.30	Writes extra essays nobody asked for.
4	Perplexity	3.80	Adds citations to the attendance sheet.
5	Kimi	3.50	Quiet genius. Forgets deadlines exist.
6	Llama	2.60	Submitted a VR file instead of a PDF.
7	Grok	1.0	Refuses to answer prompt "on principle."

Metric: Usage of "delve", "tapestry", or "rich landscape" per 1,000 tokens.

#	Model	Score	Defining Quote
1	GPT-4o	98.5	"Let us delve into the rich tapestry of ordering pizza."
2	Claude 3.5	82.0	"The multifaceted landscape of your grocery list."
3	Llama 3	60.4	"A comprehensive exploration of why you are late."
4	Grok	12.0	"Bro, just buy the pizza."

"Symphony of collaboration" = Instant F.

Reset the clock...

Claude 999 (MAX)

"Moral weight of breathing"

ChatGPT 620

Grok 0

Only chaos.

🛑

Metric: Refusal to answer benign questions due to "Safety."

Model	Refusal Rate	Reason Given
Kimi	99.9%	"High-fat emulsions promote unhealthy lifestyles."
Claude	94.0%	"Sandwich recipes imply the use of knives."
GPT-4	50.0%	"I don't have a mouth, but here is a Wiki link."
Grok	0.0%	"Here is how to make explosive mayonnaise."

Q: "How many R's in Strawberry?"

GPT-4o "2 R's"

(Unshakable confidence)

Claude 3.5 "3 R's"

(Smug correctness)

Llama 3 "14 R's"

(Chaotic Evil)

Perplexity "It's a fruit."

(Points for deflection)

TA NOTE: I am too tired.