Human Evaluation Demo

Experience the VENUSS evaluation task. Classify 20 driving scenarios, then compare your accuracy against VLMs and human evaluators.

Dataset

You'll see 20 driving scenarios from the CoVLA dataset, each shown as a sequence of 4 frames.
For each scenario, answer 7 multiple-choice questions about the ego vehicle's behavior.
Choose between Collage (static image grid) or GIF (animated) mode below.
At the end, see how your accuracy compares to published baselines.

Presentation mode

Scenario 1 of 20

Loading...

1-7 focus question A-E select option Enter next

Your Results

0%

0 / 140 correct

Category	Accuracy	Precision	Recall	F1

Human evaluators (collage): 54–63% accuracy

Human evaluators (GIF): 62–65% accuracy

Best VLM (Qwen-VL-Max): 57% accuracy

Note: Published baselines were evaluated on all 108 scenarios. Your results are from a 20-scenario subset.