Back to VENUSS

Human Evaluation Demo

Experience the VENUSS evaluation task. Classify 20 driving scenarios, then compare your accuracy against VLMs and human evaluators.

Dataset

How it works

  • You'll see 20 driving scenarios from the CoVLA dataset, each shown as a sequence of 4 frames.
  • For each scenario, answer 7 multiple-choice questions about the ego vehicle's behavior.
  • Choose between Collage (static image grid) or GIF (animated) mode below.
  • At the end, see how your accuracy compares to published baselines.

Keyboard shortcuts

  • 1–7 — Jump to question
  • A–E — Select option for the focused question
  • Enter — Next scenario (when all questions answered)

Presentation mode

Scenario 1 of 20

Loading...
1-7 focus question   A-E select option   Enter next

Back to VENUSS

Your Results

0%
0 / 140 correct
Category Accuracy Precision Recall F1

How do you compare?

Human evaluators (collage): 54–63% accuracy

Human evaluators (GIF): 62–65% accuracy

Best VLM (Qwen-VL-Max): 57% accuracy

Note: Published baselines were evaluated on all 108 scenarios. Your results are from a 20-scenario subset.

Back to VENUSS