Experience the VENUSS evaluation task. Classify 20 driving scenarios, then compare your accuracy against VLMs and human evaluators.
Dataset
Presentation mode
| Category | Accuracy | Precision | Recall | F1 |
|---|
Human evaluators (collage): 54–63% accuracy
Human evaluators (GIF): 62–65% accuracy
Best VLM (Qwen-VL-Max): 57% accuracy
Note: Published baselines were evaluated on all 108 scenarios. Your results are from a 20-scenario subset.