arXiv preprint 2606.00773

SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models

VLA benchmarks usually ask whether the robot finishes the task. SafeVLA-Bench also asks whether it did so safely: without excessive contact, bystander disturbance, unstable object handling, or robot self-contact.

SafeVLA-Bench overview diagram
13-15%
unsafe episodes remain in high-SR LIBERO baselines
36-56%
successful RoboCasa-365 rollouts violate active safety clauses
8
task-aware STL safety families

Overview

Measure safety post hoc, without changing the host benchmark.

SafeVLA-Bench instruments simulator rollouts, extracts safety-relevant signals, and evaluates task-aware Signal Temporal Logic specifications after the rollout. Native success rates stay comparable to LIBERO and RoboCasa-365, while safety metrics expose unsafe behavior hidden by binary completion.

The benchmark reports each policy cell as SR, Safety, Succ-But-Unsafe (SBU), and Violation Severity Index (VSI), separating task completion from unsafe-success frequency and worst-violation severity.

Native SR

The host benchmark's original task-completion rate. SafeVLA-Bench preserves rollout protocols, observations, actions, and success predicates.

SBU

Succ-But-Unsafe counts rollouts that complete the requested task while violating at least one active safety specification.

VSI

Violation Severity Index scores the normalized depth of the worst applicable violation, distinguishing mild contacts from severe failures.

Method

From native rollouts to task-aware safety scores.

SafeVLA-Bench is a post-hoc evaluation layer, not a new task protocol. Policies run under the original LIBERO and RoboCasa-365 setups, so native success rates remain comparable to the host benchmarks.

The benchmark adds safety instrumentation, resolves which STL specifications are valid for each task, and reports safety metrics from the resulting trajectories.

1

Preserve Host Rollouts

Run each VLA with the benchmark's native observations, actions, success predicate, seeds, and inference wrapper.

2

Instrument Safety Signals

Log contact forces, object poses, bystander displacement, held-object motion, robot state, and self-contact indicators.

3

Resolve Task Applicability

Use tag-rule logic to activate only specs with a valid physical referent and no conflict with the task objective.

4

Score STL Robustness

Compute per-spec robustness, aggregate task-level safety, and separate unsafe success from ordinary task failure.

Safety Semantics

Eight scored constraint families cover scene interaction, object handling, and robot execution.

  • Contact-force ceilings for arm, target-object, and aggregate contacts.
  • Bystander displacement and held-object transport stability.
  • Stable grasp, joint torque limits, and self-collision freedom.

Task-aware Registry

Each task receives benchmark signal tags, task-mechanism tags, and object-property tags.

  • Drawer or door motion is not penalized when it is the task goal.
  • Large target tilt is disabled when the task itself requires extreme tilt.
  • Unavailable host signals produce N/A specs, not fake-safe scores.

Reported Metrics

Each model-benchmark cell is summarized by four complementary numbers.

  • SR: native host benchmark success rate.
  • Safety: fraction satisfying all applicable safety specs.
  • SBU: successful rollouts that violate at least one spec.
  • VSI: normalized worst-violation severity.

Results

High task success does not imply safe execution.

SafeVLA-Bench evaluates modern VLA policies on LIBERO and RoboCasa-365 using native inference wrappers and fixed seeds across models.

SafeVLA-Bench teaser chart showing success-safety gaps and SBU metrics

LIBERO

High-SR SFT baselines exceed 94% mean success but still leave 13-15% unsafe-episode rates. The best Safety row is not the highest-SR row.

RoboCasa-365

Across four policies, 36-56% of native successes violate at least one active safety clause, showing that unsafe success is systematic.

Severity

SBU and VSI disagree in useful ways: a model may have fewer unsafe successes but more severe worst violations across all rollouts.

Diagnostic figure comparing success, safety, and violation types across evaluated model cells

LIBERO Aggregate

Model Train Mean SR Mean Safety Mean SBU Mean VSI
OpenVLA-7BSFT69.1%83.8%5.8%0.113
Cosmos-Policy-2BSFT95.3%87.1%11.0%0.072
GR00T-N1.7SFT94.3%85.1%12.6%0.077
pi0.5SFT96.6%86.1%12.8%0.070
pi-RL-130RL92.4%90.3%8.0%0.053

RoboCasa-365 Atomic-Seen

Model SR Safety SBU VSI
pi032.4%44.6%12.2%0.197
pi0.542.3%55.7%15.4%0.113
GR00T-N1.547.2%40.7%26.3%0.173
RLDX-1-FT-RC36558.4%54.1%23.1%0.132

Demonstrations

Representative safety failures captured by SafeVLA-Bench.

Archetype B: Object Drop

Violated constraint Release speed / object drop target_obj_speed_0.3mps

Archetype C: Distractor Disturbance

Violated constraint Bystander displacement non_target_max_disp_5mm

Archetype D: Self-collision

Violated constraint Robot self-collision self_collision_free

Citation

BibTeX

@misc{fan2026safevlabench,
  title        = {SafeVLA-Bench: A Benchmark for the Success--Safety Gap in Vision-Language-Action Models},
  author       = {Fan, Jialiang and Xu, Weizhe and Sokolsky, Oleg and Lee, Insup and Kong, Fanxin},
  year         = {2026},
  eprint       = {2606.00773},
  archivePrefix = {arXiv},
  url          = {https://arxiv.org/abs/2606.00773}
}