Modular VLA Systems: Decoupling Perception, Reasoning, and Physics for Scalable Robotics

Project ID: BRJP26100083

Project Overview

This project proposes a disintegrated Vision–Language–Action (VLA) architecture for physically grounded robotic intelligence that prioritizes modularity, interpretability, and scalable training over monolithic end-to-end learning.

The architecture separates perception, intent reasoning, and physical feasibility into independently optimized components connected through structured interfaces. This design promotes robustness, easier adaptation to new robotic platforms, and safer real-world deployment.

Research Objectives

1. Perceptual World Modeling

The first module leverages VL-JEPA to generate embeddings in a joint vision–language space that continuously encode the current scene.

A dedicated decoder converts these embeddings into a time-varying knowledge graph where:

This structured representation provides an interpretable and temporally consistent world model instead of raw feature tokens.

2. Command-to-Trajectory Reasoning

The second module translates natural language or symbolic commands into executable robot actions.

Using the knowledge graph history and the current end-effector trajectory, the system generates a demand trajectory for the end-effector and gripper over time. This effectively converts human intent into kinematically meaningful action plans.

3. Physics-Aware Verification and Execution

A physics verification module evaluates whether the generated trajectory satisfies robot-specific constraints including:

Instead of relying purely on learned predictors, this module performs deterministic physics checks using constraint solvers. Feasible trajectories are executed, while infeasible ones generate structured feedback for iterative refinement.

Project Deliverables

Module 1 – Perceptual World Modeling

Module 2 – Command Understanding and Trajectory Generation

Module 3 – Physics-Aware Verification and Execution

Complete Stack Deployment

Research Outputs

Overall outcome: a modular architecture separating perception, reasoning, and physics to improve scalability, flexibility, and cross-platform generalization in robot learning systems.

Supervisors

Aritra Mukherjee
Assistant Professor
BITS Pilani, Hyderabad Campus
Profile Page

Dr. Ehsan Asadi
STEM, RMIT University
Email: ehsan.asadi@rmit.edu.au
Profile Page

Required Skills

Essential

Desirable

Preferred Student Disciplines

Application

Interested candidates can apply through the official program website:

Apply Here