Modular VLA Systems for Scalable Robotics

Project Overview

This project proposes a disintegrated Vision–Language–Action (VLA) architecture for physically grounded robotic intelligence that prioritizes modularity, interpretability, and scalable training over monolithic end-to-end learning.

The architecture separates perception, intent reasoning, and physical feasibility into independently optimized components connected through structured interfaces. This design promotes robustness, easier adaptation to new robotic platforms, and safer real-world deployment.

Research Objectives

1. Perceptual World Modeling

The first module leverages VL-JEPA to generate embeddings in a joint vision–language space that continuously encode the current scene.

A dedicated decoder converts these embeddings into a time-varying knowledge graph where:

Nodes represent objects and agents
Edges encode physics-constrained relations such as contact, support, reachability, and motion dependencies

This structured representation provides an interpretable and temporally consistent world model instead of raw feature tokens.

2. Command-to-Trajectory Reasoning

The second module translates natural language or symbolic commands into executable robot actions.

Using the knowledge graph history and the current end-effector trajectory, the system generates a demand trajectory for the end-effector and gripper over time. This effectively converts human intent into kinematically meaningful action plans.

3. Physics-Aware Verification and Execution

A physics verification module evaluates whether the generated trajectory satisfies robot-specific constraints including:

Joint limits
Torque bounds
Link geometry
Collision avoidance
Dynamic feasibility

Instead of relying purely on learned predictors, this module performs deterministic physics checks using constraint solvers. Feasible trajectories are executed, while infeasible ones generate structured feedback for iterative refinement.

Project Deliverables

Module 1 – Perceptual World Modeling

VL-JEPA based system generating joint vision-language embeddings
Real-time knowledge graph capturing objects, attributes, and physics relations
Reusable graph representation for VLA and Vision-Language Navigation (VLN)

Module 2 – Command Understanding and Trajectory Generation

Language-model based command encoder
Micro-task decomposition of long-horizon goals
Trajectory prediction using graph state and trajectory history
Support for manipulation, navigation, and hybrid behaviors

Module 3 – Physics-Aware Verification and Execution

Constraint-aware planner validating Newtonian feasibility
Robot parameters provided directly as inputs
No robot-specific fine-tuning required
Reliable sim-to-real transfer

Complete Stack Deployment

Unified software stack deployable in NVIDIA Isaac Sim
Real-world validation on compact 6-DOF manipulators such as myCobot

Research Outputs

Open modular codebase
System documentation
Reproducible benchmarks
Technical reports and peer-reviewed publications

Overall outcome: a modular architecture separating perception, reasoning, and physics to improve scalability, flexibility, and cross-platform generalization in robot learning systems.

Supervisors

Aritra Mukherjee
Assistant Professor
BITS Pilani, Hyderabad Campus
Profile Page

Dr. Ehsan Asadi
STEM, RMIT University
Email: ehsan.asadi@rmit.edu.au
Profile Page

Required Skills

Essential

Strong mathematics and physics background
Strong Python programming skills

Desirable

Experience with PyTorch
Training deep learning models
Re-engineering seminal deep learning repositories

Preferred Student Disciplines

Artificial Intelligence
Machine Learning and Deep Learning
Natural Language Processing
Information and Knowledge Extraction
Computer Science and Engineering
Engineering Physics
Robotics
Sensors and Signal Processing
Control Engineering

Application

Interested candidates can apply through the official program website:

Apply Here

Modular VLA Systems: Decoupling Perception, Reasoning, and Physics for Scalable Robotics