Project ID: BRJP26100083
This project proposes a disintegrated Vision–Language–Action (VLA) architecture for physically grounded robotic intelligence that prioritizes modularity, interpretability, and scalable training over monolithic end-to-end learning.
The architecture separates perception, intent reasoning, and physical feasibility into independently optimized components connected through structured interfaces. This design promotes robustness, easier adaptation to new robotic platforms, and safer real-world deployment.
The first module leverages VL-JEPA to generate embeddings in a joint vision–language space that continuously encode the current scene.
A dedicated decoder converts these embeddings into a time-varying knowledge graph where:
This structured representation provides an interpretable and temporally consistent world model instead of raw feature tokens.
The second module translates natural language or symbolic commands into executable robot actions.
Using the knowledge graph history and the current end-effector trajectory, the system generates a demand trajectory for the end-effector and gripper over time. This effectively converts human intent into kinematically meaningful action plans.
A physics verification module evaluates whether the generated trajectory satisfies robot-specific constraints including:
Instead of relying purely on learned predictors, this module performs deterministic physics checks using constraint solvers. Feasible trajectories are executed, while infeasible ones generate structured feedback for iterative refinement.
Overall outcome: a modular architecture separating perception, reasoning, and physics to improve scalability, flexibility, and cross-platform generalization in robot learning systems.