Module 4: Multimodal AI for Reasoning & Action

Module 4 connects perception to language and high-level reasoning. You will explore how Vision-Language Models (VLMs) and multimodal LLMs interpret scenes, answer questions, and generate task-relevant representations that can be fed into planners and controllers.

4.1 Perception–Language Grounding

Why Ground Language in Perception?

For a humanoid to follow natural-language commands like:

“Pick up the red mug near the laptop.”
“Go to the doorway and wait until a human arrives.”

It must:

Understand linguistic concepts (“red mug”, “doorway”, “human”)
Align them with perceptual entities (bounding boxes, masks, poses)
Resolve ambiguity and reference (which mug, which laptop)

This process is perception–language grounding:

Text tokens ↔ visual regions
Object names ↔ detected entities
Relational phrases ↔ spatial relations in the map

4.2 VLM-Based Real-Time Perception

Vision-Language Models (VLMs)

VLMs jointly process images and text to:

Answer visual questions (“What objects are on the table?”)
Describe scenes (“A humanoid standing near a red chair.”)
Identify affordances (“Places where I can place a mug.”)

They typically:

Encode images into visual features (CNN or Vision Transformer)
Encode text into language features (transformer-based)
Use cross-attention to relate the two

Real-Time Constraints

For robotic use:

Latency matters (tens to hundreds of milliseconds per query)
You may run:
- Lightweight VLMs on the robot GPU
- Heavier models on a local workstation or in the cloud

You will conceptually design:

ROS 2 nodes that:
- Subscribe to camera images
- Send images and text prompts to a VLM service
- Receive structured outputs about objects, relations, or tasks

4.3 Connecting Perception to Control

Pipeline: Camera → VLM → Planner → Actuator

An end-to-end multimodal pipeline might look like:

Camera & perception:
- RGB-D camera feeds detection and mapping modules
VLM query:
- Text: “Where is the red mug near the laptop?”
- Image: Current or recent camera frame
VLM output:
- Structured description (e.g., target object bounding box or label)
- Optionally, a symbolic scene graph (objects + relations)
Planner:
- Converts target object into a 3D pose using depth/map
- Plans a collision-free path to that pose
Controller:
- Executes motion plan with whole-body control

Scene Graphs and World State

To keep the interface clean:

Represent perception outputs as a scene graph:
- Nodes: objects, humans, regions
- Edges: spatial and semantic relations (e.g., “on top of”, “near”, “left of”)
Allow planners and language modules to:
- Query objects by attributes (“the nearest red mug”)
- Use relations to ground instructions

By the end of Module 4, you should understand:

How multimodal models interpret scenes and language together
How to design ROS 2 interfaces that connect VLM outputs to planning and control
How perception, language, and action come together in a vision–language–action loop

Module 4: Multimodal AI for Reasoning & Action

4.1 Perception–Language Grounding

Why Ground Language in Perception?

4.2 VLM-Based Real-Time Perception

Vision-Language Models (VLMs)

Real-Time Constraints

4.3 Connecting Perception to Control

Pipeline: Camera → VLM → Planner → Actuator

Scene Graphs and World State

AI Assistant

AI Assistant

Start a Conversation

4.1 Perception–Language Grounding​

Why Ground Language in Perception?​

4.2 VLM-Based Real-Time Perception​

Vision-Language Models (VLMs)​

Real-Time Constraints​

4.3 Connecting Perception to Control​

Pipeline: Camera → VLM → Planner → Actuator​

Scene Graphs and World State​

AI Assistant

AI Assistant

Start a Conversation

4.1 Perception–Language Grounding

Why Ground Language in Perception?

4.2 VLM-Based Real-Time Perception

Vision-Language Models (VLMs)

Real-Time Constraints

4.3 Connecting Perception to Control

Pipeline: Camera → VLM → Planner → Actuator

Scene Graphs and World State