Module 1: Foundations of Perception & Sensor Understanding

Module 1 introduces perception as the bridge from sensors to intelligence. You will learn what it means for a robot to “see,” why perception is framed as state estimation, and how different sensors (RGB, depth, LiDAR, IMU, GPS) complement one another to form a coherent view of the world.

1.1 What is Perception?

Perception is the process of turning raw sensor data into structured understanding that a robot can act on:

From pixels to objects (e.g., “this patch of pixels is a mug”)
From ranges to obstacles (e.g., “there is a wall 1.2 m ahead”)
From IMU readings to pose (e.g., “I am tilted 3° forward”)

Conceptually, perception is a state estimation problem:

Sensors provide noisy, partial observations of the world
The robot maintains an internal state (pose, map, object locations)
Perception algorithms update this state as new data arrives

Perception sits between:

Sensing (hardware signals) and
Planning & control (decisions and actions)

Without robust perception, even the best controllers will act on incorrect or incomplete information.

1.2 Vision Input Streams

Different sensors offer different trade-offs:

RGB Cameras

Capture color images at high resolution and frame rates
Rich semantic information (object categories, textures, human gestures)
Sensitive to lighting and occlusions

RGB cameras are the primary input for:

Object detection and segmentation
Human pose estimation and interaction
Vision–language models (image–text grounding)

Depth Cameras

Depth sensors (e.g., RealSense) measure distance per pixel:

Provide a dense depth map aligned with RGB
Enable:
- 3D reconstruction of scenes
- Precise object localization in 3D
- Safer manipulation and navigation around obstacles

Limitations:

Range and accuracy degrade with distance
Sensitive to reflective/transparent surfaces

LiDAR

LiDAR (2D or 3D) emits laser beams and measures time-of-flight:

Produces point clouds with accurate ranges
Less sensitive to lighting than cameras
Excellent for:
- SLAM and mapping
- Obstacle detection for navigation

Trade-offs:

Lower semantic richness (it “sees shapes,” not colors)
Sensors can be more expensive and bulky

IMU (Inertial Measurement Unit)

IMUs measure:

Linear acceleration
Angular velocity
Sometimes magnetic field

Uses:

Estimating orientation and detecting motion
Stabilizing walking and balance
Fusing with vision for VSLAM (visual-inertial SLAM)

IMU data alone drifts over time, but combined with cameras and encoders it becomes a powerful stabilizing signal.

GPS (Optional)

For outdoor robots:

GPS provides global position estimates (with meter-level accuracy)
Often fused with IMU and odometry for robust localization

For this course:

GPS is optional; focus is on indoor humanoids using cameras, depth, LiDAR, and IMUs.

1.3 Data Types and Representations

Perception algorithms operate on a variety of data structures:

Images and Tensors

RGB or grayscale images represented as:
- Height × Width × Channels arrays
- Normalized or standardized before feeding into neural networks
Intermediate representations:
- Feature maps: learned channels encoding edges, textures, shapes
- Embeddings: compact vectors representing images, regions, or objects

These representations are the foundation of deep vision models.

Point Clouds and 3D Grids

LiDAR and depth cameras produce:

Point clouds: sets of 3D points ((x, y, z)) (optionally with color/intensity)
Voxel grids or TSDF (Truncated Signed Distance Function) volumes:
- Discretized 3D space where each cell stores occupancy or distance to surfaces

Uses:

Mapping and collision checking
3D object detection and pose estimation
Reconstruction of rooms and environments

Motion Cues: Optical Flow and Disparity

Optical flow:
- 2D motion field between consecutive frames
- Useful for tracking objects and estimating ego-motion
Stereo disparity:
- Difference between left/right camera images
- Converts to depth via triangulation

These signals help:

Understand dynamic scenes
Support SLAM and odometry

When to Use 2D vs 3D Representations

2D-centric (images, feature maps):
- Good for high-level semantics (object categories, attributes, actions)
- Efficient and mature tooling (CNNs, Vision Transformers)
3D-centric (point clouds, voxels, meshes):
- Essential for geometry (distances, free space, collisions)
- Critical for navigation, manipulation, and safety

In practice, your humanoid will use both:

2D representations for understanding what is in the scene
3D representations for understanding where things are and how to move safely

By the end of Module 1, you should have a clear mental model of:

What your sensors provide
How those signals are represented internally
How perception transforms raw data into the world models that planning and control require.

Module 1: Foundations of Perception & Sensor Understanding

1.1 What is Perception?

1.2 Vision Input Streams

RGB Cameras

Depth Cameras

LiDAR

IMU (Inertial Measurement Unit)

GPS (Optional)

1.3 Data Types and Representations

Images and Tensors

Point Clouds and 3D Grids

Motion Cues: Optical Flow and Disparity

When to Use 2D vs 3D Representations

AI Assistant

AI Assistant

Start a Conversation

1.1 What is Perception?​

1.2 Vision Input Streams​

RGB Cameras​

Depth Cameras​

LiDAR​

IMU (Inertial Measurement Unit)​

GPS (Optional)​

1.3 Data Types and Representations​

Images and Tensors​

Point Clouds and 3D Grids​

Motion Cues: Optical Flow and Disparity​

When to Use 2D vs 3D Representations​

AI Assistant

AI Assistant

Start a Conversation

1.1 What is Perception?

1.2 Vision Input Streams

RGB Cameras

Depth Cameras

LiDAR

IMU (Inertial Measurement Unit)

GPS (Optional)

1.3 Data Types and Representations

Images and Tensors

Point Clouds and 3D Grids

Motion Cues: Optical Flow and Disparity

When to Use 2D vs 3D Representations