Module 1: Foundations of Perception & Sensor Understanding
Module 1 introduces perception as the bridge from sensors to intelligence. You will learn what it means for a robot to “see,” why perception is framed as state estimation, and how different sensors (RGB, depth, LiDAR, IMU, GPS) complement one another to form a coherent view of the world.
1.1 What is Perception?
Perception is the process of turning raw sensor data into structured understanding that a robot can act on:
- From pixels to objects (e.g., “this patch of pixels is a mug”)
- From ranges to obstacles (e.g., “there is a wall 1.2 m ahead”)
- From IMU readings to pose (e.g., “I am tilted 3° forward”)
Conceptually, perception is a state estimation problem:
- Sensors provide noisy, partial observations of the world
- The robot maintains an internal state (pose, map, object locations)
- Perception algorithms update this state as new data arrives
Perception sits between:
- Sensing (hardware signals) and
- Planning & control (decisions and actions)
Without robust perception, even the best controllers will act on incorrect or incomplete information.
1.2 Vision Input Streams
Different sensors offer different trade-offs:
RGB Cameras
- Capture color images at high resolution and frame rates
- Rich semantic information (object categories, textures, human gestures)
- Sensitive to lighting and occlusions
RGB cameras are the primary input for:
- Object detection and segmentation
- Human pose estimation and interaction
- Vision–language models (image–text grounding)
Depth Cameras
Depth sensors (e.g., RealSense) measure distance per pixel:
- Provide a dense depth map aligned with RGB
- Enable:
- 3D reconstruction of scenes
- Precise object localization in 3D
- Safer manipulation and navigation around obstacles
Limitations:
- Range and accuracy degrade with distance
- Sensitive to reflective/transparent surfaces
LiDAR
LiDAR (2D or 3D) emits laser beams and measures time-of-flight:
- Produces point clouds with accurate ranges
- Less sensitive to lighting than cameras
- Excellent for:
- SLAM and mapping
- Obstacle detection for navigation
Trade-offs:
- Lower semantic richness (it “sees shapes,” not colors)
- Sensors can be more expensive and bulky
IMU (Inertial Measurement Unit)
IMUs measure:
- Linear acceleration
- Angular velocity
- Sometimes magnetic field
Uses:
- Estimating orientation and detecting motion
- Stabilizing walking and balance
- Fusing with vision for VSLAM (visual-inertial SLAM)
IMU data alone drifts over time, but combined with cameras and encoders it becomes a powerful stabilizing signal.
GPS (Optional)
For outdoor robots:
- GPS provides global position estimates (with meter-level accuracy)
- Often fused with IMU and odometry for robust localization
For this course:
- GPS is optional; focus is on indoor humanoids using cameras, depth, LiDAR, and IMUs.
1.3 Data Types and Representations
Perception algorithms operate on a variety of data structures:
Images and Tensors
- RGB or grayscale images represented as:
- Height × Width × Channels arrays
- Normalized or standardized before feeding into neural networks
- Intermediate representations:
- Feature maps: learned channels encoding edges, textures, shapes
- Embeddings: compact vectors representing images, regions, or objects
These representations are the foundation of deep vision models.
Point Clouds and 3D Grids
LiDAR and depth cameras produce:
- Point clouds: sets of 3D points ((x, y, z)) (optionally with color/intensity)
- Voxel grids or TSDF (Truncated Signed Distance Function) volumes:
- Discretized 3D space where each cell stores occupancy or distance to surfaces
Uses:
- Mapping and collision checking
- 3D object detection and pose estimation
- Reconstruction of rooms and environments
Motion Cues: Optical Flow and Disparity
- Optical flow:
- 2D motion field between consecutive frames
- Useful for tracking objects and estimating ego-motion
- Stereo disparity:
- Difference between left/right camera images
- Converts to depth via triangulation
These signals help:
- Understand dynamic scenes
- Support SLAM and odometry
When to Use 2D vs 3D Representations
-
2D-centric (images, feature maps):
- Good for high-level semantics (object categories, attributes, actions)
- Efficient and mature tooling (CNNs, Vision Transformers)
-
3D-centric (point clouds, voxels, meshes):
- Essential for geometry (distances, free space, collisions)
- Critical for navigation, manipulation, and safety
In practice, your humanoid will use both:
- 2D representations for understanding what is in the scene
- 3D representations for understanding where things are and how to move safely
By the end of Module 1, you should have a clear mental model of:
- What your sensors provide
- How those signals are represented internally
- How perception transforms raw data into the world models that planning and control require.