research | Jane Yang

I study egocentric vision, visual representation learning, and perception in naturalistic environments—using large-scale egocentric video, computer vision, and behavioral experiments to understand how infants’ everyday visual experience shapes early concept development.

Characterizing infant visual experience

What do infants actually see, and how does it compare to the data that drives modern vision systems? I use naturalistic egocentric video (e.g., the BabyView dataset) to quantify objects, scenes, and activities in infants’ view and compare those statistics to standard vision datasets.

Quantifying infants’ everyday experiences with objects in a large corpus of egocentric videos (CCN 2025) — Automated object detection over 868+ hours of infant headcam video.
Characterizing the inputs to infants’ object category representations (VSS 2026) — How these inputs relate to early object categories; CLIP/DINO embeddings comparing infant experience to datasets like THINGS.

Visual–linguistic alignment

How well are the visual and linguistic streams aligned in naturalistic settings? I use multimodal models (e.g., CLIP) to measure that alignment in egocentric infant data.

Assessing the alignment between infants’ visual and linguistic experience using multimodal language models (Tan, Yang, et al., 2025) — Vision–language models to assess alignment over time and what makes naturalistic multimodal input learnable.

Attention, action, and learning

How do attention and action structure learning—e.g., how manual actions create visual saliency and support joint attention, and how real-time attention relates to language?

Using manual actions to create visual saliency (CogSci 2023) — Manual actions create visual saliency and support joint attention from an “outside-in” perspective.
Learning semantic knowledge based on infant real-time attention and parent in-situ speech (CogSci 2024) — Linking infant gaze and parent speech to ask how attention shapes which semantic information is available during learning.

3D vision and object experience (in development)

How do manipulation and dyadic interaction shape the 3D view statistics that support object learning? I use 3D object reconstruction and 6DoF pose tracking to characterize viewpoint distributions during active vs. passive viewing, compare infant and caregiver view experiences during joint play, and test whether view variability predicts recognition and generalization to novel views and exemplars.

Methods and open science

Methods: Computer vision (object detection, pose estimation, multimodal embeddings), multimodal data fusion (head-mounted eye trackers, cameras, microphones), and behavioral experiments (e.g., eye-tracking).
Open science: Committed to data-driven, ecologically valid developmental psychology and reproducible pipelines on large-scale naturalistic data.