EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents

Anonymous Authors
Author information is temporarily withheld for anonymous review.
EmbodiedHead teaser figure
We present EmbodiedHead, which generates a real-time head-embodied avatar for LLMs. Unlike dual-audio methods, it uses a single audio stream with explicit listening-speaking state conditioning to achieve unified conversational behavior.

Interactive Demo

Try the live EmbodiedHead system directly in the browser. You can chat with EmbodiedHead powered by Qwen.

If the interactive demo below is temporarily unavailable, you can still try the full experience at https://www.embodiedhead.xyz/.

实时渲染窗口
未连接
WS: - 缓冲帧: - 状态: idle

Abstract

We present EmbodiedHead, a speech-driven talking-head framework that equips LLMs with real-time visual avatars for conversation. A practical embodied avatar must achieve real-time generation, unified listening-speaking behavior, and high rendered visual quality simultaneously. Our framework couples the first Rectified-Flow Diffusion Transformer (DiT) for this task with a differentiable renderer, enabling diverse, high-fidelity generation in as few as four sampling steps. Prior listening-speaking methods rely on dual-stream audio, introducing an interlocutor look-ahead dependency incompatible with causal user–LLM interaction. We instead adopt a single-stream interface with explicit per-frame listening-speaking state conditioning and a Streaming Audio Scheduler, suppressing spurious mouth motion during listening while enabling seamless turn-taking. A two-stage training scheme of coefficient-space pretraining and joint image-domain refinement further closes the gap between motion-level supervision and rendered quality. Extensive experiments demonstrate state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios.

Overview

EmbodiedHead overview figure
Overview of EmbodiedHead. It employs a Rectified-Flow DiT to generate speech-driven talking-head animation in a few steps. It conditions on the reference, timestep, motion magnitude, and LS-state. A streaming scheduler merges user and LLM audio, enabling unified listening-speaking behavior.

EmbodiedHead Results

High-fidelity and Liveliness
Reference image for EmbodiedHead result 01
Reference image for EmbodiedHead result 02
Reference image for EmbodiedHead result 03
Reference image for EmbodiedHead result 04
Reference image for EmbodiedHead result 05
Reference image for EmbodiedHead result 06
Reference image for EmbodiedHead result 07
Reference image for EmbodiedHead result 08
Reference image for EmbodiedHead result 09
Comparison
Reference image for EmbodiedHead comparison 01
Reference image for EmbodiedHead comparison 02
Reference image for EmbodiedHead comparison 03
Chatting with EmbodiedHead

BibTeX

@misc{anonymous2026embodiedhead,
  title  = {EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents},
  author = {Anonymous Authors},
  year   = {2026},
  note   = {Author information withheld for anonymous review},
  eprint = {2604.17211},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2604.17211}
}