Sonal Labs · § ResearchFolio IV · MMXXVI

§ Research · Publication register

Who said
what, when.

Our agenda is narrow by design. Overlap-native speaker diarization is the first research bet — getting attribution right, frame by frame. From that base, we extend to speaker-attributed transcription, memory-object generation, and audio-visual memory.

Paper № 001·Whitepaper·Sonal Labs Research·January 2026

Overlap-Native Speaker
Diarization.

As the foundation for wearable audio-visual memory systems

Overlapping speech — two or more people speaking simultaneously — remains a critical failure mode in speaker diarization and transcription systems. When overlap occurs, conventional pipelines drop words, truncate turns, or misattribute content to the wrong speaker, fundamentally undermining trust in any downstream automation.

This paper presents Sonal's technical approach: overlap-native speaker diarization designed for a wearable device that captures everyday life conversations — not just meetings, but walks, calls, family moments, errands, worship, and spontaneous interactions.

We survey established overlap-aware techniques including End-to-End Neural Diarization (EEND), which formulates diarization as multi-label classification, and Target-Speaker Voice Activity Detection (TS-VAD), which estimates per-speaker activity conditioned on speaker embeddings. We propose a reproducible baseline architecture built on open foundations, and outline an evaluation methodology that reports performance separately on overlap versus non-overlap regions.

The framework is the foundation of Presence AI, our native overlap LLM, and the prerequisite for every Sonal memory product. Everything downstream — summary, report, action — inherits the accuracy of this first layer.

Fig. 1 — Overlap event, 4.5 s excerptA ink · B Sonal green

§ Contributions

What the paper contributes.

I.

Problem formalisation

A precise statement of speaker diarization, overlapping speech, and the four failure modes under overlap: dropped content, misattribution, turn truncation, and speaker fragmentation.

II.

Survey of overlap-aware techniques

EEND (multi-label classification), TS-VAD (per-speaker activity), OSD as preprocessing routing, and diarization-guided source separation — with notes on productisation gaps.

III.

Reproducible baseline architecture

A five-stage system on open foundations: Whisper ASR → OSD → hybrid diarization core → identity stitching → attributed transcript with confidence metadata. Designed for wearable constraints.

IV.

Overlap-stratified evaluation methodology

The critical requirement: report DER, JER and word-level metrics separately on overlap and non-overlap regions. Aggregate numbers hide the failure mode.

§ Benchmarks

What we measure against.

No single dataset captures “everyday life” audio. We evaluate on a portfolio that covers the axes that matter — meetings, telephone, in-the-wild, and acoustic robustness — with overlap-stratified reporting in every case.

AMI Meeting Corpus

~100 h of multimodal meeting recordings with multiple microphones.

VoxConverse

In-the-wild multi-speaker content with reported overlap statistics.

CALLHOME

Unscripted telephone conversations reflecting informal speech patterns.

DIHARD

Robustness-focused challenge with diverse acoustic conditions.

§ Roadmap

Four phases, in order.

Phase I.

Overlap-Native Attribution

Current
  • OSD + overlap-aware diarization baseline
  • Overlap-stratified evaluation methodology
  • Published benchmarks and ablation results

Phase II.

Reliable Speaker-Attributed Transcription

Next
  • Improved identity stitching and speaker stability
  • cpWER / tcpWER evaluation for joint diarization + ASR
  • Wearable-specific acoustic adaptation

Phase III.

Memory Object Generation

In design
  • Action items, reminders, follow-ups with evidence grounding
  • Confidence-aware UI with review workflows for uncertain attribution
  • Searchable conversation memory

Phase IV.

Audio-Visual Memory

Horizon
  • Visual context integration for disambiguation and enriched recall
  • Episodic memory retrieval capabilities
  • Audio-visual diarization cues (lip movement, facial cues)

§ Publications

The register.

  • 2026

    001

    Overlap-Native Speaker Diarization as the Foundation for Wearable Audio-Visual Memory Systems

    Sonal Labs Whitepaper · v1.0 · Tim Uzua

    Foundational paper. Draft available on request.

  • forthcoming

    002

    Overlap-stratified evaluation: a protocol for speaker-attributed transcription benchmarks

    Sonal Labs Technical Report

    Benchmark and baselines across AMI, VoxConverse, CALLHOME and DIHARD.

  • forthcoming

    003

    Presence AI — model card (preview)

    Sonal Labs

    Accompanies the first public Presence release.

  • forthcoming

    004

    Multimodal memory: extending attribution to vision and environment

    Sonal Labs Working Paper

    Direction of travel. Early drafts with research partners.

Research partners

Working in speech, ML systems, or applied memory? Write to us with a short paragraph — we read everything.

research@sonallabs.com