Sonal Labs · Presence AI · Model card (preview)MMXXVI · Folio III

§ Foundation model

The model that
knows who said what.

Presence AI is the research engine at the centre of every Sonal memory system. Its job is attribution — knowing, frame by frame, who is holding the floor and what they said, through overlap, interruption, and repair.

Built on EEND, TS-VAD and open ASR foundations. Measured on overlap-stratified benchmarks, not aggregate numbers that hide the failure mode.

Fig. — Presence AI, runtime traceBi-directional

Two parallel streams — listener state and speaker projection — resolved at every frame. The model never stops doing both; that's what keeps attribution right.

Overlap-native

Treats overlap as a first-class event

Attribution-first

Who said what, frame by frame

Open foundations

Whisper · EEND · TS-VAD · OSD

Audio-visual ready

Lip movement + facial cues in roadmap

§ 01 · Why this model

The Cocktail Party Problem.

Picture a consultation. The doctor is halfway through a question when the patient interrupts with the answer. The doctor says “good, good” while the patient is still talking, and asks a follow-up. Five seconds of audio; four speaker-turns; two simultaneous utterances. A conventional system hears one of them — and assigns the wrong person to the clinical fact.

That mis-attribution is the thing that breaks every downstream action. A report that says the patient denied a symptom when in fact the doctor did. A deposition transcript that collapses two speakers into one. A care note that remembers the carer's reminder, not the patient's promise.

Presence AI is designed for that frame. It treats overlap as a first-class event to be modelled, not an error to be suppressed. It projects where each voice is going, keeps its hypothesis of who is holding the floor up to date, and recovers gracefully when it gets it wrong — the same way people do.

Get that right, and the memory that follows can be trusted.

§ 02 · What conventional systems break

Four failure modes, named.

From whitepaper § 2.3. These are the failure modes overlap-native diarization has to solve — and the ones any “handles overlap” claim has to be benchmarked against.

i.

Dropped content

One speaker's words are entirely absent from the transcript — often the quieter or shorter utterance.

ii.

Misattribution

Words are assigned to the wrong speaker. Particularly problematic for commitments and decisions.

iii.

Turn truncation

Segment boundaries are placed at overlap onset, cutting a speaker off mid-sentence.

iv.

Speaker fragmentation

A single speaker is split into multiple identities due to embedding instability during mixed speech.

§ 03 · Architecture

Five stages, open foundations.

  1. 1.

    ASR Foundation

    Whisper-based transcription produces word sequences with timestamps — the text content to be attributed.

  2. 2.

    Overlap Detection (OSD)

    Identifies temporal regions with multiple simultaneous speakers, enabling differentiated processing.

  3. 3.

    Diarization Core

    Non-overlap regions use efficient embedding-based clustering; overlap regions employ EEND-style multi-label inference or TS-VAD refinement.

  4. 4.

    Identity Stitching

    Cross-window speaker identity alignment ensures stable speaker labels across the full recording.

  5. 5.

    Transcript Assembly

    Speaker-attributed transcript with explicit overlap markers and confidence metadata — every word, every speaker, every moment of uncertainty.

§ 04 · Capabilities

What it does differently.

i.

Turn projection

Continuously predicts where the current turn is going and when it will end, conditioned on prosody, syntax and context.

ii.

Floor management

Decides whether to hold, yield, ask to continue, or interrupt — the same decisions a good human interlocutor makes, trained as an explicit objective.

iii.

Repair

Handles self- and other-repair as structure to preserve, not errors to clean up.

iv.

Backchannels

‘mm’, ‘yeah’, ‘right’ — the small sounds of listening. Produced and interpreted in context, without parroting.

v.

Prosody as semantics

Intonation, stress, pace and laughter modelled as meaning. Irony, uncertainty, question-hood — all first-class.

vi.

Multilingual + code-switched

Turn dynamics preserved across languages, including in live interpretation and code-switched dialogue.

§ 05 · Wearable constraints

Meeting-room models don't survive a pocket.

A wearable device operates under constraints meeting-room systems don't face: variable microphone position (pocket, lapel, ear), motion artefacts, unpredictable acoustic environments, limited on-device compute, and battery-life considerations.

  • ·Adaptive thresholds calibrated to wearable microphone characteristics
  • ·Hybrid on-device / cloud processing with intelligent batching
  • ·Confidence-aware outputs — uncertain attribution is flagged for review, not surfaced as fact

§ 06 · Direction of travel

Audio first. Vision next.

Attribution doesn't end at the microphone. A pointed finger, a head-nod, a glance at the chart — these resolve ambiguity that speech alone leaves open. Audio-visual TS-VAD research already shows that lip movement and facial cues improve diarization under acoustic overlap.

Our working hypothesis is that memory worth keeping is multimodal. The first Presence release is a speech model. Subsequent releases extend the same attribution objective to vision and environmental context.

Model card · previewSonal Labs · Presence AI · MMXXVI

Modality

Speech in · speech out, with text co-conditioning. Vision co-conditioning in roadmap.

Access

Evaluation licence for research and design partners.

Safety

Consent, provenance and opt-out are first-class. No training on identified voices without permission. Redaction supported at the record level.

The full model card ships alongside the first public Presence release. Evaluation partners receive an extended version under NDA, including attribution-accuracy metrics on clinical and legal benchmarks.

§ Closing

A model isn't conversational because it has a voice.
It's conversational because it knows who's talking.