Character Embeddings in AI Video: A Technical Primer

How character embeddings actually work in AI video systems — the architecture, the trade-offs, and the open problems.

May 24, 2026·9 min read·technical

This article is for engineers — researchers, ML practitioners, and developers building or evaluating AI video tooling. If you want a non-technical overview of why character consistency matters, start with the complete guide.

Here we’ll walk through how character embedding systems actually work in modern AI video stacks: the architecture, the design decisions, the failure modes, and the open problems we haven’t solved yet.

The problem statement

Given a generative video model M and a character C, we want a procedure such that for any prompt p_i in a sequence p_1, p_2, …, p_n that references C, the generated outputs all preserve C’s identity.

The naive approach — include the character description in every prompt —fails because diffusion sampling is stochastic and prompts describe categories, not identities. Each generation is a draw from the distribution of valid characters matching the description; identity drifts across draws.

We need a way to condition the model’s output on a specific learned identity, not just a description.

The architecture

A modern character consistency system has six components:

1. Feature extraction       — produce identity embedding from reference
2. Storage                  — persist embedding tied to character_id
3. Negative prompt synthesis — auto-build negative_prompts from drift catalog
4. Conditioning injection   — inject embedding into model conditioning
5. Generation               — diffusion sampling with conditioned model
6. Consistency verification — post-hoc similarity check, regenerate if needed

Let’s walk through each.

1. Feature extraction

On character upload, we run multiple specialized models against the reference image:

Face encoder: ArcFace, FaceNet, or similar. Outputs a 512-dim identity embedding optimized for face recognition. Captures identity-invariant features.
Body parser: PIFu or Sapiens for body proportions and posture. Lower-dimensional vector encoding height, build, posture.
Appearance encoder: CLIP image encoder for hair color, skin tone, clothing style. 768-dim semantic embedding.
Style classifier: separately encodes whether the reference is realistic, stylized, animated, etc. Small categorical vector.

These are concatenated (or attended together) into a high-dimensional character embedding e_C. Total dimensionality is typically 1500-3000.

Why multiple models instead of one? Because identity has multiple axes that no single encoder captures fully. Face encoders are great at “is this the same face?” but oblivious to body proportions. Body parsers are oblivious to face details. CLIP is great at semantic appearance but loses fine identity. Concatenating gives you orthogonal coverage.

Trade-off: a more complex extraction pipeline means more compute on character upload (~30-90 seconds in our system). For consumer-facing tools this is fine. For high-throughput pipelines, you can pre-compute embeddings once at upload and reference them at generation.

2. Storage

Each character is stored as (character_id, embedding_vector, metadata). Metadata includes:

Source reference image (for debugging and re-extraction)
Owner / project association
Sub-variant pointers (more on this in the form-variant section)
Style anchors (for cross-style work)
Drift mode override list (per-character customizations)

Storage is typically a vector database (Pinecone, Qdrant, Weaviate) or a custom indexed structure. Lookups need to be fast — sub-100ms — because they happen on every generation.

For privacy-sensitive deployments, embeddings can be stored encrypted with per-tenant keys. The extraction is a one-way function (you can’t reconstruct the reference image from the embedding), but treating embeddings as PII is the right default for systems handling real people.

3. Negative prompt synthesis

This is the non-obvious part of the system, and where most of the engineering work lives.

We maintain a catalog of common drift modes — categorical failure types observed across thousands of generations. For each mode, we have a corresponding negative_prompt fragment that suppresses that failure.

Examples from our catalog:

Drift mode	Negative prompt fragment
Eye color shift (brown → green)	“green eyes, hazel eyes” (when reference is brown)
Jawline narrowing	“narrow jaw, weak chin, soft jawline”
Hairline retreat	“high hairline, thinning hair, receding hairline”
Skin tone warming	“warm skin tone, golden complexion” (when reference is cool)
Asymmetry creep	“asymmetric face, uneven features”
Eye spacing shift	“wide-set eyes, close-set eyes”

Building this catalog requires labeled data. We labeled ~10,000 generations from public AI video tools (Runway, Pika, Sora, etc.) with the specific drift modes that appeared. Clustering produced ~30 distinct modes covering ~85% of observed drift.

For each generation, the system:

Retrieves the character’s reference attributes
Computes the “opposite” of each attribute (e.g., if reference has dark eyes, opposite is light eyes)
Constructs a per-character negative prompt assembling the relevant drift suppressors

The result is a much stronger conditioning signal than vanilla prompt-only generation.

4. Conditioning injection

Different video models accept conditioning differently:

Reference-image-based models (most public APIs): you can pass a reference image; we encode the embedding back into a “synthetic reference image” via a learned projection, then pass that.
Text-only conditioning: pass a learned soft-prompt projection of the embedding.
API-level model access (when available): inject the embedding directly into cross-attention layers, similar to IP-Adapter conditioning.

In our experience, API-level injection is far more effective than reference-image-based, but most public APIs don’t expose this depth of access. Working at the API surface area available to us, we’ve found that combining a strong negative prompt with a reference-image-encoded embedding gets us 80-90% of the way to API-level injection.

This is partly why building a character consistency layer is meaningful even when you don’t control the underlying model — there’s significant headroom in the conditioning surface area that public APIs already expose.

5. Generation

Standard diffusion sampling, with the caveat that the conditioning is now a combination of:

Original prompt (scene, action, framing)
Character embedding (injected via mechanism above)
Negative prompt (auto-synthesized)
Style anchor (if applicable for the segment)

Generation cost is typically 1.0-1.2× a vanilla generation. The marginal cost is small.

6. Consistency verification

After generation, we run a separate identity model (typically the same face encoder used in step 1) against the output. We compute cosine similarity between the output’s identity embedding and the original reference embedding.

Threshold: typically 0.85 cosine similarity. Above threshold, the output is accepted. Below threshold, we trigger regeneration with stricter conditioning (higher negative prompt weight, stronger embedding injection).

This adds ~5-10% generation cost on average (most shots pass first try) and prevents the worst drift cases from reaching the user.

What works well, what doesn’t

What works:

30+ shots of a single character with high consistency, on standard scene variation
Character library reuse across projects (one extraction, infinite reuse)
Cross-platform consistency (same character_id, same identity across different scenes / styles within reasonable bounds)
Multi-character scenes with distinct features (different age, gender, ethnicity)

What’s harder:

Form variants: same character but injured, aged, in different clothing. We use sub-embeddings keyed off the master, where the master encodes invariant identity and the sub encodes the delta. Works for moderate variation; breaks at large transformations (e.g., 8-year-old version of the same character).
Identity bleed in multi-character scenes: when two locked characters share a frame and have similar features (both 30-year-old Asian women, for instance), about 10% of generations show partial feature bleed.
Cross-style coherence: locked realistic character placed in a stylized “watercolor” segment. Solved partially via per-segment style anchors, but degradation is visible.
Animal / non-human characters: the same architecture applies, but face encoder quality drops sharply outside human faces.
Long-form coherence beyond ~3 minutes: drift suppression works per-shot, but accumulated subtle differences across 50+ shots can still produce minor visible inconsistency to a careful viewer.

Open research problems

If you’re working in this space, here are problems we’d want to see solved:

Form-variant invariants. What’s the right learned representation that captures identity-invariant facial structure while allowing arbitrary state transformations?
Active drift detection during sampling. Current consistency checks are post-hoc. Can we detect drift during the diffusion process and correct mid-sampling?
Implicit-vs-explicit identity tradeoff. When does training a small per-character LoRA outperform embedding-based conditioning? Where’s the boundary?
Multi-character interaction modeling. How do we capture not just two locked identities but their relationship dynamics in a way that holds across shots?
Identity uncertainty quantification. When the model is unsure about identity, can it surface that uncertainty rather than producing a confident drift?

If you’re working on any of these and want to compare notes, the team behind Juying is genuinely interested. Reach out.

Practical advice for builders

If you’re considering building a character consistency layer for your own product, three pieces of advice:

1. Start with negative prompt catalog. This is the highest-impact lowest-cost win. You don’t need API-level model access; the negative prompt is exposed by every public API. Spend a week labeling 1000 generations and you’ll have a catalog covering most drift.

2. Don’t underestimate post-hoc verification. Adding a simple “regenerate if similarity < 0.85” loop catches the worst 10% of failures and dramatically improves perceived quality. It’s the cheapest 90/100 → 95/100 quality bump available.

3. Invest in storage early. Character embeddings as persistent assets is the architecture insight that compounds. Build the right primitives once and every future feature (style locks, scene libraries, asset reuse) extends naturally.