Character Embeddings in AI Video: A Technical Primer

How character embeddings actually work in AI video systems — the architecture, the trade-offs, and the open problems.

·9 min read·technical

This article is for engineers researchers, ML practitioners, and developers building or evaluating AI video tooling. If you want a non-technical overview of why character consistency matters, start with the complete guide.

Here well walk through how character embedding systems actually work in modern AI video stacks: the architecture, the design decisions, the failure modes, and the open problems we havent solved yet.

The problem statement

Given a generative video model M and a character C, we want a procedure such that for any prompt p_i in a sequence p_1, p_2, , p_n that references C, the generated outputs all preserve Cs identity.

The naive approach include the character description in every prompt fails because diffusion sampling is stochastic and prompts describe categories, not identities. Each generation is a draw from the distribution of valid characters matching the description; identity drifts across draws.

We need a way to condition the models output on a specific learned identity, not just a description.

The architecture

A modern character consistency system has six components:

1. Feature extraction       — produce identity embedding from reference
2. Storage                  — persist embedding tied to character_id
3. Negative prompt synthesis — auto-build negative_prompts from drift catalog
4. Conditioning injection   — inject embedding into model conditioning
5. Generation               — diffusion sampling with conditioned model
6. Consistency verification — post-hoc similarity check, regenerate if needed

Lets walk through each.

1. Feature extraction

On character upload, we run multiple specialized models against the reference image:

These are concatenated (or attended together) into a high-dimensional character embedding e_C. Total dimensionality is typically 1500-3000.

Why multiple models instead of one? Because identity has multiple axes that no single encoder captures fully. Face encoders are great at is this the same face? but oblivious to body proportions. Body parsers are oblivious to face details. CLIP is great at semantic appearance but loses fine identity. Concatenating gives you orthogonal coverage.

Trade-off: a more complex extraction pipeline means more compute on character upload (~30-90 seconds in our system). For consumer-facing tools this is fine. For high-throughput pipelines, you can pre-compute embeddings once at upload and reference them at generation.

2. Storage

Each character is stored as (character_id, embedding_vector, metadata). Metadata includes:

Storage is typically a vector database (Pinecone, Qdrant, Weaviate) or a custom indexed structure. Lookups need to be fast sub-100ms because they happen on every generation.

For privacy-sensitive deployments, embeddings can be stored encrypted with per-tenant keys. The extraction is a one-way function (you cant reconstruct the reference image from the embedding), but treating embeddings as PII is the right default for systems handling real people.

3. Negative prompt synthesis

This is the non-obvious part of the system, and where most of the engineering work lives.

We maintain a catalog of common drift modes categorical failure types observed across thousands of generations. For each mode, we have a corresponding negative_prompt fragment that suppresses that failure.

Examples from our catalog:

Drift modeNegative prompt fragment
Eye color shift (brown green)green eyes, hazel eyes (when reference is brown)
Jawline narrowingnarrow jaw, weak chin, soft jawline
Hairline retreathigh hairline, thinning hair, receding hairline
Skin tone warmingwarm skin tone, golden complexion (when reference is cool)
Asymmetry creepasymmetric face, uneven features
Eye spacing shiftwide-set eyes, close-set eyes

Building this catalog requires labeled data. We labeled ~10,000 generations from public AI video tools (Runway, Pika, Sora, etc.) with the specific drift modes that appeared. Clustering produced ~30 distinct modes covering ~85% of observed drift.

For each generation, the system:

  1. Retrieves the characters reference attributes
  2. Computes the opposite of each attribute (e.g., if reference has dark eyes, opposite is light eyes)
  3. Constructs a per-character negative prompt assembling the relevant drift suppressors

The result is a much stronger conditioning signal than vanilla prompt-only generation.

4. Conditioning injection

Different video models accept conditioning differently:

In our experience, API-level injection is far more effective than reference-image-based, but most public APIs dont expose this depth of access. Working at the API surface area available to us, weve found that combining a strong negative prompt with a reference-image-encoded embedding gets us 80-90% of the way to API-level injection.

This is partly why building a character consistency layer is meaningful even when you dont control the underlying model theres significant headroom in the conditioning surface area that public APIs already expose.

5. Generation

Standard diffusion sampling, with the caveat that the conditioning is now a combination of:

Generation cost is typically 1.0-1.2× a vanilla generation. The marginal cost is small.

6. Consistency verification

After generation, we run a separate identity model (typically the same face encoder used in step 1) against the output. We compute cosine similarity between the outputs identity embedding and the original reference embedding.

Threshold: typically 0.85 cosine similarity. Above threshold, the output is accepted. Below threshold, we trigger regeneration with stricter conditioning (higher negative prompt weight, stronger embedding injection).

This adds ~5-10% generation cost on average (most shots pass first try) and prevents the worst drift cases from reaching the user.

What works well, what doesnt

What works:

Whats harder:

Open research problems

If youre working in this space, here are problems wed want to see solved:

  1. Form-variant invariants. Whats the right learned representation that captures identity-invariant facial structure while allowing arbitrary state transformations?
  2. Active drift detection during sampling. Current consistency checks are post-hoc. Can we detect drift during the diffusion process and correct mid-sampling?
  3. Implicit-vs-explicit identity tradeoff. When does training a small per-character LoRA outperform embedding-based conditioning? Wheres the boundary?
  4. Multi-character interaction modeling. How do we capture not just two locked identities but their relationship dynamics in a way that holds across shots?
  5. Identity uncertainty quantification. When the model is unsure about identity, can it surface that uncertainty rather than producing a confident drift?

If youre working on any of these and want to compare notes, the team behind Juying is genuinely interested. Reach out.

Practical advice for builders

If youre considering building a character consistency layer for your own product, three pieces of advice:

1. Start with negative prompt catalog. This is the highest-impact lowest-cost win. You dont need API-level model access; the negative prompt is exposed by every public API. Spend a week labeling 1000 generations and youll have a catalog covering most drift.

2. Dont underestimate post-hoc verification. Adding a simple regenerate if similarity < 0.85 loop catches the worst 10% of failures and dramatically improves perceived quality. Its the cheapest 90/100 95/100 quality bump available.

3. Invest in storage early. Character embeddings as persistent assets is the architecture insight that compounds. Build the right primitives once and every future feature (style locks, scene libraries, asset reuse) extends naturally.

Related reading

If youre building in this space and want to chat info@juying.art