Aller au contenu
GenLovers

How to use HappyHorse reference-to-video (multi-image AI video)

Dernière mise à jour: 8 min de lectureDifficulté: Beginner-friendly

Most AI video tools take one image and animate it. HappyHorse reference-to-video (R2V) does something different: you give it several reference images — say, a person, an object they're holding, and an accessory they're wearing — plus a prompt, and it generates a video that combines all of them into a single coherent scene. It's the tool to reach for when the shot you want doesn't exist as one photo, because the subject, the prop, and the outfit are three different pictures.

This guide covers how to structure a multi-image prompt so HappyHorse knows which image is which, the settings worth knowing, and the mistakes that make a combined scene look disjointed instead of natural.

What reference-to-video is for

Image-to-video animates one picture. Reference-to-video composes several. If you want a specific person wearing a specific outfit and holding a specific object — and no single photo has all three — R2V is built for exactly that: it treats each reference image as one ingredient and assembles them into a scene your prompt describes.

This makes it the right tool for combining a character with props, outfits, or accessories from separate shoots, rather than for animating a single existing photo as-is. If you already have one image that shows everything you want, plain image-to-video is simpler and cheaper.

Step-by-step

The workflow has one extra step compared to single-image tools: numbering your references and pointing to them in the prompt.

  1. 1

    Gather your reference images

    Pick one clear, sharp image per element you want in the scene — for example, the person, the outfit piece, and the accessory. Each image should show its subject plainly, without clutter. You can use 1 to 9 reference images in a single generation.

  2. 2

    Order the images deliberately

    The order you upload images in becomes their number: the first is [Image 1], the second [Image 2], and so on. Decide this order before you write the prompt, since the prompt has to match it exactly.

  3. 3

    Write a prompt that names each reference

    Refer to each image by number and describe what it contributes: "the woman in a red qipao from [Image 1]", "unfolding the fan from [Image 2]". This is what tells the model which pixels to pull from which source, rather than inventing its own version of each element.

  4. 4

    Describe the scene and camera as a sequence

    Beyond naming the references, describe what happens and how it's shot — a shot type, an action, a camera move — the same way you would for any video prompt. R2V still needs motion and framing direction, not just a list of ingredients.

  5. 5

    Set resolution, aspect ratio, and duration

    Choose the output resolution and aspect ratio to match where the video will be used, and set a duration between 3 and 15 seconds (5 is the default and the reliable starting point).

  6. 6

    Generate, then check that every reference actually shows up

    Review the result specifically for each named element — did the outfit, the prop, and the person all appear as described? If one reference got ignored, it's almost always a prompt-clarity problem: make the reference to that image and its role in the scene more explicit and re-run.

Writing the multi-image prompt

Treat the prompt as stage directions for a small scene, where each reference image is a named prop or performer. Introduce a reference the first time you use it — "the woman in a red qipao from [Image 1]" — so the model has both the visual source and a plain-language description to anchor it to.

Once a reference is introduced, you can describe it acting: "unfolding the fan from [Image 2]", "the tassel earrings from [Image 3] sway with her head movement". This lets you sequence a short scene — an opening shot, an action in the middle, a closing detail — while keeping every element tied back to its source image.

Keep the numbering in your prompt consistent with the upload order. A mismatch between the order you provided images and the order you reference them in the prompt is the most common cause of the wrong element showing up in the wrong place.

Recommended settings (baseline)

Start here, then adjust one variable at a time.

Reference images1–9 images; sharp, one clear subject per image, shortest side at least 400px (720p+ recommended)
Resolution1080P (default) or 720P — 720P is the cheaper, faster option for drafts and iteration
Aspect ratio16:9 default; also supports 9:16, 3:4, 4:3, 4:5, 5:4, 1:1, 9:21, 21:9 — match your target platform
Duration3–15 seconds; 5 seconds is the default and the most reliable length to start from
WatermarkOn by default (bottom-right "Happy Horse" mark); can be turned off
SeedFixed while tuning so you compare like-for-like; randomize once you're happy with the composition to explore variations

Choosing reference images that combine well

Consistent lighting and quality across references matters more than it seems. If one reference is a bright, sharp studio photo and another is a dim, grainy phone snapshot, the model has to reconcile two different qualities of source material in one scene — the result often looks like it's compositing rather than one coherent shot.

Show each subject clearly and separately. A reference image where the outfit or object is partially hidden, at an odd angle, or mixed in with other clutter gives the model less to work with than one where it's the clear focus of the frame.

Fewer, stronger references beat many marginal ones. Three sharp, well-chosen images that clearly show the person, the key prop, and the key detail will outperform seven images where several barely add anything and only add room for confusion.

Common problems and fixes

A reference image doesn't appear in the output: the prompt didn't clearly introduce it, or its image number doesn't match its upload position. Re-check the order and make the reference to it more explicit and specific.

Elements look like they don't belong in the same shot: the reference images differ too much in lighting, quality, or resolution. Replace the weakest one with a cleaner, better-lit image.

Wrong element shows up where another was expected: the numbering in the prompt doesn't match the order the images were provided in. Double-check [Image 1], [Image 2], [Image 3] against your actual upload order.

Output feels cluttered or the focus is unclear: too many references competing for attention. Drop to the two or three that matter most for the shot.

Where R2V fits versus single-image tools

If you have one photo that already shows everything you want animated, plain image-to-video (like Wan) is simpler, cheaper, and has one less thing to get wrong. Reach for reference-to-video specifically when the shot you're after is assembled from parts — a person from one photo, an item from another — and no single image contains the whole scene.

The two techniques also compose: you can use R2V to generate a combined starting frame or short clip, then treat its output as the source image for further single-image animation or chaining into a longer sequence.

Continuer la lecture

Recevez les nouveaux guides par email

Un email quand nous publions de nouveaux guides et analyses de modèles. Pas de spam, désinscription à tout moment.