Aller au contenu
GenLovers

How to generate videos with Wan 2.7

Dernière mise à jour: 10 min de lectureDifficulté: Intermediate

Wan 2.7 is the premium tier of the Wan image-to-video line, and it changes what a single generation can do. Where earlier versions gave you a few seconds of silent motion, 2.7 produces clips up to around fifteen seconds with native audio — ambient sound or voice — generated in the same pass, and it can optimize your prompt for you before it runs. That combination lets one render carry a small, complete moment instead of a fragment you have to assemble later.

This guide covers what 2.7 adds over 2.5, and the two new skills that come with it: prompting audio, and scripting a longer clip second by second so it stays coherent across its full length. If you have never run a Wan I2V workflow, start with the Wan 2.2 walkthrough for the basics, then come back — everything here builds on that foundation.

What 2.7 adds over 2.5

Native audio. This is the headline. 2.7 generates sound as part of the video — ambient noise that matches the scene, or voice that matches the subject — instead of leaving you a silent clip to dub afterward. Done well, it removes an entire post-production step and makes a clip feel real rather than like a moving photo.

Much longer clips. Where earlier versions capped out around five to eight seconds, 2.7 runs flexibly from about two up to roughly fifteen seconds. Fifteen seconds is long enough to hold a short beginning-middle-end, which is why 2.7 is aimed at finished, watchable content rather than raw motion assets.

Built-in prompt optimization. The model can expand and refine a short prompt for you before generating. This lowers the skill floor — a rough prompt still gives a decent result — but knowing how to write a strong prompt yourself still gives you far more control over the outcome.

The cost of all this is real. 2.7 is the expensive tier and a heavier model, so each second costs more and takes longer than the same second on 2.2. It earns that cost when audio and length are the point; it wastes it when they aren't.

Step-by-step

The core workflow is the familiar I2V flow. The new work is in the prompt — audio and timing — covered in the sections after this.

  1. 1

    Load Wan 2.7 and prepare your source image

    Point your tool at the 2.7 I2V model. Prepare the source image exactly as for any Wan version: one clear subject, sharp, cropped to your target aspect ratio. The stronger model still can't invent detail a soft image doesn't contain.

  2. 2

    Decide your clip length up front

    Because 2.7 runs from about 2 to 15 seconds and you pay per second, pick the length the moment needs before you write the prompt. A short reaction wants 2-4 seconds; a small self-contained scene wants the longer end. Length shapes how you write the prompt.

  3. 3

    Write the motion prompt — and the audio

    Describe the movement in present-progressive verbs as always, then add what should be heard. On an audio model, sound is part of the prompt, not an afterthought. See the audio section below for how.

  4. 4

    For longer clips, script it on a timeline

    Past a few seconds, a single sentence of motion isn't enough to fill the clip coherently. Lay the action out second by second so the model knows the sequence. The timeline section below shows the format.

  5. 5

    Let prompt optimization help, then take control

    If your tool offers built-in prompt optimization, use it for a first pass to see what a fuller prompt looks like. Then edit it toward what you actually want — the optimizer is a starting point, not the final say.

  6. 6

    Generate, review with sound on, iterate

    Always review a 2.7 clip with audio playing — a clip that looks fine can have mismatched or off sound. If motion is right but audio is wrong, adjust only the audio part of the prompt and re-run.

Prompting native audio

Treat sound as a described layer, the same way you describe motion. Name the ambient sound of the scene — "waves rolling onto the shore," "a quiet room tone with distant traffic," "wind moving through trees" — so the audio matches what's on screen. Mismatched sound is more jarring than no sound at all.

For voice, describe the manner, not a script you need read verbatim: "she is speaking softly," "a calm, warm voice." Keep it simple and let the model fit the voice to the subject and motion. Over-specifying a long line of dialogue in a few seconds is the audio equivalent of asking for too much motion.

Keep audio and action in sync. If the subject's mouth moves, the voice should be present; if a wave breaks on screen, the sound should land with it. When in doubt, describe fewer, simpler sounds that clearly belong to what's visible — a clean, matched soundscape beats a busy, drifting one.

Scripting a longer clip on a timeline

A five-second clip can run on a single sentence of motion. A fifteen-second clip cannot — left to one instruction, the model runs out of direction and the back half drifts or repeats. The fix is to script the clip as a timeline, giving the action second by second.

Write it as timestamps: a line for each second (or every couple of seconds) describing what is happening at that moment. "At 00:00, she is standing at the window looking out. At 00:03, she turns slowly toward the camera. At 00:06, she smiles and steps forward." Each beat hands off to the next, so the model always knows what comes next and the clip reads as one continuous action instead of a loop that lost its way.

Keep each beat small and physically continuous with the one before it — a person can turn, step, and smile in fifteen seconds; they cannot cross a room and change outfits. The timeline is for pacing and sequence, not for cramming in more than the runtime can hold.

Recommended settings (baseline)

Start here, then adjust one variable at a time. 2.7 exposes length and audio as first-class choices that earlier versions didn't.

WorkflowImage-to-video (I2V), Wan 2.7 model
Resolution720p or 1080p; 1080p costs more per second — use it only when the output warrants it
Clip length2-15 seconds; pick per moment, and remember cost scales directly with seconds
AudioNative — describe ambient sound and/or voice in the prompt; leave silent only if you'll replace the sound
PromptFor clips beyond ~5s, script the action on a second-by-second timeline
Prompt optimizationOptional first pass to expand a rough prompt; edit its output toward your intent
SeedFixed while tuning so you compare like-for-like; randomize to explore variations

When a cheaper model is the smarter call

Silent, short assets. If you need loops, animated thumbnails, or quick social clips with no sound and no need for length, 2.2 does it for a fraction of the cost. Paying for 2.7's audio and duration you won't use is pure waste.

High-volume work. When you need many clips and per-clip cost dominates, the cheaper models are the right engine. Reserve 2.7 for the hero pieces where audio and a full fifteen seconds actually change what the viewer gets.

Fast exploration. Burn through your rough ideas on a light, cheap model, then bring only the winner to 2.7 for the finished, sounded version.

Common problems and fixes

Audio doesn't match the scene: your prompt described sound that isn't on screen, or none at all. Name the specific ambient sound that belongs to the visible action, and keep it simple.

Long clip drifts or repeats in the back half: you gave one instruction for too much runtime. Script it as a second-by-second timeline so every beat has direction.

Costs are ballooning: you're iterating at the premium tier. Move trial-and-error to short, silent, lower-resolution runs and reserve full 2.7 renders for finals.

Optimized prompt changed your intent: the built-in optimizer expanded the prompt in a direction you didn't want. Treat its output as a draft and edit it back toward your goal.