...

Audio nodes

Audio nodes

Spaces includes a set of audio nodes that let you generate voiceovers, music, and sound effects — all from text descriptions. Use them to add narration, soundtracks, and ambient audio to your video workflows without recording or licensing anything.

In this article

Available audio nodes

NodeWhat it does
VoiceoverConverts text into natural-sounding speech using AI voices
Music GeneratorComposes original music tracks from text descriptions
Sound EffectsGenerates AI-powered foley and ambient sounds from text
Video Audio MixCombines multiple audio tracks with video

All audio nodes are new additions to Spaces.

Voiceover, Music Generator, and Sound Effects are being rolled out gradually and may not be available in your account yet.

Voiceover

The Voiceover node converts text into natural-sounding speech. Choose from hundreds of AI voices across multiple providers to narrate scripts, create dialogue, or produce voice content.

How to use the Voiceover node

  1. Add the node

    Add a Voiceover node to your Space.

  2. Enter your script

    Type or paste your script into the text field — or connect a Text node to the input port.

  3. Select a model

    Choose ElevenLabs v2, ElevenLabs v3, or Gemini 2.5 Pro.

  4. Pick a voice

    Click the voice chip to open the voice library — browse and preview voices before selecting one.

  5. Adjust parameters

    Set speed, stability, and similarity boost if needed.

  6. Generate

    Set the number of generations from 1 to 10 and run the node.

Voice parameters

ParameterRangeWhat it does
Speed0.7x to 1.2xSpeaking rate — lower is slower, higher is faster
Stability0 to 1Voice consistency — lower is more expressive, higher is more stable
Similarity Boost0 to 1How closely the output matches the selected voice

Input and output

The Voiceover node accepts a Text input — your script — and outputs Audio — the generated voiceover. You can type directly on the card or connect a Text node for dynamic scripts.

Use cases

Voiceover models

Three AI models are available for voice generation. Each has different strengths — pick the one that matches your project.

ModelProviderSpeedQualityBest for
ElevenLabs v2 TurboElevenLabsFastGoodQuick narration, batch processing
ElevenLabs v3ElevenLabsModerateHighFinal production, emotional narration
Gemini 2.5 ProGoogleModerateHighMulti-language content, conversational tone

ElevenLabs v2 is the fastest option. Use it when you need to generate many voiceovers quickly or iterate on scripts during production. It supports Speed, Stability, and Similarity Boost parameters.

ElevenLabs v3 delivers more natural prosody and intonation than v2 — pacing, emphasis, and emotional tone feel closer to a human read. Use it for final output when quality matters most. Same parameters as v2.

Gemini 2.5 Pro excels at multi-language content and conversational delivery. It supports Temperature, System Instruction, and Language selection. Multi-speaker configuration is possible for dialogue-style output.

Gemini 2.5 Pro is in gradual rollout and may not be available to all users yet.

Quick guide: which model to choose

You need...Use this
Fast turnaroundElevenLabs v2
Best audio qualityElevenLabs v3
Multiple languagesGemini 2.5 Pro
Emotional, expressive narrationElevenLabs v3
Conversational or dialogue toneGemini 2.5 Pro
Batch processing many clipsElevenLabs v2

Music Generator

The Music Generator node creates original music from text descriptions. Describe the mood, genre, tempo, and instruments — the AI composes a unique track.

How to use the Music Generator node

  1. Add the node

    Add a Music Generator node to your Space.

  2. Describe your music

    Write a description of the music you want in the prompt field — or connect a Text node to the input port.

  3. Select a model

    Choose Google Lyria or ElevenLabs Music.

  4. Set the duration

    Up to 30 seconds for Lyria, up to 10 seconds for ElevenLabs.

  5. Generate

    Set the number of generations from 1 to 10 and run the node.

Input and output

The Music Generator node accepts a Text input — your music description — and outputs Audio — the generated track.

Use cases

Music Generator models

Two AI models are available for music generation, each optimized for different use cases.

ModelProviderMax durationBest for
Google LyriaGoogle30 secondsBackground music, soundtracks, ambient, varied genres
ElevenLabs MusicElevenLabs10 secondsJingles, intros, sound logos, short loops

Google Lyria excels at longer compositions with natural musical structure. It handles a wide range of genres and can produce pieces with evolving arrangement — intros, builds, and transitions.

ElevenLabs Music is optimized for short, concentrated pieces. The output is clean and well-defined — ideal for branding elements, transitions, and loop-ready clips.

Quick guide: which model to choose

You need...Use this
More than 10 secondsGoogle Lyria
Short, punchy audioElevenLabs Music
Varied instrumentationGoogle Lyria
Quick generationElevenLabs Music
Soundtrack or background musicGoogle Lyria
Jingle, intro, or sound logoElevenLabs Music

Sound Effects

The Sound Effects node generates AI-powered audio from text descriptions. Describe any sound — from rain on a tin roof to a spaceship engine humming — and the AI creates it. Use it to add atmosphere and foley to video projects.

How to use the Sound Effects node

  1. Add the node

    Add a Sound Effects node to your Space.

  2. Describe the sound

    Write a description in the prompt field — or connect a Text node to the input port.

  3. Set the duration

    Choose the desired length for the sound effect.

  4. Enable Loop if needed

    Turn on Loop for sounds that need to play continuously, like ambient or background audio.

  5. Generate

    Set the number of generations from 1 to 10 and run the node.

Input and output

The Sound Effects node accepts a Text input — your sound description — and outputs Audio — the generated effect.

Use cases

Video Audio Mix

The Video Audio Mix node combines multiple audio tracks with a video. Use it as the final step in your audio workflow — connect your voiceover, music, and sound effects, then mix them with your generated or uploaded video.

Typical audio workflow

A common pattern for producing narrated video with a soundtrack in Spaces:

  1. Write your script

    Use a Text node or type directly into the Voiceover node.

  2. Generate voiceover

    The Voiceover node converts your script to natural speech.

  3. Add music

    The Music Generator creates a background track from a mood description.

  4. Layer sound effects

    The Sound Effects node generates ambient audio or foley.

  5. Mix everything

    Connect all audio outputs and your video to a Video Audio Mix node.

You can connect as many audio sources as you need before combining them with the final video.

Prompting tips

Good prompts lead to better audio. Here is what works for each node.

Voiceover

Your voiceover prompt is the script itself — write it exactly as you want it spoken. Keep sentences natural and conversational. If you need specific pacing or emphasis, choose the right model and adjust the voice parameters rather than trying to encode delivery instructions in the text.

Music Generator

Include genre, mood, tempo, and instruments for the most control.

Cinematic orchestral piece, dramatic, building tension, strings and brass, 120 BPM

Acoustic folk guitar, warm and nostalgic, fingerpicking style, 90 BPM

For ElevenLabs Music — shorter pieces — keep descriptions focused on a single mood or purpose.

Upbeat electronic jingle, happy, synth lead

Dark ambient drone, eerie, low frequency hum

Sound Effects

Be descriptive and specific. The more detail you include, the closer the result will match what you hear in your head.

Heavy rain on a tin roof with distant thunder — works better than just rain

Busy coffee shop ambiance with quiet conversation and espresso machine — works better than cafe sounds

Tips and best practices

Preview voices before committing. The voice library includes sample playback for every voice — listen before you generate.

Lower stability for expression. If your voiceover sounds too flat or robotic, try reducing the Stability parameter. This adds more variation and emotional range to the delivery.

Generate multiple variations. Audio generation — especially music and sound effects — is highly variable. Generate several versions of the same prompt and pick the best one.

Use Loop for ambient sounds. Enable the Loop toggle on the Sound Effects node when you need continuous background audio like rain, traffic, or office ambiance.

Draft with v2, finish with v3. Use ElevenLabs v2 for fast iteration on scripts, then switch to v3 for the final voiceover when quality matters.

Be specific about tempo in music prompts. Adding a BPM value like 120 BPM gives the AI a concrete target and produces more consistent results.

Use multiple Voiceover nodes for dialogue. Add several Voiceover nodes with different voices to create conversation scenes, then combine them with Video Audio Mix.

Combine Sound Effects for rich soundscapes. Layer multiple Sound Effects nodes — for example, rain plus distant traffic plus indoor echo — and mix them together for depth.