32blogby Studio Mitsu

Auto-Generate Subtitles with FFmpeg and Whisper

Combine OpenAI's Whisper with FFmpeg to automatically add subtitles to any video. Build a pipeline that handles audio extraction, SRT generation, and subtitle burning.

by omitsu10 min read
FFmpegWhispersubtitlesautomationAI
On this page

You can auto-generate subtitles by extracting audio with FFmpeg, transcribing it with OpenAI Whisper, and burning the resulting SRT back into the video — all in a single Python script. No manual transcription needed.

I used to spend hours adding subtitles to tutorial videos. Transcribe, align timestamps, export — repeat for every single video. That workflow doesn't scale.

This article shows you how to combine Whisper with FFmpeg to fully automate subtitle generation. We'll build a pipeline that goes from raw video all the way to a subtitled output in one script: audio extraction → transcription → SRT generation → subtitle burning.

Input VideoMP4 / MOVWAV 16kHzFFmpegAudio ExtractTranscribeWhisperTranscriptionSubtitleSRT + FFmpegBurn-in

Eliminate the Subtitle Bottleneck

The real bottleneck in subtitle creation is transcription. Converting spoken words to text, deciding where to break segments, and assigning timecodes takes up 90% of the total time.

Whisper solves this almost entirely. Pass it an audio file and it handles transcription and timecoding automatically. Output the result as SRT and you can pipe it straight into FFmpeg to burn the subtitles into your video.

You need two things:

  • Python 3.8+ with Whisper installed
  • FFmpeg already installed and in your PATH

If you need a refresher on FFmpeg basics, check out Getting Started with FFmpeg first.

Install Whisper and Choose a Model

Install Whisper via pip.

bash
pip install openai-whisper

Once installed, pick a model size. Whisper ships several variants with different speed/accuracy trade-offs.

ModelVRAMSpeedAccuracy
tiny~1GBFastestLow
base~1GBFastModerate
small~2GBMediumGood
medium~5GBSlowHigh
large-v3~10GBSlowestBest
turbo~6GBFastNear-best

The turbo model was released in September 2024. It strips the large-v3 decoder from 32 layers down to 4, making it roughly 8× faster while keeping accuracy close to large-v3. If you have a GPU with 6GB+ VRAM, turbo is the sweet spot for most workflows.

Extract Audio with FFmpeg

Whisper can accept a video file directly, but extracting audio first makes the pipeline faster and cleaner. Use FFmpeg to convert to WAV.

bash
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

What each flag does:

  • -vn — ignore the video stream, process audio only
  • -acodec pcm_s16le — output as WAV (16-bit linear PCM)
  • -ar 16000 — set sample rate to 16kHz (Whisper's recommended input)
  • -ac 1 — downmix to mono (Whisper is more consistent with mono audio)

The resulting audio.wav is what gets passed to Whisper next.

Transcribe Audio and Generate SRT with Whisper

With the audio file ready, run it through Whisper. You can use either the CLI or the Python API.

Using the CLI

bash
whisper audio.wav --model medium --language en --output_format srt --output_dir ./subtitles

Explicitly passing --language is faster and more accurate than letting Whisper auto-detect. The output lands at ./subtitles/audio.srt and looks like this:

1
00:00:00,000 --> 00:00:03,500
Welcome. Today we're going to look at FFmpeg and Whisper.

2
00:00:03,500 --> 00:00:07,200
Let's start by installing Whisper.

Using the Python API

Use the API when you need more control over the output.

python
import whisper

model = whisper.load_model("medium")
result = model.transcribe("audio.wav", language="en")

def format_timestamp(seconds: float) -> str:
    ms = int((seconds % 1) * 1000)
    s = int(seconds) % 60
    m = int(seconds) // 60 % 60
    h = int(seconds) // 3600
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

with open("subtitles/audio.srt", "w", encoding="utf-8") as f:
    for i, segment in enumerate(result["segments"], start=1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        f.write(f"{i}\n{start} --> {end}\n{text}\n\n")

print("SRT file written")

Each item in result["segments"] contains start, end (in seconds), and text. The helper formats those into SRT timestamps.

Burn Subtitles into the Video with FFmpeg

Once the SRT file exists, use FFmpeg's subtitles filter to hard-code it into the video.

bash
ffmpeg -i input.mp4 -vf "subtitles=subtitles/audio.srt" -c:a copy output.mp4

To customize the font, size, and color, pass a force_style argument.

bash
ffmpeg -i input.mp4 \
  -vf "subtitles=subtitles/audio.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&HFFFFFF&,OutlineColour=&H000000&,Outline=2'" \
  -c:a copy output.mp4

force_style follows the ASS/SSA style specification. Colors use &HBBGGRR& format — note the BGR order (reversed from HTML's RGB) and the trailing &. Non-Latin scripts should use a font that covers the required character set to avoid rendering issues.

Combine Everything Into One Script

Here is the full pipeline in a single Python script. Point it at a video file and it outputs a subtitled version.

python
import subprocess
import sys
import os
import whisper


def extract_audio(input_video: str, output_audio: str) -> None:
    """Extract audio from a video file."""
    cmd = [
        "ffmpeg", "-i", input_video,
        "-vn", "-acodec", "pcm_s16le",
        "-ar", "16000", "-ac", "1",
        output_audio, "-y"
    ]
    subprocess.run(cmd, check=True, capture_output=True)
    print(f"Audio extracted: {output_audio}")


def transcribe_to_srt(audio_path: str, srt_path: str, model_name: str = "medium") -> None:
    """Transcribe audio with Whisper and write an SRT file."""
    model = whisper.load_model(model_name)
    result = model.transcribe(audio_path, language="en")

    def format_timestamp(seconds: float) -> str:
        ms = int((seconds % 1) * 1000)
        s = int(seconds) % 60
        m = int(seconds) // 60 % 60
        h = int(seconds) // 3600
        return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

    with open(srt_path, "w", encoding="utf-8") as f:
        for i, segment in enumerate(result["segments"], start=1):
            start = format_timestamp(segment["start"])
            end = format_timestamp(segment["end"])
            text = segment["text"].strip()
            f.write(f"{i}\n{start} --> {end}\n{text}\n\n")

    print(f"SRT written: {srt_path}")


def burn_subtitles(input_video: str, srt_path: str, output_video: str) -> None:
    """Hard-code subtitles into the video."""
    # Assumes srt_path contains only ASCII characters
    vf = f"subtitles={srt_path}"
    cmd = [
        "ffmpeg", "-i", input_video,
        "-vf", vf,
        "-c:a", "copy",
        output_video, "-y"
    ]
    subprocess.run(cmd, check=True, capture_output=True)
    print(f"Subtitles burned: {output_video}")


def main(input_video: str) -> None:
    base = os.path.splitext(input_video)[0]
    audio_path = f"{base}_audio.wav"
    srt_path = f"{base}_subtitles.srt"
    output_video = f"{base}_subtitled.mp4"

    print("=== Subtitle Auto-Generation Pipeline ===")
    extract_audio(input_video, audio_path)
    transcribe_to_srt(audio_path, srt_path)
    burn_subtitles(input_video, srt_path, output_video)

    # Clean up intermediate files
    os.remove(audio_path)
    print(f"\nDone: {output_video}")


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python subtitle_pipeline.py <input_video>")
        sys.exit(1)
    main(sys.argv[1])

Run it with a single command.

bash
python subtitle_pipeline.py input.mp4

If you want to process an entire folder of videos at once, the FFmpeg Python Batch Automation article covers that pattern in detail.

Speed It Up with faster-whisper

If processing time matters — and it always does when you're batching hundreds of videos — check out faster-whisper. It's a reimplementation of Whisper using CTranslate2, and it runs up to 4× faster with significantly less memory.

bash
pip install faster-whisper
python
from faster_whisper import WhisperModel

model = WhisperModel("turbo", compute_type="float16")
segments, info = model.transcribe("audio.wav", language="en")

with open("subtitles/audio.srt", "w", encoding="utf-8") as f:
    for i, segment in enumerate(segments, start=1):
        start = format_timestamp(segment.start)
        end = format_timestamp(segment.end)
        f.write(f"{i}\n{start} --> {end}\n{segment.text.strip()}\n\n")

The API is slightly different (segments is a generator, attributes are accessed with dots instead of dict keys), but the output is identical. I switched the 32blog video pipeline to faster-whisper and cut processing time from 12 minutes to under 3 minutes for a 30-minute video on a RTX 3060.

FAQ

Does Whisper work without a GPU?

Yes. Whisper runs on CPU, but it's significantly slower. A 10-minute video takes around 30 seconds on a mid-range GPU with the medium model, versus 5–10 minutes on CPU. The turbo model helps a lot even on CPU.

What's the difference between soft subtitles and hard subtitles?

Hard subtitles (burn-in) are permanently embedded in the video pixels — viewers can't turn them off. Soft subtitles are stored as a separate stream inside the container, and players let you toggle them. This article covers hard subtitles with the subtitles filter. For soft subs, use ffmpeg -i input.mp4 -i subs.srt -c copy -c:s mov_text output.mp4.

Can Whisper handle multiple languages in one video?

Whisper's --language flag sets a single language for the entire file. If your video switches between languages, omit the --language flag and let Whisper auto-detect per segment. Accuracy drops compared to single-language mode, so split the audio at language boundaries if precision matters.

How accurate is Whisper compared to paid transcription services?

On clear English speech, Whisper's large-v3 and turbo models rival commercial services like Rev or Otter.ai. Accuracy drops with heavy accents, background noise, or overlapping speakers. For professional use, always review the generated SRT before burning it in.

Can I output VTT instead of SRT?

Yes. Use --output_format vtt with the Whisper CLI, or change the timestamp format in the Python script (VTT uses HH:MM:SS.mmm with a dot, while SRT uses HH:MM:SS,mmm with a comma). VTT is the standard for web video via the <track> element.

Why extract audio first instead of passing the video directly to Whisper?

Whisper can accept video files, but it still extracts audio internally using FFmpeg. Doing it yourself gives you control over the sample rate (16kHz mono is optimal for Whisper) and avoids re-extracting audio if you need to retry with different model settings.

Does the subtitles filter work on Windows?

Yes, but watch out for path issues. On Windows, the subtitles filter can fail if the file path contains non-ASCII characters (Japanese, spaces, etc.) because of how the Windows C runtime handles file paths. Stick to ASCII-only paths or copy the SRT to a simple location.

What's the best model for Japanese / non-English transcription?

Use medium or large-v3 for non-English languages. The tiny and base models frequently misrecognize proper nouns and produce garbled output in languages like Japanese or Korean. The turbo model is a good middle ground — near large-v3 accuracy at much higher speed.

Wrapping Up

Combining FFmpeg and Whisper automates nearly the entire subtitle workflow.

  • The pipeline is four steps: extract audio, transcribe with Whisper, write SRT, burn with FFmpeg
  • The turbo model is the best default for most use cases — fast and accurate
  • Use faster-whisper for 4× speed improvement
  • Keep file paths ASCII-only on Windows to avoid subtitles filter errors
  • The Python script wraps everything into a single command: python subtitle_pipeline.py input.mp4

Stop spending time on manual transcription. Let Whisper handle it and focus on editing.


Related articles: