32blogby StudioMitsu
ffmpeg6 min read

Auto-Generate Subtitles with FFmpeg and Whisper

Combine OpenAI's Whisper with FFmpeg to automatically add subtitles to any video. Build a pipeline that handles audio extraction, SRT generation, and subtitle burning.

FFmpegWhispersubtitlesautomationAI
On this page

Adding subtitles to a video is tedious work. Transcribe the speech, align the timestamps, export the file — doing this by hand every time just doesn't scale.

This article shows you how to combine OpenAI's Whisper with FFmpeg to fully automate subtitle generation. We'll build a pipeline that goes from raw video all the way to a subtitled output in a single script: audio extraction → transcription → SRT generation → subtitle burning.

Eliminate the Subtitle Bottleneck

The real bottleneck in subtitle creation is transcription. Converting spoken words to text, deciding where to break segments, and assigning timecodes takes up 90% of the total time.

Whisper solves this almost entirely. Pass it an audio file and it handles transcription and timecoding automatically. Output the result as SRT and you can pipe it straight into FFmpeg to burn the subtitles into your video.

You need two things:

  • Python 3.8+ with Whisper installed
  • FFmpeg already installed and in your PATH

If you need a refresher on FFmpeg basics, check out Getting Started with FFmpeg first.

Install Whisper and Choose a Model

Install Whisper via pip.

bash
pip install openai-whisper

Once installed, pick a model size. Whisper ships several variants with different speed/accuracy trade-offs.

ModelVRAMSpeedAccuracy
tiny~1GBFastestLow
base~1GBFastModerate
small~2GBMediumGood
medium~5GBSlowHigh
large~10GBSlowestBest

Extract Audio with FFmpeg

Whisper can accept a video file directly, but extracting audio first makes the pipeline faster and cleaner. Use FFmpeg to convert to WAV.

bash
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

What each flag does:

  • -vn — ignore the video stream, process audio only
  • -acodec pcm_s16le — output as WAV (16-bit linear PCM)
  • -ar 16000 — set sample rate to 16kHz (Whisper's recommended input)
  • -ac 1 — downmix to mono (Whisper is more consistent with mono audio)

The resulting audio.wav is what gets passed to Whisper next.

Transcribe Audio and Generate SRT with Whisper

With the audio file ready, run it through Whisper. You can use either the CLI or the Python API.

Using the CLI

bash
whisper audio.wav --model medium --language en --output_format srt --output_dir ./subtitles

Explicitly passing --language is faster and more accurate than letting Whisper auto-detect. The output lands at ./subtitles/audio.srt and looks like this:

1
00:00:00,000 --> 00:00:03,500
Welcome. Today we're going to look at FFmpeg and Whisper.

2
00:00:03,500 --> 00:00:07,200
Let's start by installing Whisper.

Using the Python API

Use the API when you need more control over the output.

python
import whisper

model = whisper.load_model("medium")
result = model.transcribe("audio.wav", language="en")

def format_timestamp(seconds: float) -> str:
    ms = int((seconds % 1) * 1000)
    s = int(seconds) % 60
    m = int(seconds) // 60 % 60
    h = int(seconds) // 3600
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

with open("subtitles/audio.srt", "w", encoding="utf-8") as f:
    for i, segment in enumerate(result["segments"], start=1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        f.write(f"{i}\n{start} --> {end}\n{text}\n\n")

print("SRT file written")

Each item in result["segments"] contains start, end (in seconds), and text. The helper formats those into SRT timestamps.

Burn Subtitles into the Video with FFmpeg

Once the SRT file exists, use FFmpeg's subtitles filter to hard-code it into the video.

bash
ffmpeg -i input.mp4 -vf "subtitles=subtitles/audio.srt" -c:a copy output.mp4

To customize the font, size, and color, pass a force_style argument.

bash
ffmpeg -i input.mp4 \
  -vf "subtitles=subtitles/audio.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&Hffffff,OutlineColour=&H000000,Outline=2'" \
  -c:a copy output.mp4

force_style follows the ASS/SSA style specification. Non-Latin scripts should use a font that covers the required character set to avoid rendering issues.

Combine Everything Into One Script

Here is the full pipeline in a single Python script. Point it at a video file and it outputs a subtitled version.

python
import subprocess
import sys
import os
import whisper


def extract_audio(input_video: str, output_audio: str) -> None:
    """Extract audio from a video file."""
    cmd = [
        "ffmpeg", "-i", input_video,
        "-vn", "-acodec", "pcm_s16le",
        "-ar", "16000", "-ac", "1",
        output_audio, "-y"
    ]
    subprocess.run(cmd, check=True, capture_output=True)
    print(f"Audio extracted: {output_audio}")


def transcribe_to_srt(audio_path: str, srt_path: str, model_name: str = "medium") -> None:
    """Transcribe audio with Whisper and write an SRT file."""
    model = whisper.load_model(model_name)
    result = model.transcribe(audio_path, language="en")

    def format_timestamp(seconds: float) -> str:
        ms = int((seconds % 1) * 1000)
        s = int(seconds) % 60
        m = int(seconds) // 60 % 60
        h = int(seconds) // 3600
        return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

    with open(srt_path, "w", encoding="utf-8") as f:
        for i, segment in enumerate(result["segments"], start=1):
            start = format_timestamp(segment["start"])
            end = format_timestamp(segment["end"])
            text = segment["text"].strip()
            f.write(f"{i}\n{start} --> {end}\n{text}\n\n")

    print(f"SRT written: {srt_path}")


def burn_subtitles(input_video: str, srt_path: str, output_video: str) -> None:
    """Hard-code subtitles into the video."""
    # Assumes srt_path contains only ASCII characters
    vf = f"subtitles={srt_path}"
    cmd = [
        "ffmpeg", "-i", input_video,
        "-vf", vf,
        "-c:a", "copy",
        output_video, "-y"
    ]
    subprocess.run(cmd, check=True, capture_output=True)
    print(f"Subtitles burned: {output_video}")


def main(input_video: str) -> None:
    base = os.path.splitext(input_video)[0]
    audio_path = f"{base}_audio.wav"
    srt_path = f"{base}_subtitles.srt"
    output_video = f"{base}_subtitled.mp4"

    print("=== Subtitle Auto-Generation Pipeline ===")
    extract_audio(input_video, audio_path)
    transcribe_to_srt(audio_path, srt_path)
    burn_subtitles(input_video, srt_path, output_video)

    # Clean up intermediate files
    os.remove(audio_path)
    print(f"\nDone: {output_video}")


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python subtitle_pipeline.py <input_video>")
        sys.exit(1)
    main(sys.argv[1])

Run it with a single command.

bash
python subtitle_pipeline.py input.mp4

If you want to process an entire folder of videos at once, the FFmpeg Python Batch Automation article covers that pattern in detail.

Wrapping Up

Combining FFmpeg and Whisper automates nearly the entire subtitle workflow.

  • The pipeline is four steps: extract audio, transcribe with Whisper, write SRT, burn with FFmpeg
  • Use medium or larger models for non-English audio
  • Keep all file paths ASCII-only to avoid the subtitles filter error
  • The Python script wraps everything into a single command: python subtitle_pipeline.py input.mp4

Stop spending time on manual transcription. Let Whisper handle it and focus on editing.