Adding subtitles to a video is tedious work. Transcribe the speech, align the timestamps, export the file — doing this by hand every time just doesn't scale.
This article shows you how to combine OpenAI's Whisper with FFmpeg to fully automate subtitle generation. We'll build a pipeline that goes from raw video all the way to a subtitled output in a single script: audio extraction → transcription → SRT generation → subtitle burning.
Eliminate the Subtitle Bottleneck
The real bottleneck in subtitle creation is transcription. Converting spoken words to text, deciding where to break segments, and assigning timecodes takes up 90% of the total time.
Whisper solves this almost entirely. Pass it an audio file and it handles transcription and timecoding automatically. Output the result as SRT and you can pipe it straight into FFmpeg to burn the subtitles into your video.
You need two things:
- Python 3.8+ with Whisper installed
- FFmpeg already installed and in your PATH
If you need a refresher on FFmpeg basics, check out Getting Started with FFmpeg first.
Install Whisper and Choose a Model
Install Whisper via pip.
pip install openai-whisper
Once installed, pick a model size. Whisper ships several variants with different speed/accuracy trade-offs.
| Model | VRAM | Speed | Accuracy |
|---|---|---|---|
| tiny | ~1GB | Fastest | Low |
| base | ~1GB | Fast | Moderate |
| small | ~2GB | Medium | Good |
| medium | ~5GB | Slow | High |
| large | ~10GB | Slowest | Best |
Extract Audio with FFmpeg
Whisper can accept a video file directly, but extracting audio first makes the pipeline faster and cleaner. Use FFmpeg to convert to WAV.
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
What each flag does:
-vn— ignore the video stream, process audio only-acodec pcm_s16le— output as WAV (16-bit linear PCM)-ar 16000— set sample rate to 16kHz (Whisper's recommended input)-ac 1— downmix to mono (Whisper is more consistent with mono audio)
The resulting audio.wav is what gets passed to Whisper next.
Transcribe Audio and Generate SRT with Whisper
With the audio file ready, run it through Whisper. You can use either the CLI or the Python API.
Using the CLI
whisper audio.wav --model medium --language en --output_format srt --output_dir ./subtitles
Explicitly passing --language is faster and more accurate than letting Whisper auto-detect. The output lands at ./subtitles/audio.srt and looks like this:
1
00:00:00,000 --> 00:00:03,500
Welcome. Today we're going to look at FFmpeg and Whisper.
2
00:00:03,500 --> 00:00:07,200
Let's start by installing Whisper.
Using the Python API
Use the API when you need more control over the output.
import whisper
model = whisper.load_model("medium")
result = model.transcribe("audio.wav", language="en")
def format_timestamp(seconds: float) -> str:
ms = int((seconds % 1) * 1000)
s = int(seconds) % 60
m = int(seconds) // 60 % 60
h = int(seconds) // 3600
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
with open("subtitles/audio.srt", "w", encoding="utf-8") as f:
for i, segment in enumerate(result["segments"], start=1):
start = format_timestamp(segment["start"])
end = format_timestamp(segment["end"])
text = segment["text"].strip()
f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
print("SRT file written")
Each item in result["segments"] contains start, end (in seconds), and text. The helper formats those into SRT timestamps.
Burn Subtitles into the Video with FFmpeg
Once the SRT file exists, use FFmpeg's subtitles filter to hard-code it into the video.
ffmpeg -i input.mp4 -vf "subtitles=subtitles/audio.srt" -c:a copy output.mp4
To customize the font, size, and color, pass a force_style argument.
ffmpeg -i input.mp4 \
-vf "subtitles=subtitles/audio.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&Hffffff,OutlineColour=&H000000,Outline=2'" \
-c:a copy output.mp4
force_style follows the ASS/SSA style specification. Non-Latin scripts should use a font that covers the required character set to avoid rendering issues.
Combine Everything Into One Script
Here is the full pipeline in a single Python script. Point it at a video file and it outputs a subtitled version.
import subprocess
import sys
import os
import whisper
def extract_audio(input_video: str, output_audio: str) -> None:
"""Extract audio from a video file."""
cmd = [
"ffmpeg", "-i", input_video,
"-vn", "-acodec", "pcm_s16le",
"-ar", "16000", "-ac", "1",
output_audio, "-y"
]
subprocess.run(cmd, check=True, capture_output=True)
print(f"Audio extracted: {output_audio}")
def transcribe_to_srt(audio_path: str, srt_path: str, model_name: str = "medium") -> None:
"""Transcribe audio with Whisper and write an SRT file."""
model = whisper.load_model(model_name)
result = model.transcribe(audio_path, language="en")
def format_timestamp(seconds: float) -> str:
ms = int((seconds % 1) * 1000)
s = int(seconds) % 60
m = int(seconds) // 60 % 60
h = int(seconds) // 3600
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
with open(srt_path, "w", encoding="utf-8") as f:
for i, segment in enumerate(result["segments"], start=1):
start = format_timestamp(segment["start"])
end = format_timestamp(segment["end"])
text = segment["text"].strip()
f.write(f"{i}\n{start} --> {end}\n{text}\n\n")
print(f"SRT written: {srt_path}")
def burn_subtitles(input_video: str, srt_path: str, output_video: str) -> None:
"""Hard-code subtitles into the video."""
# Assumes srt_path contains only ASCII characters
vf = f"subtitles={srt_path}"
cmd = [
"ffmpeg", "-i", input_video,
"-vf", vf,
"-c:a", "copy",
output_video, "-y"
]
subprocess.run(cmd, check=True, capture_output=True)
print(f"Subtitles burned: {output_video}")
def main(input_video: str) -> None:
base = os.path.splitext(input_video)[0]
audio_path = f"{base}_audio.wav"
srt_path = f"{base}_subtitles.srt"
output_video = f"{base}_subtitled.mp4"
print("=== Subtitle Auto-Generation Pipeline ===")
extract_audio(input_video, audio_path)
transcribe_to_srt(audio_path, srt_path)
burn_subtitles(input_video, srt_path, output_video)
# Clean up intermediate files
os.remove(audio_path)
print(f"\nDone: {output_video}")
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python subtitle_pipeline.py <input_video>")
sys.exit(1)
main(sys.argv[1])
Run it with a single command.
python subtitle_pipeline.py input.mp4
If you want to process an entire folder of videos at once, the FFmpeg Python Batch Automation article covers that pattern in detail.
Wrapping Up
Combining FFmpeg and Whisper automates nearly the entire subtitle workflow.
- The pipeline is four steps: extract audio, transcribe with Whisper, write SRT, burn with FFmpeg
- Use medium or larger models for non-English audio
- Keep all file paths ASCII-only to avoid the
subtitlesfilter error - The Python script wraps everything into a single command:
python subtitle_pipeline.py input.mp4
Stop spending time on manual transcription. Let Whisper handle it and focus on editing.