What is the YouTube Transcript Downloader Skill?

The YouTube Transcript Downloader Skill is a tool that allows you to download transcripts, subtitles, and cover images from YouTube videos using just their URL or video ID. It supports various features like multi-language retrieval, translation, chapter segmentation, and speaker identification.

How does this skill get YouTube transcripts without an API key?

The skill directly uses YouTube's InnerTube API to fetch transcripts. If direct access is blocked, it automatically falls back to `yt-dlp` to ensure reliable transcript retrieval without requiring a separate API key.

Can I get transcripts in languages other than English?

Yes, you can specify a comma-separated list of language codes using the `--languages` option. The skill will attempt to fetch transcripts in the specified priority order. You can also translate the transcript into another language using the `--translate` option.

Does it support speaker identification and chapter segmentation?

Yes, the skill supports chapter segmentation from video descriptions using the `--chapters` option. For speaker identification, you can use the `--speakers` option, which outputs a raw file for AI post-processing to label speakers.

What output formats are available for the YouTube transcript?

You can output the transcript in Markdown (`.md`) format, which includes timestamps, chapters, and optional speaker data, or as an SRT (`.srt`) subtitle file, which is compatible with most video players.

Is there a caching mechanism, and how does it work?

Yes, the skill caches video metadata, raw transcript data, and segmented sentences. Subsequent runs for the same video will use this cached data, speeding up processing. You can force a re-fetch with the `--refresh` option.

YouTube Transcript Agent Skill

Result preview

Full Demo

Explore a professional video analysis report powered by this Skill.

Get started

Run Your First Task

01
Install
Add the YouTube Transcript Agent Skill to your AI agent.
02
Provide Content
Share a YouTube link and specify your desired output format.
03
Extract Insights
Generate structured transcripts, summaries, key takeaways, and reusable content.

About

The YouTube Transcript Downloader Skill provides a robust solution for extracting comprehensive information from YouTube videos. This skill allows users to effortlessly download full YouTube transcripts, subtitles, and even cover images by simply providing a video URL or ID. It's designed for content creators, researchers, and anyone needing to convert spoken content into text, offering features like multi-language support, translation capabilities, and advanced structuring options.

Leveraging direct access to YouTube's InnerTube API and a smart fallback to `yt-dlp`, the skill ensures reliable and efficient data retrieval without the need for personal API keys. It offers flexible output formats, including Markdown for detailed analysis with timestamps and chapter markers, and SRT for standard subtitle integration. Additionally, it supports chapter segmentation from video descriptions and provides a workflow for AI-powered speaker identification, delivering highly organized and attributed text.

With intelligent caching, the skill minimizes redundant network requests, making subsequent operations on the same video exceptionally fast. Whether you need to analyze video content, create accessible subtitles, translate material for a global audience, or simply extract video metadata and thumbnails, this skill streamlines the process, providing a powerful tool for YouTube content management.

Key features

What makes it powerful

Direct YouTube Access
Accesses YouTube's InnerTube API directly for fast transcript retrieval, automatically falling back to `yt-dlp` if the direct API is blocked, ensuring reliable access without API keys.
Multi-language Support & Translation
Specify preferred languages for transcripts and translate them into a target language, making content accessible to a global audience.
Chapter Segmentation & Speaker Identification
Automatically segments transcripts by video chapters and supports AI post-processing for speaker identification, providing structured and attributed text.
Flexible Output Formats
Generates transcripts in Markdown with timestamps or SRT subtitle files, suitable for various uses from content analysis to video players.
Smart Caching for Efficiency
Caches raw video data, metadata, and segmented transcripts, enabling fast re-formatting and reducing network calls on subsequent requests for the same video.

Use cases

When to reach for it

Generate YouTube Transcripts for Content Analysis
Content creators and researchers can quickly obtain full YouTube transcripts, including timestamps and chapter markers, to analyze video content, extract key information, or repurpose spoken content into text articles.
Create Subtitle Files for Accessibility
Video editors and accessibility specialists can generate SRT subtitle files from YouTube videos, ensuring that content is accessible to hearing-impaired audiences or those who prefer to consume content silently.
Translate Video Content for Global Reach
Marketers and educators can translate YouTube transcripts into multiple languages, expanding the reach of their video content to non-native speakers and improving global engagement.
Extract Metadata and Cover Images
Users can easily extract video metadata and high-quality cover images, useful for cataloging, social media promotion, or creating visual assets related to the video content.

SKILL.md

YouTube Transcript

Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly and automatically falls back to yt-dlp when YouTube blocks the direct API path.

Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.

Script Directory

Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.

Script	Purpose
`scripts/main.ts`	Transcript download CLI

Usage

# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>

# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja

# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps

# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters

# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers

# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt

# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans

# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list

# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

Options

Option	Description	Default
`<url-or-id>`	YouTube URL or video ID (multiple allowed)	Required
`--languages <codes>`	Language codes, comma-separated, in priority order	`en`
`--format <fmt>`	Output format: `text`, `srt`	`text`
`--translate <code>`	Translate to specified language code
`--list`	List available transcripts instead of fetching
`--timestamps`	Include `[HH:MM:SS → HH:MM:SS]` timestamps per paragraph	on
`--no-timestamps`	Disable timestamps
`--chapters`	Chapter segmentation from video description
`--speakers`	Raw transcript with metadata for speaker identification
`--exclude-generated`	Skip auto-generated transcripts
`--exclude-manually-created`	Skip manually created transcripts
`--refresh`	Force re-fetch, ignore cached data
`-o, --output <path>`	Save to specific file path	auto-generated
`--output-dir <dir>`	Base output directory	`youtube-transcript`

Optional Environment Variables

Variable	Description
`YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER`	Passed to `yt-dlp --cookies-from-browser` during fallback, e.g. `chrome`, `safari`, `firefox`, or `chrome:Profile 1`

Input Formats

Accepts any of these as video input:

Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
Short URL: https://youtu.be/dQw4w9WgXcQ
Embed URL: https://www.youtube.com/embed/dQw4w9WgXcQ
Shorts URL: https://www.youtube.com/shorts/dQw4w9WgXcQ
Video ID: dQw4w9WgXcQ

Output Formats

Format	Extension	Description
`text`	`.md`	Markdown with frontmatter (incl. `description`), title heading, summary, optional TOC/cover/timestamps/chapters/speakers
`srt`	`.srt`	SubRip subtitle format for video players

Output Directory

youtube-transcript/
├── .index.json                          # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # Video metadata (title, channel, description, duration, chapters, etc.)
    ├── transcript-raw.json              # Raw transcript snippets from YouTube API (cached)
    ├── transcript-sentences.json        # Sentence-segmented transcript (split by punctuation, merged across snippets)
    ├── imgs/
    │   └── cover.jpg                    # Video thumbnail
    ├── transcript.md                    # Markdown transcript (generated from sentences)
    └── transcript.srt                   # SRT subtitle (generated from raw snippets, if --format srt)

{channel-slug}: Channel name in kebab-case
{title-full-slug}: Full video title in kebab-case

The --list mode outputs to stdout only (no file saved).

Caching

On first fetch, the script saves:

meta.json — video metadata, chapters, cover image path, language info
transcript-raw.json — raw transcript snippets from YouTube API ({ text, start, duration }[])
transcript-sentences.json — sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。？！ etc.), timestamps proportionally allocated by character length, CJK-aware text merging
imgs/cover.jpg — video thumbnail

Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.

When YouTube returns anti-bot / blocked responses on the direct InnerTube path, the script retries with alternate client identities and then falls back to yt-dlp if available. If fallback is needed but yt-dlp is unavailable, the agent should decide how to make yt-dlp available and continue rather than pushing the installation decision to the user.

SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.

Workflow

When user provides a YouTube URL and wants the transcript:

Run with --list first if the user hasn't specified a language, to show available options
Always single-quote the URL when running the script — zsh treats ? as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use 'https://www.youtube.com/watch?v=ID'
Default: run with --chapters --speakers for the richest output (chapters + speaker identification)
The script auto-saves cached data + output file and prints the file path
For --speakers mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels

When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.

When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.

Chapter & Speaker Workflow

Chapters (`--chapters`)

The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.

If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.

Speaker Identification (`--speakers`)

Speaker identification requires AI processing. The script outputs a raw .md file containing:

YAML frontmatter with video metadata (title, channel, date, cover, description, language)
Video description (for speaker name extraction)
Chapter list from description (if available)
Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)

After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:

Read the saved .md file
Read the prompt template at {baseDir}/prompts/speaker-transcript.md
Process the raw transcript following the prompt:
- Identify speakers using video metadata (title → guest, channel → host, description → names)
- Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
- Segment into chapters (use description chapters if available, else create from topic shifts)
- Format with **Speaker Name:** labels, paragraph grouping (2-4 sentences), and [HH:MM:SS → HH:MM:SS] timestamps
Overwrite the .md file with the processed transcript (keep the YAML frontmatter)

When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.

Error Cases

Error	Meaning
Transcripts disabled	Video has no captions at all
No transcript found	Requested language not available
Video unavailable	Video deleted, private, or region-locked
IP blocked	Too many requests, try again later
Age restricted	Video requires login for age verification
bot detected	The script retries alternate clients and then `yt-dlp`; if fallback tooling is missing, the agent should resolve that itself, otherwise if it still fails try `YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER=safari` (or your browser)

Full Demo

Run Your First Task

Install

Provide Content

Extract Insights

About

What makes it powerful

Direct YouTube Access

Multi-language Support & Translation

Chapter Segmentation & Speaker Identification

Flexible Output Formats

Smart Caching for Efficiency