NanoSkill
submit your skill

YouTube Transcript Agent Skill

byJimLiu1KGitHub starsGitHub

Download YouTube video transcripts, subtitles, and cover images by URL or video ID. Supports multiple languages, translation, chapters, and speaker identification to enhance content accessibility. Start free in seconds.

youtubeSecurity scan passed
Result preview

Full Demo

Explore a professional video analysis report powered by this Skill.

Get started

Run Your First Task

  1. A screenshot showing the installation process of the 'baoyu-youtube-transcript' AI agent skill from a GitHub repository using an NPX command. At the top, a blue command banner displays the installation command for adding the skill from the GitHub repository. Below, a conversational interface explains the installation workflow, including handling an interactive installation prompt, detecting that the process did not complete automatically, rerunning the command with a '-y' flag to bypass prompts, and explicitly targeting the Hermes Agent environment. The log then describes creating a symbolic link from the agent skills directory to the Hermes skills directory so the skill can be recognized by the Hermes Agent. A final confirmation states that the 'baoyu-youtube-transcript' skill was successfully installed, linked, and is available for use through a skill invocation command. The interface uses a clean chat-style layout with gray response panels, dark blue command highlighting, and monospaced code snippets for file paths and terminal commands.
    01

    Install

    Add the YouTube Transcript Agent Skill to your AI agent.

  2. A screenshot showing a prompt and response workflow for a YouTube transcript analysis skill. The top section contains a dark blue prompt panel describing an AI-powered content research and knowledge extraction system designed to analyze YouTube videos and transform them into structured knowledge assets. The prompt outlines requirements such as extracting complete transcripts, metadata, chapter structures, timestamps, speaker distinctions, key topics, major insights, statistics, examples, and actionable takeaways. It also lists supported output formats including executive summaries, study notes, blog articles, social media content, newsletter drafts, and research reports. Additional instructions emphasize transcript quality evaluation, improved readability, information hierarchy, redundancy removal, and generating a polished final deliverable. The requested output is a visually engaging PDF report with chapter navigation, highlighted takeaways, and presentation-ready layouts. Beneath the prompt, a gray response panel shows the AI acknowledging that the skill has been loaded successfully but explaining that a YouTube URL is still required before transcript extraction and report generation can begin. The interface uses a clean chat-style layout with large rounded panels, white text on a dark blue background, and structured instructional formatting.
    02

    Provide Content

    Share a YouTube link and specify your desired output format.

  3. A presentation cover slide for a TED Talk analysis and knowledge report. The slide features a bold red background with a teal accent bar running along the bottom edge. Centered at the top is the large white title 'This Is How Kids Should Be Learning with AI.' Below the title, the speaker attribution reads 'Priya Lakhani | TEDNext 2025,' followed by the subtitle 'TED Talk Analysis & Knowledge Report' in italicized white text. In the center of the slide is a thumbnail image from the TED Talk showing Priya Lakhani on stage. The thumbnail includes the prominent message 'AI Isn't a Shortcut to Learning' alongside the TED logo. The overall design resembles a professional research report cover, using strong typography, high contrast colors, and a clean layout to introduce an educational analysis focused on artificial intelligence, learning science, and the future of education.
    03

    Extract Insights

    Generate structured transcripts, summaries, key takeaways, and reusable content.

Install command

$ npx skills add https://github.com/JimLiu/baoyu-skills/tree/main/skills/baoyu-youtube-transcript

About

The YouTube Transcript Downloader Skill provides a robust solution for extracting comprehensive information from YouTube videos. This skill allows users to effortlessly download full YouTube transcripts, subtitles, and even cover images by simply providing a video URL or ID. It's designed for content creators, researchers, and anyone needing to convert spoken content into text, offering features like multi-language support, translation capabilities, and advanced structuring options.

Leveraging direct access to YouTube's InnerTube API and a smart fallback to `yt-dlp`, the skill ensures reliable and efficient data retrieval without the need for personal API keys. It offers flexible output formats, including Markdown for detailed analysis with timestamps and chapter markers, and SRT for standard subtitle integration. Additionally, it supports chapter segmentation from video descriptions and provides a workflow for AI-powered speaker identification, delivering highly organized and attributed text.

With intelligent caching, the skill minimizes redundant network requests, making subsequent operations on the same video exceptionally fast. Whether you need to analyze video content, create accessible subtitles, translate material for a global audience, or simply extract video metadata and thumbnails, this skill streamlines the process, providing a powerful tool for YouTube content management.

Key features

What makes it powerful

  • Direct YouTube Access

    Accesses YouTube's InnerTube API directly for fast transcript retrieval, automatically falling back to `yt-dlp` if the direct API is blocked, ensuring reliable access without API keys.

  • Multi-language Support & Translation

    Specify preferred languages for transcripts and translate them into a target language, making content accessible to a global audience.

  • Chapter Segmentation & Speaker Identification

    Automatically segments transcripts by video chapters and supports AI post-processing for speaker identification, providing structured and attributed text.

  • Flexible Output Formats

    Generates transcripts in Markdown with timestamps or SRT subtitle files, suitable for various uses from content analysis to video players.

  • Smart Caching for Efficiency

    Caches raw video data, metadata, and segmented transcripts, enabling fast re-formatting and reducing network calls on subsequent requests for the same video.

Use cases

When to reach for it

  • Generate YouTube Transcripts for Content Analysis

    Content creators and researchers can quickly obtain full YouTube transcripts, including timestamps and chapter markers, to analyze video content, extract key information, or repurpose spoken content into text articles.

  • Create Subtitle Files for Accessibility

    Video editors and accessibility specialists can generate SRT subtitle files from YouTube videos, ensuring that content is accessible to hearing-impaired audiences or those who prefer to consume content silently.

  • Translate Video Content for Global Reach

    Marketers and educators can translate YouTube transcripts into multiple languages, expanding the reach of their video content to non-native speakers and improving global engagement.

  • Extract Metadata and Cover Images

    Users can easily extract video metadata and high-quality cover images, useful for cataloging, social media promotion, or creating visual assets related to the video content.

SKILL.md

YouTube Transcript

Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly and automatically falls back to yt-dlp when YouTube blocks the direct API path.

Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.

Script Directory

Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.

ScriptPurpose
scripts/main.tsTranscript download CLI

Usage

# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>

# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja

# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps

# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters

# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers

# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt

# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans

# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list

# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

Options

OptionDescriptionDefault
<url-or-id>YouTube URL or video ID (multiple allowed)Required
--languages <codes>Language codes, comma-separated, in priority orderen
--format <fmt>Output format: text, srttext
--translate <code>Translate to specified language code
--listList available transcripts instead of fetching
--timestampsInclude [HH:MM:SS → HH:MM:SS] timestamps per paragraphon
--no-timestampsDisable timestamps
--chaptersChapter segmentation from video description
--speakersRaw transcript with metadata for speaker identification
--exclude-generatedSkip auto-generated transcripts
--exclude-manually-createdSkip manually created transcripts
--refreshForce re-fetch, ignore cached data
-o, --output <path>Save to specific file pathauto-generated
--output-dir <dir>Base output directoryyoutube-transcript

Optional Environment Variables

VariableDescription
YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSERPassed to yt-dlp --cookies-from-browser during fallback, e.g. chrome, safari, firefox, or chrome:Profile 1

Input Formats

Accepts any of these as video input:

  • Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
  • Short URL: https://youtu.be/dQw4w9WgXcQ
  • Embed URL: https://www.youtube.com/embed/dQw4w9WgXcQ
  • Shorts URL: https://www.youtube.com/shorts/dQw4w9WgXcQ
  • Video ID: dQw4w9WgXcQ

Output Formats

FormatExtensionDescription
text.mdMarkdown with frontmatter (incl. description), title heading, summary, optional TOC/cover/timestamps/chapters/speakers
srt.srtSubRip subtitle format for video players

Output Directory

youtube-transcript/
├── .index.json                          # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # Video metadata (title, channel, description, duration, chapters, etc.)
    ├── transcript-raw.json              # Raw transcript snippets from YouTube API (cached)
    ├── transcript-sentences.json        # Sentence-segmented transcript (split by punctuation, merged across snippets)
    ├── imgs/
    │   └── cover.jpg                    # Video thumbnail
    ├── transcript.md                    # Markdown transcript (generated from sentences)
    └── transcript.srt                   # SRT subtitle (generated from raw snippets, if --format srt)
  • {channel-slug}: Channel name in kebab-case
  • {title-full-slug}: Full video title in kebab-case

The --list mode outputs to stdout only (no file saved).

Caching

On first fetch, the script saves:

  • meta.json — video metadata, chapters, cover image path, language info
  • transcript-raw.json — raw transcript snippets from YouTube API ({ text, start, duration }[])
  • transcript-sentences.json — sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。?! etc.), timestamps proportionally allocated by character length, CJK-aware text merging
  • imgs/cover.jpg — video thumbnail

Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.

When YouTube returns anti-bot / blocked responses on the direct InnerTube path, the script retries with alternate client identities and then falls back to yt-dlp if available. If fallback is needed but yt-dlp is unavailable, the agent should decide how to make yt-dlp available and continue rather than pushing the installation decision to the user.

SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.

Workflow

When user provides a YouTube URL and wants the transcript:

  1. Run with --list first if the user hasn't specified a language, to show available options
  2. Always single-quote the URL when running the script — zsh treats ? as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use 'https://www.youtube.com/watch?v=ID'
  3. Default: run with --chapters --speakers for the richest output (chapters + speaker identification)
  4. The script auto-saves cached data + output file and prints the file path
  5. For --speakers mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels

When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.

When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.

Chapter & Speaker Workflow

Chapters (--chapters)

The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.

If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.

Speaker Identification (--speakers)

Speaker identification requires AI processing. The script outputs a raw .md file containing:

  • YAML frontmatter with video metadata (title, channel, date, cover, description, language)
  • Video description (for speaker name extraction)
  • Chapter list from description (if available)
  • Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)

After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:

  1. Read the saved .md file
  2. Read the prompt template at {baseDir}/prompts/speaker-transcript.md
  3. Process the raw transcript following the prompt:
    • Identify speakers using video metadata (title → guest, channel → host, description → names)
    • Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
    • Segment into chapters (use description chapters if available, else create from topic shifts)
    • Format with **Speaker Name:** labels, paragraph grouping (2-4 sentences), and [HH:MM:SS → HH:MM:SS] timestamps
  4. Overwrite the .md file with the processed transcript (keep the YAML frontmatter)

When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.

Error Cases

ErrorMeaning
Transcripts disabledVideo has no captions at all
No transcript foundRequested language not available
Video unavailableVideo deleted, private, or region-locked
IP blockedToo many requests, try again later
Age restrictedVideo requires login for age verification
bot detectedThe script retries alternate clients and then yt-dlp; if fallback tooling is missing, the agent should resolve that itself, otherwise if it still fails try YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER=safari (or your browser)

FAQ