NanoSkill
submit your skill

Video Editing Agent SKill

bybrowser-use9KGitHub starsGitHub

Edit videos using AI with Claude Code, automating tasks like cutting filler words, color grading, and burning subtitles. Streamline your video production workflow and get a polished final MP4.

videoSecurity scan passed
Result preview

Full Demo

See an interview video polished into a social media clip with improved pacing, subtitles, enhanced audio.

Get started

Run Your First Task

  1. a simple demonstration of the first step in using Video Editing Agent Skill
    01

    Install

    Add the skill to your agent.

  2. a simple demonstration of the second step in using Video Editing Agent Skill
    02

    Upload Your Video

    Upload an interview, tutorial, or product video and describe the desired edits.

  3. the demo of Video Editing Agent Skill
    03

    Review Output

    Receive the edited version and verify subtitles, transitions, and pacing.

Install command

$ npx skills add https://github.com/browser-use/video-use

About

video-use introduces an innovative approach to video editing, leveraging AI agents like Claude Code to automate tedious and time-consuming tasks. This open-source tool allows users to simply drop raw footage into a folder, interact with an AI agent, and receive a polished `final.mp4` output. It's designed to streamline the video editing process for various content types, from talking heads to montages, by intelligently handling cuts, color grading, and subtitle burning.

The system operates by reading video content through detailed audio transcripts and on-demand visual composites, rather than processing every frame. This method provides the AI with word-boundary precision for editing while significantly reducing computational overhead. Key capabilities include automatically removing filler words and dead space, applying intelligent color grades, ensuring seamless audio transitions, and burning customizable subtitles directly into the video.

Beyond automation, video-use incorporates a robust self-evaluation pipeline. After initial processing, the AI reviews the rendered output at each cut boundary to detect and correct imperfections like visual jumps or audio pops. This ensures that the final video meets high production standards before it's presented to the user. The tool also maintains session memory, allowing users to pick up editing sessions exactly where they left off, enhancing workflow efficiency and consistency.

Key features

What makes it powerful

  • Automated Filler Word Removal

    Automatically cuts out filler words such as 'umm,' 'uh,' false starts, and dead space between takes for cleaner audio.

  • Intelligent Color Grading

    Auto color grades every video segment, offering options like warm cinematic, neutral punch, or custom FFmpeg chains for consistent visual quality.

  • Seamless Audio Fades

    Applies 30ms audio fades at every cut to eliminate pops and ensure smooth transitions between segments.

  • Customizable Subtitle Burning

    Burns subtitles directly into your video in a customizable style, with 2-word UPPERCASE chunks by default, enhancing accessibility and engagement.

  • AI-Powered Self-Evaluation

    The system self-evaluates the rendered output at every cut boundary, catching visual jumps, audio pops, and hidden subtitles before showing you the preview.

Use cases

When to reach for it

  • Producing Professional Talking Head Videos

    Quickly refine talking head footage by removing pauses and filler words, ensuring a concise and engaging presentation.

  • Creating Dynamic Video Montages

    Effortlessly assemble montages with automated cuts, color grading, and animation overlays for a polished, professional look.

  • Streamlining Tutorial Video Production

    Accelerate the creation of tutorial videos by automating repetitive editing tasks, allowing creators to focus on content.

SKILL.md

video-use

Introducing video-use — edit videos with Claude Code. 100% open source.

Drop raw footage in a folder, chat with Claude Code, get final.mp4 back. Works for any content — talking heads, montages, tutorials, travel, interviews — without presets or menus.

What it does

  • Cuts out filler words (umm, uh, false starts) and dead space between takes
  • Auto color grades every segment (warm cinematic, neutral punch, or any custom ffmpeg chain)
  • 30ms audio fades at every cut so you never hear a pop
  • Burns subtitles in your style — 2-word UPPERCASE chunks by default, fully customizable
  • Generates animation overlays via HyperFrames, Remotion, Manim, or PIL — spawned in parallel sub-agents, one per animation
  • Self-evaluates the rendered output at every cut boundary before showing you anything
  • Persists session memory in project.md so next week's session picks up where you left off

Setup prompt

Paste into Claude Code, Codex, Hermes, Openclaw, or any agent with shell access:

Set up https://github.com/browser-use/video-use for me.

Read install.md first to install this repo, wire up ffmpeg, register the skill with whichever agent you're running under, and set up the ElevenLabs API key — ask me to paste it when you need it. Then read SKILL.md for daily usage, and always read helpers/ because that's where the editing scripts live. After install, don't transcribe anything on your own — just tell me it's ready and wait for me to drop footage into a folder.

The agent handles the clone, dependencies, skill registration, and prompts you once for your ElevenLabs API key (grab one at elevenlabs.io/app/settings/api-keys).

Then point your agent at a folder of raw takes:

cd /path/to/your/videos
claude    # or codex, hermes, etc.

For always-on editing from your own VPS or Telegram, run the agent through Browser Use Box. Watch the 15-second demo.

And in the session:

edit these into a launch video

It inventories the sources, proposes a strategy, waits for your OK, then produces edit/final.mp4 next to your sources. All outputs live in <videos_dir>/edit/ — the skill directory stays clean.

Manual install

If you'd rather do it by hand:

# 1. Clone and symlink into your agent's skills directory
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use        # Claude Code
# ln -sfn ~/Developer/video-use ~/.codex/skills/video-use       # Codex

# 2. Install deps
cd ~/Developer/video-use
uv sync                         # or: pip install -e .
brew install ffmpeg             # required
brew install yt-dlp             # optional, for downloading online sources

# 3. Add your ElevenLabs API key
cp .env.example .env
$EDITOR .env                    # ELEVENLABS_API_KEY=...

How it works

The LLM never watches the video. It reads it — through two layers that together give it everything it needs to cut with word-boundary precision.

<p align="center"> <img src="https://file.nanoskill.ai/timeline-view.svg" alt="timeline_view composite — filmstrip + speaker track + waveform + word labels + silence-gap cut candidates" width="100%"> </p>

Layer 1 — Audio transcript (always loaded). One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events ((laughter), (applause), (sigh)). All takes pack into a single ~12KB takes_packed.md — the LLM's primary reading view.

## C0103  (duration: 43.0s, 8 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.

Layer 2 — Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points — ambiguous pauses, retake comparisons, cut-point sanity checks.

Naive approach: 30,000 frames × 1,500 tokens = 45M tokens of noise. Video Use: 12KB text + a handful of PNGs.

Same idea as browser-use giving an LLM a structured DOM instead of a screenshot — but for video.

Pipeline

Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval
                                                              │
                                                              └─ issue? fix + re-render (max 3)

The self-eval loop runs timeline_view on the rendered output at every cut boundary — catches visual jumps, audio pops, hidden subtitles. You see the preview only after it passes.

Design principles

  1. Text + on-demand visuals. No frame-dumping. The transcript is the surface.
  2. Audio is primary, visuals follow. Cuts come from speech boundaries and silence gaps.
  3. Ask → confirm → execute → self-eval → persist. Never touch the cut without strategy approval.
  4. Zero assumptions about content type. Look, ask, then edit.
  5. 12 hard rules, artistic freedom elsewhere. Production-correctness is non-negotiable. Taste isn't.

See SKILL.md for the full production rules and editing craft.

FAQ