video-use is an open-source tool that allows you to edit videos using AI agents like Claude Code. It automates common editing tasks such as cutting filler words, color grading, and adding subtitles.

How does video-use work without the LLM watching the video?

The LLM reads the video through two layers: an audio transcript with word-level timestamps and speaker diarization, and on-demand visual composites (filmstrip + waveform + word labels PNGs) for decision points. This approach minimizes token usage while providing precise editing capabilities.

What types of videos can I edit with video-use?

video-use is designed to work with any content type, including talking heads, montages, tutorials, travel vlogs, and interviews. It adapts to your footage without requiring presets or menus.

What are the core features of video-use?

Key features include cutting out filler words and dead space, auto color grading, 30ms audio fades, burning customizable subtitles, generating animation overlays, self-evaluating rendered output, and persisting session memory.

Do I need an ElevenLabs API key to use video-use?

Yes, an ElevenLabs API key is required for audio transcription and speaker diarization, which are crucial for the tool's functionality. You will be prompted to provide it during the setup process.

Can I manually install video-use instead of using the setup prompt?

Yes, you can perform a manual installation by cloning the repository, symlinking it into your agent's skills directory, installing dependencies like FFmpeg, and configuring your ElevenLabs API key in the .env file.

Video Editing Agent SKill

Result preview

Full Demo

See an interview video polished into a social media clip with improved pacing, subtitles, enhanced audio.

Get started

Run Your First Task

01
Install
Add the skill to your agent.
02
Upload Your Video
Upload an interview, tutorial, or product video and describe the desired edits.
03
Review Output
Receive the edited version and verify subtitles, transitions, and pacing.

About

video-use introduces an innovative approach to video editing, leveraging AI agents like Claude Code to automate tedious and time-consuming tasks. This open-source tool allows users to simply drop raw footage into a folder, interact with an AI agent, and receive a polished `final.mp4` output. It's designed to streamline the video editing process for various content types, from talking heads to montages, by intelligently handling cuts, color grading, and subtitle burning.

The system operates by reading video content through detailed audio transcripts and on-demand visual composites, rather than processing every frame. This method provides the AI with word-boundary precision for editing while significantly reducing computational overhead. Key capabilities include automatically removing filler words and dead space, applying intelligent color grades, ensuring seamless audio transitions, and burning customizable subtitles directly into the video.

Beyond automation, video-use incorporates a robust self-evaluation pipeline. After initial processing, the AI reviews the rendered output at each cut boundary to detect and correct imperfections like visual jumps or audio pops. This ensures that the final video meets high production standards before it's presented to the user. The tool also maintains session memory, allowing users to pick up editing sessions exactly where they left off, enhancing workflow efficiency and consistency.

Key features

What makes it powerful

Automated Filler Word Removal
Automatically cuts out filler words such as 'umm,' 'uh,' false starts, and dead space between takes for cleaner audio.
Intelligent Color Grading
Auto color grades every video segment, offering options like warm cinematic, neutral punch, or custom FFmpeg chains for consistent visual quality.
Seamless Audio Fades
Applies 30ms audio fades at every cut to eliminate pops and ensure smooth transitions between segments.
Customizable Subtitle Burning
Burns subtitles directly into your video in a customizable style, with 2-word UPPERCASE chunks by default, enhancing accessibility and engagement.
AI-Powered Self-Evaluation
The system self-evaluates the rendered output at every cut boundary, catching visual jumps, audio pops, and hidden subtitles before showing you the preview.

Use cases

When to reach for it

Producing Professional Talking Head Videos
Quickly refine talking head footage by removing pauses and filler words, ensuring a concise and engaging presentation.
Creating Dynamic Video Montages
Effortlessly assemble montages with automated cuts, color grading, and animation overlays for a polished, professional look.
Streamlining Tutorial Video Production
Accelerate the creation of tutorial videos by automating repetitive editing tasks, allowing creators to focus on content.

SKILL.md

video-use

Introducing video-use — edit videos with Claude Code. 100% open source.

Drop raw footage in a folder, chat with Claude Code, get final.mp4 back. Works for any content — talking heads, montages, tutorials, travel, interviews — without presets or menus.

What it does

Cuts out filler words (umm, uh, false starts) and dead space between takes
Auto color grades every segment (warm cinematic, neutral punch, or any custom ffmpeg chain)
30ms audio fades at every cut so you never hear a pop
Burns subtitles in your style — 2-word UPPERCASE chunks by default, fully customizable
Generates animation overlays via HyperFrames, Remotion, Manim, or PIL — spawned in parallel sub-agents, one per animation
Self-evaluates the rendered output at every cut boundary before showing you anything
Persists session memory in project.md so next week's session picks up where you left off

Setup prompt

Paste into Claude Code, Codex, Hermes, Openclaw, or any agent with shell access:

Set up https://github.com/browser-use/video-use for me.

Read install.md first to install this repo, wire up ffmpeg, register the skill with whichever agent you're running under, and set up the ElevenLabs API key — ask me to paste it when you need it. Then read SKILL.md for daily usage, and always read helpers/ because that's where the editing scripts live. After install, don't transcribe anything on your own — just tell me it's ready and wait for me to drop footage into a folder.

The agent handles the clone, dependencies, skill registration, and prompts you once for your ElevenLabs API key (grab one at elevenlabs.io/app/settings/api-keys).

Then point your agent at a folder of raw takes:

cd /path/to/your/videos
claude    # or codex, hermes, etc.

For always-on editing from your own VPS or Telegram, run the agent through Browser Use Box. Watch the 15-second demo.

And in the session:

edit these into a launch video

It inventories the sources, proposes a strategy, waits for your OK, then produces edit/final.mp4 next to your sources. All outputs live in <videos_dir>/edit/ — the skill directory stays clean.

Manual install

If you'd rather do it by hand:

# 1. Clone and symlink into your agent's skills directory
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use        # Claude Code
# ln -sfn ~/Developer/video-use ~/.codex/skills/video-use       # Codex

# 2. Install deps
cd ~/Developer/video-use
uv sync                         # or: pip install -e .
brew install ffmpeg             # required
brew install yt-dlp             # optional, for downloading online sources

# 3. Add your ElevenLabs API key
cp .env.example .env
$EDITOR .env                    # ELEVENLABS_API_KEY=...

How it works

The LLM never watches the video. It reads it — through two layers that together give it everything it needs to cut with word-boundary precision.

Layer 1 — Audio transcript (always loaded). One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events ((laughter), (applause), (sigh)). All takes pack into a single ~12KB takes_packed.md — the LLM's primary reading view.

## C0103  (duration: 43.0s, 8 phrases)
  [002.52-005.36] S0 Ninety percent of what a web agent does is completely wasted.
  [006.08-006.74] S0 We fixed this.

Layer 2 — Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points — ambiguous pauses, retake comparisons, cut-point sanity checks.

Naive approach: 30,000 frames × 1,500 tokens = 45M tokens of noise. Video Use: 12KB text + a handful of PNGs.

Same idea as browser-use giving an LLM a structured DOM instead of a screenshot — but for video.

Pipeline

Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval
                                                              │
                                                              └─ issue? fix + re-render (max 3)

The self-eval loop runs timeline_view on the rendered output at every cut boundary — catches visual jumps, audio pops, hidden subtitles. You see the preview only after it passes.

Design principles

Text + on-demand visuals. No frame-dumping. The transcript is the surface.
Audio is primary, visuals follow. Cuts come from speech boundaries and silence gaps.
Ask → confirm → execute → self-eval → persist. Never touch the cut without strategy approval.
Zero assumptions about content type. Look, ask, then edit.
12 hard rules, artistic freedom elsewhere. Production-correctness is non-negotiable. Taste isn't.

See SKILL.md for the full production rules and editing craft.

Full Demo

Run Your First Task

Install

Upload Your Video

Review Output

About

What makes it powerful

Automated Filler Word Removal

Intelligent Color Grading

Seamless Audio Fades

Customizable Subtitle Burning

AI-Powered Self-Evaluation