AI Detection

AI Auto-Reframe for Vertical Video — How It Works and Why It Matters

A technical breakdown of how AI converts 16:9 video to 9:16 — face detection, subject tracking, crop window calculation, and when the automation fails.

7 min read

TL;DR

AI auto-reframe is the process by which an AI system automatically converts horizontal (16:9) video to vertical (9:16) format by tracking the primary subject — typically a human face or speaker — and dynamically adjusting the crop window to keep them centered throughout the clip. It works through a four-step pipeline: subject detection, subject tracking, crop window calculation, and smoothing. For context on how auto-reframe fits into the broader clip generation workflow, see what is an AI clip generator.

Section 1

The Problem: 16:9 Video on a 9:16 Screen

The fundamental mismatch between horizontal video production and vertical mobile consumption is a structural problem for every video creator. Cameras, monitors, and editing software default to 16:9 (landscape) because that's the traditional television and desktop format. TikTok, Instagram Reels, and YouTube Shorts are consumed on phones held vertically — a 9:16 (portrait) format.

When a 16:9 video is displayed in a 9:16 player, the two common solutions are both bad. Letterboxing shrinks the 16:9 frame to fit inside the 9:16 container and fills the top and bottom with black bars — the subject appears small in the center of the screen, surrounded by dead space. Center crop zooms in to fill the 9:16 frame by cutting the left and right edges — if the subject isn't centered, they're cropped out of frame.

Auto-reframe solves this by making the crop intelligent. Instead of a fixed crop position, it tracks where the subject is at every moment in the clip and keeps them centered as they move. The result is a vertical clip where the speaker always appears correctly framed — without letterboxing, without accidentally cropped faces, and without the static zoom of a center crop.

Section 2

How AI Auto-Reframe Works: The Four-Step Pipeline

AI auto-reframe operates through a four-stage pipeline that runs on every frame of the clip. Understanding each stage helps you predict when the output will be clean and when it will need a manual correction pass.

Step 1 — Subject detection

The first step identifies what the primary subject is and where it appears in the initial frame. For most creator content, this is face detection — the model locates human faces in the frame using a neural network trained on face detection tasks. Google's MediaPipe Face Mesh and OpenCV-based Haar cascades are two widely used approaches; MediaPipe provides 468 facial landmark points per face, enabling precise bounding box calculation even for non-frontal face angles.

For full-body content (dance, fitness, presentation) where the face may not always be visible, pose estimation models take over — detecting the body skeleton and centering the crop around the torso or center of mass. Product demo content may use object detection models trained on specific product categories to track the primary object in frame rather than a human subject.

Step 2 — Tracking the primary subject across frames

Once the subject is detected in the first frame, the tracking algorithm follows it through subsequent frames. Object tracking algorithms — such as SORT (Simple Online and Realtime Tracking) or CSRT (Discriminative Correlation Filter with Channel and Spatial Reliability) — maintain a bounding box around the subject as it moves, predicting its position in each new frame based on its velocity and appearance in previous frames.

Multi-subject scenes introduce the dominant challenge: when two or more people are visible in the frame, the tracker must decide which subject to follow. Most tools use a primary speaker detection heuristic — the person who is actively speaking (detected via audio-video alignment) is treated as the primary subject. When both people are speaking simultaneously, behavior varies: some tools cut between subjects using quick pans; others lock to one subject until the other becomes clearly dominant.

Step 3 — Dynamic crop window calculation

With the subject's bounding box established for each frame, the crop window is calculated. For a 16:9 source converted to 9:16 output, the output width is 9/16 of the original height — meaning roughly 56% of the original frame width is retained. The crop window is positioned so the subject's detected center (typically the midpoint between the top of the head and the chin) is vertically centered in the frame.

The crop calculation also applies padding to avoid the subject being too tightly framed. A well-configured auto-reframe tool leaves 15–20% of headroom above the subject and frames them in the upper-center of the vertical crop rather than dead center — this feels more natural for talking-head content and leaves room for caption overlays at the bottom.

Step 4 — Smoothing and stabilization

Raw crop window positions calculated frame by frame would produce a jittery output — the crop would shift slightly with every micro-movement of the subject. A smoothing pass applies a low-pass filter or a Kalman filter to the crop position over time, ensuring the window moves smoothly rather than jumping. The smoothing window is typically 10–30 frames, balancing responsiveness to large subject movements against stability for small head motion.

Sudden large movements — a speaker turning to face a second camera, leaning far forward, or moving off-frame entirely — can break the smoother if the motion is larger than the tracking algorithm's predicted range. These are the moments most likely to require manual correction in the review pass.

Auto-Reframe and Captions Included with Every Clip

Paste a YouTube URL into Transcriptr — clips are auto-reframed to 9:16 with word-level captions applied. Free to start.

Try Free
Section 3

When Auto-Reframe Works Well (and When It Fails)

Auto-reframe is highly reliable for a specific content profile and significantly less reliable outside it. Knowing the boundary helps you decide when to trust the automation and when to spend the 30 seconds on manual correction.

Auto-reframe works well for: single-speaker talking-head content with a static or simple background (podcast interviews, YouTube commentary, lecture recordings); content where the speaker remains roughly centered in the 16:9 frame; and clips under 90 seconds where tracking drift doesn't accumulate significantly. For this content profile, auto-reframe output typically requires no manual adjustment.

Auto-reframe struggles with: two or more people simultaneously in frame with equal visual weight; rapid, unpredictable subject movement (action sports, dance, physical comedy); wide establishing shots where the subject is small relative to the full frame; and content with dramatic background motion (green screen glitches, camera shake, fast-cutting B-roll). For these cases, the rule of thumb is: if the clip has more than 2 speakers in frame or heavy motion, review the auto-reframe output before exporting.

The comparison between auto-reframe and manual crop isn't a quality debate for most creator content — it's a time investment question. For the majority of talking-head clips from interviews and podcasts, auto-reframe output is good enough to post without adjustment. For flagship content — hero clips, brand campaigns, anything going to paid promotion — a manual crop pass is worth the extra time.

Section 4

Auto-Caption Placement After Reframe

Auto-reframe changes the spatial layout of the clip — and that has direct consequences for where captions can be placed. This is one of the most common production errors in AI clip workflows: captions are placed at the optimal position for the 16:9 source frame, and then when the clip is reframed to 9:16, the captions end up in the wrong position relative to the new frame boundaries or the platform's UI overlay area.

On TikTok, the bottom 15% of a 9:16 frame is typically obscured by the platform's UI (username, caption text, action buttons). Instagram Reels has a similar bottom overlay zone. YouTube Shorts uses the bottom 10–12% for its interface. Captions placed too low in the 9:16 frame will be partially or fully hidden when viewed in-feed.

The safe zone for caption placement in a 9:16 clip is between 60% and 85% from the top of the frame — below the speaker's face (which typically occupies the upper 50–60% of the frame after reframe) and above the platform's UI overlay zone. Good auto-caption tools handle this automatically; if your tool allows manual caption positioning, place captions in this zone explicitly. For a deeper guide on caption workflow, the topic of auto-caption placement connects naturally to adding captions across all clip types.

One practical check: always preview your exported clip in the actual platform's upload interface before publishing. What looks correct in your desktop editor may be partially obscured in the mobile viewer. This 30-second check has prevented more caption errors than any automated positioning logic. Also see viral moment detection for how reframe fits downstream in the full clip generation pipeline.

Key Terms

Aspect ratio

The proportional relationship between a video's width and height, expressed as width:height. 16:9 (landscape, e.g. 1920×1080) is the standard for YouTube long-form and traditional TV. 9:16 (portrait, e.g. 1080×1920) is the standard for TikTok, Reels, and Shorts. 1:1 (square, e.g. 1080×1080) is used for some LinkedIn and Twitter/X content.

Auto-reframe

An AI process that automatically converts a video from one aspect ratio to another by tracking the primary subject and dynamically positioning the crop window to keep them centered. Distinguished from a static crop (which cuts a fixed region) by its dynamic, frame-by-frame adjustment.

Subject tracking

The computer vision technique of following a specific object or person across consecutive video frames. In auto-reframe, subject tracking maintains a bounding box around the primary subject as they move, providing the positional data needed to calculate the crop window position at each frame.

Crop window

The rectangular region of the source video frame that is selected for output. In a 16:9-to-9:16 conversion, the crop window is a 9:16 rectangle positioned within the 16:9 source frame. Auto-reframe dynamically repositions this window to follow the subject; manual crop uses a fixed window position.

Safe zone

The region of a 9:16 frame that is guaranteed to be visible to users, unobscured by platform UI overlays. For TikTok, Reels, and Shorts, the safe zone is approximately the center 70–75% of the frame height — above the bottom UI overlay and below the top platform chrome. Captions and key visual elements should be placed within this zone.

Frequently Asked Questions

Does AI auto-reframe work for all video types?

Auto-reframe works best for single-speaker talking-head content with a static or simple background. It handles multi-speaker content reasonably well when speakers appear sequentially. It struggles with rapid motion, wide-angle B-roll shots, and scenes where the primary subject is unclear. For action content, manual crop review is recommended.

Which AI clip tools have the best auto-reframe?

Transcriptr, OpusClip, Klap, and Submagic all include auto-reframe for YouTube URL clips. {/* TODO: verify current auto-reframe quality across these tools */} For single-speaker interviews and podcast content, the difference is minimal. For content with multiple subjects or heavy motion, the quality gap between tools becomes more noticeable.

Can I manually adjust auto-reframe output?

Yes — most tools including Transcriptr allow you to manually adjust the crop window after auto-reframe runs. This is recommended any time the clip has more than two speakers in frame, involves rapid movement, or uses wide establishing shots. Manual correction takes 30–60 seconds per clip and significantly improves output quality.