How to Automatically Choose the Best Video Upscaling Model for Each Clip | SmartEnhancer
Learn how to build a content-aware system that automatically analyzes videos and selects the optimal upscaling model. No more manual model roulette - get stable, predictable results across anime and real footage with our SmartEnhancer approach.
Verging AI Team
Published on 2025-12-11
12 min read
Updated: 2025-12-11

How to Automatically Choose the Best Video Upscaling Model for Each Clip
Video enhancement and super-resolution have exploded in popularity.
There are dozens of models—Real-ESRGAN variants, anime-focused GANs, Swin Transformer–based models, web-photo fixers, “artistic” enhancers, and ultra-light models for low-end GPUs. Each performs well on a very specific type of content, and fails badly on others.
This creates a practical problem for anyone doing video upscaling at scale:
For any given video, which upscaling / enhancement model should you actually use, and how can you make that choice automatically?
In this article, I’ll walk through how I approached this problem and built SmartEnhancer: a content-aware, quality-aware system that analyzes a video, classifies its content and visual quality, and then automatically selects an appropriate video upscaling model and blend strength.
The goals are practical:
- No manual “model roulette” for each clip
- Stable, predictable results across anime and real footage
- A clear, controllable trade-off between speed and quality
If you work with Real-ESRGAN, Swin2SR, or anime-focused super-resolution models and often ask “which one should I use for this video?”, this system is designed to answer that question automatically.

1. The Video Upscaling Model Zoo: What We’re Choosing Between
Before building any “smart” automatic model selection logic, I spent time benchmarking and understanding the behavior of common video enhancement / super-resolution models.
1.1 Categories of video enhancement models
From real testing, these models naturally fall into a few buckets:
| Category | Representative Models | What They’re Good At |
|---|---|---|
| Real-ESRGAN Crew | real_esrgan_x2/x4/x8, _fp16 variants |
General photos, de-noising, old family photos, real footage |
| GANs for Anime / Illustrations | clear_reality_x4, lsdir_x4, nomos8k_sc_x4 |
2D content, anime, CG, sharp lines and edges |
| Web / Blurry Photo Fixers | real_web_photo_x4 |
Compressed screenshots, social media JPGs |
| Artistic Flair / Overkill | real_hatgan_x4, remacri_x4, siax_x4 |
Stylized, textured, sometimes hallucinatory results |
| Fast & Light (Low-End Gear) | realistic_rescaler_x4, span_kendata_x4 |
Quick previews, low-end hardware |
| Next-Level Architectures | swin2_sr_x4, ultra_sharp_x4, ultra_sharp_2_x4 |
More detail via Swin Transformer–style backbones |

1.2 What actually works in practice
A quick summary of key models from real-world use:
clear_reality_x4- Great for anime / 2D art / CG
- Crisp lines, natural colors, stable edges
- Looks plastic on real photos → avoid on live-action
swin2_sr_x4- Recovers tiny details on almost anything
- Great for high-quality real footage (architecture, nature)
- Heavy on GPU, slow on weaker cards
real_esrgan_x4/real_esrgan_x4_fp16- Reliable for real photos, surveillance, old videos
- FP16 is faster but slightly more artifact-prone
- Solid baseline, sometimes “flat” in micro-detail
- Anime-focused models (
nomos8k_sc_x4,lsdir_x4)- Fantastic on anime, comics, cyberpunk art
- Completely wrong for real-world images
ultra_sharp_x4/ultra_sharp_2_x4- Rescue tools for blurry footage
- Easy to over-sharpen and generate halos if pushed too hard
real_web_photo_x4- Very handy for compressed web screenshots
- Can amplify noise if the input is already very noisy
real_esrgan_x8/_x8_fp16- 8× upscaling sounds attractive on paper
- In practice: heavy artifacts, warped shapes, “mushy” areas
- I do not recommend it for serious work
From these experiments, it was clear that a one-size-fits-all video upscaling model is unrealistic. The right model depends strongly on:
- Whether the content is anime / 2D or real
- The baseline quality (sharpness, noise, contrast)
- Whether it’s a web screenshot, an old compressed photo, or a clean DSLR-like clip
- Hardware constraints and required throughput (fps)
So the idea for SmartEnhancer was simple:
Let the system measure these properties first, then pick the model that is most likely to work well.
2. Design Goals for SmartEnhancer
Let’s define a few concrete goals for this automatic video enhancement model selector:
- Content-aware
- Distinguish between anime / 2D and real-world video
- Recognize special cases like web screenshots and low-sharpness footage
- Quality-aware
- Estimate sharpness, noise, contrast, and approximate resolution category
- Decide if footage is high-quality or low-quality real content
- Performance-aware
- Use pre-measured FPS for each model as part of the decision
- Allow a simple toggle between:
normal= speed-firsthigh_quality= quality-first
- Explainable decisions
- For each video, log:
- Which content category it was classified into
- Which model was chosen
- Why (features, thresholds, special cases)
- For each video, log:

3. High-Level Architecture of the Automatic Selector
At a high level, SmartEnhancer works like this:
- Take a target video (or image) path
- Extract a representative frame (for videos, the middle frame)
- Run content analysis
- Anime vs real detection
- Visual quality assessment
- Resolution analysis
- Special-case detection (web screenshot, needs sharpening)
- Assign a content category
animereal_high_qualityreal_low_qualityweb_screenshotwant_sharper
- Select a model configuration
- Based on category + quality mode (
normal/high_quality) - Each config defines
model+blend_strength+ description
- Based on category + quality mode (
- Apply configuration
- Store chosen model and blend strength into a shared state
- The actual frame-by-frame enhancement pipeline then uses this configuration
The system exposes two main entry points:
SmartEnhancer.classify_content_and_select_model(path, quality_mode)smart_enhance_video(path, quality_mode)— a convenience wrapper that also sets shared state.

4. Content Classification: Anime vs Real-World Footage
The first major axis in the decision tree is whether the content is anime / 2D or real-world. This strongly influences which video upscaling models are even eligible.
SmartEnhancer supports two layers here:
- ONNX-based CLIP zero-shot classifier (preferred if available)
- Traditional heuristic-based detection (fallback)
4.1 ONNX CLIP zero-shot detection
When the ONNX CLIP model is correctly loaded, the system:
-
Extracts or loads an RGB image from the input
-
Uses
CLIPProcessorto preprocess the image into pixel tensors -
Tokenizes two prompts:
"an anime illustration" "a real photo" -
Runs text encoder and vision encoder via ONNX Runtime
-
Normalizes embeddings and computes similarity scores
-
Applies a learned or configured
logit_scale -
Converts logits into probabilities for “anime” vs “real”
The result looks like:
{
"is_anime": True / False,
"confidence": 0.0–1.0,
"method": "clip_zero_shot_onnx",
"anime_prob": ...,
"real_prob": ...
}
If anything fails (files missing, ONNX issues, etc.), the system automatically falls back to the traditional method.

4.2 Traditional anime heuristics (fallback)
The fallback detector uses a set of handcrafted features from a representative frame:
- Color block ratio
- Measures how much the image is made of large, uniform color regions
- Uses k-means clustering in LAB color space and looks at the largest cluster
- Anime often has higher contiguous flat regions of color
- Color saturation
- Computed from HSV
- Anime tends to have medium to high saturation
- Edge sharpness
- Based on Laplacian variance
- Anime line art has strong, clean edges
These metrics are combined into an anime_score, with several softened thresholds to make the detector more sensitive. For example:
- Medium color-block ratio + medium saturation → bonus score
- Multiple low-to-medium anime-like signals can still push it over the threshold
The final decision:
- Compute
anime_score - Normalize to a 0–1 confidence
- If
anime_score >= 2.5, treat as anime
5. Visual Quality Assessment for Video Upscaling
The second dimension is quality: is this video already good, or does it need heavy restoration?
SmartEnhancer is designed to support a NIMA-like quality model, but currently uses a robust traditional estimator by default.
5.1 Traditional quality metrics
From the representative frame, it computes:
- Sharpness
- Laplacian variance, normalized
- High variance → sharp, low variance → blurry
- Noise level
- High-pass filtered grayscale image
- Standard deviation scaled to [0, 1]
- Contrast
- Standard deviation of grayscale intensities, normalized
- Resolution
- Width and height, plus min dimension
From these raw metrics, it builds a 0–10 quality score starting from a baseline and nudging up or down:
- High sharpness → +1 to +2
- Very low sharpness → −2
- Low noise → +1
- Heavy noise → −1.5
- Reasonable contrast → +1
- Very low or extreme contrast → −1
- Very low resolution → −1.5
- Very high resolution → +1
The score is clamped to [1.0, 10.0], then mapped to:
highqualitymediumqualitylowquality
using configurable thresholds (default: <4 low, 4–6 medium, >6 high).
This gives the classifier a simple, interpretable signal: is the footage already decent (high), borderline (medium), or fundamentally poor (low)?

6. Resolution and Special-Case Detection
Beyond content and quality, some specific patterns need special handling for better video enhancement.
6.1 Resolution buckets
The system categorizes resolution into:
very_low– min dimension < 480low– 480–720medium– 720–1080high– ≥1080
This is used for both logging and for fine-tuning when to apply heavier models or adjust expectations.
6.2 Web screenshot / old photo detection
A dedicated module attempts to recognize web screenshots and old compressed photos by combining:
- JPEG artifact analysis
- Low contrast & low saturation
- Low edge sharpness
- Text / watermark signals
- Common aspect ratios
All these contribute to a web_screenshot_score. If it crosses a conservative threshold, the system flags:
"is_web_screenshot": True
6.3 “Needs sharpening” detection
Separately, the system checks whether the footage is:
- Reasonably high-quality overall
- But with insufficient edge sharpness or poor texture detail
Signals include:
- High quality level but low edge sharpness
- Very low texture-detail metric
- Good contrast but soft edges
- High resolution with unexpectedly fuzzy structure
These feed into a sharpening_score. If it’s high enough, the system sets:
"needs_sharpening": True
This is used to route content into a “sharpening-oriented” category that will prefer ultra_sharp_x4 in high-quality mode.
7. From Analysis to Action: Content Categories
With all the analysis in place, the final classifier is straightforward. Given:
is_anime(from CLIP or heuristics)quality_level(high/medium/low)special_[cases.is](http://cases.is)_web_screenshotspecial_cases.needs_sharpening
The logic is:
if is_anime:
category = "anime"
else:
if is_web_screenshot:
category = "web_screenshot"
elif needs_sharpening:
category = "want_sharper"
elif quality_level == "high":
category = "real_high_quality"
else:
category = "real_low_quality"
This gives the core category that will be used to select a specific video upscaling model.
8. Model Selection Strategy for Video Enhancement
The actual model decision is made by a mapping table in SmartEnhancer.
Each category has two modes:
normal– speed-first, “good enough” qualityhigh_quality– more aggressive enhancement, slower
8.1 Model mapping overview
Anime / Illustration
anime.normal- Model:
span_kendata_x4 - Description: fast anime enhancement, moderate quality
- Blend: 80
- Model:
anime.high_quality- Model:
lsdir_x4 - Description: higher-quality anime enhancement
- Blend: 85
- Model:
Real – High Quality
real_high_quality.normal- Model:
clear_reality_x4 - Description: light, clean enhancement, very fast
- Blend: 70
- Model:
real_high_quality.high_quality- Model:
real_esrgan_x4_fp16 - Description: stronger detail recovery with good speed
- Blend: 80
- Model:
Real – Low / Medium Quality
real_low_quality.normal- Model:
clear_reality_x4 - Description: fast enhancement with some restoration
- Blend: 75
- Model:
real_low_quality.high_quality- Model:
real_esrgan_x4_fp16 - Description: more aggressive repair of low-quality footage
- Blend: 90
- Model:
Web Screenshots / Old Photos
web_screenshot.normal- Model:
clear_reality_x4 - Description: quick cleanup for screenshots and scans
- Blend: 75
- Model:
web_screenshot.high_quality- Model:
real_web_photo_x4 - Description: specialized repair for web-compressed images
- Blend: 85
- Model:
“Need Sharpening” Cases
want_sharper.normal- Model:
clear_reality_x4 - Description: moderate sharpening without being too harsh
- Blend: 70
- Model:
want_sharper.high_quality- Model:
ultra_sharp_x4 - Description: strong sharpening for very soft footage
- Blend: 80
- Model:
Each model also has a small performance profile attached:
{
"real_esrgan_x4_fp16": {"fps": 3.5, "speed_level": "medium"},
"clear_reality_x4": {"fps": 11.0, "speed_level": "very_fast"},
"span_kendata_x4": {"fps": 8.0, "speed_level": "fast"},
"lsdir_x4": {"fps": 1.0, "speed_level": "slow"},
"real_web_photo_x4": {"fps": 1.0, "speed_level": "slow"},
"ultra_sharp_x4": {"fps": 0.8, "speed_level": "slow"},
# ...
}

9. Bringing It Together: smart_enhance_video
On top of the SmartEnhancer class, there’s a convenience function:
def smart_enhance_video(target_path: str, quality_mode: str = "normal") -> bool:
# 1. Log startup and chosen quality mode
# 2. Create SmartEnhancer instance
# 3. Run classify_content_and_select_model(target_path, quality_mode)
# 4. Read selected_model and blend_strength
# 5. Push them into shared state:
# - frame_enhancer_model
# - frame_enhancer_blend
# 6. Log final configuration and decision factors
# 7. Return True on success, False on failure
This function does not perform the actual frame-by-frame processing. Instead, it configures the video enhancement pipeline:
- Chooses the correct model for the clip
- Sets a blending strength that balances enhancement with naturalness
- Leaves the actual enhancement implementation to the host system (for example, a video processing app or CLI)
[Optional YouTube embed – demo video]
A short demo video: before/after comparisons where different clips automatically get different models. This works very well for both users and SEO.
10. Why Automatic Model Selection Scales Better Than Manual Tuning
Manually picking a model per video is fine for occasional creative work.
But for serious use—batch processing, user-facing tools, or pipelines that run on varied content—you quickly run into problems:
- You cannot maintain a giant if–else tree in your head for every new clip
- Users do not want to learn model names or understand architectures
- The “right” model is usually a function of content type, quality, and performance budget
SmartEnhancer solves this by:
- Turning empirical knowledge of models into a formal mapping
- Using measurable features (sharpness, noise, saturation, block artifacts) instead of guesses
- Making decisions that are transparent in logs and easy to adjust
If you later discover that, for example, swin2_sr_x4 is viable on your A100 cluster for real_high_quality.high_quality, you can simply update the mapping table and keep the rest of the logic intact.
11. Possible Extensions
There are several directions to push this automatic video upscaling model selector further:
- Train a dedicated quality model
- Replace or augment the traditional estimator with a trained NIMA-like model
- Tailor it to your specific dataset (for example, anime-heavy vs real-heavy content)
- Add more content categories
- For example, “noisy night-time footage”, “handheld shaky phone video”, “screen-share UI clips”
- Use multi-frame analysis
- Sample several frames instead of just the middle frame
- Aggregate signals for more stable classification
- Implement reinforcement from user feedback
- Log when users override the chosen model
- Gradually refine thresholds and mappings based on corrections
Even in its current form, SmartEnhancer already removes a huge amount of friction and guesswork from video enhancement workflows.
Conclusion
The core idea behind SmartEnhancer is simple:
Let the video itself tell you which upscaling model it needs.
By combining:
- CLIP-based content classification for anime vs real-world detection
- Multi-metric quality assessment using sharpness, noise, and contrast analysis
- Special-case detection for web screenshots and blurry footage
- Performance-aware model mapping that balances quality with processing speed
the system can automatically choose a video enhancement / super-resolution model that makes sense for a given clip, and do so in a way that’s explainable, configurable, and production-friendly.
If you’re experimenting with video super-resolution models or building your own video processing toolchain, structuring your logic this way—measure first, then choose—is far more scalable than hand-picking a model for each video.
This approach scales from individual creative projects to production pipelines processing thousands of videos daily. The decision logic is transparent, configurable, and continuously improvable as new models become available.
For developers building video processing tools or content creators working with diverse footage types, implementing content-aware model selection transforms video enhancement from an art into a reliable, automated process.
The future of video enhancement isn't just better models—it's smarter model selection.
Ready to enhance your videos with intelligent model selection? Try our video enhancement tool and experience automatic model optimization in action.
Ready to Try Our AI Video Tools?
Transform your videos with cutting-edge AI technology. Start with our free tools today!