How to Automatically Choose the Best Video Upscaling Model for Each Clip | SmartEnhancer

How to Automatically Choose the Best Video Upscaling Model for Each Clip

Video enhancement and super-resolution have exploded in popularity.

There are dozens of models—Real-ESRGAN variants, anime-focused GANs, Swin Transformer–based models, web-photo fixers, “artistic” enhancers, and ultra-light models for low-end GPUs. Each performs well on a very specific type of content, and fails badly on others.

This creates a practical problem for anyone doing video upscaling at scale:

For any given video, which upscaling / enhancement model should you actually use, and how can you make that choice automatically?

In this article, I’ll walk through how I approached this problem and built SmartEnhancer: a content-aware, quality-aware system that analyzes a video, classifies its content and visual quality, and then automatically selects an appropriate video upscaling model and blend strength.

The goals are practical:

No manual “model roulette” for each clip
Stable, predictable results across anime and real footage
A clear, controllable trade-off between speed and quality

If you work with Real-ESRGAN, Swin2SR, or anime-focused super-resolution models and often ask “which one should I use for this video?”, this system is designed to answer that question automatically.

1. The Video Upscaling Model Zoo: What We’re Choosing Between

Before building any “smart” automatic model selection logic, I spent time benchmarking and understanding the behavior of common video enhancement / super-resolution models.

1.1 Categories of video enhancement models

From real testing, these models naturally fall into a few buckets:

Category	Representative Models	What They’re Good At
Real-ESRGAN Crew	`real_esrgan_x2/x4/x8`, `_fp16` variants	General photos, de-noising, old family photos, real footage
GANs for Anime / Illustrations	`clear_reality_x4`, `lsdir_x4`, `nomos8k_sc_x4`	2D content, anime, CG, sharp lines and edges
Web / Blurry Photo Fixers	`real_web_photo_x4`	Compressed screenshots, social media JPGs
Artistic Flair / Overkill	`real_hatgan_x4`, `remacri_x4`, `siax_x4`	Stylized, textured, sometimes hallucinatory results
Fast & Light (Low-End Gear)	`realistic_rescaler_x4`, `span_kendata_x4`	Quick previews, low-end hardware
Next-Level Architectures	`swin2_sr_x4`, `ultra_sharp_x4`, `ultra_sharp_2_x4`	More detail via Swin Transformer–style backbones

1.2 What actually works in practice

A quick summary of key models from real-world use:

clear_reality_x4
- Great for anime / 2D art / CG
- Crisp lines, natural colors, stable edges
- Looks plastic on real photos → avoid on live-action
swin2_sr_x4
- Recovers tiny details on almost anything
- Great for high-quality real footage (architecture, nature)
- Heavy on GPU, slow on weaker cards
real_esrgan_x4 / real_esrgan_x4_fp16
- Reliable for real photos, surveillance, old videos
- FP16 is faster but slightly more artifact-prone
- Solid baseline, sometimes “flat” in micro-detail
Anime-focused models (nomos8k_sc_x4, lsdir_x4)
- Fantastic on anime, comics, cyberpunk art
- Completely wrong for real-world images
ultra_sharp_x4 / ultra_sharp_2_x4
- Rescue tools for blurry footage
- Easy to over-sharpen and generate halos if pushed too hard
real_web_photo_x4
- Very handy for compressed web screenshots
- Can amplify noise if the input is already very noisy
real_esrgan_x8 / _x8_fp16
- 8× upscaling sounds attractive on paper
- In practice: heavy artifacts, warped shapes, “mushy” areas
- I do not recommend it for serious work

From these experiments, it was clear that a one-size-fits-all video upscaling model is unrealistic. The right model depends strongly on:

Whether the content is anime / 2D or real
The baseline quality (sharpness, noise, contrast)
Whether it’s a web screenshot, an old compressed photo, or a clean DSLR-like clip
Hardware constraints and required throughput (fps)

So the idea for SmartEnhancer was simple:

Let the system measure these properties first, then pick the model that is most likely to work well.

2. Design Goals for SmartEnhancer

Let’s define a few concrete goals for this automatic video enhancement model selector:

Content-aware
- Distinguish between anime / 2D and real-world video
- Recognize special cases like web screenshots and low-sharpness footage
Quality-aware
- Estimate sharpness, noise, contrast, and approximate resolution category
- Decide if footage is high-quality or low-quality real content
Performance-aware
- Use pre-measured FPS for each model as part of the decision
- Allow a simple toggle between:
  - normal = speed-first
  - high_quality = quality-first
Explainable decisions
- For each video, log:
  - Which content category it was classified into
  - Which model was chosen
  - Why (features, thresholds, special cases)

3. High-Level Architecture of the Automatic Selector

At a high level, SmartEnhancer works like this:

Take a target video (or image) path
Extract a representative frame (for videos, the middle frame)
Run content analysis
- Anime vs real detection
- Visual quality assessment
- Resolution analysis
- Special-case detection (web screenshot, needs sharpening)
Assign a content category
- anime
- real_high_quality
- real_low_quality
- web_screenshot
- want_sharper
Select a model configuration
- Based on category + quality mode (normal / high_quality)
- Each config defines model + blend_strength + description
Apply configuration
- Store chosen model and blend strength into a shared state
- The actual frame-by-frame enhancement pipeline then uses this configuration

The system exposes two main entry points:

SmartEnhancer.classify_content_and_select_model(path, quality_mode)
smart_enhance_video(path, quality_mode) — a convenience wrapper that also sets shared state.

4. Content Classification: Anime vs Real-World Footage

The first major axis in the decision tree is whether the content is anime / 2D or real-world. This strongly influences which video upscaling models are even eligible.

SmartEnhancer supports two layers here:

ONNX-based CLIP zero-shot classifier (preferred if available)
Traditional heuristic-based detection (fallback)

4.1 ONNX CLIP zero-shot detection

When the ONNX CLIP model is correctly loaded, the system:

Extracts or loads an RGB image from the input
Uses CLIPProcessor to preprocess the image into pixel tensors
Tokenizes two prompts:
```
"an anime illustration"
"a real photo"
```
Runs text encoder and vision encoder via ONNX Runtime
Normalizes embeddings and computes similarity scores
Applies a learned or configured logit_scale
Converts logits into probabilities for “anime” vs “real”

The result looks like:

{
    "is_anime": True / False,
    "confidence": 0.0–1.0,
    "method": "clip_zero_shot_onnx",
    "anime_prob": ...,
    "real_prob": ...
}

If anything fails (files missing, ONNX issues, etc.), the system automatically falls back to the traditional method.

4.2 Traditional anime heuristics (fallback)

The fallback detector uses a set of handcrafted features from a representative frame:

Color block ratio
- Measures how much the image is made of large, uniform color regions
- Uses k-means clustering in LAB color space and looks at the largest cluster
- Anime often has higher contiguous flat regions of color
Color saturation
- Computed from HSV
- Anime tends to have medium to high saturation
Edge sharpness
- Based on Laplacian variance
- Anime line art has strong, clean edges

These metrics are combined into an anime_score, with several softened thresholds to make the detector more sensitive. For example:

Medium color-block ratio + medium saturation → bonus score
Multiple low-to-medium anime-like signals can still push it over the threshold

The final decision:

Compute anime_score
Normalize to a 0–1 confidence
If anime_score >= 2.5, treat as anime

5. Visual Quality Assessment for Video Upscaling

The second dimension is quality: is this video already good, or does it need heavy restoration?

SmartEnhancer is designed to support a NIMA-like quality model, but currently uses a robust traditional estimator by default.

5.1 Traditional quality metrics

From the representative frame, it computes:

Sharpness
- Laplacian variance, normalized
- High variance → sharp, low variance → blurry
Noise level
- High-pass filtered grayscale image
- Standard deviation scaled to [0, 1]
Contrast
- Standard deviation of grayscale intensities, normalized
Resolution
- Width and height, plus min dimension

From these raw metrics, it builds a 0–10 quality score starting from a baseline and nudging up or down:

High sharpness → +1 to +2
Very low sharpness → −2
Low noise → +1
Heavy noise → −1.5
Reasonable contrast → +1
Very low or extreme contrast → −1
Very low resolution → −1.5
Very high resolution → +1

The score is clamped to [1.0, 10.0], then mapped to:

high quality
medium quality
low quality

using configurable thresholds (default: <4 low, 4–6 medium, >6 high).

This gives the classifier a simple, interpretable signal: is the footage already decent (high), borderline (medium), or fundamentally poor (low)?

6. Resolution and Special-Case Detection

Beyond content and quality, some specific patterns need special handling for better video enhancement.

6.1 Resolution buckets

The system categorizes resolution into:

very_low – min dimension < 480
low – 480–720
medium – 720–1080
high – ≥1080

This is used for both logging and for fine-tuning when to apply heavier models or adjust expectations.

6.2 Web screenshot / old photo detection

A dedicated module attempts to recognize web screenshots and old compressed photos by combining:

JPEG artifact analysis
Low contrast & low saturation
Low edge sharpness
Text / watermark signals
Common aspect ratios

All these contribute to a web_screenshot_score. If it crosses a conservative threshold, the system flags:

"is_web_screenshot": True

6.3 “Needs sharpening” detection

Separately, the system checks whether the footage is:

Reasonably high-quality overall
But with insufficient edge sharpness or poor texture detail

Signals include:

High quality level but low edge sharpness
Very low texture-detail metric
Good contrast but soft edges
High resolution with unexpectedly fuzzy structure

These feed into a sharpening_score. If it’s high enough, the system sets:

"needs_sharpening": True

This is used to route content into a “sharpening-oriented” category that will prefer ultra_sharp_x4 in high-quality mode.

7. From Analysis to Action: Content Categories

With all the analysis in place, the final classifier is straightforward. Given:

is_anime (from CLIP or heuristics)
quality_level (high / medium / low)
special_[cases.is](http://cases.is)_web_screenshot
special_cases.needs_sharpening

The logic is:

if is_anime:
    category = "anime"
else:
    if is_web_screenshot:
        category = "web_screenshot"
    elif needs_sharpening:
        category = "want_sharper"
    elif quality_level == "high":
        category = "real_high_quality"
    else:
        category = "real_low_quality"

This gives the core category that will be used to select a specific video upscaling model.

8. Model Selection Strategy for Video Enhancement

The actual model decision is made by a mapping table in SmartEnhancer.

Each category has two modes:

normal – speed-first, “good enough” quality
high_quality – more aggressive enhancement, slower

8.1 Model mapping overview

Anime / Illustration

anime.normal
- Model: span_kendata_x4
- Description: fast anime enhancement, moderate quality
- Blend: 80
anime.high_quality
- Model: lsdir_x4
- Description: higher-quality anime enhancement
- Blend: 85

Real – High Quality

real_high_quality.normal
- Model: clear_reality_x4
- Description: light, clean enhancement, very fast
- Blend: 70
real_high_quality.high_quality
- Model: real_esrgan_x4_fp16
- Description: stronger detail recovery with good speed
- Blend: 80

Real – Low / Medium Quality

real_low_quality.normal
- Model: clear_reality_x4
- Description: fast enhancement with some restoration
- Blend: 75
real_low_quality.high_quality
- Model: real_esrgan_x4_fp16
- Description: more aggressive repair of low-quality footage
- Blend: 90

Web Screenshots / Old Photos

web_screenshot.normal
- Model: clear_reality_x4
- Description: quick cleanup for screenshots and scans
- Blend: 75
web_screenshot.high_quality
- Model: real_web_photo_x4
- Description: specialized repair for web-compressed images
- Blend: 85

“Need Sharpening” Cases

want_sharper.normal
- Model: clear_reality_x4
- Description: moderate sharpening without being too harsh
- Blend: 70
want_sharper.high_quality
- Model: ultra_sharp_x4
- Description: strong sharpening for very soft footage
- Blend: 80

Each model also has a small performance profile attached:

{
    "real_esrgan_x4_fp16": {"fps": 3.5, "speed_level": "medium"},
    "clear_reality_x4":    {"fps": 11.0, "speed_level": "very_fast"},
    "span_kendata_x4":     {"fps": 8.0,  "speed_level": "fast"},
    "lsdir_x4":            {"fps": 1.0,  "speed_level": "slow"},
    "real_web_photo_x4":   {"fps": 1.0,  "speed_level": "slow"},
    "ultra_sharp_x4":      {"fps": 0.8,  "speed_level": "slow"},
    # ...
}

9. Bringing It Together: `smart_enhance_video`

On top of the SmartEnhancer class, there’s a convenience function:

def smart_enhance_video(target_path: str, quality_mode: str = "normal") -> bool:
    # 1. Log startup and chosen quality mode
    # 2. Create SmartEnhancer instance
    # 3. Run classify_content_and_select_model(target_path, quality_mode)
    # 4. Read selected_model and blend_strength
    # 5. Push them into shared state:
    #       - frame_enhancer_model
    #       - frame_enhancer_blend
    # 6. Log final configuration and decision factors
    # 7. Return True on success, False on failure

This function does not perform the actual frame-by-frame processing. Instead, it configures the video enhancement pipeline:

Chooses the correct model for the clip
Sets a blending strength that balances enhancement with naturalness
Leaves the actual enhancement implementation to the host system (for example, a video processing app or CLI)

[Optional YouTube embed – demo video]

A short demo video: before/after comparisons where different clips automatically get different models. This works very well for both users and SEO.

10. Why Automatic Model Selection Scales Better Than Manual Tuning

Manually picking a model per video is fine for occasional creative work.

But for serious use—batch processing, user-facing tools, or pipelines that run on varied content—you quickly run into problems:

You cannot maintain a giant if–else tree in your head for every new clip
Users do not want to learn model names or understand architectures
The “right” model is usually a function of content type, quality, and performance budget

SmartEnhancer solves this by:

Turning empirical knowledge of models into a formal mapping
Using measurable features (sharpness, noise, saturation, block artifacts) instead of guesses
Making decisions that are transparent in logs and easy to adjust

If you later discover that, for example, swin2_sr_x4 is viable on your A100 cluster for real_high_quality.high_quality, you can simply update the mapping table and keep the rest of the logic intact.

11. Possible Extensions

There are several directions to push this automatic video upscaling model selector further:

Train a dedicated quality model
- Replace or augment the traditional estimator with a trained NIMA-like model
- Tailor it to your specific dataset (for example, anime-heavy vs real-heavy content)
Add more content categories
- For example, “noisy night-time footage”, “handheld shaky phone video”, “screen-share UI clips”
Use multi-frame analysis
- Sample several frames instead of just the middle frame
- Aggregate signals for more stable classification
Implement reinforcement from user feedback
- Log when users override the chosen model
- Gradually refine thresholds and mappings based on corrections

Even in its current form, SmartEnhancer already removes a huge amount of friction and guesswork from video enhancement workflows.

Conclusion

The core idea behind SmartEnhancer is simple:

Let the video itself tell you which upscaling model it needs.

By combining:

CLIP-based content classification for anime vs real-world detection
Multi-metric quality assessment using sharpness, noise, and contrast analysis
Special-case detection for web screenshots and blurry footage
Performance-aware model mapping that balances quality with processing speed

the system can automatically choose a video enhancement / super-resolution model that makes sense for a given clip, and do so in a way that’s explainable, configurable, and production-friendly.

If you’re experimenting with video super-resolution models or building your own video processing toolchain, structuring your logic this way—measure first, then choose—is far more scalable than hand-picking a model for each video.

This approach scales from individual creative projects to production pipelines processing thousands of videos daily. The decision logic is transparent, configurable, and continuously improvable as new models become available.

For developers building video processing tools or content creators working with diverse footage types, implementing content-aware model selection transforms video enhancement from an art into a reliable, automated process.

The future of video enhancement isn't just better models—it's smarter model selection.

Ready to enhance your videos with intelligent model selection? Try our video enhancement tool and experience automatic model optimization in action.