Verging
Technology#AI#video-enhancement#automation#machine-learning#real-esrgan#swin2sr#smart-enhancer

How to Automatically Choose the Best Video Upscaling Model for Each Clip | SmartEnhancer

Learn how to build a content-aware system that automatically analyzes videos and selects the optimal upscaling model. No more manual model roulette - get stable, predictable results across anime and real footage with our SmartEnhancer approach.

V

Verging AI Team

Published on 2025-12-11

12 min read

Updated: 2025-12-11

How to Automatically Choose the Best Video Upscaling Model for Each Clip | SmartEnhancer

How to Automatically Choose the Best Video Upscaling Model for Each Clip

Video enhancement and super-resolution have exploded in popularity.

There are dozens of models—Real-ESRGAN variants, anime-focused GANs, Swin Transformer–based models, web-photo fixers, “artistic” enhancers, and ultra-light models for low-end GPUs. Each performs well on a very specific type of content, and fails badly on others.

This creates a practical problem for anyone doing video upscaling at scale:

For any given video, which upscaling / enhancement model should you actually use, and how can you make that choice automatically?

In this article, I’ll walk through how I approached this problem and built SmartEnhancer: a content-aware, quality-aware system that analyzes a video, classifies its content and visual quality, and then automatically selects an appropriate video upscaling model and blend strength.

The goals are practical:

  • No manual “model roulette” for each clip
  • Stable, predictable results across anime and real footage
  • A clear, controllable trade-off between speed and quality

If you work with Real-ESRGAN, Swin2SR, or anime-focused super-resolution models and often ask “which one should I use for this video?”, this system is designed to answer that question automatically.

Video Enhancement Model Zoo Overview - Different AI models for various content types


1. The Video Upscaling Model Zoo: What We’re Choosing Between

Before building any “smart” automatic model selection logic, I spent time benchmarking and understanding the behavior of common video enhancement / super-resolution models.

1.1 Categories of video enhancement models

From real testing, these models naturally fall into a few buckets:

Category Representative Models What They’re Good At
Real-ESRGAN Crew real_esrgan_x2/x4/x8, _fp16 variants General photos, de-noising, old family photos, real footage
GANs for Anime / Illustrations clear_reality_x4, lsdir_x4, nomos8k_sc_x4 2D content, anime, CG, sharp lines and edges
Web / Blurry Photo Fixers real_web_photo_x4 Compressed screenshots, social media JPGs
Artistic Flair / Overkill real_hatgan_x4, remacri_x4, siax_x4 Stylized, textured, sometimes hallucinatory results
Fast & Light (Low-End Gear) realistic_rescaler_x4, span_kendata_x4 Quick previews, low-end hardware
Next-Level Architectures swin2_sr_x4, ultra_sharp_x4, ultra_sharp_2_x4 More detail via Swin Transformer–style backbones

Comparison of different video enhancement model results on anime vs real content

1.2 What actually works in practice

A quick summary of key models from real-world use:

  • clear_reality_x4
    • Great for anime / 2D art / CG
    • Crisp lines, natural colors, stable edges
    • Looks plastic on real photos → avoid on live-action
  • swin2_sr_x4
    • Recovers tiny details on almost anything
    • Great for high-quality real footage (architecture, nature)
    • Heavy on GPU, slow on weaker cards
  • real_esrgan_x4 / real_esrgan_x4_fp16
    • Reliable for real photos, surveillance, old videos
    • FP16 is faster but slightly more artifact-prone
    • Solid baseline, sometimes “flat” in micro-detail
  • Anime-focused models (nomos8k_sc_x4, lsdir_x4)
    • Fantastic on anime, comics, cyberpunk art
    • Completely wrong for real-world images
  • ultra_sharp_x4 / ultra_sharp_2_x4
    • Rescue tools for blurry footage
    • Easy to over-sharpen and generate halos if pushed too hard
  • real_web_photo_x4
    • Very handy for compressed web screenshots
    • Can amplify noise if the input is already very noisy
  • real_esrgan_x8 / _x8_fp16
    • 8× upscaling sounds attractive on paper
    • In practice: heavy artifacts, warped shapes, “mushy” areas
    • I do not recommend it for serious work

From these experiments, it was clear that a one-size-fits-all video upscaling model is unrealistic. The right model depends strongly on:

  • Whether the content is anime / 2D or real
  • The baseline quality (sharpness, noise, contrast)
  • Whether it’s a web screenshot, an old compressed photo, or a clean DSLR-like clip
  • Hardware constraints and required throughput (fps)

So the idea for SmartEnhancer was simple:

Let the system measure these properties first, then pick the model that is most likely to work well.


2. Design Goals for SmartEnhancer

Let’s define a few concrete goals for this automatic video enhancement model selector:

  1. Content-aware
    • Distinguish between anime / 2D and real-world video
    • Recognize special cases like web screenshots and low-sharpness footage
  2. Quality-aware
    • Estimate sharpness, noise, contrast, and approximate resolution category
    • Decide if footage is high-quality or low-quality real content
  3. Performance-aware
    • Use pre-measured FPS for each model as part of the decision
    • Allow a simple toggle between:
      • normal = speed-first
      • high_quality = quality-first
  4. Explainable decisions
    • For each video, log:
      • Which content category it was classified into
      • Which model was chosen
      • Why (features, thresholds, special cases)

SmartEnhancer design goals: content-aware, quality-aware, and performance-aware video enhancement


3. High-Level Architecture of the Automatic Selector

At a high level, SmartEnhancer works like this:

  1. Take a target video (or image) path
  2. Extract a representative frame (for videos, the middle frame)
  3. Run content analysis
    • Anime vs real detection
    • Visual quality assessment
    • Resolution analysis
    • Special-case detection (web screenshot, needs sharpening)
  4. Assign a content category
    • anime
    • real_high_quality
    • real_low_quality
    • web_screenshot
    • want_sharper
  5. Select a model configuration
    • Based on category + quality mode (normal / high_quality)
    • Each config defines model + blend_strength + description
  6. Apply configuration
    • Store chosen model and blend strength into a shared state
    • The actual frame-by-frame enhancement pipeline then uses this configuration

The system exposes two main entry points:

  • SmartEnhancer.classify_content_and_select_model(path, quality_mode)
  • smart_enhance_video(path, quality_mode) — a convenience wrapper that also sets shared state.

SmartEnhancer architecture: from video input to automatic model selection


4. Content Classification: Anime vs Real-World Footage

The first major axis in the decision tree is whether the content is anime / 2D or real-world. This strongly influences which video upscaling models are even eligible.

SmartEnhancer supports two layers here:

  1. ONNX-based CLIP zero-shot classifier (preferred if available)
  2. Traditional heuristic-based detection (fallback)

4.1 ONNX CLIP zero-shot detection

When the ONNX CLIP model is correctly loaded, the system:

  • Extracts or loads an RGB image from the input

  • Uses CLIPProcessor to preprocess the image into pixel tensors

  • Tokenizes two prompts:

    "an anime illustration"
    "a real photo"
    
  • Runs text encoder and vision encoder via ONNX Runtime

  • Normalizes embeddings and computes similarity scores

  • Applies a learned or configured logit_scale

  • Converts logits into probabilities for “anime” vs “real”

The result looks like:

{
    "is_anime": True / False,
    "confidence": 0.0–1.0,
    "method": "clip_zero_shot_onnx",
    "anime_prob": ...,
    "real_prob": ...
}

If anything fails (files missing, ONNX issues, etc.), the system automatically falls back to the traditional method.

CLIP-based video content detection: anime vs real-world classification

4.2 Traditional anime heuristics (fallback)

The fallback detector uses a set of handcrafted features from a representative frame:

  • Color block ratio
    • Measures how much the image is made of large, uniform color regions
    • Uses k-means clustering in LAB color space and looks at the largest cluster
    • Anime often has higher contiguous flat regions of color
  • Color saturation
    • Computed from HSV
    • Anime tends to have medium to high saturation
  • Edge sharpness
    • Based on Laplacian variance
    • Anime line art has strong, clean edges

These metrics are combined into an anime_score, with several softened thresholds to make the detector more sensitive. For example:

  • Medium color-block ratio + medium saturation → bonus score
  • Multiple low-to-medium anime-like signals can still push it over the threshold

The final decision:

  • Compute anime_score
  • Normalize to a 0–1 confidence
  • If anime_score >= 2.5, treat as anime

5. Visual Quality Assessment for Video Upscaling

The second dimension is quality: is this video already good, or does it need heavy restoration?

SmartEnhancer is designed to support a NIMA-like quality model, but currently uses a robust traditional estimator by default.

5.1 Traditional quality metrics

From the representative frame, it computes:

  • Sharpness
    • Laplacian variance, normalized
    • High variance → sharp, low variance → blurry
  • Noise level
    • High-pass filtered grayscale image
    • Standard deviation scaled to [0, 1]
  • Contrast
    • Standard deviation of grayscale intensities, normalized
  • Resolution
    • Width and height, plus min dimension

From these raw metrics, it builds a 0–10 quality score starting from a baseline and nudging up or down:

  • High sharpness → +1 to +2
  • Very low sharpness → −2
  • Low noise → +1
  • Heavy noise → −1.5
  • Reasonable contrast → +1
  • Very low or extreme contrast → −1
  • Very low resolution → −1.5
  • Very high resolution → +1

The score is clamped to [1.0, 10.0], then mapped to:

  • high quality
  • medium quality
  • low quality

using configurable thresholds (default: <4 low, 4–6 medium, >6 high).

This gives the classifier a simple, interpretable signal: is the footage already decent (high), borderline (medium), or fundamentally poor (low)?

Video quality assessment metrics: sharpness, noise, contrast analysis


6. Resolution and Special-Case Detection

Beyond content and quality, some specific patterns need special handling for better video enhancement.

6.1 Resolution buckets

The system categorizes resolution into:

  • very_low – min dimension < 480
  • low – 480–720
  • medium – 720–1080
  • high – ≥1080

This is used for both logging and for fine-tuning when to apply heavier models or adjust expectations.

6.2 Web screenshot / old photo detection

A dedicated module attempts to recognize web screenshots and old compressed photos by combining:

  • JPEG artifact analysis
  • Low contrast & low saturation
  • Low edge sharpness
  • Text / watermark signals
  • Common aspect ratios

All these contribute to a web_screenshot_score. If it crosses a conservative threshold, the system flags:

"is_web_screenshot": True

6.3 “Needs sharpening” detection

Separately, the system checks whether the footage is:

  • Reasonably high-quality overall
  • But with insufficient edge sharpness or poor texture detail

Signals include:

  • High quality level but low edge sharpness
  • Very low texture-detail metric
  • Good contrast but soft edges
  • High resolution with unexpectedly fuzzy structure

These feed into a sharpening_score. If it’s high enough, the system sets:

"needs_sharpening": True

This is used to route content into a “sharpening-oriented” category that will prefer ultra_sharp_x4 in high-quality mode.


7. From Analysis to Action: Content Categories

With all the analysis in place, the final classifier is straightforward. Given:

  • is_anime (from CLIP or heuristics)
  • quality_level (high / medium / low)
  • special_[cases.is](http://cases.is)_web_screenshot
  • special_cases.needs_sharpening

The logic is:

if is_anime:
    category = "anime"
else:
    if is_web_screenshot:
        category = "web_screenshot"
    elif needs_sharpening:
        category = "want_sharper"
    elif quality_level == "high":
        category = "real_high_quality"
    else:
        category = "real_low_quality"

This gives the core category that will be used to select a specific video upscaling model.


8. Model Selection Strategy for Video Enhancement

The actual model decision is made by a mapping table in SmartEnhancer.

Each category has two modes:

  • normal – speed-first, “good enough” quality
  • high_quality – more aggressive enhancement, slower

8.1 Model mapping overview

Anime / Illustration

  • anime.normal
    • Model: span_kendata_x4
    • Description: fast anime enhancement, moderate quality
    • Blend: 80
  • anime.high_quality
    • Model: lsdir_x4
    • Description: higher-quality anime enhancement
    • Blend: 85

Real – High Quality

  • real_high_quality.normal
    • Model: clear_reality_x4
    • Description: light, clean enhancement, very fast
    • Blend: 70
  • real_high_quality.high_quality
    • Model: real_esrgan_x4_fp16
    • Description: stronger detail recovery with good speed
    • Blend: 80

Real – Low / Medium Quality

  • real_low_quality.normal
    • Model: clear_reality_x4
    • Description: fast enhancement with some restoration
    • Blend: 75
  • real_low_quality.high_quality
    • Model: real_esrgan_x4_fp16
    • Description: more aggressive repair of low-quality footage
    • Blend: 90

Web Screenshots / Old Photos

  • web_screenshot.normal
    • Model: clear_reality_x4
    • Description: quick cleanup for screenshots and scans
    • Blend: 75
  • web_screenshot.high_quality
    • Model: real_web_photo_x4
    • Description: specialized repair for web-compressed images
    • Blend: 85

“Need Sharpening” Cases

  • want_sharper.normal
    • Model: clear_reality_x4
    • Description: moderate sharpening without being too harsh
    • Blend: 70
  • want_sharper.high_quality
    • Model: ultra_sharp_x4
    • Description: strong sharpening for very soft footage
    • Blend: 80

Each model also has a small performance profile attached:

{
    "real_esrgan_x4_fp16": {"fps": 3.5, "speed_level": "medium"},
    "clear_reality_x4":    {"fps": 11.0, "speed_level": "very_fast"},
    "span_kendata_x4":     {"fps": 8.0,  "speed_level": "fast"},
    "lsdir_x4":            {"fps": 1.0,  "speed_level": "slow"},
    "real_web_photo_x4":   {"fps": 1.0,  "speed_level": "slow"},
    "ultra_sharp_x4":      {"fps": 0.8,  "speed_level": "slow"},
    # ...
}

Model selection strategy matrix: mapping content types to optimal enhancement models


9. Bringing It Together: smart_enhance_video

On top of the SmartEnhancer class, there’s a convenience function:

def smart_enhance_video(target_path: str, quality_mode: str = "normal") -> bool:
    # 1. Log startup and chosen quality mode
    # 2. Create SmartEnhancer instance
    # 3. Run classify_content_and_select_model(target_path, quality_mode)
    # 4. Read selected_model and blend_strength
    # 5. Push them into shared state:
    #       - frame_enhancer_model
    #       - frame_enhancer_blend
    # 6. Log final configuration and decision factors
    # 7. Return True on success, False on failure

This function does not perform the actual frame-by-frame processing. Instead, it configures the video enhancement pipeline:

  • Chooses the correct model for the clip
  • Sets a blending strength that balances enhancement with naturalness
  • Leaves the actual enhancement implementation to the host system (for example, a video processing app or CLI)

[Optional YouTube embed – demo video]

A short demo video: before/after comparisons where different clips automatically get different models. This works very well for both users and SEO.


10. Why Automatic Model Selection Scales Better Than Manual Tuning

Manually picking a model per video is fine for occasional creative work.

But for serious use—batch processing, user-facing tools, or pipelines that run on varied content—you quickly run into problems:

  • You cannot maintain a giant if–else tree in your head for every new clip
  • Users do not want to learn model names or understand architectures
  • The “right” model is usually a function of content type, quality, and performance budget

SmartEnhancer solves this by:

  • Turning empirical knowledge of models into a formal mapping
  • Using measurable features (sharpness, noise, saturation, block artifacts) instead of guesses
  • Making decisions that are transparent in logs and easy to adjust

If you later discover that, for example, swin2_sr_x4 is viable on your A100 cluster for real_high_quality.high_quality, you can simply update the mapping table and keep the rest of the logic intact.


11. Possible Extensions

There are several directions to push this automatic video upscaling model selector further:

  • Train a dedicated quality model
    • Replace or augment the traditional estimator with a trained NIMA-like model
    • Tailor it to your specific dataset (for example, anime-heavy vs real-heavy content)
  • Add more content categories
    • For example, “noisy night-time footage”, “handheld shaky phone video”, “screen-share UI clips”
  • Use multi-frame analysis
    • Sample several frames instead of just the middle frame
    • Aggregate signals for more stable classification
  • Implement reinforcement from user feedback
    • Log when users override the chosen model
    • Gradually refine thresholds and mappings based on corrections

Even in its current form, SmartEnhancer already removes a huge amount of friction and guesswork from video enhancement workflows.


Conclusion

The core idea behind SmartEnhancer is simple:

Let the video itself tell you which upscaling model it needs.

By combining:

  • CLIP-based content classification for anime vs real-world detection
  • Multi-metric quality assessment using sharpness, noise, and contrast analysis
  • Special-case detection for web screenshots and blurry footage
  • Performance-aware model mapping that balances quality with processing speed

the system can automatically choose a video enhancement / super-resolution model that makes sense for a given clip, and do so in a way that’s explainable, configurable, and production-friendly.

If you’re experimenting with video super-resolution models or building your own video processing toolchain, structuring your logic this way—measure first, then choose—is far more scalable than hand-picking a model for each video.

This approach scales from individual creative projects to production pipelines processing thousands of videos daily. The decision logic is transparent, configurable, and continuously improvable as new models become available.

For developers building video processing tools or content creators working with diverse footage types, implementing content-aware model selection transforms video enhancement from an art into a reliable, automated process.

The future of video enhancement isn't just better models—it's smarter model selection.


Ready to enhance your videos with intelligent model selection? Try our video enhancement tool and experience automatic model optimization in action.

Ready to Try Our AI Video Tools?

Transform your videos with cutting-edge AI technology. Start with our free tools today!