Skip to main content
Visual Layer’s Enrichment Hub lets you generate high-value metadata using pre-trained models tailored for image and video datasets. These enrichment models can:
  • Extract descriptive labels, captions, and tags
  • Enable advanced semantic and object-level search
  • Power downstream filtering, QA, and automation
  • Improve annotation coverage and data understanding

Available Enrichment Models

Visual Layer provides a range of built-in models designed for diverse enrichment tasks:
Model NameTask TypeDescription
VL-Object-DetectorObject DetectionIdentifies and localizes objects within images or videos by drawing bounding boxes and classifying each detected object.
VL-Image-TaggerImage ClassificationAssigns labels or tags to an entire image, categorizing its content for identification and analysis.
VL-Face-DetectorFace DetectionDetects faces and extracts facial landmarks for accurate face alignment and recognition workflows.
VL-Image-CaptionerImage to TextGenerates descriptive text that summarizes the content and context of the entire image input.
VL Advanced CaptionerImage to TextA state-of-the-art Vision-Language model that generates detailed captions and answers questions about image content (VQA).
VL-Object-CaptionerObject to TextGenerates descriptive text that summarizes detected objects and their interactions in the image.
NVILA-Lite-2BImage to TextA family of open VLMs designed to optimize both efficiency and accuracy for video understanding and multi-image tasks.
VL-Image-Semantic-SearchSemantic Image SearchEnhances image search with conceptual queries, identifying content that matches search intent and improving discovery by understanding visual context.
Advanced-Object-SearchSemantic Object SearchFinds objects in images or videos based on meaning and context, beyond simple tags. Quickly retrieves relevant objects using natural language queries.
Radiology-Image-SearchSemantic Image SearchEnhances image search with radiology understanding, improving discovery by understanding radiology images and terms.
Some models require pre-existing enrichments before they can be applied. These dependencies include:
  • VL-Object-Captioner requires Object Detection to be applied first.
  • Semantic Search models require captions or embeddings from a prior enrichment step.
Labels may come from user annotations, the VL-Object-Detector, or the VL-Image-Tagger.

Coming Soon

These models are in development and will be available in the enrichment catalog:
Model NameTask TypeDescription
Nv-grounding dinoObject DetectionAn open vocabulary zero-shot object detection model with natural language prompts.
Advanced-Image-SearchSemantic Image SearchEnhanced conceptual image retrieval using complex queries, identifying content that matches search intent.
yolov9Object DetectionObject detection model for fast and accurate bounding box predictions.

Want Early Access?

Get in Touch

Have questions or want to try out upcoming models early? Contact us to request access or learn more.