Image Understanding in Featrix¶

Featrix can automatically understand images referenced by URL in your datasets. When enabled, Featrix downloads each image, extracts visual features, reads any text in the image via OCR, and identifies objects — then learns how all of this relates to every other column in your data.

This means that product photos, document scans, user avatars, receipts, screenshots, and any other images become first-class features in your embedding space, on equal footing with text, numbers, and categories.

What Featrix Extracts From Each Image¶

Every image goes through three analysis pipelines:

Visual Features (DINOv2)¶

Featrix uses Meta's DINOv2 vision model to produce a dense visual embedding for each image. This captures the overall visual content — composition, color palette, texture, style, subject matter — in a compact numerical representation. Two images of similar-looking products will have similar visual embeddings even if they share no text or labels.

Text in Images (OCR)¶

Featrix reads text directly from images using optical character recognition. This includes:

Product labels and brand names
Prices, serial numbers, dates
Street signs, receipts, screenshots of text
Any visible text in any language

The extracted text is stored in full (for matching against your other columns) and also converted into a semantic embedding that captures its meaning.

Object Detection (ResNet)¶

Featrix identifies objects in each image using ImageNet classification. For each image, it produces a "bag of objects" — a ranked list of what's in the picture, with confidence scores. For example, a product photo might return:

Object	Confidence
running_shoe	0.92
sneaker	0.05
sandal	0.02

This gives the model a vocabulary of ~1,000 common object types to reason about.

How Images Relate to Other Columns¶

The real power of image understanding is in how Featrix connects image content to the rest of your data. The embedding space learns these cross-modal relationships automatically through its transformer architecture.

Image + Text Columns¶

Featrix computes several relationship signals between image content and text columns:

Word overlap (Jaccard similarity): How many words in the OCR text also appear in a text column value? High overlap suggests the image literally contains the same information as the text field.
Substring matching: Is a text column value (or fragment of it) found inside the OCR text? This catches cases like a brand name "Nike" appearing on a shoe in the photo, matching a brand column.
Reverse substring matching: Are OCR fragments found in the text column? This catches the opposite direction — text extracted from an image that appears in a description field.
Semantic similarity: Even when the exact words differ, Featrix measures meaning overlap between the OCR text and text columns. "Automobile" in the image and "car" in the text column will still register as related.

Example: A product listing has a description column saying "Nike Air Max 90 Running Shoe" and a product photo. OCR reads "Nike Air Max" from the shoe label. Featrix detects high word overlap, exact substring match on "Nike Air Max", and strong semantic similarity — all of which help the model understand this image-text pairing.

Image + Category Columns¶

Featrix connects detected objects and OCR text to categorical (set) columns:

Object-category matching: Do any detected objects semantically match category values? If ResNet detects "laptop" and your category column contains "Electronics", Featrix measures the semantic similarity between these labels and finds a strong match.
OCR-category matching: Does text in the image match any category member? If a product label says "Premium" and your tier column has values {Basic, Premium, Enterprise}, that's a direct match.
Object diversity: The number of distinct objects detected in an image can correlate with category characteristics — a busy product photo with many items might indicate a "bundle" or "variety pack" category.

Example: A real estate listing has a property_type column with values like "Apartment", "House", "Condo". Photos of houses show detectable objects (lawn, fence, driveway) that differ systematically from apartment photos (hallway, elevator, lobby). Featrix learns these visual-category associations.

Image + Numeric Columns¶

Featrix finds connections between images and numeric data:

Numbers in images: OCR extracts numbers from images (prices, quantities, measurements, dates) and Featrix checks how close they are to numeric column values. A receipt photo with "$42.99" paired with a total_amount column value of 42.99 is a near-perfect match.
Object count: How many objects are in the image? This can correlate with numeric measures like quantity, density, or complexity.
Image quality signals: Visual characteristics of the image itself — brightness, contrast, sharpness, color richness, aspect ratio — can correlate with numeric columns. Higher-quality product photos often correlate with higher prices. Professional headshots differ measurably from casual selfies.

Example: An e-commerce dataset has product images and a price column. Products with clean, high-contrast studio photos tend to be more expensive. Products with OCR'd price tags matching the price column confirm the data integrity. Featrix discovers both patterns.

Image + Image Columns¶

When multiple image columns exist (e.g., front_photo and back_photo, or product_image and lifestyle_image), Featrix can learn relationships between them through visual similarity, shared OCR text, and overlapping detected objects.

Enabling Image Understanding¶

Image understanding is controlled by your Sphere configuration file (config.json).

Basic Setup¶

Add to your config.json:

{
  "enable_image_url_detection": true
}

That's it. Featrix will automatically detect columns containing image URLs and process them.

Auto-Detection¶

Featrix identifies image URL columns by looking for:

URL patterns with image file extensions: .jpg, .jpeg, .png, .gif, .svg
Column name hints: Columns named image, photo, thumbnail, picture, avatar, logo, icon, img, banner, poster, cover, or similar

If more than half the URLs in a column point to images, it's classified as an image column. Column name hints boost detection confidence.

Manual Override¶

You can also force specific columns to be treated as image URLs using column_overrides in the API:

column_overrides = {
    "product_photo_url": "image_url",
    "thumbnail": "image_url"
}

Configuration Options¶

Parameter	Default	Description
`enable_image_url_detection`	`false`	Master switch for image understanding
`image_url_download_timeout`	`10`	Per-image download timeout in seconds
`image_url_max_file_size_mb`	`20`	Skip images larger than this (MB)
`image_url_max_workers`	`8`	Parallel download threads
`image_url_enable_object_detection`	`true`	Enable ResNet object identification
`image_url_object_top_k`	`10`	Number of top objects to keep per image

Example Configuration¶

{
  "enable_image_url_detection": true,
  "image_url_download_timeout": 15,
  "image_url_max_file_size_mb": 50,
  "image_url_max_workers": 4,
  "image_url_enable_object_detection": true,
  "image_url_object_top_k": 10
}

How It Works Under the Hood¶

During Data Ingestion¶

When you upload a dataset, Featrix:

Detects which columns contain image URLs
Downloads all images in parallel (respecting timeout and size limits)
Runs each image through DINOv2 (visual), EasyOCR (text), and ResNet (objects)
Stores the extracted features alongside your data

Images that fail to download or can't be processed are handled gracefully — they become "unknown" values that the model treats like missing data.

During Training¶

The extracted features are fed through a learnable ImageEncoder neural network with three branches:

A visual branch that projects the visual embedding
An OCR branch that projects the text embedding
An object branch that projects the object label embeddings

These branches are combined through learned modal gates — the model figures out which modalities matter most for each image. A product photo with prominent text benefits more from OCR; a nature photo benefits more from visual features. This weighting is learned automatically.

The combined image embedding feeds into the same transformer that processes all your other columns, so cross-column attention can discover relationships between images and everything else in your dataset.

Supported Image Formats¶

JPEG (.jpg, .jpeg)
PNG (.png)
GIF (.gif) — first frame only
SVG (.svg) — rasterized

What Happens When Images Fail¶

Situation	What Happens
Image URL returns 404	Treated as missing data
Image too large	Skipped, treated as missing
Download times out	Treated as missing
No text found by OCR	OCR features are zero (visual and object features still work)
Object detection disabled	Only visual and OCR features are used
Mixed column (some image URLs, some not)	If <50% are images, column is treated as a regular URL column instead

Missing images don't break training — the model learns a default embedding for missing values and continues with the data that is available.

Use Cases¶

E-Commerce¶

Product images + price, category, brand, description. Featrix learns that studio-quality photos correlate with premium pricing, that shoe images contain brand logos matching the brand column, and that product descriptions match OCR'd label text.

Real Estate¶

Property photos + price, square footage, property type, neighborhood. Featrix discovers that visual features (pool, large yard, modern kitchen) predict price ranges, and that property type (house vs. condo) corresponds to distinctive visual patterns.

Document Processing¶

Scanned documents + metadata columns. OCR extracts text from documents, and Featrix learns which document types contain which text patterns, matching against category and description columns.

Profile photos + demographic and behavioral data. Visual style of profile photos correlates with user segments. Avatar choices relate to activity patterns.

Quality Control / Manufacturing¶

Product inspection photos + defect type, severity, production line. Visual features of defect photos map to defect categories. OCR reads serial numbers matching production records.

Requirements¶

Image understanding requires these additional packages on compute nodes:

easyocr — OCR text extraction
torchvision — Object detection (typically already installed with PyTorch)
DINOv2 model weights are downloaded automatically on first use (~80MB, cached)

Performance¶

Image downloads run in parallel (configurable thread count)
Visual model inference is GPU-accelerated when available
Feature extraction happens once during data ingestion, not during every training epoch
A dataset of 10,000 images typically processes in 5-15 minutes depending on download speeds and GPU availability