Automatic Type Handling¶

Real-world data doesn't fit neat categories. Addresses span multiple columns. Coordinates come in pairs. Some columns are mostly numbers but occasionally contain text. Featrix detects and handles all of these automatically.

The Problem¶

Traditional ML forces you to decide how to encode every column upfront:

Address components: Do you concatenate them? One-hot encode cities? What about typos in street names?
Lat/long pairs: Treat as two separate numbers? Lose the geographic relationship.
Mixed-type columns: A "price" field that sometimes says "Call for quote"—crash or silently corrupt?
Related attributes: customer_name, customer_id, customer_type are clearly related, but how do you tell the model?

Get these decisions wrong and your model learns garbage. Get them right and you've spent weeks on preprocessing instead of solving your actual problem.

How Featrix Solves This¶

Automatic Detection¶

Featrix scans your column names and values to detect patterns:

Pattern	Example Columns	What Happens
Address groups	`shipping_addr1`, `shipping_city`, `shipping_state`, `shipping_zip`	Merged into single composite encoding
Coordinate pairs	`store_lat`, `store_long`	Encoded together with geographic features
Entity attributes	`customer_name`, `customer_id`, `customer_type`	Relationship-aware encoding—model learns they belong together
Mixed-type values	Column with both `"299.99"` and `"Contact sales"`	Hybrid encoder handles both numeric and text

No configuration required. The detection runs automatically during training.

Two Encoding Strategies¶

Merge Strategy (for composite concepts)

Address components, coordinates, and other tightly-coupled columns are merged into a single embedding:

shipping_addr1 + shipping_city + shipping_state + shipping_zip
                              ↓
              [single address embedding vector]

Benefits: - Captures the address as one semantic concept - Reduces sequence length for the transformer (faster training) - Handles missing components gracefully

Relationship Strategy (for related but distinct attributes)

Entity attributes stay as separate columns but the model knows they're related:

customer_name  →  [embedding + group_marker]
customer_id    →  [embedding + group_marker]
customer_type  →  [embedding + group_marker]

Benefits: - Model learns customer_* columns describe the same entity - Individual components can still be masked during training - More interpretable—you can see which attribute matters most

Mixed-Type Columns¶

Some columns are messy. A "quantity" field might be: - 100 (number) - "100+" (text indicating "more than 100") - "TBD" (placeholder) - null (missing)

Traditional approaches either crash or silently drop the non-numeric values.

Featrix uses hybrid encoders that:

Detect columns with mixed types during analysis
Route numeric values through the scalar encoder
Route text values through the string encoder
Combine both into a unified embedding

The model learns that "100" and 100 are the same, while "TBD" means something different.

What Gets Detected¶

Address Patterns¶

Prefixes: shipping_, billing_, mailing_, home_, work_, delivery_

Components: addr1, addr2, address, street, city, state, province, zip, postal_code, country

Minimum 2 components required to trigger detection.

Coordinate Patterns¶

Pairs: *_lat + *_long, *_latitude + *_longitude, lat + lng

Examples: hq_lat/hq_long, pickup_latitude/pickup_longitude

Entity Patterns¶

Same prefix with 3+ columns of different types.

Examples: customer_*, product_*, order_*, employee_*

Performance Impact¶

Detection: Runs once during dataset initialization (~100ms for 100 columns)
Merge strategy: Actually reduces training time (fewer tokens in transformer)
Relationship strategy: Tiny overhead for group embeddings

Datasets with many address or coordinate columns typically train 10-30% faster with merged encodings.

You Don't Have to Think About This¶

That's the point. Featrix:

Scans your columns automatically
Detects patterns that indicate related columns
Applies the appropriate encoding strategy
Trains a model that understands these relationships

If you have addresses, coordinates, or entity attributes in your data, Featrix handles them correctly without you specifying anything. If your columns don't match any patterns, everything works exactly as before.

This is one of dozens of decisions Featrix makes automatically—decisions that would take you weeks to implement and tune yourself.