Skip to content

Automatic Type Handling

Real-world data doesn't fit neat categories. Addresses span multiple columns. Coordinates come in pairs. Some columns are mostly numbers but occasionally contain text. Featrix detects and handles all of these automatically.

The Problem

Traditional ML forces you to decide how to encode every column upfront:

  • Address components: Do you concatenate them? One-hot encode cities? What about typos in street names?
  • Lat/long pairs: Treat as two separate numbers? Lose the geographic relationship.
  • Mixed-type columns: A "price" field that sometimes says "Call for quote"—crash or silently corrupt?
  • Related attributes: customer_name, customer_id, customer_type are clearly related, but how do you tell the model?

Get these decisions wrong and your model learns garbage. Get them right and you've spent weeks on preprocessing instead of solving your actual problem.

How Featrix Solves This

Automatic Detection

Featrix scans your column names and values to detect patterns:

Pattern Example Columns What Happens
Address groups shipping_addr1, shipping_city, shipping_state, shipping_zip Merged into single composite encoding
Coordinate pairs store_lat, store_long Encoded together with geographic features
Entity attributes customer_name, customer_id, customer_type Relationship-aware encoding—model learns they belong together
Mixed-type values Column with both "299.99" and "Contact sales" Hybrid encoder handles both numeric and text

No configuration required. The detection runs automatically during training.

Two Encoding Strategies

Merge Strategy (for composite concepts)

Address components, coordinates, and other tightly-coupled columns are merged into a single embedding:

shipping_addr1 + shipping_city + shipping_state + shipping_zip
              [single address embedding vector]

Benefits: - Captures the address as one semantic concept - Reduces sequence length for the transformer (faster training) - Handles missing components gracefully

Relationship Strategy (for related but distinct attributes)

Entity attributes stay as separate columns but the model knows they're related:

customer_name  →  [embedding + group_marker]
customer_id    →  [embedding + group_marker]
customer_type  →  [embedding + group_marker]

Benefits: - Model learns customer_* columns describe the same entity - Individual components can still be masked during training - More interpretable—you can see which attribute matters most

Mixed-Type Columns

Some columns are messy. A "quantity" field might be: - 100 (number) - "100+" (text indicating "more than 100") - "TBD" (placeholder) - null (missing)

Traditional approaches either crash or silently drop the non-numeric values.

Featrix uses hybrid encoders that:

  1. Detect columns with mixed types during analysis
  2. Route numeric values through the scalar encoder
  3. Route text values through the string encoder
  4. Combine both into a unified embedding

The model learns that "100" and 100 are the same, while "TBD" means something different.

What Gets Detected

Address Patterns

Prefixes: shipping_, billing_, mailing_, home_, work_, delivery_

Components: addr1, addr2, address, street, city, state, province, zip, postal_code, country

Minimum 2 components required to trigger detection.

Coordinate Patterns

Pairs: *_lat + *_long, *_latitude + *_longitude, lat + lng

Examples: hq_lat/hq_long, pickup_latitude/pickup_longitude

Entity Patterns

Same prefix with 3+ columns of different types.

Examples: customer_*, product_*, order_*, employee_*

Performance Impact

  • Detection: Runs once during dataset initialization (~100ms for 100 columns)
  • Merge strategy: Actually reduces training time (fewer tokens in transformer)
  • Relationship strategy: Tiny overhead for group embeddings

Datasets with many address or coordinate columns typically train 10-30% faster with merged encodings.

You Don't Have to Think About This

That's the point. Featrix:

  1. Scans your columns automatically
  2. Detects patterns that indicate related columns
  3. Applies the appropriate encoding strategy
  4. Trains a model that understands these relationships

If you have addresses, coordinates, or entity attributes in your data, Featrix handles them correctly without you specifying anything. If your columns don't match any patterns, everything works exactly as before.

This is one of dozens of decisions Featrix makes automatically—decisions that would take you weeks to implement and tune yourself.