Automatic Type Handling¶
Real-world data doesn't fit neat categories. Addresses span multiple columns. Coordinates come in pairs. Some columns are mostly numbers but occasionally contain text. Featrix detects and handles all of these automatically.
The Problem¶
Traditional ML forces you to decide how to encode every column upfront:
- Address components: Do you concatenate them? One-hot encode cities? What about typos in street names?
- Lat/long pairs: Treat as two separate numbers? Lose the geographic relationship.
- Mixed-type columns: A "price" field that sometimes says "Call for quote"—crash or silently corrupt?
- Related attributes:
customer_name,customer_id,customer_typeare clearly related, but how do you tell the model?
Get these decisions wrong and your model learns garbage. Get them right and you've spent weeks on preprocessing instead of solving your actual problem.
How Featrix Solves This¶
Automatic Detection¶
Featrix scans your column names and values to detect patterns:
| Pattern | Example Columns | What Happens |
|---|---|---|
| Address groups | shipping_addr1, shipping_city, shipping_state, shipping_zip |
Merged into single composite encoding |
| Coordinate pairs | store_lat, store_long |
Encoded together with geographic features |
| Entity attributes | customer_name, customer_id, customer_type |
Relationship-aware encoding—model learns they belong together |
| Mixed-type values | Column with both "299.99" and "Contact sales" |
Hybrid encoder handles both numeric and text |
No configuration required. The detection runs automatically during training.
Two Encoding Strategies¶
Merge Strategy (for composite concepts)
Address components, coordinates, and other tightly-coupled columns are merged into a single embedding:
Benefits: - Captures the address as one semantic concept - Reduces sequence length for the transformer (faster training) - Handles missing components gracefully
Relationship Strategy (for related but distinct attributes)
Entity attributes stay as separate columns but the model knows they're related:
customer_name → [embedding + group_marker]
customer_id → [embedding + group_marker]
customer_type → [embedding + group_marker]
Benefits:
- Model learns customer_* columns describe the same entity
- Individual components can still be masked during training
- More interpretable—you can see which attribute matters most
Mixed-Type Columns¶
Some columns are messy. A "quantity" field might be:
- 100 (number)
- "100+" (text indicating "more than 100")
- "TBD" (placeholder)
- null (missing)
Traditional approaches either crash or silently drop the non-numeric values.
Featrix uses hybrid encoders that:
- Detect columns with mixed types during analysis
- Route numeric values through the scalar encoder
- Route text values through the string encoder
- Combine both into a unified embedding
The model learns that "100" and 100 are the same, while "TBD" means something different.
What Gets Detected¶
Address Patterns¶
Prefixes: shipping_, billing_, mailing_, home_, work_, delivery_
Components: addr1, addr2, address, street, city, state, province, zip, postal_code, country
Minimum 2 components required to trigger detection.
Coordinate Patterns¶
Pairs: *_lat + *_long, *_latitude + *_longitude, lat + lng
Examples: hq_lat/hq_long, pickup_latitude/pickup_longitude
Entity Patterns¶
Same prefix with 3+ columns of different types.
Examples: customer_*, product_*, order_*, employee_*
Performance Impact¶
- Detection: Runs once during dataset initialization (~100ms for 100 columns)
- Merge strategy: Actually reduces training time (fewer tokens in transformer)
- Relationship strategy: Tiny overhead for group embeddings
Datasets with many address or coordinate columns typically train 10-30% faster with merged encodings.
You Don't Have to Think About This¶
That's the point. Featrix:
- Scans your columns automatically
- Detects patterns that indicate related columns
- Applies the appropriate encoding strategy
- Trains a model that understands these relationships
If you have addresses, coordinates, or entity attributes in your data, Featrix handles them correctly without you specifying anything. If your columns don't match any patterns, everything works exactly as before.
This is one of dozens of decisions Featrix makes automatically—decisions that would take you weeks to implement and tune yourself.