Deep Relationship Discovery¶
Featrix doesn't just encode columns independently—it discovers and exploits relationships between them. Dates relate to seasons. ZIP codes relate to demographics. Lat/long pairs define geographic proximity. Email domains indicate company relationships. Featrix finds these patterns automatically.
The Problem with Independent Encoding¶
Traditional ML treats each column as isolated:
age: 35 → normalize to 0.35
income: 75000 → normalize to 0.75
zip: 10001 → one-hot encode (10,000+ dimensions)
This misses obvious relationships:
- Age and income are correlated (income typically rises with experience)
- ZIP code implies geography, demographics, and economic indicators
- Timestamps have cyclical patterns (hour-of-day, day-of-week, seasonality)
How Featrix Discovers Relationships¶
1. Transformer Self-Attention¶
Every column attends to every other column through multi-head self-attention:
The model learns:
- Income × Education: Higher education correlates with higher income
- Age × Medical History: Older patients have different condition profiles
- Location × Price: Real estate prices vary by geography
Multiple attention heads capture different interaction types in parallel.
2. Type-Aware Relationship Extraction¶
Beyond attention, Featrix computes explicit pairwise features based on column types:
| Column Pair | Relationship Features |
|---|---|
| Scalar × Scalar | Ratio, difference, product, log ratio, correlation |
| Scalar × Category | Per-category statistics, category-weighted scalar |
| Category × Category | Co-occurrence patterns, conditional distributions |
| Timestamp × Scalar | Temporal trends, seasonality patterns |
| Location × Location | Geographic distance, same-region indicator |
| Text × Category | Keyword presence per category |
These relationship tokens feed directly into the joint encoder alongside column embeddings.
3. Group-Biased Attention¶
Related columns get boosted attention scores automatically:
Detected groups:
- billing_street, billing_city, billing_state, billing_zip → address group
- product_price, product_cost, product_margin → financial group
- user_first_name, user_last_name, user_email → identity group
Columns in the same group attend to each other more strongly by default. The bias is learned and can be strengthened or weakened during training.
Specialized Type Understanding¶
Timestamps: 12 Cyclical Features¶
A timestamp isn't a single number. Featrix extracts:
| Feature | Encoding | What It Captures |
|---|---|---|
| Second of minute | sin/cos | Sub-minute patterns |
| Minute of hour | sin/cos | Hourly patterns |
| Hour of day | sin/cos | Daily cycles (business hours, sleep) |
| Day of week | sin/cos | Weekly patterns (weekday vs weekend) |
| Day of month | sin/cos | Monthly billing cycles |
| Week of year | sin/cos | Seasonal patterns |
| Month of year | sin/cos | Annual seasonality |
| Quarter | sin/cos | Business quarters |
| Year (relative) | linear | Long-term trends |
| Is weekend | binary | Weekend indicator |
| Is holiday | binary | Holiday detection |
| Timezone offset | linear | Geographic time zone |
A timestamp like 2024-03-15 14:30:00 becomes a rich 12-dimensional vector that captures "Friday afternoon in Q1" rather than just a Unix timestamp.
Geographic Data: Automatic Enrichment¶
ZIP Codes and FIPS Codes
ZIP codes look like numbers but aren't—10001 isn't "greater than" 10000 in any meaningful sense. Featrix:
- Detects ZIP/FIPS code patterns automatically
- Looks up latitude/longitude coordinates
- Adds geographic features (region, timezone, urban/rural)
- Enables geographic distance calculations with other location columns
Lat/Long Coordinate Pairs
When Featrix detects paired coordinates (*_lat + *_long), it:
- Groups them as a single geographic concept
- Computes distances to other location columns
- Adds reverse-geocoded features (city, state, country)
- Enables radius-based similarity search
Addresses
Address components (street, city, state, zip) are merged into a single composite encoding:
Benefits:
- Captures the address as one semantic concept
- Reduces sequence length for the transformer
- Handles missing components gracefully
URLs and Domains: Structural Decomposition¶
A URL like https://api.example.com/v2/users?limit=100 is parsed into:
| Component | Value | What It Captures |
|---|---|---|
| Protocol | https | Security level |
| Subdomain | api | Service type |
| Domain | example.com | Organization |
| TLD | .com | Domain type |
| Path | /v2/users | Endpoint structure |
| Path depth | 2 | API complexity |
| Query params | limit=100 | Request parameters |
Additionally, Featrix can resolve domains to IP addresses and geolocate them—meaning you can discover relationships between domains and geographic data without explicit enrichment.
Email Addresses: Identity Decomposition¶
An email like john.smith@acme-corp.com becomes:
| Component | Value | What It Captures |
|---|---|---|
| Local part | john.smith | Personal identifier |
| Domain | acme-corp.com | Organization |
| TLD | .com | Domain type |
| Is free email | false | Gmail/Yahoo vs corporate |
| Domain category | business | Personal vs business |
Emails from the same domain cluster together—useful for B2B analytics where company relationships matter.
JSON: Nested Structure Flattening¶
JSON columns are flattened and encoded via child embedding spaces:
{
"address": {"city": "NYC", "zip": "10001"},
"preferences": {"notifications": true, "theme": "dark"}
}
Becomes:
address.city: "NYC" → categorical encoding
address.zip: "10001" → ZIP code encoding (with geo enrichment)
preferences.notifications: true → boolean encoding
preferences.theme: "dark" → categorical encoding
The flattened columns are then encoded using a child embedding space, preserving the hierarchical structure while making it compatible with the main model.
Cross-Column Relationship Examples¶
Example 1: E-commerce Fraud Detection¶
| Column | Type | Relationships Discovered |
|---|---|---|
| order_amount | scalar | Ratio with customer average |
| shipping_zip | ZIP | Distance from billing ZIP |
| order_time | timestamp | Is unusual hour for customer |
| email_domain | Free email vs corporate | |
| device_fingerprint | string | Seen before with this customer |
Featrix automatically learns: "Large orders to a different ZIP at 3am from a free email domain on a new device" = high fraud risk.
Example 2: Real Estate Pricing¶
| Column | Type | Relationships Discovered |
|---|---|---|
| property_lat/long | coordinates | Distance to city center |
| listing_date | timestamp | Seasonal market effects |
| bedrooms | scalar | Price per bedroom |
| neighborhood | category | Location premium |
| description | text | Keywords vs price (e.g., "renovated") |
Featrix learns geographic gradients (price drops with distance from downtown) and seasonal patterns (spring listings command premiums) automatically.
Example 3: Healthcare Risk Prediction¶
| Column | Type | Relationships Discovered |
|---|---|---|
| age | scalar | Age-condition interactions |
| diagnosis_codes | list | Comorbidity patterns |
| visit_date | timestamp | Time since last visit |
| provider_zip | ZIP | Geographic care patterns |
| notes | text | Symptom keywords |
Featrix discovers: "Elderly patients with diabetes and recent ER visits in rural areas" = different risk profile than similar patients in urban areas.
Exhaustive Combinatorial Coverage¶
The random masking strategy isn't just for self-supervision—it's a combinatorial exploration mechanism.
Over hundreds of epochs:
- Every column is masked from every possible combination of other columns
- The model learns to predict every column from every context
- Rare but important relationships surface through repetition
By the end of training, the model has explored the full cross-column signal space.
What This Means for You¶
You don't configure any of this. Featrix:
- Detects column types and related column groups
- Extracts appropriate features for each type
- Computes pairwise relationships where they make sense
- Learns which relationships actually matter for your data
- Encodes everything into a unified embedding space
The result: your data's deep structure is captured automatically, producing embeddings that understand "Friday afternoon orders to a ZIP code 500 miles away" as a meaningful pattern—not just three independent values.