Skip to content

Deep Relationship Discovery

Featrix doesn't just encode columns independently—it discovers and exploits relationships between them. Dates relate to seasons. ZIP codes relate to demographics. Lat/long pairs define geographic proximity. Email domains indicate company relationships. Featrix finds these patterns automatically.

The Problem with Independent Encoding

Traditional ML treats each column as isolated:

age: 35      → normalize to 0.35
income: 75000 → normalize to 0.75
zip: 10001   → one-hot encode (10,000+ dimensions)

This misses obvious relationships:

  • Age and income are correlated (income typically rises with experience)
  • ZIP code implies geography, demographics, and economic indicators
  • Timestamps have cyclical patterns (hour-of-day, day-of-week, seasonality)

How Featrix Discovers Relationships

1. Transformer Self-Attention

Every column attends to every other column through multi-head self-attention:

For each pair of columns (A, B):
    attention_score = softmax(Q_A · K_B^T / √d)

The model learns:

  • Income × Education: Higher education correlates with higher income
  • Age × Medical History: Older patients have different condition profiles
  • Location × Price: Real estate prices vary by geography

Multiple attention heads capture different interaction types in parallel.

2. Type-Aware Relationship Extraction

Beyond attention, Featrix computes explicit pairwise features based on column types:

Column Pair Relationship Features
Scalar × Scalar Ratio, difference, product, log ratio, correlation
Scalar × Category Per-category statistics, category-weighted scalar
Category × Category Co-occurrence patterns, conditional distributions
Timestamp × Scalar Temporal trends, seasonality patterns
Location × Location Geographic distance, same-region indicator
Text × Category Keyword presence per category

These relationship tokens feed directly into the joint encoder alongside column embeddings.

3. Group-Biased Attention

Related columns get boosted attention scores automatically:

Detected groups:
- billing_street, billing_city, billing_state, billing_zip → address group
- product_price, product_cost, product_margin → financial group
- user_first_name, user_last_name, user_email → identity group

Columns in the same group attend to each other more strongly by default. The bias is learned and can be strengthened or weakened during training.

Specialized Type Understanding

Timestamps: 12 Cyclical Features

A timestamp isn't a single number. Featrix extracts:

Feature Encoding What It Captures
Second of minute sin/cos Sub-minute patterns
Minute of hour sin/cos Hourly patterns
Hour of day sin/cos Daily cycles (business hours, sleep)
Day of week sin/cos Weekly patterns (weekday vs weekend)
Day of month sin/cos Monthly billing cycles
Week of year sin/cos Seasonal patterns
Month of year sin/cos Annual seasonality
Quarter sin/cos Business quarters
Year (relative) linear Long-term trends
Is weekend binary Weekend indicator
Is holiday binary Holiday detection
Timezone offset linear Geographic time zone

A timestamp like 2024-03-15 14:30:00 becomes a rich 12-dimensional vector that captures "Friday afternoon in Q1" rather than just a Unix timestamp.

Geographic Data: Automatic Enrichment

ZIP Codes and FIPS Codes

ZIP codes look like numbers but aren't—10001 isn't "greater than" 10000 in any meaningful sense. Featrix:

  1. Detects ZIP/FIPS code patterns automatically
  2. Looks up latitude/longitude coordinates
  3. Adds geographic features (region, timezone, urban/rural)
  4. Enables geographic distance calculations with other location columns

Lat/Long Coordinate Pairs

When Featrix detects paired coordinates (*_lat + *_long), it:

  1. Groups them as a single geographic concept
  2. Computes distances to other location columns
  3. Adds reverse-geocoded features (city, state, country)
  4. Enables radius-based similarity search

Addresses

Address components (street, city, state, zip) are merged into a single composite encoding:

shipping_street + shipping_city + shipping_state + shipping_zip
              [single address embedding vector]

Benefits:

  • Captures the address as one semantic concept
  • Reduces sequence length for the transformer
  • Handles missing components gracefully

URLs and Domains: Structural Decomposition

A URL like https://api.example.com/v2/users?limit=100 is parsed into:

Component Value What It Captures
Protocol https Security level
Subdomain api Service type
Domain example.com Organization
TLD .com Domain type
Path /v2/users Endpoint structure
Path depth 2 API complexity
Query params limit=100 Request parameters

Additionally, Featrix can resolve domains to IP addresses and geolocate them—meaning you can discover relationships between domains and geographic data without explicit enrichment.

Email Addresses: Identity Decomposition

An email like john.smith@acme-corp.com becomes:

Component Value What It Captures
Local part john.smith Personal identifier
Domain acme-corp.com Organization
TLD .com Domain type
Is free email false Gmail/Yahoo vs corporate
Domain category business Personal vs business

Emails from the same domain cluster together—useful for B2B analytics where company relationships matter.

JSON: Nested Structure Flattening

JSON columns are flattened and encoded via child embedding spaces:

{
  "address": {"city": "NYC", "zip": "10001"},
  "preferences": {"notifications": true, "theme": "dark"}
}

Becomes:

address.city: "NYC"      → categorical encoding
address.zip: "10001"     → ZIP code encoding (with geo enrichment)
preferences.notifications: true → boolean encoding
preferences.theme: "dark" → categorical encoding

The flattened columns are then encoded using a child embedding space, preserving the hierarchical structure while making it compatible with the main model.

Cross-Column Relationship Examples

Example 1: E-commerce Fraud Detection

Column Type Relationships Discovered
order_amount scalar Ratio with customer average
shipping_zip ZIP Distance from billing ZIP
order_time timestamp Is unusual hour for customer
email_domain email Free email vs corporate
device_fingerprint string Seen before with this customer

Featrix automatically learns: "Large orders to a different ZIP at 3am from a free email domain on a new device" = high fraud risk.

Example 2: Real Estate Pricing

Column Type Relationships Discovered
property_lat/long coordinates Distance to city center
listing_date timestamp Seasonal market effects
bedrooms scalar Price per bedroom
neighborhood category Location premium
description text Keywords vs price (e.g., "renovated")

Featrix learns geographic gradients (price drops with distance from downtown) and seasonal patterns (spring listings command premiums) automatically.

Example 3: Healthcare Risk Prediction

Column Type Relationships Discovered
age scalar Age-condition interactions
diagnosis_codes list Comorbidity patterns
visit_date timestamp Time since last visit
provider_zip ZIP Geographic care patterns
notes text Symptom keywords

Featrix discovers: "Elderly patients with diabetes and recent ER visits in rural areas" = different risk profile than similar patients in urban areas.

Exhaustive Combinatorial Coverage

The random masking strategy isn't just for self-supervision—it's a combinatorial exploration mechanism.

Over hundreds of epochs:

  • Every column is masked from every possible combination of other columns
  • The model learns to predict every column from every context
  • Rare but important relationships surface through repetition

By the end of training, the model has explored the full cross-column signal space.

What This Means for You

You don't configure any of this. Featrix:

  1. Detects column types and related column groups
  2. Extracts appropriate features for each type
  3. Computes pairwise relationships where they make sense
  4. Learns which relationships actually matter for your data
  5. Encodes everything into a unified embedding space

The result: your data's deep structure is captured automatically, producing embeddings that understand "Friday afternoon orders to a ZIP code 500 miles away" as a meaningful pattern—not just three independent values.