Column Type Interaction Matrix¶

This document enumerates all possible interaction types between different column types and proposes relationship features for each pair.

Column Types¶

SCALAR - Numeric values (integers, floats)
SET - Categorical sets (one-hot encoded categories)
FREE_STRING - Free-form text strings
LIST_OF_A_SET - Lists of categorical values
VECTOR - Fixed-length numeric vectors
URL - URLs (can be treated as strings with domain/path structure)
JSON - Structured JSON data
TIMESTAMP - Date/time values
EMAIL - Email addresses (can be treated as strings with domain structure)
DOMAIN - Domain names (can be treated as strings)

Interaction Matrix¶

SCALAR × SCALAR¶

Relationship Types: - Ratio: scalar_a / scalar_b (division with epsilon) - Difference: scalar_a - scalar_b - Product: scalar_a * scalar_b - Sum: scalar_a + scalar_b - Normalized Difference: (scalar_a - scalar_b) / (scalar_a + scalar_b + eps) (relative difference) - Log Ratio: log(scalar_a + eps) - log(scalar_b + eps) (scale-invariant) - Power Ratio: scalar_a^p / scalar_b^p (for specific powers p ∈ {0.5, 1, 2})

Use Cases: - Age/Income ratio → financial stability indicator - Price/Quantity → unit price - Revenue/Expenses → profit margin - Distance/Time → speed

Implementation: - Compute element-wise ratios, differences, products - MLP-ize each relationship type separately - Can limit to top-N pairs by MI or correlation

SCALAR × SET¶

Relationship Types: - Set Cardinality × Scalar: scalar * |set| (scalar weighted by set size) - Set Membership Indicator × Scalar: For each set member, compute scalar * indicator(member ∈ set) - Set Statistics × Scalar: If set values are numeric-like, compute scalar * mean(set_values), scalar * max(set_values), etc. - Scalar Normalized by Set Size: scalar / (|set| + eps) (per-member average) - Scalar × Set Intersection Size: If multiple sets, scalar * |set_1 ∩ set_2|

Use Cases: - Age × Job Categories → age-weighted job type - Price × Product Categories → category-specific pricing - Count × Set Membership → frequency-weighted features - Revenue × Customer Segments → segment-specific revenue

Implementation: - Extract set cardinality as scalar feature - Extract set membership indicators (one-hot-like) - Compute interactions with scalar - MLP-ize interactions

SCALAR × FREE_STRING¶

Relationship Types: - String Length × Scalar: scalar * len(string) (scalar weighted by string length) - String Embedding Statistics × Scalar: scalar * mean(string_embedding), scalar * std(string_embedding) - Scalar Normalized by String Length: scalar / (len(string) + eps) - Scalar × String Token Count: scalar * num_tokens(string)

Use Cases: - Age × Description Length → age-weighted description complexity - Price × Product Description → description-weighted pricing - Count × Text Length → frequency-weighted text features

Implementation: - Extract string statistics (length, token count, embedding stats) - Compute interactions with scalar - MLP-ize interactions

SCALAR × LIST_OF_A_SET¶

Relationship Types: - List Length × Scalar: scalar * len(list) (scalar weighted by list size) - List Cardinality × Scalar: scalar * sum(|set_i| for set_i in list) (total set members) - Scalar Normalized by List Length: scalar / (len(list) + eps) - List Diversity × Scalar: scalar * num_unique_items(list) (scalar weighted by diversity)

Use Cases: - Age × List of Skills → age-weighted skill count - Price × List of Features → feature-count-weighted pricing - Count × List of Tags → tag-frequency-weighted features

Implementation: - Extract list statistics (length, cardinality, diversity) - Compute interactions with scalar - MLP-ize interactions

SCALAR × VECTOR¶

Relationship Types: - Scalar × Vector Element-wise: scalar * vector (scalar scales entire vector) - Scalar × Vector Norm: scalar * ||vector|| (scalar weighted by vector magnitude) - Scalar × Vector Mean: scalar * mean(vector) (scalar weighted by average) - Scalar × Vector Dot Product: If multiple vectors, scalar * (vector_a · vector_b) - Vector Normalized by Scalar: vector / (scalar + eps) (vector scaled by scalar)

Use Cases: - Age × Feature Vector → age-weighted feature importance - Price × Embedding Vector → price-weighted semantic features - Count × Vector → frequency-weighted vector features

Implementation: - Extract vector statistics (norm, mean, std) - Compute element-wise products with scalar - MLP-ize interactions

SCALAR × URL¶

Relationship Types: - URL Depth × Scalar: scalar * url_depth (scalar weighted by URL path depth) - Domain Length × Scalar: scalar * len(domain) (scalar weighted by domain name length) - Has Query × Scalar: scalar * indicator(has_query_params) (binary indicator) - Scalar Normalized by URL Length: scalar / (len(url) + eps)

Use Cases: - Age × URL Depth → age-weighted navigation depth - Price × Domain Type → domain-specific pricing - Count × URL Structure → structure-frequency-weighted features

Implementation: - Extract URL features (depth, domain, query params) - Compute interactions with scalar - MLP-ize interactions

SCALAR × JSON¶

Relationship Types: - JSON Depth × Scalar: scalar * json_depth (scalar weighted by nesting depth) - JSON Key Count × Scalar: scalar * num_keys(json) (scalar weighted by key count) - JSON Value Types × Scalar: scalar * num_numeric_values(json), scalar * num_string_values(json) - Scalar Normalized by JSON Size: scalar / (json_size + eps)

Use Cases: - Age × JSON Complexity → age-weighted data complexity - Price × JSON Structure → structure-weighted pricing - Count × JSON Keys → key-frequency-weighted features

Implementation: - Extract JSON statistics (depth, key count, value types) - Compute interactions with scalar - MLP-ize interactions

SCALAR × TIMESTAMP¶

Relationship Types: - Time Delta × Scalar: scalar * delta_time (scalar weighted by time difference) - Age (from timestamp) × Scalar: scalar * age(timestamp) (scalar weighted by age) - Temporal Features × Scalar: scalar * hour_of_day, scalar * day_of_week, scalar * month - Scalar Normalized by Time: scalar / (time_since_epoch + eps)

Use Cases: - Age × Registration Date → account-age-weighted age - Price × Purchase Date → date-weighted pricing - Count × Timestamp → time-frequency-weighted features

Implementation: - Extract temporal features (hour, day, month, age, delta) - Compute interactions with scalar - MLP-ize interactions

SCALAR × EMAIL¶

Relationship Types: - Email Domain Length × Scalar: scalar * len(email_domain) (scalar weighted by domain length) - Has Subdomain × Scalar: scalar * indicator(has_subdomain) (binary indicator) - Email Length × Scalar: scalar * len(email) (scalar weighted by email length) - Scalar Normalized by Email Length: scalar / (len(email) + eps)

Use Cases: - Age × Email Domain → domain-age correlation - Price × Email Provider → provider-specific pricing - Count × Email Structure → structure-frequency-weighted features

Implementation: - Extract email features (domain, subdomain, length) - Compute interactions with scalar - MLP-ize interactions

SCALAR × DOMAIN¶

Relationship Types: - Domain Length × Scalar: scalar * len(domain) (scalar weighted by domain length) - TLD Type × Scalar: scalar * indicator(tld_type) (categorical: .com, .org, etc.) - Has Subdomain × Scalar: scalar * indicator(has_subdomain) (binary indicator) - Scalar Normalized by Domain Length: scalar / (len(domain) + eps)

Use Cases: - Age × Domain Type → domain-age correlation - Price × Domain Category → category-specific pricing - Count × Domain Structure → structure-frequency-weighted features

Implementation: - Extract domain features (length, TLD, subdomain) - Compute interactions with scalar - MLP-ize interactions

SET × SET¶

Use Cases: - Job Categories × Skills → category-skill overlap - Product Categories × Customer Segments → category-segment alignment - Tags × Categories → tag-category relationships

Implementation: - Compute set operations (intersection, union, difference) - Extract set statistics (cardinality, similarity) - MLP-ize relationship features

SET × FREE_STRING¶

Relationship Types: - Set Cardinality × String Length: |set| * len(string) (size interaction) - Set Membership × String Embedding: For each set member, compute indicator(member ∈ set) * string_embedding - String Contains Set Member: indicator(any(set_member in string)) (binary: does string mention any set member?) - Set Size Normalized by String Length: |set| / (len(string) + eps) - String Embedding × Set Embedding: mean(string_embedding) · mean(set_embedding) (cosine similarity)

Use Cases: - Job Categories × Description → category-description alignment - Tags × Text → tag-text relevance - Categories × Comments → category-comment relationships

Implementation: - Extract set cardinality and embeddings - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions

SET × LIST_OF_A_SET¶

Use Cases: - Skills × Skill List → skill-list overlap - Categories × Tag List → category-tag relationships - Features × Feature List → feature-list alignment

Implementation: - Compute set-list operations - Extract statistics - MLP-ize interactions

SET × VECTOR¶

Relationship Types: - Set Cardinality × Vector Norm: |set| * ||vector|| - Set Embedding × Vector: mean(set_embedding) · vector (dot product) - Set Size × Vector Mean: |set| * mean(vector) - Vector Normalized by Set Size: vector / (|set| + eps)

Use Cases: - Categories × Feature Vector → category-feature alignment - Tags × Embedding → tag-embedding relationships

Implementation: - Extract set statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions

SET × URL¶

Relationship Types: - Set Cardinality × URL Depth: |set| * url_depth - Set × URL Domain: indicator(domain in set) (if set contains domains) - Set Size × URL Length: |set| * len(url) - URL Features × Set Embedding: url_features · mean(set_embedding)

Use Cases: - Categories × URL → category-url relationships - Tags × URL → tag-url relevance

Implementation: - Extract URL features - Extract set statistics - Compute interactions - MLP-ize interactions

SET × JSON¶

Relationship Types: - Set Cardinality × JSON Key Count: |set| * num_keys(json) - Set × JSON Keys: indicator(any(set_member in json_keys)) (binary: does JSON have keys matching set?) - Set Size × JSON Depth: |set| * json_depth - JSON Structure × Set Embedding: json_features · mean(set_embedding)

Use Cases: - Categories × JSON Structure → category-json relationships - Tags × JSON Keys → tag-key alignment

Implementation: - Extract JSON features - Extract set statistics - Compute interactions - MLP-ize interactions

SET × TIMESTAMP¶

Relationship Types: - Set Cardinality × Time Delta: |set| * delta_time - Set × Temporal Features: mean(set_embedding) · temporal_features - Set Size × Age: |set| * age(timestamp) - Temporal Features × Set Embedding: temporal_features · mean(set_embedding)

Use Cases: - Categories × Registration Date → category-time relationships - Tags × Event Time → tag-time alignment

Implementation: - Extract temporal features - Extract set statistics - Compute interactions - MLP-ize interactions

SET × EMAIL¶

Relationship Types: - Set Cardinality × Email Length: |set| * len(email) - Set × Email Domain: indicator(email_domain in set) (if set contains domains) - Set Size × Domain Length: |set| * len(email_domain) - Email Features × Set Embedding: email_features · mean(set_embedding)

Use Cases: - Categories × Email Domain → category-domain relationships - Tags × Email Provider → tag-provider alignment

Implementation: - Extract email features - Extract set statistics - Compute interactions - MLP-ize interactions

SET × DOMAIN¶

Relationship Types: - Set Cardinality × Domain Length: |set| * len(domain) - Set × Domain: indicator(domain in set) (if set contains domains) - Set Size × TLD Type: |set| * indicator(tld_type) - Domain Features × Set Embedding: domain_features · mean(set_embedding)

Use Cases: - Categories × Domain → category-domain relationships - Tags × Domain Type → tag-domain alignment

Implementation: - Extract domain features - Extract set statistics - Compute interactions - MLP-ize interactions

FREE_STRING × FREE_STRING¶

Relationship Types: - String Similarity: Cosine similarity between string embeddings - String Length Ratio: len(string_a) / (len(string_b) + eps) - String Length Difference: len(string_a) - len(string_b) - Token Overlap: |tokens(string_a) ∩ tokens(string_b)| (common tokens) - Token Jaccard: |tokens_a ∩ tokens_b| / |tokens_a ∪ tokens_b| - Embedding Distance: ||embedding_a - embedding_b|| (L2 distance) - Embedding Dot Product: embedding_a · embedding_b (cosine similarity) - Substring Indicator: indicator(string_a in string_b or string_b in string_a) (containment)

Use Cases: - Description × Title → description-title alignment - Comment × Review → comment-review similarity - Query × Document → query-document relevance

Implementation: - Extract string statistics (length, token count) - Extract string embeddings - Compute similarity metrics - MLP-ize interactions

FREE_STRING × LIST_OF_A_SET¶

Relationship Types: - String Length × List Length: len(string) * len(list) - String × List Union: indicator(any(list_item in string)) (does string mention any list item?) - String Embedding × List Embedding: string_embedding · mean(list_embedding) - List Diversity × String Length: num_unique_items(list) * len(string)

Use Cases: - Description × Tag List → description-tag alignment - Comment × Category List → comment-category relationships

Implementation: - Extract string statistics and embeddings - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions

FREE_STRING × VECTOR¶

Relationship Types: - String Embedding × Vector: string_embedding · vector (dot product) - String Length × Vector Norm: len(string) * ||vector|| - String Statistics × Vector Mean: mean(string_embedding) · mean(vector) - Vector Normalized by String Length: vector / (len(string) + eps)

Use Cases: - Description × Feature Vector → description-feature alignment - Text × Embedding → text-embedding relationships

Implementation: - Extract string statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions

FREE_STRING × URL¶

Relationship Types: - String × URL Domain: indicator(domain in string) (does string mention domain?) - String Length × URL Length: len(string) * len(url) - String Embedding × URL Features: string_embedding · url_features - URL Depth × String Length: url_depth * len(string)

Use Cases: - Description × URL → description-url relationships - Comment × Link → comment-link alignment

Implementation: - Extract URL features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions

FREE_STRING × JSON¶

Relationship Types: - String × JSON Keys: indicator(any(key in string)) (does string mention any JSON key?) - String Length × JSON Key Count: len(string) * num_keys(json) - String Embedding × JSON Structure: string_embedding · json_features - JSON Depth × String Length: json_depth * len(string)

Use Cases: - Description × JSON Structure → description-json relationships - Comment × JSON Data → comment-data alignment

Implementation: - Extract JSON features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions

FREE_STRING × TIMESTAMP¶

Relationship Types: - String Length × Time Delta: len(string) * delta_time - String Embedding × Temporal Features: string_embedding · temporal_features - Age × String Length: age(timestamp) * len(string) - Temporal Features × String Statistics: temporal_features · mean(string_embedding)

Use Cases: - Comment × Post Date → comment-time relationships - Description × Creation Date → description-time alignment

Implementation: - Extract temporal features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions

FREE_STRING × EMAIL¶

Relationship Types: - String × Email Domain: indicator(domain in string) (does string mention email domain?) - String Length × Email Length: len(string) * len(email) - String Embedding × Email Features: string_embedding · email_features - Email Domain Length × String Length: len(email_domain) * len(string)

Use Cases: - Comment × Email → comment-email relationships - Description × Email Domain → description-domain alignment

Implementation: - Extract email features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions

FREE_STRING × DOMAIN¶

Relationship Types: - String × Domain: indicator(domain in string) (does string mention domain?) - String Length × Domain Length: len(string) * len(domain) - String Embedding × Domain Features: string_embedding · domain_features - TLD Type × String Length: indicator(tld_type) * len(string)

Use Cases: - Description × Domain → description-domain relationships - Comment × Domain Type → comment-domain alignment

Implementation: - Extract domain features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions

LIST_OF_A_SET × LIST_OF_A_SET¶

Relationship Types: - List Intersection: |union(list_a) ∩ union(list_b)| (common items) - List Jaccard: |union_a ∩ union_b| / |union_a ∪ union_b| - List Length Ratio: len(list_a) / (len(list_b) + eps) - List Length Difference: len(list_a) - len(list_b) - List Diversity Ratio: num_unique_items(list_a) / (num_unique_items(list_b) + eps) - List Embedding Similarity: mean(list_a_embedding) · mean(list_b_embedding)

Use Cases: - Skill List × Tag List → skill-tag overlap - Category List × Feature List → category-feature relationships

Implementation: - Compute list operations (union, intersection) - Extract list statistics (length, diversity) - Extract list embeddings - MLP-ize interactions

LIST_OF_A_SET × VECTOR¶

Relationship Types: - List Length × Vector Norm: len(list) * ||vector|| - List Embedding × Vector: mean(list_embedding) · vector - List Diversity × Vector Mean: num_unique_items(list) * mean(vector) - Vector Normalized by List Length: vector / (len(list) + eps)

Use Cases: - Tag List × Feature Vector → tag-feature alignment - Category List × Embedding → category-embedding relationships

Implementation: - Extract list statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions

LIST_OF_A_SET × URL¶

Relationship Types: - List Length × URL Depth: len(list) * url_depth - List × URL Domain: indicator(domain in union(list)) (if list contains domains) - List Diversity × URL Length: num_unique_items(list) * len(url) - URL Features × List Embedding: url_features · mean(list_embedding)

Use Cases: - Tag List × URL → tag-url relationships - Category List × URL → category-url alignment

Implementation: - Extract URL features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions

LIST_OF_A_SET × JSON¶

Relationship Types: - List Length × JSON Key Count: len(list) * num_keys(json) - List × JSON Keys: indicator(any(list_item in json_keys)) - List Diversity × JSON Depth: num_unique_items(list) * json_depth - JSON Structure × List Embedding: json_features · mean(list_embedding)

Use Cases: - Tag List × JSON Structure → tag-json relationships - Category List × JSON Keys → category-key alignment

Implementation: - Extract JSON features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions

LIST_OF_A_SET × TIMESTAMP¶

Relationship Types: - List Length × Time Delta: len(list) * delta_time - List Embedding × Temporal Features: mean(list_embedding) · temporal_features - List Diversity × Age: num_unique_items(list) * age(timestamp) - Temporal Features × List Embedding: temporal_features · mean(list_embedding)

Use Cases: - Tag List × Event Time → tag-time relationships - Category List × Registration Date → category-time alignment

Implementation: - Extract temporal features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions

LIST_OF_A_SET × EMAIL¶

Relationship Types: - List Length × Email Length: len(list) * len(email) - List × Email Domain: indicator(email_domain in union(list)) (if list contains domains) - List Diversity × Domain Length: num_unique_items(list) * len(email_domain) - Email Features × List Embedding: email_features · mean(list_embedding)

Use Cases: - Tag List × Email Domain → tag-domain relationships - Category List × Email Provider → category-provider alignment

Implementation: - Extract email features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions

LIST_OF_A_SET × DOMAIN¶

Relationship Types: - List Length × Domain Length: len(list) * len(domain) - List × Domain: indicator(domain in union(list)) (if list contains domains) - List Diversity × TLD Type: num_unique_items(list) * indicator(tld_type) - Domain Features × List Embedding: domain_features · mean(list_embedding)

Use Cases: - Tag List × Domain → tag-domain relationships - Category List × Domain Type → category-domain alignment

Implementation: - Extract domain features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions

VECTOR × VECTOR¶

Relationship Types: - Dot Product: vector_a · vector_b (cosine similarity when normalized) - L2 Distance: ||vector_a - vector_b|| (Euclidean distance) - L1 Distance: |vector_a - vector_b| (Manhattan distance) - Cosine Similarity: (vector_a · vector_b) / (||vector_a|| * ||vector_b|| + eps) - Element-wise Product: vector_a * vector_b (Hadamard product) - Element-wise Ratio: vector_a / (vector_b + eps) (element-wise division) - Vector Norm Ratio: ||vector_a|| / (||vector_b|| + eps) - Vector Mean Difference: mean(vector_a) - mean(vector_b)

Use Cases: - Feature Vector × Embedding Vector → feature-embedding alignment - User Vector × Item Vector → user-item similarity - Query Vector × Document Vector → query-document relevance

Implementation: - Compute vector operations (dot product, distance, similarity) - Extract vector statistics (norm, mean, std) - MLP-ize interactions

VECTOR × URL¶

Relationship Types: - Vector × URL Features: vector · url_features (dot product) - Vector Norm × URL Depth: ||vector|| * url_depth - Vector Mean × URL Length: mean(vector) * len(url) - URL Features Normalized by Vector Norm: url_features / (||vector|| + eps)

Use Cases: - Embedding Vector × URL → embedding-url relationships - Feature Vector × URL Structure → feature-url alignment

Implementation: - Extract URL features - Extract vector statistics - Compute interactions - MLP-ize interactions

VECTOR × JSON¶

Relationship Types: - Vector × JSON Features: vector · json_features (dot product) - Vector Norm × JSON Key Count: ||vector|| * num_keys(json) - Vector Mean × JSON Depth: mean(vector) * json_depth - JSON Features Normalized by Vector Norm: json_features / (||vector|| + eps)

Use Cases: - Embedding Vector × JSON Structure → embedding-json relationships - Feature Vector × JSON Keys → feature-json alignment

Implementation: - Extract JSON features - Extract vector statistics - Compute interactions - MLP-ize interactions

VECTOR × TIMESTAMP¶

Relationship Types: - Vector × Temporal Features: vector · temporal_features (dot product) - Vector Norm × Time Delta: ||vector|| * delta_time - Vector Mean × Age: mean(vector) * age(timestamp) - Temporal Features Normalized by Vector Norm: temporal_features / (||vector|| + eps)

Use Cases: - Embedding Vector × Time → embedding-time relationships - Feature Vector × Temporal Features → feature-time alignment

Implementation: - Extract temporal features - Extract vector statistics - Compute interactions - MLP-ize interactions

VECTOR × EMAIL¶

Relationship Types: - Vector × Email Features: vector · email_features (dot product) - Vector Norm × Email Length: ||vector|| * len(email) - Vector Mean × Domain Length: mean(vector) * len(email_domain) - Email Features Normalized by Vector Norm: email_features / (||vector|| + eps)

Use Cases: - Embedding Vector × Email → embedding-email relationships - Feature Vector × Email Domain → feature-domain alignment

Implementation: - Extract email features - Extract vector statistics - Compute interactions - MLP-ize interactions

VECTOR × DOMAIN¶

Relationship Types: - Vector × Domain Features: vector · domain_features (dot product) - Vector Norm × Domain Length: ||vector|| * len(domain) - Vector Mean × TLD Type: mean(vector) * indicator(tld_type) - Domain Features Normalized by Vector Norm: domain_features / (||vector|| + eps)

Use Cases: - Embedding Vector × Domain → embedding-domain relationships - Feature Vector × Domain Type → feature-domain alignment

Implementation: - Extract domain features - Extract vector statistics - Compute interactions - MLP-ize interactions

URL × URL¶

Relationship Types: - Domain Match: indicator(domain_a == domain_b) (same domain?) - TLD Match: indicator(tld_a == tld_b) (same TLD?) - Path Depth Difference: url_depth_a - url_depth_b - Path Depth Ratio: url_depth_a / (url_depth_b + eps) - URL Length Ratio: len(url_a) / (len(url_b) + eps) - URL Embedding Similarity: url_embedding_a · url_embedding_b

Use Cases: - Source URL × Destination URL → source-destination relationships - Referrer × Current URL → referrer-current alignment

Implementation: - Extract URL features (domain, TLD, depth, length) - Extract URL embeddings - Compute similarity metrics - MLP-ize interactions

URL × JSON¶

Relationship Types: - URL Depth × JSON Depth: url_depth * json_depth - URL × JSON Keys: indicator(any(url_component in json_keys)) - URL Length × JSON Key Count: len(url) * num_keys(json) - URL Features × JSON Structure: url_features · json_features

Use Cases: - URL × JSON Response → url-response relationships - URL Structure × JSON Data → structure-data alignment

Implementation: - Extract URL features - Extract JSON features - Compute interactions - MLP-ize interactions

URL × TIMESTAMP¶

Relationship Types: - URL Depth × Time Delta: url_depth * delta_time - URL Features × Temporal Features: url_features · temporal_features - URL Length × Age: len(url) * age(timestamp) - Temporal Features × URL Embedding: temporal_features · url_embedding

Use Cases: - URL × Access Time → url-time relationships - URL Structure × Event Time → structure-time alignment

Implementation: - Extract URL features - Extract temporal features - Compute interactions - MLP-ize interactions

URL × EMAIL¶

Relationship Types: - URL Domain × Email Domain: indicator(url_domain == email_domain) (same domain?) - URL Depth × Email Length: url_depth * len(email) - URL Features × Email Features: url_features · email_features - Email Domain Length × URL Length: len(email_domain) * len(url)

Use Cases: - URL × Email Domain → url-email relationships - URL Structure × Email Provider → structure-provider alignment

Implementation: - Extract URL features - Extract email features - Compute interactions - MLP-ize interactions

URL × DOMAIN¶

Relationship Types: - URL Domain × Domain: indicator(url_domain == domain) (same domain?) - URL Depth × Domain Length: url_depth * len(domain) - URL Features × Domain Features: url_features · domain_features - TLD Match: indicator(url_tld == domain_tld) (same TLD?)

Use Cases: - URL × Domain → url-domain relationships - URL Structure × Domain Type → structure-domain alignment

Implementation: - Extract URL features - Extract domain features - Compute interactions - MLP-ize interactions

JSON × JSON¶

Relationship Types: - JSON Depth Difference: json_depth_a - json_depth_b - JSON Key Overlap: |json_keys_a ∩ json_keys_b| (common keys) - JSON Key Jaccard: |keys_a ∩ keys_b| / |keys_a ∪ keys_b| - JSON Key Count Ratio: num_keys(json_a) / (num_keys(json_b) + eps) - JSON Structure Similarity: json_features_a · json_features_b

Use Cases: - Request JSON × Response JSON → request-response relationships - JSON Schema × JSON Data → schema-data alignment

Implementation: - Extract JSON features (depth, keys, structure) - Extract JSON embeddings - Compute similarity metrics - MLP-ize interactions

JSON × TIMESTAMP¶

Relationship Types: - JSON Depth × Time Delta: json_depth * delta_time - JSON Features × Temporal Features: json_features · temporal_features - JSON Key Count × Age: num_keys(json) * age(timestamp) - Temporal Features × JSON Embedding: temporal_features · json_embedding

Use Cases: - JSON × Creation Time → json-time relationships - JSON Structure × Event Time → structure-time alignment

Implementation: - Extract JSON features - Extract temporal features - Compute interactions - MLP-ize interactions

JSON × EMAIL¶

Relationship Types: - JSON Key Count × Email Length: num_keys(json) * len(email) - JSON Features × Email Features: json_features · email_features - JSON Depth × Domain Length: json_depth * len(email_domain) - Email Features × JSON Embedding: email_features · json_embedding

Use Cases: - JSON × Email Domain → json-email relationships - JSON Structure × Email Provider → structure-provider alignment

Implementation: - Extract JSON features - Extract email features - Compute interactions - MLP-ize interactions

JSON × DOMAIN¶

Relationship Types: - JSON Key Count × Domain Length: num_keys(json) * len(domain) - JSON Features × Domain Features: json_features · domain_features - JSON Depth × TLD Type: json_depth * indicator(tld_type) - Domain Features × JSON Embedding: domain_features · json_embedding

Use Cases: - JSON × Domain → json-domain relationships - JSON Structure × Domain Type → structure-domain alignment

Implementation: - Extract JSON features - Extract domain features - Compute interactions - MLP-ize interactions

TIMESTAMP × TIMESTAMP¶

Relationship Types: - Time Delta: timestamp_a - timestamp_b (absolute time difference) - Time Ratio: timestamp_a / (timestamp_b + eps) (relative time) - Age Difference: age(timestamp_a) - age(timestamp_b) (relative ages) - Temporal Feature Differences: hour_a - hour_b, day_a - day_b, month_a - month_b - Same Day Indicator: indicator(day_a == day_b) (same day?) - Same Week Indicator: indicator(week_a == week_b) (same week?) - Same Month Indicator: indicator(month_a == month_b) (same month?)

Use Cases: - Registration Date × Last Login → registration-login relationships - Purchase Date × Ship Date → purchase-ship alignment - Start Time × End Time → duration relationships

Implementation: - Extract temporal features (hour, day, week, month, age) - Compute time differences and ratios - Extract temporal indicators (same day/week/month) - MLP-ize interactions

TIMESTAMP × EMAIL¶

Relationship Types: - Age × Email Length: age(timestamp) * len(email) - Temporal Features × Email Features: temporal_features · email_features - Time Delta × Domain Length: delta_time * len(email_domain) - Email Features × Temporal Embedding: email_features · temporal_embedding

Use Cases: - Registration Date × Email Domain → registration-email relationships - Event Time × Email Provider → time-provider alignment

Implementation: - Extract temporal features - Extract email features - Compute interactions - MLP-ize interactions

TIMESTAMP × DOMAIN¶

Relationship Types: - Age × Domain Length: age(timestamp) * len(domain) - Temporal Features × Domain Features: temporal_features · domain_features - Time Delta × TLD Type: delta_time * indicator(tld_type) - Domain Features × Temporal Embedding: domain_features · temporal_embedding

Use Cases: - Registration Date × Domain → registration-domain relationships - Event Time × Domain Type → time-domain alignment

Implementation: - Extract temporal features - Extract domain features - Compute interactions - MLP-ize interactions

EMAIL × EMAIL¶

Relationship Types: - Domain Match: indicator(email_domain_a == email_domain_b) (same domain?) - TLD Match: indicator(tld_a == tld_b) (same TLD?) - Email Length Ratio: len(email_a) / (len(email_b) + eps) - Email Length Difference: len(email_a) - len(email_b) - Domain Length Ratio: len(email_domain_a) / (len(email_domain_b) + eps) - Email Embedding Similarity: email_embedding_a · email_embedding_b

Use Cases: - Sender Email × Recipient Email → sender-recipient relationships - Primary Email × Secondary Email → primary-secondary alignment

Implementation: - Extract email features (domain, TLD, length) - Extract email embeddings - Compute similarity metrics - MLP-ize interactions

EMAIL × DOMAIN¶

Relationship Types: - Email Domain × Domain: indicator(email_domain == domain) (same domain?) - Email Length × Domain Length: len(email) * len(domain) - Email Features × Domain Features: email_features · domain_features - TLD Match: indicator(email_tld == domain_tld) (same TLD?)

Use Cases: - Email × Domain → email-domain relationships - Email Provider × Domain Type → provider-domain alignment

Implementation: - Extract email features - Extract domain features - Compute interactions - MLP-ize interactions

DOMAIN × DOMAIN¶

Relationship Types: - Domain Match: indicator(domain_a == domain_b) (same domain?) - TLD Match: indicator(tld_a == tld_b) (same TLD?) - Domain Length Ratio: len(domain_a) / (len(domain_b) + eps) - Domain Length Difference: len(domain_a) - len(domain_b) - Subdomain Relationship: indicator(domain_a is subdomain of domain_b or vice versa) - Domain Embedding Similarity: domain_embedding_a · domain_embedding_b

Use Cases: - Source Domain × Destination Domain → source-destination relationships - Primary Domain × Secondary Domain → primary-secondary alignment

Implementation: - Extract domain features (length, TLD, subdomain) - Extract domain embeddings - Compute similarity metrics - MLP-ize interactions

Summary¶

Implementation Strategy¶

Type-Aware Relationship Extraction
Detect column types from col_types dict
Select appropriate interaction functions based on type pairs
Compute relationship features for each pair
Feature Selection
Limit to top-N pairs by MI or correlation (if max_pairwise_ratios is set)
Prioritize high-MI pairs for relationship tokens
Use upstream hints to weight relationship importance
MLP-ization
Each relationship type gets its own MLP (or shared MLP per category)
Project all relationship features to d_model dimension
Normalize outputs for stable training
Masking
Zero out relationships involving masked columns
Handle missing values gracefully (use epsilon for divisions)

This comprehensive enumeration provides a roadmap for implementing type-aware relationship features in the RelationshipFeatureExtractor.