Skip to content

Column Type Interaction Matrix

This document enumerates all possible interaction types between different column types and proposes relationship features for each pair.

Column Types

  1. SCALAR - Numeric values (integers, floats)
  2. SET - Categorical sets (one-hot encoded categories)
  3. FREE_STRING - Free-form text strings
  4. LIST_OF_A_SET - Lists of categorical values
  5. VECTOR - Fixed-length numeric vectors
  6. URL - URLs (can be treated as strings with domain/path structure)
  7. JSON - Structured JSON data
  8. TIMESTAMP - Date/time values
  9. EMAIL - Email addresses (can be treated as strings with domain structure)
  10. DOMAIN - Domain names (can be treated as strings)

Interaction Matrix

SCALAR × SCALAR

Relationship Types: - Ratio: scalar_a / scalar_b (division with epsilon) - Difference: scalar_a - scalar_b - Product: scalar_a * scalar_b - Sum: scalar_a + scalar_b - Normalized Difference: (scalar_a - scalar_b) / (scalar_a + scalar_b + eps) (relative difference) - Log Ratio: log(scalar_a + eps) - log(scalar_b + eps) (scale-invariant) - Power Ratio: scalar_a^p / scalar_b^p (for specific powers p ∈ {0.5, 1, 2})

Use Cases: - Age/Income ratio → financial stability indicator - Price/Quantity → unit price - Revenue/Expenses → profit margin - Distance/Time → speed

Implementation: - Compute element-wise ratios, differences, products - MLP-ize each relationship type separately - Can limit to top-N pairs by MI or correlation


SCALAR × SET

Relationship Types: - Set Cardinality × Scalar: scalar * |set| (scalar weighted by set size) - Set Membership Indicator × Scalar: For each set member, compute scalar * indicator(member ∈ set) - Set Statistics × Scalar: If set values are numeric-like, compute scalar * mean(set_values), scalar * max(set_values), etc. - Scalar Normalized by Set Size: scalar / (|set| + eps) (per-member average) - Scalar × Set Intersection Size: If multiple sets, scalar * |set_1 ∩ set_2|

Use Cases: - Age × Job Categories → age-weighted job type - Price × Product Categories → category-specific pricing - Count × Set Membership → frequency-weighted features - Revenue × Customer Segments → segment-specific revenue

Implementation: - Extract set cardinality as scalar feature - Extract set membership indicators (one-hot-like) - Compute interactions with scalar - MLP-ize interactions


SCALAR × FREE_STRING

Relationship Types: - String Length × Scalar: scalar * len(string) (scalar weighted by string length) - String Embedding Statistics × Scalar: scalar * mean(string_embedding), scalar * std(string_embedding) - Scalar Normalized by String Length: scalar / (len(string) + eps) - Scalar × String Token Count: scalar * num_tokens(string)

Use Cases: - Age × Description Length → age-weighted description complexity - Price × Product Description → description-weighted pricing - Count × Text Length → frequency-weighted text features

Implementation: - Extract string statistics (length, token count, embedding stats) - Compute interactions with scalar - MLP-ize interactions


SCALAR × LIST_OF_A_SET

Relationship Types: - List Length × Scalar: scalar * len(list) (scalar weighted by list size) - List Cardinality × Scalar: scalar * sum(|set_i| for set_i in list) (total set members) - Scalar Normalized by List Length: scalar / (len(list) + eps) - List Diversity × Scalar: scalar * num_unique_items(list) (scalar weighted by diversity)

Use Cases: - Age × List of Skills → age-weighted skill count - Price × List of Features → feature-count-weighted pricing - Count × List of Tags → tag-frequency-weighted features

Implementation: - Extract list statistics (length, cardinality, diversity) - Compute interactions with scalar - MLP-ize interactions


SCALAR × VECTOR

Relationship Types: - Scalar × Vector Element-wise: scalar * vector (scalar scales entire vector) - Scalar × Vector Norm: scalar * ||vector|| (scalar weighted by vector magnitude) - Scalar × Vector Mean: scalar * mean(vector) (scalar weighted by average) - Scalar × Vector Dot Product: If multiple vectors, scalar * (vector_a · vector_b) - Vector Normalized by Scalar: vector / (scalar + eps) (vector scaled by scalar)

Use Cases: - Age × Feature Vector → age-weighted feature importance - Price × Embedding Vector → price-weighted semantic features - Count × Vector → frequency-weighted vector features

Implementation: - Extract vector statistics (norm, mean, std) - Compute element-wise products with scalar - MLP-ize interactions


SCALAR × URL

Relationship Types: - URL Depth × Scalar: scalar * url_depth (scalar weighted by URL path depth) - Domain Length × Scalar: scalar * len(domain) (scalar weighted by domain name length) - Has Query × Scalar: scalar * indicator(has_query_params) (binary indicator) - Scalar Normalized by URL Length: scalar / (len(url) + eps)

Use Cases: - Age × URL Depth → age-weighted navigation depth - Price × Domain Type → domain-specific pricing - Count × URL Structure → structure-frequency-weighted features

Implementation: - Extract URL features (depth, domain, query params) - Compute interactions with scalar - MLP-ize interactions


SCALAR × JSON

Relationship Types: - JSON Depth × Scalar: scalar * json_depth (scalar weighted by nesting depth) - JSON Key Count × Scalar: scalar * num_keys(json) (scalar weighted by key count) - JSON Value Types × Scalar: scalar * num_numeric_values(json), scalar * num_string_values(json) - Scalar Normalized by JSON Size: scalar / (json_size + eps)

Use Cases: - Age × JSON Complexity → age-weighted data complexity - Price × JSON Structure → structure-weighted pricing - Count × JSON Keys → key-frequency-weighted features

Implementation: - Extract JSON statistics (depth, key count, value types) - Compute interactions with scalar - MLP-ize interactions


SCALAR × TIMESTAMP

Relationship Types: - Time Delta × Scalar: scalar * delta_time (scalar weighted by time difference) - Age (from timestamp) × Scalar: scalar * age(timestamp) (scalar weighted by age) - Temporal Features × Scalar: scalar * hour_of_day, scalar * day_of_week, scalar * month - Scalar Normalized by Time: scalar / (time_since_epoch + eps)

Use Cases: - Age × Registration Date → account-age-weighted age - Price × Purchase Date → date-weighted pricing - Count × Timestamp → time-frequency-weighted features

Implementation: - Extract temporal features (hour, day, month, age, delta) - Compute interactions with scalar - MLP-ize interactions


SCALAR × EMAIL

Relationship Types: - Email Domain Length × Scalar: scalar * len(email_domain) (scalar weighted by domain length) - Has Subdomain × Scalar: scalar * indicator(has_subdomain) (binary indicator) - Email Length × Scalar: scalar * len(email) (scalar weighted by email length) - Scalar Normalized by Email Length: scalar / (len(email) + eps)

Use Cases: - Age × Email Domain → domain-age correlation - Price × Email Provider → provider-specific pricing - Count × Email Structure → structure-frequency-weighted features

Implementation: - Extract email features (domain, subdomain, length) - Compute interactions with scalar - MLP-ize interactions


SCALAR × DOMAIN

Relationship Types: - Domain Length × Scalar: scalar * len(domain) (scalar weighted by domain length) - TLD Type × Scalar: scalar * indicator(tld_type) (categorical: .com, .org, etc.) - Has Subdomain × Scalar: scalar * indicator(has_subdomain) (binary indicator) - Scalar Normalized by Domain Length: scalar / (len(domain) + eps)

Use Cases: - Age × Domain Type → domain-age correlation - Price × Domain Category → category-specific pricing - Count × Domain Structure → structure-frequency-weighted features

Implementation: - Extract domain features (length, TLD, subdomain) - Compute interactions with scalar - MLP-ize interactions


SET × SET

Relationship Types: - Jaccard Similarity: |set_a ∩ set_b| / |set_a ∪ set_b| (overlap ratio) - Intersection Size: |set_a ∩ set_b| (number of common elements) - Union Size: |set_a ∪ set_b| (total unique elements) - Set Difference: |set_a - set_b| (elements in A but not B) - Symmetric Difference: |set_a Δ set_b| (elements in exactly one set) - Subset Indicator: indicator(set_a ⊆ set_b) (binary: is A subset of B?) - Cardinality Ratio: |set_a| / (|set_b| + eps) (size ratio) - Cardinality Difference: |set_a| - |set_b| (size difference)

Use Cases: - Job Categories × Skills → category-skill overlap - Product Categories × Customer Segments → category-segment alignment - Tags × Categories → tag-category relationships

Implementation: - Compute set operations (intersection, union, difference) - Extract set statistics (cardinality, similarity) - MLP-ize relationship features


SET × FREE_STRING

Relationship Types: - Set Cardinality × String Length: |set| * len(string) (size interaction) - Set Membership × String Embedding: For each set member, compute indicator(member ∈ set) * string_embedding - String Contains Set Member: indicator(any(set_member in string)) (binary: does string mention any set member?) - Set Size Normalized by String Length: |set| / (len(string) + eps) - String Embedding × Set Embedding: mean(string_embedding) · mean(set_embedding) (cosine similarity)

Use Cases: - Job Categories × Description → category-description alignment - Tags × Text → tag-text relevance - Categories × Comments → category-comment relationships

Implementation: - Extract set cardinality and embeddings - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions


SET × LIST_OF_A_SET

Relationship Types: - Set × List Intersection: |set ∩ (union of list sets)| (overlap with list) - Set × List Jaccard: |set ∩ list_union| / |set ∪ list_union| - Set Cardinality × List Length: |set| * len(list) - Set × List Diversity: |set| * num_unique_items(list) - List Contains Set: indicator(set ⊆ list_union) (binary: is set contained in list?)

Use Cases: - Skills × Skill List → skill-list overlap - Categories × Tag List → category-tag relationships - Features × Feature List → feature-list alignment

Implementation: - Compute set-list operations - Extract statistics - MLP-ize interactions


SET × VECTOR

Relationship Types: - Set Cardinality × Vector Norm: |set| * ||vector|| - Set Embedding × Vector: mean(set_embedding) · vector (dot product) - Set Size × Vector Mean: |set| * mean(vector) - Vector Normalized by Set Size: vector / (|set| + eps)

Use Cases: - Categories × Feature Vector → category-feature alignment - Tags × Embedding → tag-embedding relationships

Implementation: - Extract set statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions


SET × URL

Relationship Types: - Set Cardinality × URL Depth: |set| * url_depth - Set × URL Domain: indicator(domain in set) (if set contains domains) - Set Size × URL Length: |set| * len(url) - URL Features × Set Embedding: url_features · mean(set_embedding)

Use Cases: - Categories × URL → category-url relationships - Tags × URL → tag-url relevance

Implementation: - Extract URL features - Extract set statistics - Compute interactions - MLP-ize interactions


SET × JSON

Relationship Types: - Set Cardinality × JSON Key Count: |set| * num_keys(json) - Set × JSON Keys: indicator(any(set_member in json_keys)) (binary: does JSON have keys matching set?) - Set Size × JSON Depth: |set| * json_depth - JSON Structure × Set Embedding: json_features · mean(set_embedding)

Use Cases: - Categories × JSON Structure → category-json relationships - Tags × JSON Keys → tag-key alignment

Implementation: - Extract JSON features - Extract set statistics - Compute interactions - MLP-ize interactions


SET × TIMESTAMP

Relationship Types: - Set Cardinality × Time Delta: |set| * delta_time - Set × Temporal Features: mean(set_embedding) · temporal_features - Set Size × Age: |set| * age(timestamp) - Temporal Features × Set Embedding: temporal_features · mean(set_embedding)

Use Cases: - Categories × Registration Date → category-time relationships - Tags × Event Time → tag-time alignment

Implementation: - Extract temporal features - Extract set statistics - Compute interactions - MLP-ize interactions


SET × EMAIL

Relationship Types: - Set Cardinality × Email Length: |set| * len(email) - Set × Email Domain: indicator(email_domain in set) (if set contains domains) - Set Size × Domain Length: |set| * len(email_domain) - Email Features × Set Embedding: email_features · mean(set_embedding)

Use Cases: - Categories × Email Domain → category-domain relationships - Tags × Email Provider → tag-provider alignment

Implementation: - Extract email features - Extract set statistics - Compute interactions - MLP-ize interactions


SET × DOMAIN

Relationship Types: - Set Cardinality × Domain Length: |set| * len(domain) - Set × Domain: indicator(domain in set) (if set contains domains) - Set Size × TLD Type: |set| * indicator(tld_type) - Domain Features × Set Embedding: domain_features · mean(set_embedding)

Use Cases: - Categories × Domain → category-domain relationships - Tags × Domain Type → tag-domain alignment

Implementation: - Extract domain features - Extract set statistics - Compute interactions - MLP-ize interactions


FREE_STRING × FREE_STRING

Relationship Types: - String Similarity: Cosine similarity between string embeddings - String Length Ratio: len(string_a) / (len(string_b) + eps) - String Length Difference: len(string_a) - len(string_b) - Token Overlap: |tokens(string_a) ∩ tokens(string_b)| (common tokens) - Token Jaccard: |tokens_a ∩ tokens_b| / |tokens_a ∪ tokens_b| - Embedding Distance: ||embedding_a - embedding_b|| (L2 distance) - Embedding Dot Product: embedding_a · embedding_b (cosine similarity) - Substring Indicator: indicator(string_a in string_b or string_b in string_a) (containment)

Use Cases: - Description × Title → description-title alignment - Comment × Review → comment-review similarity - Query × Document → query-document relevance

Implementation: - Extract string statistics (length, token count) - Extract string embeddings - Compute similarity metrics - MLP-ize interactions


FREE_STRING × LIST_OF_A_SET

Relationship Types: - String Length × List Length: len(string) * len(list) - String × List Union: indicator(any(list_item in string)) (does string mention any list item?) - String Embedding × List Embedding: string_embedding · mean(list_embedding) - List Diversity × String Length: num_unique_items(list) * len(string)

Use Cases: - Description × Tag List → description-tag alignment - Comment × Category List → comment-category relationships

Implementation: - Extract string statistics and embeddings - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions


FREE_STRING × VECTOR

Relationship Types: - String Embedding × Vector: string_embedding · vector (dot product) - String Length × Vector Norm: len(string) * ||vector|| - String Statistics × Vector Mean: mean(string_embedding) · mean(vector) - Vector Normalized by String Length: vector / (len(string) + eps)

Use Cases: - Description × Feature Vector → description-feature alignment - Text × Embedding → text-embedding relationships

Implementation: - Extract string statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions


FREE_STRING × URL

Relationship Types: - String × URL Domain: indicator(domain in string) (does string mention domain?) - String Length × URL Length: len(string) * len(url) - String Embedding × URL Features: string_embedding · url_features - URL Depth × String Length: url_depth * len(string)

Use Cases: - Description × URL → description-url relationships - Comment × Link → comment-link alignment

Implementation: - Extract URL features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions


FREE_STRING × JSON

Relationship Types: - String × JSON Keys: indicator(any(key in string)) (does string mention any JSON key?) - String Length × JSON Key Count: len(string) * num_keys(json) - String Embedding × JSON Structure: string_embedding · json_features - JSON Depth × String Length: json_depth * len(string)

Use Cases: - Description × JSON Structure → description-json relationships - Comment × JSON Data → comment-data alignment

Implementation: - Extract JSON features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions


FREE_STRING × TIMESTAMP

Relationship Types: - String Length × Time Delta: len(string) * delta_time - String Embedding × Temporal Features: string_embedding · temporal_features - Age × String Length: age(timestamp) * len(string) - Temporal Features × String Statistics: temporal_features · mean(string_embedding)

Use Cases: - Comment × Post Date → comment-time relationships - Description × Creation Date → description-time alignment

Implementation: - Extract temporal features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions


FREE_STRING × EMAIL

Relationship Types: - String × Email Domain: indicator(domain in string) (does string mention email domain?) - String Length × Email Length: len(string) * len(email) - String Embedding × Email Features: string_embedding · email_features - Email Domain Length × String Length: len(email_domain) * len(string)

Use Cases: - Comment × Email → comment-email relationships - Description × Email Domain → description-domain alignment

Implementation: - Extract email features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions


FREE_STRING × DOMAIN

Relationship Types: - String × Domain: indicator(domain in string) (does string mention domain?) - String Length × Domain Length: len(string) * len(domain) - String Embedding × Domain Features: string_embedding · domain_features - TLD Type × String Length: indicator(tld_type) * len(string)

Use Cases: - Description × Domain → description-domain relationships - Comment × Domain Type → comment-domain alignment

Implementation: - Extract domain features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions


LIST_OF_A_SET × LIST_OF_A_SET

Relationship Types: - List Intersection: |union(list_a) ∩ union(list_b)| (common items) - List Jaccard: |union_a ∩ union_b| / |union_a ∪ union_b| - List Length Ratio: len(list_a) / (len(list_b) + eps) - List Length Difference: len(list_a) - len(list_b) - List Diversity Ratio: num_unique_items(list_a) / (num_unique_items(list_b) + eps) - List Embedding Similarity: mean(list_a_embedding) · mean(list_b_embedding)

Use Cases: - Skill List × Tag List → skill-tag overlap - Category List × Feature List → category-feature relationships

Implementation: - Compute list operations (union, intersection) - Extract list statistics (length, diversity) - Extract list embeddings - MLP-ize interactions


LIST_OF_A_SET × VECTOR

Relationship Types: - List Length × Vector Norm: len(list) * ||vector|| - List Embedding × Vector: mean(list_embedding) · vector - List Diversity × Vector Mean: num_unique_items(list) * mean(vector) - Vector Normalized by List Length: vector / (len(list) + eps)

Use Cases: - Tag List × Feature Vector → tag-feature alignment - Category List × Embedding → category-embedding relationships

Implementation: - Extract list statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions


LIST_OF_A_SET × URL

Relationship Types: - List Length × URL Depth: len(list) * url_depth - List × URL Domain: indicator(domain in union(list)) (if list contains domains) - List Diversity × URL Length: num_unique_items(list) * len(url) - URL Features × List Embedding: url_features · mean(list_embedding)

Use Cases: - Tag List × URL → tag-url relationships - Category List × URL → category-url alignment

Implementation: - Extract URL features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions


LIST_OF_A_SET × JSON

Relationship Types: - List Length × JSON Key Count: len(list) * num_keys(json) - List × JSON Keys: indicator(any(list_item in json_keys)) - List Diversity × JSON Depth: num_unique_items(list) * json_depth - JSON Structure × List Embedding: json_features · mean(list_embedding)

Use Cases: - Tag List × JSON Structure → tag-json relationships - Category List × JSON Keys → category-key alignment

Implementation: - Extract JSON features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions


LIST_OF_A_SET × TIMESTAMP

Relationship Types: - List Length × Time Delta: len(list) * delta_time - List Embedding × Temporal Features: mean(list_embedding) · temporal_features - List Diversity × Age: num_unique_items(list) * age(timestamp) - Temporal Features × List Embedding: temporal_features · mean(list_embedding)

Use Cases: - Tag List × Event Time → tag-time relationships - Category List × Registration Date → category-time alignment

Implementation: - Extract temporal features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions


LIST_OF_A_SET × EMAIL

Relationship Types: - List Length × Email Length: len(list) * len(email) - List × Email Domain: indicator(email_domain in union(list)) (if list contains domains) - List Diversity × Domain Length: num_unique_items(list) * len(email_domain) - Email Features × List Embedding: email_features · mean(list_embedding)

Use Cases: - Tag List × Email Domain → tag-domain relationships - Category List × Email Provider → category-provider alignment

Implementation: - Extract email features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions


LIST_OF_A_SET × DOMAIN

Relationship Types: - List Length × Domain Length: len(list) * len(domain) - List × Domain: indicator(domain in union(list)) (if list contains domains) - List Diversity × TLD Type: num_unique_items(list) * indicator(tld_type) - Domain Features × List Embedding: domain_features · mean(list_embedding)

Use Cases: - Tag List × Domain → tag-domain relationships - Category List × Domain Type → category-domain alignment

Implementation: - Extract domain features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions


VECTOR × VECTOR

Relationship Types: - Dot Product: vector_a · vector_b (cosine similarity when normalized) - L2 Distance: ||vector_a - vector_b|| (Euclidean distance) - L1 Distance: |vector_a - vector_b| (Manhattan distance) - Cosine Similarity: (vector_a · vector_b) / (||vector_a|| * ||vector_b|| + eps) - Element-wise Product: vector_a * vector_b (Hadamard product) - Element-wise Ratio: vector_a / (vector_b + eps) (element-wise division) - Vector Norm Ratio: ||vector_a|| / (||vector_b|| + eps) - Vector Mean Difference: mean(vector_a) - mean(vector_b)

Use Cases: - Feature Vector × Embedding Vector → feature-embedding alignment - User Vector × Item Vector → user-item similarity - Query Vector × Document Vector → query-document relevance

Implementation: - Compute vector operations (dot product, distance, similarity) - Extract vector statistics (norm, mean, std) - MLP-ize interactions


VECTOR × URL

Relationship Types: - Vector × URL Features: vector · url_features (dot product) - Vector Norm × URL Depth: ||vector|| * url_depth - Vector Mean × URL Length: mean(vector) * len(url) - URL Features Normalized by Vector Norm: url_features / (||vector|| + eps)

Use Cases: - Embedding Vector × URL → embedding-url relationships - Feature Vector × URL Structure → feature-url alignment

Implementation: - Extract URL features - Extract vector statistics - Compute interactions - MLP-ize interactions


VECTOR × JSON

Relationship Types: - Vector × JSON Features: vector · json_features (dot product) - Vector Norm × JSON Key Count: ||vector|| * num_keys(json) - Vector Mean × JSON Depth: mean(vector) * json_depth - JSON Features Normalized by Vector Norm: json_features / (||vector|| + eps)

Use Cases: - Embedding Vector × JSON Structure → embedding-json relationships - Feature Vector × JSON Keys → feature-json alignment

Implementation: - Extract JSON features - Extract vector statistics - Compute interactions - MLP-ize interactions


VECTOR × TIMESTAMP

Relationship Types: - Vector × Temporal Features: vector · temporal_features (dot product) - Vector Norm × Time Delta: ||vector|| * delta_time - Vector Mean × Age: mean(vector) * age(timestamp) - Temporal Features Normalized by Vector Norm: temporal_features / (||vector|| + eps)

Use Cases: - Embedding Vector × Time → embedding-time relationships - Feature Vector × Temporal Features → feature-time alignment

Implementation: - Extract temporal features - Extract vector statistics - Compute interactions - MLP-ize interactions


VECTOR × EMAIL

Relationship Types: - Vector × Email Features: vector · email_features (dot product) - Vector Norm × Email Length: ||vector|| * len(email) - Vector Mean × Domain Length: mean(vector) * len(email_domain) - Email Features Normalized by Vector Norm: email_features / (||vector|| + eps)

Use Cases: - Embedding Vector × Email → embedding-email relationships - Feature Vector × Email Domain → feature-domain alignment

Implementation: - Extract email features - Extract vector statistics - Compute interactions - MLP-ize interactions


VECTOR × DOMAIN

Relationship Types: - Vector × Domain Features: vector · domain_features (dot product) - Vector Norm × Domain Length: ||vector|| * len(domain) - Vector Mean × TLD Type: mean(vector) * indicator(tld_type) - Domain Features Normalized by Vector Norm: domain_features / (||vector|| + eps)

Use Cases: - Embedding Vector × Domain → embedding-domain relationships - Feature Vector × Domain Type → feature-domain alignment

Implementation: - Extract domain features - Extract vector statistics - Compute interactions - MLP-ize interactions


URL × URL

Relationship Types: - Domain Match: indicator(domain_a == domain_b) (same domain?) - TLD Match: indicator(tld_a == tld_b) (same TLD?) - Path Depth Difference: url_depth_a - url_depth_b - Path Depth Ratio: url_depth_a / (url_depth_b + eps) - URL Length Ratio: len(url_a) / (len(url_b) + eps) - URL Embedding Similarity: url_embedding_a · url_embedding_b

Use Cases: - Source URL × Destination URL → source-destination relationships - Referrer × Current URL → referrer-current alignment

Implementation: - Extract URL features (domain, TLD, depth, length) - Extract URL embeddings - Compute similarity metrics - MLP-ize interactions


URL × JSON

Relationship Types: - URL Depth × JSON Depth: url_depth * json_depth - URL × JSON Keys: indicator(any(url_component in json_keys)) - URL Length × JSON Key Count: len(url) * num_keys(json) - URL Features × JSON Structure: url_features · json_features

Use Cases: - URL × JSON Response → url-response relationships - URL Structure × JSON Data → structure-data alignment

Implementation: - Extract URL features - Extract JSON features - Compute interactions - MLP-ize interactions


URL × TIMESTAMP

Relationship Types: - URL Depth × Time Delta: url_depth * delta_time - URL Features × Temporal Features: url_features · temporal_features - URL Length × Age: len(url) * age(timestamp) - Temporal Features × URL Embedding: temporal_features · url_embedding

Use Cases: - URL × Access Time → url-time relationships - URL Structure × Event Time → structure-time alignment

Implementation: - Extract URL features - Extract temporal features - Compute interactions - MLP-ize interactions


URL × EMAIL

Relationship Types: - URL Domain × Email Domain: indicator(url_domain == email_domain) (same domain?) - URL Depth × Email Length: url_depth * len(email) - URL Features × Email Features: url_features · email_features - Email Domain Length × URL Length: len(email_domain) * len(url)

Use Cases: - URL × Email Domain → url-email relationships - URL Structure × Email Provider → structure-provider alignment

Implementation: - Extract URL features - Extract email features - Compute interactions - MLP-ize interactions


URL × DOMAIN

Relationship Types: - URL Domain × Domain: indicator(url_domain == domain) (same domain?) - URL Depth × Domain Length: url_depth * len(domain) - URL Features × Domain Features: url_features · domain_features - TLD Match: indicator(url_tld == domain_tld) (same TLD?)

Use Cases: - URL × Domain → url-domain relationships - URL Structure × Domain Type → structure-domain alignment

Implementation: - Extract URL features - Extract domain features - Compute interactions - MLP-ize interactions


JSON × JSON

Relationship Types: - JSON Depth Difference: json_depth_a - json_depth_b - JSON Key Overlap: |json_keys_a ∩ json_keys_b| (common keys) - JSON Key Jaccard: |keys_a ∩ keys_b| / |keys_a ∪ keys_b| - JSON Key Count Ratio: num_keys(json_a) / (num_keys(json_b) + eps) - JSON Structure Similarity: json_features_a · json_features_b

Use Cases: - Request JSON × Response JSON → request-response relationships - JSON Schema × JSON Data → schema-data alignment

Implementation: - Extract JSON features (depth, keys, structure) - Extract JSON embeddings - Compute similarity metrics - MLP-ize interactions


JSON × TIMESTAMP

Relationship Types: - JSON Depth × Time Delta: json_depth * delta_time - JSON Features × Temporal Features: json_features · temporal_features - JSON Key Count × Age: num_keys(json) * age(timestamp) - Temporal Features × JSON Embedding: temporal_features · json_embedding

Use Cases: - JSON × Creation Time → json-time relationships - JSON Structure × Event Time → structure-time alignment

Implementation: - Extract JSON features - Extract temporal features - Compute interactions - MLP-ize interactions


JSON × EMAIL

Relationship Types: - JSON Key Count × Email Length: num_keys(json) * len(email) - JSON Features × Email Features: json_features · email_features - JSON Depth × Domain Length: json_depth * len(email_domain) - Email Features × JSON Embedding: email_features · json_embedding

Use Cases: - JSON × Email Domain → json-email relationships - JSON Structure × Email Provider → structure-provider alignment

Implementation: - Extract JSON features - Extract email features - Compute interactions - MLP-ize interactions


JSON × DOMAIN

Relationship Types: - JSON Key Count × Domain Length: num_keys(json) * len(domain) - JSON Features × Domain Features: json_features · domain_features - JSON Depth × TLD Type: json_depth * indicator(tld_type) - Domain Features × JSON Embedding: domain_features · json_embedding

Use Cases: - JSON × Domain → json-domain relationships - JSON Structure × Domain Type → structure-domain alignment

Implementation: - Extract JSON features - Extract domain features - Compute interactions - MLP-ize interactions


TIMESTAMP × TIMESTAMP

Relationship Types: - Time Delta: timestamp_a - timestamp_b (absolute time difference) - Time Ratio: timestamp_a / (timestamp_b + eps) (relative time) - Age Difference: age(timestamp_a) - age(timestamp_b) (relative ages) - Temporal Feature Differences: hour_a - hour_b, day_a - day_b, month_a - month_b - Same Day Indicator: indicator(day_a == day_b) (same day?) - Same Week Indicator: indicator(week_a == week_b) (same week?) - Same Month Indicator: indicator(month_a == month_b) (same month?)

Use Cases: - Registration Date × Last Login → registration-login relationships - Purchase Date × Ship Date → purchase-ship alignment - Start Time × End Time → duration relationships

Implementation: - Extract temporal features (hour, day, week, month, age) - Compute time differences and ratios - Extract temporal indicators (same day/week/month) - MLP-ize interactions


TIMESTAMP × EMAIL

Relationship Types: - Age × Email Length: age(timestamp) * len(email) - Temporal Features × Email Features: temporal_features · email_features - Time Delta × Domain Length: delta_time * len(email_domain) - Email Features × Temporal Embedding: email_features · temporal_embedding

Use Cases: - Registration Date × Email Domain → registration-email relationships - Event Time × Email Provider → time-provider alignment

Implementation: - Extract temporal features - Extract email features - Compute interactions - MLP-ize interactions


TIMESTAMP × DOMAIN

Relationship Types: - Age × Domain Length: age(timestamp) * len(domain) - Temporal Features × Domain Features: temporal_features · domain_features - Time Delta × TLD Type: delta_time * indicator(tld_type) - Domain Features × Temporal Embedding: domain_features · temporal_embedding

Use Cases: - Registration Date × Domain → registration-domain relationships - Event Time × Domain Type → time-domain alignment

Implementation: - Extract temporal features - Extract domain features - Compute interactions - MLP-ize interactions


EMAIL × EMAIL

Relationship Types: - Domain Match: indicator(email_domain_a == email_domain_b) (same domain?) - TLD Match: indicator(tld_a == tld_b) (same TLD?) - Email Length Ratio: len(email_a) / (len(email_b) + eps) - Email Length Difference: len(email_a) - len(email_b) - Domain Length Ratio: len(email_domain_a) / (len(email_domain_b) + eps) - Email Embedding Similarity: email_embedding_a · email_embedding_b

Use Cases: - Sender Email × Recipient Email → sender-recipient relationships - Primary Email × Secondary Email → primary-secondary alignment

Implementation: - Extract email features (domain, TLD, length) - Extract email embeddings - Compute similarity metrics - MLP-ize interactions


EMAIL × DOMAIN

Relationship Types: - Email Domain × Domain: indicator(email_domain == domain) (same domain?) - Email Length × Domain Length: len(email) * len(domain) - Email Features × Domain Features: email_features · domain_features - TLD Match: indicator(email_tld == domain_tld) (same TLD?)

Use Cases: - Email × Domain → email-domain relationships - Email Provider × Domain Type → provider-domain alignment

Implementation: - Extract email features - Extract domain features - Compute interactions - MLP-ize interactions


DOMAIN × DOMAIN

Relationship Types: - Domain Match: indicator(domain_a == domain_b) (same domain?) - TLD Match: indicator(tld_a == tld_b) (same TLD?) - Domain Length Ratio: len(domain_a) / (len(domain_b) + eps) - Domain Length Difference: len(domain_a) - len(domain_b) - Subdomain Relationship: indicator(domain_a is subdomain of domain_b or vice versa) - Domain Embedding Similarity: domain_embedding_a · domain_embedding_b

Use Cases: - Source Domain × Destination Domain → source-destination relationships - Primary Domain × Secondary Domain → primary-secondary alignment

Implementation: - Extract domain features (length, TLD, subdomain) - Extract domain embeddings - Compute similarity metrics - MLP-ize interactions


Summary

Interaction Categories

  1. Numeric Operations (SCALAR, VECTOR, TIMESTAMP)
  2. Ratios, differences, products, sums
  3. Normalized differences, log ratios
  4. Dot products, distances, similarities

  5. Set Operations (SET, LIST_OF_A_SET)

  6. Intersection, union, difference
  7. Jaccard similarity, subset indicators
  8. Cardinality ratios and differences

  9. String Operations (FREE_STRING, URL, EMAIL, DOMAIN)

  10. Length ratios and differences
  11. Token overlap, Jaccard similarity
  12. Embedding similarities, containment indicators

  13. Structure Operations (JSON, URL, EMAIL, DOMAIN)

  14. Depth ratios and differences
  15. Key/component overlap
  16. Structure similarity

  17. Temporal Operations (TIMESTAMP)

  18. Time deltas, age differences
  19. Temporal feature differences
  20. Same period indicators (day/week/month)

  21. Cross-Type Operations

  22. Cardinality × Numeric (set size × scalar)
  23. Length × Numeric (string length × scalar)
  24. Embedding × Numeric (embedding × scalar)
  25. Structure × Numeric (depth × scalar)

Implementation Strategy

  1. Type-Aware Relationship Extraction
  2. Detect column types from col_types dict
  3. Select appropriate interaction functions based on type pairs
  4. Compute relationship features for each pair

  5. Feature Selection

  6. Limit to top-N pairs by MI or correlation (if max_pairwise_ratios is set)
  7. Prioritize high-MI pairs for relationship tokens
  8. Use upstream hints to weight relationship importance

  9. MLP-ization

  10. Each relationship type gets its own MLP (or shared MLP per category)
  11. Project all relationship features to d_model dimension
  12. Normalize outputs for stable training

  13. Masking

  14. Zero out relationships involving masked columns
  15. Handle missing values gracefully (use epsilon for divisions)

This comprehensive enumeration provides a roadmap for implementing type-aware relationship features in the RelationshipFeatureExtractor.