Column Type Interaction Matrix¶
This document enumerates all possible interaction types between different column types and proposes relationship features for each pair.
Column Types¶
- SCALAR - Numeric values (integers, floats)
- SET - Categorical sets (one-hot encoded categories)
- FREE_STRING - Free-form text strings
- LIST_OF_A_SET - Lists of categorical values
- VECTOR - Fixed-length numeric vectors
- URL - URLs (can be treated as strings with domain/path structure)
- JSON - Structured JSON data
- TIMESTAMP - Date/time values
- EMAIL - Email addresses (can be treated as strings with domain structure)
- DOMAIN - Domain names (can be treated as strings)
Interaction Matrix¶
SCALAR × SCALAR¶
Relationship Types:
- Ratio: scalar_a / scalar_b (division with epsilon)
- Difference: scalar_a - scalar_b
- Product: scalar_a * scalar_b
- Sum: scalar_a + scalar_b
- Normalized Difference: (scalar_a - scalar_b) / (scalar_a + scalar_b + eps) (relative difference)
- Log Ratio: log(scalar_a + eps) - log(scalar_b + eps) (scale-invariant)
- Power Ratio: scalar_a^p / scalar_b^p (for specific powers p ∈ {0.5, 1, 2})
Use Cases: - Age/Income ratio → financial stability indicator - Price/Quantity → unit price - Revenue/Expenses → profit margin - Distance/Time → speed
Implementation: - Compute element-wise ratios, differences, products - MLP-ize each relationship type separately - Can limit to top-N pairs by MI or correlation
SCALAR × SET¶
Relationship Types:
- Set Cardinality × Scalar: scalar * |set| (scalar weighted by set size)
- Set Membership Indicator × Scalar: For each set member, compute scalar * indicator(member ∈ set)
- Set Statistics × Scalar: If set values are numeric-like, compute scalar * mean(set_values), scalar * max(set_values), etc.
- Scalar Normalized by Set Size: scalar / (|set| + eps) (per-member average)
- Scalar × Set Intersection Size: If multiple sets, scalar * |set_1 ∩ set_2|
Use Cases: - Age × Job Categories → age-weighted job type - Price × Product Categories → category-specific pricing - Count × Set Membership → frequency-weighted features - Revenue × Customer Segments → segment-specific revenue
Implementation: - Extract set cardinality as scalar feature - Extract set membership indicators (one-hot-like) - Compute interactions with scalar - MLP-ize interactions
SCALAR × FREE_STRING¶
Relationship Types:
- String Length × Scalar: scalar * len(string) (scalar weighted by string length)
- String Embedding Statistics × Scalar: scalar * mean(string_embedding), scalar * std(string_embedding)
- Scalar Normalized by String Length: scalar / (len(string) + eps)
- Scalar × String Token Count: scalar * num_tokens(string)
Use Cases: - Age × Description Length → age-weighted description complexity - Price × Product Description → description-weighted pricing - Count × Text Length → frequency-weighted text features
Implementation: - Extract string statistics (length, token count, embedding stats) - Compute interactions with scalar - MLP-ize interactions
SCALAR × LIST_OF_A_SET¶
Relationship Types:
- List Length × Scalar: scalar * len(list) (scalar weighted by list size)
- List Cardinality × Scalar: scalar * sum(|set_i| for set_i in list) (total set members)
- Scalar Normalized by List Length: scalar / (len(list) + eps)
- List Diversity × Scalar: scalar * num_unique_items(list) (scalar weighted by diversity)
Use Cases: - Age × List of Skills → age-weighted skill count - Price × List of Features → feature-count-weighted pricing - Count × List of Tags → tag-frequency-weighted features
Implementation: - Extract list statistics (length, cardinality, diversity) - Compute interactions with scalar - MLP-ize interactions
SCALAR × VECTOR¶
Relationship Types:
- Scalar × Vector Element-wise: scalar * vector (scalar scales entire vector)
- Scalar × Vector Norm: scalar * ||vector|| (scalar weighted by vector magnitude)
- Scalar × Vector Mean: scalar * mean(vector) (scalar weighted by average)
- Scalar × Vector Dot Product: If multiple vectors, scalar * (vector_a · vector_b)
- Vector Normalized by Scalar: vector / (scalar + eps) (vector scaled by scalar)
Use Cases: - Age × Feature Vector → age-weighted feature importance - Price × Embedding Vector → price-weighted semantic features - Count × Vector → frequency-weighted vector features
Implementation: - Extract vector statistics (norm, mean, std) - Compute element-wise products with scalar - MLP-ize interactions
SCALAR × URL¶
Relationship Types:
- URL Depth × Scalar: scalar * url_depth (scalar weighted by URL path depth)
- Domain Length × Scalar: scalar * len(domain) (scalar weighted by domain name length)
- Has Query × Scalar: scalar * indicator(has_query_params) (binary indicator)
- Scalar Normalized by URL Length: scalar / (len(url) + eps)
Use Cases: - Age × URL Depth → age-weighted navigation depth - Price × Domain Type → domain-specific pricing - Count × URL Structure → structure-frequency-weighted features
Implementation: - Extract URL features (depth, domain, query params) - Compute interactions with scalar - MLP-ize interactions
SCALAR × JSON¶
Relationship Types:
- JSON Depth × Scalar: scalar * json_depth (scalar weighted by nesting depth)
- JSON Key Count × Scalar: scalar * num_keys(json) (scalar weighted by key count)
- JSON Value Types × Scalar: scalar * num_numeric_values(json), scalar * num_string_values(json)
- Scalar Normalized by JSON Size: scalar / (json_size + eps)
Use Cases: - Age × JSON Complexity → age-weighted data complexity - Price × JSON Structure → structure-weighted pricing - Count × JSON Keys → key-frequency-weighted features
Implementation: - Extract JSON statistics (depth, key count, value types) - Compute interactions with scalar - MLP-ize interactions
SCALAR × TIMESTAMP¶
Relationship Types:
- Time Delta × Scalar: scalar * delta_time (scalar weighted by time difference)
- Age (from timestamp) × Scalar: scalar * age(timestamp) (scalar weighted by age)
- Temporal Features × Scalar: scalar * hour_of_day, scalar * day_of_week, scalar * month
- Scalar Normalized by Time: scalar / (time_since_epoch + eps)
Use Cases: - Age × Registration Date → account-age-weighted age - Price × Purchase Date → date-weighted pricing - Count × Timestamp → time-frequency-weighted features
Implementation: - Extract temporal features (hour, day, month, age, delta) - Compute interactions with scalar - MLP-ize interactions
SCALAR × EMAIL¶
Relationship Types:
- Email Domain Length × Scalar: scalar * len(email_domain) (scalar weighted by domain length)
- Has Subdomain × Scalar: scalar * indicator(has_subdomain) (binary indicator)
- Email Length × Scalar: scalar * len(email) (scalar weighted by email length)
- Scalar Normalized by Email Length: scalar / (len(email) + eps)
Use Cases: - Age × Email Domain → domain-age correlation - Price × Email Provider → provider-specific pricing - Count × Email Structure → structure-frequency-weighted features
Implementation: - Extract email features (domain, subdomain, length) - Compute interactions with scalar - MLP-ize interactions
SCALAR × DOMAIN¶
Relationship Types:
- Domain Length × Scalar: scalar * len(domain) (scalar weighted by domain length)
- TLD Type × Scalar: scalar * indicator(tld_type) (categorical: .com, .org, etc.)
- Has Subdomain × Scalar: scalar * indicator(has_subdomain) (binary indicator)
- Scalar Normalized by Domain Length: scalar / (len(domain) + eps)
Use Cases: - Age × Domain Type → domain-age correlation - Price × Domain Category → category-specific pricing - Count × Domain Structure → structure-frequency-weighted features
Implementation: - Extract domain features (length, TLD, subdomain) - Compute interactions with scalar - MLP-ize interactions
SET × SET¶
Relationship Types:
- Jaccard Similarity: |set_a ∩ set_b| / |set_a ∪ set_b| (overlap ratio)
- Intersection Size: |set_a ∩ set_b| (number of common elements)
- Union Size: |set_a ∪ set_b| (total unique elements)
- Set Difference: |set_a - set_b| (elements in A but not B)
- Symmetric Difference: |set_a Δ set_b| (elements in exactly one set)
- Subset Indicator: indicator(set_a ⊆ set_b) (binary: is A subset of B?)
- Cardinality Ratio: |set_a| / (|set_b| + eps) (size ratio)
- Cardinality Difference: |set_a| - |set_b| (size difference)
Use Cases: - Job Categories × Skills → category-skill overlap - Product Categories × Customer Segments → category-segment alignment - Tags × Categories → tag-category relationships
Implementation: - Compute set operations (intersection, union, difference) - Extract set statistics (cardinality, similarity) - MLP-ize relationship features
SET × FREE_STRING¶
Relationship Types:
- Set Cardinality × String Length: |set| * len(string) (size interaction)
- Set Membership × String Embedding: For each set member, compute indicator(member ∈ set) * string_embedding
- String Contains Set Member: indicator(any(set_member in string)) (binary: does string mention any set member?)
- Set Size Normalized by String Length: |set| / (len(string) + eps)
- String Embedding × Set Embedding: mean(string_embedding) · mean(set_embedding) (cosine similarity)
Use Cases: - Job Categories × Description → category-description alignment - Tags × Text → tag-text relevance - Categories × Comments → category-comment relationships
Implementation: - Extract set cardinality and embeddings - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions
SET × LIST_OF_A_SET¶
Relationship Types:
- Set × List Intersection: |set ∩ (union of list sets)| (overlap with list)
- Set × List Jaccard: |set ∩ list_union| / |set ∪ list_union|
- Set Cardinality × List Length: |set| * len(list)
- Set × List Diversity: |set| * num_unique_items(list)
- List Contains Set: indicator(set ⊆ list_union) (binary: is set contained in list?)
Use Cases: - Skills × Skill List → skill-list overlap - Categories × Tag List → category-tag relationships - Features × Feature List → feature-list alignment
Implementation: - Compute set-list operations - Extract statistics - MLP-ize interactions
SET × VECTOR¶
Relationship Types:
- Set Cardinality × Vector Norm: |set| * ||vector||
- Set Embedding × Vector: mean(set_embedding) · vector (dot product)
- Set Size × Vector Mean: |set| * mean(vector)
- Vector Normalized by Set Size: vector / (|set| + eps)
Use Cases: - Categories × Feature Vector → category-feature alignment - Tags × Embedding → tag-embedding relationships
Implementation: - Extract set statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions
SET × URL¶
Relationship Types:
- Set Cardinality × URL Depth: |set| * url_depth
- Set × URL Domain: indicator(domain in set) (if set contains domains)
- Set Size × URL Length: |set| * len(url)
- URL Features × Set Embedding: url_features · mean(set_embedding)
Use Cases: - Categories × URL → category-url relationships - Tags × URL → tag-url relevance
Implementation: - Extract URL features - Extract set statistics - Compute interactions - MLP-ize interactions
SET × JSON¶
Relationship Types:
- Set Cardinality × JSON Key Count: |set| * num_keys(json)
- Set × JSON Keys: indicator(any(set_member in json_keys)) (binary: does JSON have keys matching set?)
- Set Size × JSON Depth: |set| * json_depth
- JSON Structure × Set Embedding: json_features · mean(set_embedding)
Use Cases: - Categories × JSON Structure → category-json relationships - Tags × JSON Keys → tag-key alignment
Implementation: - Extract JSON features - Extract set statistics - Compute interactions - MLP-ize interactions
SET × TIMESTAMP¶
Relationship Types:
- Set Cardinality × Time Delta: |set| * delta_time
- Set × Temporal Features: mean(set_embedding) · temporal_features
- Set Size × Age: |set| * age(timestamp)
- Temporal Features × Set Embedding: temporal_features · mean(set_embedding)
Use Cases: - Categories × Registration Date → category-time relationships - Tags × Event Time → tag-time alignment
Implementation: - Extract temporal features - Extract set statistics - Compute interactions - MLP-ize interactions
SET × EMAIL¶
Relationship Types:
- Set Cardinality × Email Length: |set| * len(email)
- Set × Email Domain: indicator(email_domain in set) (if set contains domains)
- Set Size × Domain Length: |set| * len(email_domain)
- Email Features × Set Embedding: email_features · mean(set_embedding)
Use Cases: - Categories × Email Domain → category-domain relationships - Tags × Email Provider → tag-provider alignment
Implementation: - Extract email features - Extract set statistics - Compute interactions - MLP-ize interactions
SET × DOMAIN¶
Relationship Types:
- Set Cardinality × Domain Length: |set| * len(domain)
- Set × Domain: indicator(domain in set) (if set contains domains)
- Set Size × TLD Type: |set| * indicator(tld_type)
- Domain Features × Set Embedding: domain_features · mean(set_embedding)
Use Cases: - Categories × Domain → category-domain relationships - Tags × Domain Type → tag-domain alignment
Implementation: - Extract domain features - Extract set statistics - Compute interactions - MLP-ize interactions
FREE_STRING × FREE_STRING¶
Relationship Types:
- String Similarity: Cosine similarity between string embeddings
- String Length Ratio: len(string_a) / (len(string_b) + eps)
- String Length Difference: len(string_a) - len(string_b)
- Token Overlap: |tokens(string_a) ∩ tokens(string_b)| (common tokens)
- Token Jaccard: |tokens_a ∩ tokens_b| / |tokens_a ∪ tokens_b|
- Embedding Distance: ||embedding_a - embedding_b|| (L2 distance)
- Embedding Dot Product: embedding_a · embedding_b (cosine similarity)
- Substring Indicator: indicator(string_a in string_b or string_b in string_a) (containment)
Use Cases: - Description × Title → description-title alignment - Comment × Review → comment-review similarity - Query × Document → query-document relevance
Implementation: - Extract string statistics (length, token count) - Extract string embeddings - Compute similarity metrics - MLP-ize interactions
FREE_STRING × LIST_OF_A_SET¶
Relationship Types:
- String Length × List Length: len(string) * len(list)
- String × List Union: indicator(any(list_item in string)) (does string mention any list item?)
- String Embedding × List Embedding: string_embedding · mean(list_embedding)
- List Diversity × String Length: num_unique_items(list) * len(string)
Use Cases: - Description × Tag List → description-tag alignment - Comment × Category List → comment-category relationships
Implementation: - Extract string statistics and embeddings - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions
FREE_STRING × VECTOR¶
Relationship Types:
- String Embedding × Vector: string_embedding · vector (dot product)
- String Length × Vector Norm: len(string) * ||vector||
- String Statistics × Vector Mean: mean(string_embedding) · mean(vector)
- Vector Normalized by String Length: vector / (len(string) + eps)
Use Cases: - Description × Feature Vector → description-feature alignment - Text × Embedding → text-embedding relationships
Implementation: - Extract string statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions
FREE_STRING × URL¶
Relationship Types:
- String × URL Domain: indicator(domain in string) (does string mention domain?)
- String Length × URL Length: len(string) * len(url)
- String Embedding × URL Features: string_embedding · url_features
- URL Depth × String Length: url_depth * len(string)
Use Cases: - Description × URL → description-url relationships - Comment × Link → comment-link alignment
Implementation: - Extract URL features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions
FREE_STRING × JSON¶
Relationship Types:
- String × JSON Keys: indicator(any(key in string)) (does string mention any JSON key?)
- String Length × JSON Key Count: len(string) * num_keys(json)
- String Embedding × JSON Structure: string_embedding · json_features
- JSON Depth × String Length: json_depth * len(string)
Use Cases: - Description × JSON Structure → description-json relationships - Comment × JSON Data → comment-data alignment
Implementation: - Extract JSON features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions
FREE_STRING × TIMESTAMP¶
Relationship Types:
- String Length × Time Delta: len(string) * delta_time
- String Embedding × Temporal Features: string_embedding · temporal_features
- Age × String Length: age(timestamp) * len(string)
- Temporal Features × String Statistics: temporal_features · mean(string_embedding)
Use Cases: - Comment × Post Date → comment-time relationships - Description × Creation Date → description-time alignment
Implementation: - Extract temporal features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions
FREE_STRING × EMAIL¶
Relationship Types:
- String × Email Domain: indicator(domain in string) (does string mention email domain?)
- String Length × Email Length: len(string) * len(email)
- String Embedding × Email Features: string_embedding · email_features
- Email Domain Length × String Length: len(email_domain) * len(string)
Use Cases: - Comment × Email → comment-email relationships - Description × Email Domain → description-domain alignment
Implementation: - Extract email features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions
FREE_STRING × DOMAIN¶
Relationship Types:
- String × Domain: indicator(domain in string) (does string mention domain?)
- String Length × Domain Length: len(string) * len(domain)
- String Embedding × Domain Features: string_embedding · domain_features
- TLD Type × String Length: indicator(tld_type) * len(string)
Use Cases: - Description × Domain → description-domain relationships - Comment × Domain Type → comment-domain alignment
Implementation: - Extract domain features - Extract string statistics and embeddings - Compute interactions - MLP-ize interactions
LIST_OF_A_SET × LIST_OF_A_SET¶
Relationship Types:
- List Intersection: |union(list_a) ∩ union(list_b)| (common items)
- List Jaccard: |union_a ∩ union_b| / |union_a ∪ union_b|
- List Length Ratio: len(list_a) / (len(list_b) + eps)
- List Length Difference: len(list_a) - len(list_b)
- List Diversity Ratio: num_unique_items(list_a) / (num_unique_items(list_b) + eps)
- List Embedding Similarity: mean(list_a_embedding) · mean(list_b_embedding)
Use Cases: - Skill List × Tag List → skill-tag overlap - Category List × Feature List → category-feature relationships
Implementation: - Compute list operations (union, intersection) - Extract list statistics (length, diversity) - Extract list embeddings - MLP-ize interactions
LIST_OF_A_SET × VECTOR¶
Relationship Types:
- List Length × Vector Norm: len(list) * ||vector||
- List Embedding × Vector: mean(list_embedding) · vector
- List Diversity × Vector Mean: num_unique_items(list) * mean(vector)
- Vector Normalized by List Length: vector / (len(list) + eps)
Use Cases: - Tag List × Feature Vector → tag-feature alignment - Category List × Embedding → category-embedding relationships
Implementation: - Extract list statistics and embeddings - Extract vector statistics - Compute interactions - MLP-ize interactions
LIST_OF_A_SET × URL¶
Relationship Types:
- List Length × URL Depth: len(list) * url_depth
- List × URL Domain: indicator(domain in union(list)) (if list contains domains)
- List Diversity × URL Length: num_unique_items(list) * len(url)
- URL Features × List Embedding: url_features · mean(list_embedding)
Use Cases: - Tag List × URL → tag-url relationships - Category List × URL → category-url alignment
Implementation: - Extract URL features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions
LIST_OF_A_SET × JSON¶
Relationship Types:
- List Length × JSON Key Count: len(list) * num_keys(json)
- List × JSON Keys: indicator(any(list_item in json_keys))
- List Diversity × JSON Depth: num_unique_items(list) * json_depth
- JSON Structure × List Embedding: json_features · mean(list_embedding)
Use Cases: - Tag List × JSON Structure → tag-json relationships - Category List × JSON Keys → category-key alignment
Implementation: - Extract JSON features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions
LIST_OF_A_SET × TIMESTAMP¶
Relationship Types:
- List Length × Time Delta: len(list) * delta_time
- List Embedding × Temporal Features: mean(list_embedding) · temporal_features
- List Diversity × Age: num_unique_items(list) * age(timestamp)
- Temporal Features × List Embedding: temporal_features · mean(list_embedding)
Use Cases: - Tag List × Event Time → tag-time relationships - Category List × Registration Date → category-time alignment
Implementation: - Extract temporal features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions
LIST_OF_A_SET × EMAIL¶
Relationship Types:
- List Length × Email Length: len(list) * len(email)
- List × Email Domain: indicator(email_domain in union(list)) (if list contains domains)
- List Diversity × Domain Length: num_unique_items(list) * len(email_domain)
- Email Features × List Embedding: email_features · mean(list_embedding)
Use Cases: - Tag List × Email Domain → tag-domain relationships - Category List × Email Provider → category-provider alignment
Implementation: - Extract email features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions
LIST_OF_A_SET × DOMAIN¶
Relationship Types:
- List Length × Domain Length: len(list) * len(domain)
- List × Domain: indicator(domain in union(list)) (if list contains domains)
- List Diversity × TLD Type: num_unique_items(list) * indicator(tld_type)
- Domain Features × List Embedding: domain_features · mean(list_embedding)
Use Cases: - Tag List × Domain → tag-domain relationships - Category List × Domain Type → category-domain alignment
Implementation: - Extract domain features - Extract list statistics and embeddings - Compute interactions - MLP-ize interactions
VECTOR × VECTOR¶
Relationship Types:
- Dot Product: vector_a · vector_b (cosine similarity when normalized)
- L2 Distance: ||vector_a - vector_b|| (Euclidean distance)
- L1 Distance: |vector_a - vector_b| (Manhattan distance)
- Cosine Similarity: (vector_a · vector_b) / (||vector_a|| * ||vector_b|| + eps)
- Element-wise Product: vector_a * vector_b (Hadamard product)
- Element-wise Ratio: vector_a / (vector_b + eps) (element-wise division)
- Vector Norm Ratio: ||vector_a|| / (||vector_b|| + eps)
- Vector Mean Difference: mean(vector_a) - mean(vector_b)
Use Cases: - Feature Vector × Embedding Vector → feature-embedding alignment - User Vector × Item Vector → user-item similarity - Query Vector × Document Vector → query-document relevance
Implementation: - Compute vector operations (dot product, distance, similarity) - Extract vector statistics (norm, mean, std) - MLP-ize interactions
VECTOR × URL¶
Relationship Types:
- Vector × URL Features: vector · url_features (dot product)
- Vector Norm × URL Depth: ||vector|| * url_depth
- Vector Mean × URL Length: mean(vector) * len(url)
- URL Features Normalized by Vector Norm: url_features / (||vector|| + eps)
Use Cases: - Embedding Vector × URL → embedding-url relationships - Feature Vector × URL Structure → feature-url alignment
Implementation: - Extract URL features - Extract vector statistics - Compute interactions - MLP-ize interactions
VECTOR × JSON¶
Relationship Types:
- Vector × JSON Features: vector · json_features (dot product)
- Vector Norm × JSON Key Count: ||vector|| * num_keys(json)
- Vector Mean × JSON Depth: mean(vector) * json_depth
- JSON Features Normalized by Vector Norm: json_features / (||vector|| + eps)
Use Cases: - Embedding Vector × JSON Structure → embedding-json relationships - Feature Vector × JSON Keys → feature-json alignment
Implementation: - Extract JSON features - Extract vector statistics - Compute interactions - MLP-ize interactions
VECTOR × TIMESTAMP¶
Relationship Types:
- Vector × Temporal Features: vector · temporal_features (dot product)
- Vector Norm × Time Delta: ||vector|| * delta_time
- Vector Mean × Age: mean(vector) * age(timestamp)
- Temporal Features Normalized by Vector Norm: temporal_features / (||vector|| + eps)
Use Cases: - Embedding Vector × Time → embedding-time relationships - Feature Vector × Temporal Features → feature-time alignment
Implementation: - Extract temporal features - Extract vector statistics - Compute interactions - MLP-ize interactions
VECTOR × EMAIL¶
Relationship Types:
- Vector × Email Features: vector · email_features (dot product)
- Vector Norm × Email Length: ||vector|| * len(email)
- Vector Mean × Domain Length: mean(vector) * len(email_domain)
- Email Features Normalized by Vector Norm: email_features / (||vector|| + eps)
Use Cases: - Embedding Vector × Email → embedding-email relationships - Feature Vector × Email Domain → feature-domain alignment
Implementation: - Extract email features - Extract vector statistics - Compute interactions - MLP-ize interactions
VECTOR × DOMAIN¶
Relationship Types:
- Vector × Domain Features: vector · domain_features (dot product)
- Vector Norm × Domain Length: ||vector|| * len(domain)
- Vector Mean × TLD Type: mean(vector) * indicator(tld_type)
- Domain Features Normalized by Vector Norm: domain_features / (||vector|| + eps)
Use Cases: - Embedding Vector × Domain → embedding-domain relationships - Feature Vector × Domain Type → feature-domain alignment
Implementation: - Extract domain features - Extract vector statistics - Compute interactions - MLP-ize interactions
URL × URL¶
Relationship Types:
- Domain Match: indicator(domain_a == domain_b) (same domain?)
- TLD Match: indicator(tld_a == tld_b) (same TLD?)
- Path Depth Difference: url_depth_a - url_depth_b
- Path Depth Ratio: url_depth_a / (url_depth_b + eps)
- URL Length Ratio: len(url_a) / (len(url_b) + eps)
- URL Embedding Similarity: url_embedding_a · url_embedding_b
Use Cases: - Source URL × Destination URL → source-destination relationships - Referrer × Current URL → referrer-current alignment
Implementation: - Extract URL features (domain, TLD, depth, length) - Extract URL embeddings - Compute similarity metrics - MLP-ize interactions
URL × JSON¶
Relationship Types:
- URL Depth × JSON Depth: url_depth * json_depth
- URL × JSON Keys: indicator(any(url_component in json_keys))
- URL Length × JSON Key Count: len(url) * num_keys(json)
- URL Features × JSON Structure: url_features · json_features
Use Cases: - URL × JSON Response → url-response relationships - URL Structure × JSON Data → structure-data alignment
Implementation: - Extract URL features - Extract JSON features - Compute interactions - MLP-ize interactions
URL × TIMESTAMP¶
Relationship Types:
- URL Depth × Time Delta: url_depth * delta_time
- URL Features × Temporal Features: url_features · temporal_features
- URL Length × Age: len(url) * age(timestamp)
- Temporal Features × URL Embedding: temporal_features · url_embedding
Use Cases: - URL × Access Time → url-time relationships - URL Structure × Event Time → structure-time alignment
Implementation: - Extract URL features - Extract temporal features - Compute interactions - MLP-ize interactions
URL × EMAIL¶
Relationship Types:
- URL Domain × Email Domain: indicator(url_domain == email_domain) (same domain?)
- URL Depth × Email Length: url_depth * len(email)
- URL Features × Email Features: url_features · email_features
- Email Domain Length × URL Length: len(email_domain) * len(url)
Use Cases: - URL × Email Domain → url-email relationships - URL Structure × Email Provider → structure-provider alignment
Implementation: - Extract URL features - Extract email features - Compute interactions - MLP-ize interactions
URL × DOMAIN¶
Relationship Types:
- URL Domain × Domain: indicator(url_domain == domain) (same domain?)
- URL Depth × Domain Length: url_depth * len(domain)
- URL Features × Domain Features: url_features · domain_features
- TLD Match: indicator(url_tld == domain_tld) (same TLD?)
Use Cases: - URL × Domain → url-domain relationships - URL Structure × Domain Type → structure-domain alignment
Implementation: - Extract URL features - Extract domain features - Compute interactions - MLP-ize interactions
JSON × JSON¶
Relationship Types:
- JSON Depth Difference: json_depth_a - json_depth_b
- JSON Key Overlap: |json_keys_a ∩ json_keys_b| (common keys)
- JSON Key Jaccard: |keys_a ∩ keys_b| / |keys_a ∪ keys_b|
- JSON Key Count Ratio: num_keys(json_a) / (num_keys(json_b) + eps)
- JSON Structure Similarity: json_features_a · json_features_b
Use Cases: - Request JSON × Response JSON → request-response relationships - JSON Schema × JSON Data → schema-data alignment
Implementation: - Extract JSON features (depth, keys, structure) - Extract JSON embeddings - Compute similarity metrics - MLP-ize interactions
JSON × TIMESTAMP¶
Relationship Types:
- JSON Depth × Time Delta: json_depth * delta_time
- JSON Features × Temporal Features: json_features · temporal_features
- JSON Key Count × Age: num_keys(json) * age(timestamp)
- Temporal Features × JSON Embedding: temporal_features · json_embedding
Use Cases: - JSON × Creation Time → json-time relationships - JSON Structure × Event Time → structure-time alignment
Implementation: - Extract JSON features - Extract temporal features - Compute interactions - MLP-ize interactions
JSON × EMAIL¶
Relationship Types:
- JSON Key Count × Email Length: num_keys(json) * len(email)
- JSON Features × Email Features: json_features · email_features
- JSON Depth × Domain Length: json_depth * len(email_domain)
- Email Features × JSON Embedding: email_features · json_embedding
Use Cases: - JSON × Email Domain → json-email relationships - JSON Structure × Email Provider → structure-provider alignment
Implementation: - Extract JSON features - Extract email features - Compute interactions - MLP-ize interactions
JSON × DOMAIN¶
Relationship Types:
- JSON Key Count × Domain Length: num_keys(json) * len(domain)
- JSON Features × Domain Features: json_features · domain_features
- JSON Depth × TLD Type: json_depth * indicator(tld_type)
- Domain Features × JSON Embedding: domain_features · json_embedding
Use Cases: - JSON × Domain → json-domain relationships - JSON Structure × Domain Type → structure-domain alignment
Implementation: - Extract JSON features - Extract domain features - Compute interactions - MLP-ize interactions
TIMESTAMP × TIMESTAMP¶
Relationship Types:
- Time Delta: timestamp_a - timestamp_b (absolute time difference)
- Time Ratio: timestamp_a / (timestamp_b + eps) (relative time)
- Age Difference: age(timestamp_a) - age(timestamp_b) (relative ages)
- Temporal Feature Differences: hour_a - hour_b, day_a - day_b, month_a - month_b
- Same Day Indicator: indicator(day_a == day_b) (same day?)
- Same Week Indicator: indicator(week_a == week_b) (same week?)
- Same Month Indicator: indicator(month_a == month_b) (same month?)
Use Cases: - Registration Date × Last Login → registration-login relationships - Purchase Date × Ship Date → purchase-ship alignment - Start Time × End Time → duration relationships
Implementation: - Extract temporal features (hour, day, week, month, age) - Compute time differences and ratios - Extract temporal indicators (same day/week/month) - MLP-ize interactions
TIMESTAMP × EMAIL¶
Relationship Types:
- Age × Email Length: age(timestamp) * len(email)
- Temporal Features × Email Features: temporal_features · email_features
- Time Delta × Domain Length: delta_time * len(email_domain)
- Email Features × Temporal Embedding: email_features · temporal_embedding
Use Cases: - Registration Date × Email Domain → registration-email relationships - Event Time × Email Provider → time-provider alignment
Implementation: - Extract temporal features - Extract email features - Compute interactions - MLP-ize interactions
TIMESTAMP × DOMAIN¶
Relationship Types:
- Age × Domain Length: age(timestamp) * len(domain)
- Temporal Features × Domain Features: temporal_features · domain_features
- Time Delta × TLD Type: delta_time * indicator(tld_type)
- Domain Features × Temporal Embedding: domain_features · temporal_embedding
Use Cases: - Registration Date × Domain → registration-domain relationships - Event Time × Domain Type → time-domain alignment
Implementation: - Extract temporal features - Extract domain features - Compute interactions - MLP-ize interactions
EMAIL × EMAIL¶
Relationship Types:
- Domain Match: indicator(email_domain_a == email_domain_b) (same domain?)
- TLD Match: indicator(tld_a == tld_b) (same TLD?)
- Email Length Ratio: len(email_a) / (len(email_b) + eps)
- Email Length Difference: len(email_a) - len(email_b)
- Domain Length Ratio: len(email_domain_a) / (len(email_domain_b) + eps)
- Email Embedding Similarity: email_embedding_a · email_embedding_b
Use Cases: - Sender Email × Recipient Email → sender-recipient relationships - Primary Email × Secondary Email → primary-secondary alignment
Implementation: - Extract email features (domain, TLD, length) - Extract email embeddings - Compute similarity metrics - MLP-ize interactions
EMAIL × DOMAIN¶
Relationship Types:
- Email Domain × Domain: indicator(email_domain == domain) (same domain?)
- Email Length × Domain Length: len(email) * len(domain)
- Email Features × Domain Features: email_features · domain_features
- TLD Match: indicator(email_tld == domain_tld) (same TLD?)
Use Cases: - Email × Domain → email-domain relationships - Email Provider × Domain Type → provider-domain alignment
Implementation: - Extract email features - Extract domain features - Compute interactions - MLP-ize interactions
DOMAIN × DOMAIN¶
Relationship Types:
- Domain Match: indicator(domain_a == domain_b) (same domain?)
- TLD Match: indicator(tld_a == tld_b) (same TLD?)
- Domain Length Ratio: len(domain_a) / (len(domain_b) + eps)
- Domain Length Difference: len(domain_a) - len(domain_b)
- Subdomain Relationship: indicator(domain_a is subdomain of domain_b or vice versa)
- Domain Embedding Similarity: domain_embedding_a · domain_embedding_b
Use Cases: - Source Domain × Destination Domain → source-destination relationships - Primary Domain × Secondary Domain → primary-secondary alignment
Implementation: - Extract domain features (length, TLD, subdomain) - Extract domain embeddings - Compute similarity metrics - MLP-ize interactions
Summary¶
Interaction Categories¶
- Numeric Operations (SCALAR, VECTOR, TIMESTAMP)
- Ratios, differences, products, sums
- Normalized differences, log ratios
-
Dot products, distances, similarities
-
Set Operations (SET, LIST_OF_A_SET)
- Intersection, union, difference
- Jaccard similarity, subset indicators
-
Cardinality ratios and differences
-
String Operations (FREE_STRING, URL, EMAIL, DOMAIN)
- Length ratios and differences
- Token overlap, Jaccard similarity
-
Embedding similarities, containment indicators
-
Structure Operations (JSON, URL, EMAIL, DOMAIN)
- Depth ratios and differences
- Key/component overlap
-
Structure similarity
-
Temporal Operations (TIMESTAMP)
- Time deltas, age differences
- Temporal feature differences
-
Same period indicators (day/week/month)
-
Cross-Type Operations
- Cardinality × Numeric (set size × scalar)
- Length × Numeric (string length × scalar)
- Embedding × Numeric (embedding × scalar)
- Structure × Numeric (depth × scalar)
Implementation Strategy¶
- Type-Aware Relationship Extraction
- Detect column types from
col_typesdict - Select appropriate interaction functions based on type pairs
-
Compute relationship features for each pair
-
Feature Selection
- Limit to top-N pairs by MI or correlation (if
max_pairwise_ratiosis set) - Prioritize high-MI pairs for relationship tokens
-
Use upstream hints to weight relationship importance
-
MLP-ization
- Each relationship type gets its own MLP (or shared MLP per category)
- Project all relationship features to
d_modeldimension -
Normalize outputs for stable training
-
Masking
- Zero out relationships involving masked columns
- Handle missing values gracefully (use epsilon for divisions)
This comprehensive enumeration provides a roadmap for implementing type-aware relationship features in the RelationshipFeatureExtractor.