Working with Hidden Metadata Columns¶
Include data in your records that gets stored and returned with search results but never influences the model. Perfect for IDs, timestamps, descriptions, or any contextual information you need to retrieve but don't want the model to learn from.
The __featrix_meta Prefix¶
Any column starting with __featrix_meta is treated as metadata:
color,size,price,__featrix_meta_id,__featrix_meta_notes
red,small,10,SKU-001,Popular item
blue,large,20,SKU-002,New arrival
green,medium,15,SKU-003,Clearance
In this example:
color,size,price— trained features that influence embeddings__featrix_meta_id,__featrix_meta_notes— stored but never trained on
Why Use Metadata Columns?¶
1. Keep IDs Out of Training¶
Record IDs, UUIDs, and transaction IDs are meaningless to the model—they're just noise. But you need them to identify records in your results.
customer_name,revenue,industry,__featrix_meta_customer_id
Acme Corp,500000,manufacturing,CUST-12345
TechStart,250000,software,CUST-67890
2. Preserve Timestamps¶
Timestamps for when records were created shouldn't affect similarity—a customer from January isn't inherently different from one in March. But you may want to know when they joined.
name,plan,monthly_spend,__featrix_meta_signup_date,__featrix_meta_last_login
Alice,premium,199,2024-01-15,2024-03-01
Bob,basic,29,2024-02-20,2024-03-02
3. Add Human-Readable Context¶
Include descriptions, notes, or display names that help you understand results without polluting the embedding space.
sku,category,price,__featrix_meta_display_name,__featrix_meta_warehouse_notes
WDG-001,tools,10.99,Deluxe Widget Set,Shelf A3 - high turnover
GDG-002,electronics,29.99,Smart Gadget Pro,Shelf B7 - fragile
How It Works¶
-
During Training: Metadata columns are automatically detected and excluded. They don't participate in type detection, encoding, or embedding generation.
-
In Vector Databases: Metadata is stored alongside embeddings in the vector database.
-
In Search Results: When you search, metadata columns come back with every matched record.
results = vdb.similarity_search(query, k=5)
for match in results:
# Trained features
print(match['record']['category'])
print(match['record']['price'])
# Metadata (stored but never trained on)
print(match['record']['__featrix_meta_id'])
print(match['record']['__featrix_meta_notes'])
Naming Rules¶
Valid metadata column names:
__featrix_meta_id__featrix_meta_timestamp__featrix_meta_anything_you_want
These won't work (not recognized as metadata):
__featrix_metaid— missing underscore after "meta"_featrix_meta_id— single underscore at startmeta_id— doesn't have the required prefix
Example: Customer Similarity with Metadata¶
from featrixsphere.api import FeatrixSphere
import pandas as pd
# Data with metadata columns
data = pd.DataFrame({
'industry': ['software', 'manufacturing', 'retail', 'software'],
'employee_count': [50, 200, 100, 75],
'annual_revenue': [5000000, 10000000, 3000000, 8000000],
'__featrix_meta_company_id': ['C001', 'C002', 'C003', 'C004'],
'__featrix_meta_account_manager': ['Alice', 'Bob', 'Alice', 'Carol'],
'__featrix_meta_notes': ['Key account', 'Expanding', 'At risk', 'New customer']
})
featrix = FeatrixSphere()
# Metadata columns are automatically excluded from training
fm = featrix.create_foundational_model(
name="customer_embeddings",
df=data
)
fm.wait_for_training()
# Create vector database (metadata is stored)
vdb = fm.create_vector_database(name="customers")
vdb.add_records(data.to_dict('records'))
# Search returns metadata with results
similar = vdb.similarity_search({
'industry': 'software',
'employee_count': 60,
'annual_revenue': 6000000
}, k=3)
for match in similar:
print(f"Company: {match['record']['__featrix_meta_company_id']}")
print(f"Manager: {match['record']['__featrix_meta_account_manager']}")
print(f"Notes: {match['record']['__featrix_meta_notes']}")
print(f"Similarity: {match['similarity']:.3f}")
print()
vs. ignore_columns¶
Both exclude columns from training, but they serve different purposes:
ignore_columns |
__featrix_meta_<name> |
|
|---|---|---|
| Excluded from training | Yes | Yes |
| Stored in vector database | No | Yes |
| Returned in search results | No | Yes |
| Use for | Columns you don't need at all | Columns you need for context |
Use ignore_columns for true throwaway columns. Use metadata columns for data you want to retrieve but not train on.