Use Case: Similarity Search¶
Find similar records using semantic understanding of your data.
When to Use This¶
- Product recommendations ("customers who bought X also bought Y")
- Customer segmentation and lookalike audiences
- Duplicate detection and deduplication
- Content recommendations
- Finding similar cases for support or legal
Complete Implementation¶
from featrixsphere.api import FeatrixSphere
featrix = FeatrixSphere()
# 1. Create Foundational Model
fm = featrix.create_foundational_model(
name="product_embeddings",
data_file="products.csv",
ignore_columns=["product_id", "created_at"]
)
fm.wait_for_training()
# 2. Create Vector Database
vdb = fm.create_vector_database(name="product_search")
# 3. Add records to the database
import pandas as pd
products_df = pd.read_csv("products.csv")
products = products_df.to_dict('records')
vdb.add_records(products, batch_size=500)
print(f"Added {vdb.size()} products to vector database")
# 4. Find similar products
query_product = {
"category": "electronics",
"brand": "Sony",
"price": 299.99,
"features": "wireless, noise-canceling, bluetooth"
}
similar = vdb.similarity_search(query_product, k=10)
print("Similar products:")
for match in similar:
print(f" Score: {match['similarity']:.3f} - {match['record']['name']}")
Reference Records (Positive-Only Matching)¶
When you only have examples of what you want (no negative examples):
# Create a reference from your ideal customer
ideal_customer = {
"age": 35,
"income": 150000,
"education": "masters",
"occupation": "software_engineer",
"interests": "technology, travel"
}
ref = fm.create_reference_record(
record=ideal_customer,
name="high_value_customer_profile"
)
# Find similar prospects in your database
similar_prospects = ref.find_similar(k=100, vector_database=prospects_vdb)
for prospect in similar_prospects:
print(f"Score: {prospect['similarity']:.3f}")
print(f" Name: {prospect['record']['name']}")
print(f" Email: {prospect['record']['email']}")
Use Case: Product Recommendations¶
def get_recommendations(purchased_product, vdb, k=5):
"""Get product recommendations based on a purchased product."""
similar = vdb.similarity_search(purchased_product, k=k+1)
# Exclude the purchased product itself
recommendations = [
match for match in similar
if match['record']['product_id'] != purchased_product['product_id']
][:k]
return recommendations
# User purchased wireless headphones
purchased = {
"category": "electronics",
"subcategory": "audio",
"brand": "Sony",
"price": 299.99,
"features": "wireless, noise-canceling"
}
recommendations = get_recommendations(purchased, vdb, k=5)
for rec in recommendations:
print(f"Recommend: {rec['record']['name']} (similarity: {rec['similarity']:.2f})")
Use Case: Customer Segmentation¶
# Define segment archetypes
segments = {
"premium": {"income": 200000, "age": 45, "purchases_per_year": 50},
"budget": {"income": 40000, "age": 28, "purchases_per_year": 10},
"frequent": {"income": 80000, "age": 35, "purchases_per_year": 100}
}
# Classify customers into segments
def classify_customer(customer, segment_refs):
best_segment = None
best_score = -1
for segment_name, ref in segment_refs.items():
similar = ref.find_similar(k=1, vector_database=customer_vdb)
# Check if this customer matches this segment
score = similar[0]['similarity'] if similar else 0
if score > best_score:
best_score = score
best_segment = segment_name
return best_segment, best_score
# Create reference records for each segment
segment_refs = {}
for name, archetype in segments.items():
segment_refs[name] = fm.create_reference_record(
record=archetype,
name=f"segment_{name}"
)
Use Case: Duplicate Detection¶
def find_duplicates(records, vdb, threshold=0.95):
"""Find potential duplicate records."""
duplicates = []
for i, record in enumerate(records):
similar = vdb.similarity_search(record, k=5)
for match in similar:
# Skip self-match
if match['record'].get('id') == record.get('id'):
continue
if match['similarity'] >= threshold:
duplicates.append({
'record_1': record,
'record_2': match['record'],
'similarity': match['similarity']
})
return duplicates
# Find duplicate customer records
potential_dupes = find_duplicates(customer_records, customer_vdb, threshold=0.9)
print(f"Found {len(potential_dupes)} potential duplicates")
for dupe in potential_dupes:
print(f"Similarity: {dupe['similarity']:.3f}")
print(f" Record 1: {dupe['record_1']['name']} - {dupe['record_1']['email']}")
print(f" Record 2: {dupe['record_2']['name']} - {dupe['record_2']['email']}")
Use Case: Lookalike Audience¶
# Start with your best customers
best_customers = [
{"age": 32, "income": 85000, "purchases": 25, "category_pref": "electronics"},
{"age": 28, "income": 95000, "purchases": 30, "category_pref": "fashion"},
{"age": 35, "income": 120000, "purchases": 45, "category_pref": "home"}
]
# Find lookalikes for each best customer
lookalikes = set()
for customer in best_customers:
ref = fm.create_reference_record(record=customer, name="temp_ref")
similar = ref.find_similar(k=50, vector_database=prospects_vdb)
for match in similar:
if match['similarity'] > 0.8:
lookalikes.add(match['record']['prospect_id'])
print(f"Found {len(lookalikes)} lookalike prospects")
Getting Raw Embeddings¶
For custom similarity logic or external vector stores:
# Encode records to vectors
records = [
{"category": "electronics", "price": 100},
{"category": "clothing", "price": 50}
]
# Full embeddings
embeddings = fm.encode(records)
for e in embeddings:
print(f"3D projection: {e['embedding']}") # [x, y, z]
print(f"Full embedding: {e['embedding_long']}") # [v1, v2, ..., vN]
# Just 3D vectors (for visualization)
vectors_3d = fm.encode(records, short=True)
Understanding Similarity Scores¶
Similarity scores range from 0 to 1:
| Score | Interpretation |
|---|---|
| 0.95+ | Near-duplicate or extremely similar |
| 0.80 - 0.95 | Very similar, likely same category/type |
| 0.60 - 0.80 | Moderately similar, some shared characteristics |
| 0.40 - 0.60 | Weakly similar |
| < 0.40 | Dissimilar |
Performance Tips¶
-
Batch record addition: Use
batch_sizeparameter for large datasets -
Limit k for speed: Only request as many results as you need
-
Pre-filter when possible: If you can narrow the search space, create multiple vector databases
Best Practices¶
-
Use
ignore_columnsfor IDs and timestamps - Passignore_columns=["id", "created_at"]when creating your Foundational Model. These columns add noise without semantic value. -
Thresholds depend on your use case:
- Duplicate detection: 0.90+ (you want high confidence)
- Recommendations: 0.70+ (some variety is good)
-
Lookalike audiences: 0.75+ (similar but not identical)
-
Reference records beat searching for examples - Instead of finding the "perfect" customer in your data, define one with
create_reference_record(). You control exactly what "ideal" means. -
Batch your additions - When populating a vector database, use
batch_size: -
Partition by category when it makes sense - If you'll never compare electronics to clothing, create separate vector databases. Smaller databases = faster searches.
-
The model handles mixed types automatically - Don't worry about normalizing prices vs. categories vs. text descriptions. Featrix learns how each column type contributes to similarity.