Skip to content

Use Case: Similarity Search

Find similar records using semantic understanding of your data.

When to Use This

  • Product recommendations ("customers who bought X also bought Y")
  • Customer segmentation and lookalike audiences
  • Duplicate detection and deduplication
  • Content recommendations
  • Finding similar cases for support or legal

Complete Implementation

from featrixsphere.api import FeatrixSphere

featrix = FeatrixSphere()

# 1. Create Foundational Model
fm = featrix.create_foundational_model(
    name="product_embeddings",
    data_file="products.csv",
    ignore_columns=["product_id", "created_at"]
)
fm.wait_for_training()

# 2. Create Vector Database
vdb = fm.create_vector_database(name="product_search")

# 3. Add records to the database
import pandas as pd
products_df = pd.read_csv("products.csv")
products = products_df.to_dict('records')

vdb.add_records(products, batch_size=500)
print(f"Added {vdb.size()} products to vector database")

# 4. Find similar products
query_product = {
    "category": "electronics",
    "brand": "Sony",
    "price": 299.99,
    "features": "wireless, noise-canceling, bluetooth"
}

similar = vdb.similarity_search(query_product, k=10)

print("Similar products:")
for match in similar:
    print(f"  Score: {match['similarity']:.3f} - {match['record']['name']}")

Reference Records (Positive-Only Matching)

When you only have examples of what you want (no negative examples):

# Create a reference from your ideal customer
ideal_customer = {
    "age": 35,
    "income": 150000,
    "education": "masters",
    "occupation": "software_engineer",
    "interests": "technology, travel"
}

ref = fm.create_reference_record(
    record=ideal_customer,
    name="high_value_customer_profile"
)

# Find similar prospects in your database
similar_prospects = ref.find_similar(k=100, vector_database=prospects_vdb)

for prospect in similar_prospects:
    print(f"Score: {prospect['similarity']:.3f}")
    print(f"  Name: {prospect['record']['name']}")
    print(f"  Email: {prospect['record']['email']}")

Use Case: Product Recommendations

def get_recommendations(purchased_product, vdb, k=5):
    """Get product recommendations based on a purchased product."""
    similar = vdb.similarity_search(purchased_product, k=k+1)

    # Exclude the purchased product itself
    recommendations = [
        match for match in similar
        if match['record']['product_id'] != purchased_product['product_id']
    ][:k]

    return recommendations

# User purchased wireless headphones
purchased = {
    "category": "electronics",
    "subcategory": "audio",
    "brand": "Sony",
    "price": 299.99,
    "features": "wireless, noise-canceling"
}

recommendations = get_recommendations(purchased, vdb, k=5)
for rec in recommendations:
    print(f"Recommend: {rec['record']['name']} (similarity: {rec['similarity']:.2f})")

Use Case: Customer Segmentation

# Define segment archetypes
segments = {
    "premium": {"income": 200000, "age": 45, "purchases_per_year": 50},
    "budget": {"income": 40000, "age": 28, "purchases_per_year": 10},
    "frequent": {"income": 80000, "age": 35, "purchases_per_year": 100}
}

# Classify customers into segments
def classify_customer(customer, segment_refs):
    best_segment = None
    best_score = -1

    for segment_name, ref in segment_refs.items():
        similar = ref.find_similar(k=1, vector_database=customer_vdb)
        # Check if this customer matches this segment
        score = similar[0]['similarity'] if similar else 0
        if score > best_score:
            best_score = score
            best_segment = segment_name

    return best_segment, best_score

# Create reference records for each segment
segment_refs = {}
for name, archetype in segments.items():
    segment_refs[name] = fm.create_reference_record(
        record=archetype,
        name=f"segment_{name}"
    )

Use Case: Duplicate Detection

def find_duplicates(records, vdb, threshold=0.95):
    """Find potential duplicate records."""
    duplicates = []

    for i, record in enumerate(records):
        similar = vdb.similarity_search(record, k=5)

        for match in similar:
            # Skip self-match
            if match['record'].get('id') == record.get('id'):
                continue

            if match['similarity'] >= threshold:
                duplicates.append({
                    'record_1': record,
                    'record_2': match['record'],
                    'similarity': match['similarity']
                })

    return duplicates

# Find duplicate customer records
potential_dupes = find_duplicates(customer_records, customer_vdb, threshold=0.9)
print(f"Found {len(potential_dupes)} potential duplicates")

for dupe in potential_dupes:
    print(f"Similarity: {dupe['similarity']:.3f}")
    print(f"  Record 1: {dupe['record_1']['name']} - {dupe['record_1']['email']}")
    print(f"  Record 2: {dupe['record_2']['name']} - {dupe['record_2']['email']}")

Use Case: Lookalike Audience

# Start with your best customers
best_customers = [
    {"age": 32, "income": 85000, "purchases": 25, "category_pref": "electronics"},
    {"age": 28, "income": 95000, "purchases": 30, "category_pref": "fashion"},
    {"age": 35, "income": 120000, "purchases": 45, "category_pref": "home"}
]

# Find lookalikes for each best customer
lookalikes = set()
for customer in best_customers:
    ref = fm.create_reference_record(record=customer, name="temp_ref")
    similar = ref.find_similar(k=50, vector_database=prospects_vdb)

    for match in similar:
        if match['similarity'] > 0.8:
            lookalikes.add(match['record']['prospect_id'])

print(f"Found {len(lookalikes)} lookalike prospects")

Getting Raw Embeddings

For custom similarity logic or external vector stores:

# Encode records to vectors
records = [
    {"category": "electronics", "price": 100},
    {"category": "clothing", "price": 50}
]

# Full embeddings
embeddings = fm.encode(records)
for e in embeddings:
    print(f"3D projection: {e['embedding']}")        # [x, y, z]
    print(f"Full embedding: {e['embedding_long']}")  # [v1, v2, ..., vN]

# Just 3D vectors (for visualization)
vectors_3d = fm.encode(records, short=True)

Understanding Similarity Scores

Similarity scores range from 0 to 1:

Score Interpretation
0.95+ Near-duplicate or extremely similar
0.80 - 0.95 Very similar, likely same category/type
0.60 - 0.80 Moderately similar, some shared characteristics
0.40 - 0.60 Weakly similar
< 0.40 Dissimilar

Performance Tips

  1. Batch record addition: Use batch_size parameter for large datasets

    vdb.add_records(large_dataset, batch_size=1000)
    

  2. Limit k for speed: Only request as many results as you need

    similar = vdb.similarity_search(query, k=10)  # Not k=1000
    

  3. Pre-filter when possible: If you can narrow the search space, create multiple vector databases

    electronics_vdb = fm.create_vector_database(name="electronics_products")
    clothing_vdb = fm.create_vector_database(name="clothing_products")
    

Best Practices

  1. Use ignore_columns for IDs and timestamps - Pass ignore_columns=["id", "created_at"] when creating your Foundational Model. These columns add noise without semantic value.

  2. Thresholds depend on your use case:

  3. Duplicate detection: 0.90+ (you want high confidence)
  4. Recommendations: 0.70+ (some variety is good)
  5. Lookalike audiences: 0.75+ (similar but not identical)

  6. Reference records beat searching for examples - Instead of finding the "perfect" customer in your data, define one with create_reference_record(). You control exactly what "ideal" means.

  7. Batch your additions - When populating a vector database, use batch_size:

    vdb.add_records(records, batch_size=500)
    

  8. Partition by category when it makes sense - If you'll never compare electronics to clothing, create separate vector databases. Smaller databases = faster searches.

  9. The model handles mixed types automatically - Don't worry about normalizing prices vs. categories vs. text descriptions. Featrix learns how each column type contributes to similarity.