Working with Limited Labels and Background Data¶
Real-world ML projects rarely have perfect data. You might have 100,000 customer records but only 500 with known outcomes. You might have one successful example and want to find more like it. You might have labels only for recent data but want to leverage historical patterns.
Featrix is designed for exactly these scenarios.
The Problem with Traditional ML¶
Traditional supervised learning requires labeled data for everything:
Traditional approach:
- 100K customer records
- Only 500 have labels → throw away 99.5K records
- Train on 500 labeled records
- Model has never seen 99.5% of your data distribution
This is wasteful. Your unlabeled data contains valuable information about:
- The full range of customer behaviors
- Edge cases and outliers
- Feature correlations and distributions
- Rare patterns that might only appear in the unlabeled data
How Featrix Solves This¶
Featrix separates two concerns:
- Understanding the data shape (Foundational Model) - trains on ALL data, labeled or not
- Learning the prediction task (Predictor) - trains only on labeled data
Featrix approach:
- 100K customer records
- Foundational Model trains on ALL 100K (no labels needed)
- Predictor trains on 500 labeled records
- Model understands the full data distribution
The Key Insight: Shape Augmentation¶
The Foundational Model learns the "shape" of your data:
- Where different types of records cluster
- The boundaries between different regions
- The density and distribution of data points
- Rare vs. common patterns
- Correlations and relationships between features
The Predictor then operates in this well-understood space. Even though it only trains on labeled examples, those examples are embedded in a space that understands the full data landscape.
Scenario 1: You Have One Good Example (No Labels)¶
The most common real-world scenario: you have one successful customer, one bestseller, one winning case. You want to find more like it.
Traditional ML: Can't do it. You need labeled data with both positive and negative examples.
Featrix: Create a waypoint. Find similar records. Done.
from featrixsphere import FeatrixSphere
featrix = FeatrixSphere()
# Create Foundational Model from your background data
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv", # 10,000 customers
ignore_columns=["customer_id"]
)
fm.wait_for_training()
# Your one successful example
successful_customer = {
"age": 35,
"income": 50000,
"location": "NYC",
"plan_type": "premium"
}
# Find similar customers - no labels needed
similar = fm.similarity_search(successful_customer, k=100)
# These are your "positive class" candidates
for result in similar:
print(f"Distance: {result.distance:.3f}, Record: {result.record}")
The Foundational Model learned the structure of your customer data. Similarity search finds records that occupy the same region of embedding space as your successful example.
Scenario 2: Partial Labels (Most Data Unlabeled)¶
You have 100K records but only 5K have labels. Traditional ML would discard 95K records.
import pandas as pd
# Load all your data
df = pd.read_csv('all_data.csv') # 100K rows
# Mark which rows have labels
df['__featrix_train_predictor'] = df['outcome'].notna() # True for 5K rows
# Create Foundational Model - trains on ALL 100K
fm = featrix.create_foundational_model(
name="full_model",
df=df
)
fm.wait_for_training()
# Create Predictor - trains only on 5K labeled rows
predictor = fm.create_predictor(
target_column='outcome'
)
predictor.wait_for_training()
What happens:
- Foundational Model training: Learns patterns from all 100K records
- Understands the full range of feature values
- Learns correlations across ALL data
-
Recognizes rare patterns that might only appear in unlabeled data
-
Predictor training: Trains on 5K labeled records
- Learns that certain outcomes correlate with certain embedding regions
-
But operates in an embedding space that understands the full 100K-record distribution
-
Prediction: When you predict on new data
- If a new record looks similar to one of the 95K unlabeled records, the embedding space recognizes it
- The predictor can make informed predictions even in regions where it didn't see labels
Scenario 3: Recent Labels, Historical Data¶
You have 5 years of transaction data but only the last 6 months have fraud labels.
# All transactions (5 years)
df = pd.read_csv('all_transactions.csv') # 1M rows
# Only recent transactions have labels
df['__featrix_train_predictor'] = df['date'] >= '2025-06-01'
# Foundational Model learns from ALL 5 years
# Predictor trains on recent labeled data
# Predictions benefit from historical patterns
The embedding space understands transaction patterns from 5 years of data. When a new transaction matches a pattern from 2022, the model recognizes it—even though no 2022 transactions have fraud labels.
Scenario 4: High-Quality Subset¶
You have labels for everything, but some labels are higher quality than others.
# Complex filtering for predictor training
df['__featrix_train_predictor'] = (
(df['label_quality'] == 'verified') & # Only verified labels
(df['date'] >= '2024-01-01') & # Recent data
(df['feature_completeness'] > 0.9) # Complete features
)
# Foundational Model: Learns from everything (including messy data)
# Predictor: Trains on pristine subset
The embedding space learns the full data distribution, including edge cases and outliers. The predictor trains only on high-quality labels, avoiding noise.
Why This Works: The Mathematics¶
Without background data:
Embedding space learns: f(5K labeled features) → embedding
Predictor learns: g(embedding) → outcome
Coverage: Only the 5K-row region of feature space
New data outside this region: extrapolation (dangerous)
With background data:
Embedding space learns: f(100K all features) → embedding
Predictor learns: g(embedding) → outcome
Coverage: The FULL 100K-row region of feature space
New data matching unlabeled patterns: interpolation (safe)
The Foundational Model transforms what would be extrapolation (predicting outside training distribution) into interpolation (predicting within a known region of embedding space).
Practical Patterns¶
Pattern 1: Stratified Sampling for Large Datasets¶
# Problem: 1M rows is too much for predictor training
# Solution: Foundational Model sees all 1M, Predictor trains on stratified sample
from sklearn.model_selection import train_test_split
df = pd.read_csv('huge_dataset.csv') # 1M rows
# Stratified sample for predictor
_, sample_df = train_test_split(
df,
train_size=0.9, # Keep 10% for predictor training
stratify=df['target'],
random_state=42
)
df['__featrix_train_predictor'] = df.index.isin(sample_df.index)
Pattern 2: Rare Event Detection¶
# Problem: Fraud is 0.1% of data
# Solution: Foundational Model learns normal patterns, Predictor focuses on fraud
df['__featrix_train_predictor'] = df['is_fraud'] | (df.index % 100 == 0)
# All fraud cases + 1% of normal cases for balance
# Foundational Model: Understands what "normal" looks like from ALL data
# Predictor: Learns to distinguish fraud from normal
# Result: Better at detecting anomalies because it truly understands normal
Pattern 3: Cross-Validation with Shared Embedding Space¶
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
df_fold = df.copy()
df_fold['__featrix_train_predictor'] = df_fold.index.isin(train_idx)
# Same Foundational Model for all folds
# Different predictor training splits
# Validation on held-out fold
The Bottom Line¶
Traditional ML wastes most of your data. Featrix uses all of it.
- One positive example? Find similar records with similarity search.
- 5% labeled? The other 95% improves your embedding space.
- Recent labels only? Historical data still shapes the model.
- Messy labels? Train predictor on clean subset, embedding on everything.
Your unlabeled data isn't useless—it's the foundation that makes your predictions better.