Skip to content

Working with Limited Labels and Background Data

Real-world ML projects rarely have perfect data. You might have 100,000 customer records but only 500 with known outcomes. You might have one successful example and want to find more like it. You might have labels only for recent data but want to leverage historical patterns.

Featrix is designed for exactly these scenarios.

The Problem with Traditional ML

Traditional supervised learning requires labeled data for everything:

Traditional approach:

  • 100K customer records
  • Only 500 have labels → throw away 99.5K records
  • Train on 500 labeled records
  • Model has never seen 99.5% of your data distribution

This is wasteful. Your unlabeled data contains valuable information about:

  • The full range of customer behaviors
  • Edge cases and outliers
  • Feature correlations and distributions
  • Rare patterns that might only appear in the unlabeled data

How Featrix Solves This

Featrix separates two concerns:

  1. Understanding the data shape (Foundational Model) - trains on ALL data, labeled or not
  2. Learning the prediction task (Predictor) - trains only on labeled data

Featrix approach:

  • 100K customer records
  • Foundational Model trains on ALL 100K (no labels needed)
  • Predictor trains on 500 labeled records
  • Model understands the full data distribution

The Key Insight: Shape Augmentation

The Foundational Model learns the "shape" of your data:

  • Where different types of records cluster
  • The boundaries between different regions
  • The density and distribution of data points
  • Rare vs. common patterns
  • Correlations and relationships between features

The Predictor then operates in this well-understood space. Even though it only trains on labeled examples, those examples are embedded in a space that understands the full data landscape.

Scenario 1: You Have One Good Example (No Labels)

The most common real-world scenario: you have one successful customer, one bestseller, one winning case. You want to find more like it.

Traditional ML: Can't do it. You need labeled data with both positive and negative examples.

Featrix: Create a waypoint. Find similar records. Done.

from featrixsphere import FeatrixSphere

featrix = FeatrixSphere()

# Create Foundational Model from your background data
fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",  # 10,000 customers
    ignore_columns=["customer_id"]
)
fm.wait_for_training()

# Your one successful example
successful_customer = {
    "age": 35,
    "income": 50000,
    "location": "NYC",
    "plan_type": "premium"
}

# Find similar customers - no labels needed
similar = fm.similarity_search(successful_customer, k=100)

# These are your "positive class" candidates
for result in similar:
    print(f"Distance: {result.distance:.3f}, Record: {result.record}")

The Foundational Model learned the structure of your customer data. Similarity search finds records that occupy the same region of embedding space as your successful example.

Scenario 2: Partial Labels (Most Data Unlabeled)

You have 100K records but only 5K have labels. Traditional ML would discard 95K records.

import pandas as pd

# Load all your data
df = pd.read_csv('all_data.csv')  # 100K rows

# Mark which rows have labels
df['__featrix_train_predictor'] = df['outcome'].notna()  # True for 5K rows

# Create Foundational Model - trains on ALL 100K
fm = featrix.create_foundational_model(
    name="full_model",
    df=df
)
fm.wait_for_training()

# Create Predictor - trains only on 5K labeled rows
predictor = fm.create_predictor(
    target_column='outcome'
)
predictor.wait_for_training()

What happens:

  1. Foundational Model training: Learns patterns from all 100K records
  2. Understands the full range of feature values
  3. Learns correlations across ALL data
  4. Recognizes rare patterns that might only appear in unlabeled data

  5. Predictor training: Trains on 5K labeled records

  6. Learns that certain outcomes correlate with certain embedding regions
  7. But operates in an embedding space that understands the full 100K-record distribution

  8. Prediction: When you predict on new data

  9. If a new record looks similar to one of the 95K unlabeled records, the embedding space recognizes it
  10. The predictor can make informed predictions even in regions where it didn't see labels

Scenario 3: Recent Labels, Historical Data

You have 5 years of transaction data but only the last 6 months have fraud labels.

# All transactions (5 years)
df = pd.read_csv('all_transactions.csv')  # 1M rows

# Only recent transactions have labels
df['__featrix_train_predictor'] = df['date'] >= '2025-06-01'

# Foundational Model learns from ALL 5 years
# Predictor trains on recent labeled data
# Predictions benefit from historical patterns

The embedding space understands transaction patterns from 5 years of data. When a new transaction matches a pattern from 2022, the model recognizes it—even though no 2022 transactions have fraud labels.

Scenario 4: High-Quality Subset

You have labels for everything, but some labels are higher quality than others.

# Complex filtering for predictor training
df['__featrix_train_predictor'] = (
    (df['label_quality'] == 'verified') &   # Only verified labels
    (df['date'] >= '2024-01-01') &           # Recent data
    (df['feature_completeness'] > 0.9)       # Complete features
)

# Foundational Model: Learns from everything (including messy data)
# Predictor: Trains on pristine subset

The embedding space learns the full data distribution, including edge cases and outliers. The predictor trains only on high-quality labels, avoiding noise.

Why This Works: The Mathematics

Without background data:

Embedding space learns: f(5K labeled features) → embedding
Predictor learns: g(embedding) → outcome

Coverage: Only the 5K-row region of feature space
New data outside this region: extrapolation (dangerous)

With background data:

Embedding space learns: f(100K all features) → embedding
Predictor learns: g(embedding) → outcome

Coverage: The FULL 100K-row region of feature space
New data matching unlabeled patterns: interpolation (safe)

The Foundational Model transforms what would be extrapolation (predicting outside training distribution) into interpolation (predicting within a known region of embedding space).

Practical Patterns

Pattern 1: Stratified Sampling for Large Datasets

# Problem: 1M rows is too much for predictor training
# Solution: Foundational Model sees all 1M, Predictor trains on stratified sample

from sklearn.model_selection import train_test_split

df = pd.read_csv('huge_dataset.csv')  # 1M rows

# Stratified sample for predictor
_, sample_df = train_test_split(
    df,
    train_size=0.9,  # Keep 10% for predictor training
    stratify=df['target'],
    random_state=42
)

df['__featrix_train_predictor'] = df.index.isin(sample_df.index)

Pattern 2: Rare Event Detection

# Problem: Fraud is 0.1% of data
# Solution: Foundational Model learns normal patterns, Predictor focuses on fraud

df['__featrix_train_predictor'] = df['is_fraud'] | (df.index % 100 == 0)
# All fraud cases + 1% of normal cases for balance

# Foundational Model: Understands what "normal" looks like from ALL data
# Predictor: Learns to distinguish fraud from normal
# Result: Better at detecting anomalies because it truly understands normal

Pattern 3: Cross-Validation with Shared Embedding Space

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
    df_fold = df.copy()
    df_fold['__featrix_train_predictor'] = df_fold.index.isin(train_idx)

    # Same Foundational Model for all folds
    # Different predictor training splits
    # Validation on held-out fold

The Bottom Line

Traditional ML wastes most of your data. Featrix uses all of it.

  • One positive example? Find similar records with similarity search.
  • 5% labeled? The other 95% improves your embedding space.
  • Recent labels only? Historical data still shapes the model.
  • Messy labels? Train predictor on clean subset, embedding on everything.

Your unlabeled data isn't useless—it's the foundation that makes your predictions better.