Skip to content

Training Foundational Models

A Foundational Model is the core of Featrix. It learns the structure and relationships in your data through self-supervised learning, creating a rich embedding space that enables high-accuracy predictions.

Quick Start

from featrixsphere.api import FeatrixSphere

featrix = FeatrixSphere()

# Create and train a Foundational Model
fm = featrix.create_foundational_model(
    name="my_model",
    data_file="customers.csv"
)
fm.wait_for_training()

print(f"Training complete! Dimensions: {fm.dimensions}")

Data Requirements

Minimum Requirements

Requirement Minimum Recommended
Rows 100 1,000+
Columns 2 5+ (richer relationships)
Samples per class 10 50+ (better minority recall)
Null threshold 90% max <30%

Supported Column Types

Featrix automatically detects and handles these column types:

Type Detection Encoding Strategy
Numeric Age, Revenue, Temperature 20 adaptive strategies with dual-path (continuous + binned)
Categorical Country, Product_Type Hybrid learned + BERT embeddings for OOV handling
Text Description, Comments BERT embeddings with 7 compression strategies
Timestamp Created_Date, Order_Time 12 cyclical features (seconds → years) + timezone
Email customer_email Decomposed into domain, TLD, free-email flags
URL/Domain website, referrer Parsed into TLD, subdomain, path, query
JSON metadata Flattened and encoded via child embedding space
List "red|green|blue" Delimiter attention encoding (specify delimiter)

Automatic Column Filtering

Featrix automatically excludes columns that would add noise:

  • Random strings (UUIDs, hashes, transaction IDs): >95% unique values with low semantic similarity
  • All-null columns: No information
  • Uniform columns: Single unique value, no variance
  • Internal columns: Columns starting with __featrix (metadata)

Column Type Overrides

When automatic detection gets it wrong:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    column_overrides={
        "zip_code": "set",        # Treat as category, not number
        "product_id": "string",   # Treat as text, not category
        "tags": "string_list",    # Pipe-separated list
        "score": "scalar"         # Force numeric treatment
    },
    string_list_delimiter="|"
)

Common override scenarios:

  • Zip codes: Look numeric but are categorical (10001 isn't "more" than 10000)
  • Product IDs: High cardinality but carry semantic meaning
  • Rating scores: 1-5 scale could be scalar or ordinal categorical

Data Sources

From a Local File

# CSV file
fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv"
)

# Parquet file
fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.parquet"
)

# JSON file
fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.json"
)

From a DataFrame

import pandas as pd

df = pd.read_csv("customers.csv")
fm = featrix.create_foundational_model(
    name="customer_model",
    df=df
)

From S3

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="s3://my-bucket/customers.parquet"
)

Configuration Options

Ignoring Columns

Some columns shouldn't be used for training—IDs, timestamps that leak information, or columns you want to predict:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    ignore_columns=["customer_id", "created_at", "target_column"]
)

Training Epochs

By default, Featrix automatically determines the optimal number of epochs. You can override this:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    epochs=100  # Force 100 epochs
)

Foundation Mode (Large Datasets)

For datasets with 100,000+ rows, Featrix automatically enables "foundation mode" which uses chunked iteration and stratified data splits for efficient training. You can also force it for smaller datasets:

fm = featrix.create_foundational_model(
    name="large_model",
    data_file="big_dataset.parquet",
    foundation_mode=True  # Force foundation mode
)

Data Splits

Foundation mode splits your data into four subsets:

Split Size Purpose
Warmup 5% High-quality rows (<10% nulls) used for initial training stabilization
Train 80% Main training data used for learning embeddings
Validation 10% Used during training to monitor for overfitting and guide early stopping
Test 5% Holdout set never seen during training, used for final evaluation

Training Phases

  1. Warmup Phase: The first few epochs use the warmup split—clean rows with minimal missing values. During this phase, the joint encoder is frozen while column encoders learn initial representations. This stabilizes training before the full model starts adapting.

  2. Main Training: After warmup, training proceeds on the full train split with all model components active. The learning rate follows a schedule that ramps up quickly, holds briefly at peak, then decays with small oscillations to explore the loss landscape.

  3. Validation: Throughout training, validation loss is computed on the held-out validation split to detect overfitting and inform early stopping decisions.

  4. Final Evaluation: After training completes, the test split provides an unbiased estimate of model quality since this data was never used to update weights or make training decisions.

Custom Metadata

Attach custom metadata to your model (up to 32KB JSON):

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    user_metadata={
        "project": "customer_churn",
        "version": "1.0",
        "owner": "data-team"
    }
)

Webhooks

Get notified when training completes:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    webhooks={
        "training_finished": "https://your-server.com/webhook/training-done",
        "webhook_secret": "your_secret_key"  # Optional, for HMAC verification
    }
)

Webhook Payload

The training_finished webhook sends a POST request with this JSON payload:

{
  "event": "training_finished",
  "timestamp": "2025-01-13T10:30:00Z",
  "foundational_model_id": "abc123",
  "predictor_id": null,
  "data": {
    "status": "succeeded",
    "metrics": {
      "accuracy": 0.94,
      "epochs": 150
    }
  }
}

For failed training, the data object contains "status": "failed" and an "error" message.

Headers

  • Content-Type: application/json
  • User-Agent: FeatrixSphere-Webhook/1.0
  • X-Featrix-Signature: sha256=<hmac_digest> (if webhook_secret is provided)

Verifying the Signature

If you provide a webhook_secret, verify the signature on your server:

import hmac
import hashlib

def verify_signature(payload_bytes: bytes, signature_header: str, secret: str) -> bool:
    expected = hmac.new(secret.encode(), payload_bytes, hashlib.sha256).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature_header)

Waiting for Training

Training typically takes 10-30 minutes depending on dataset size. Use wait_for_training() to block until complete:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv"
)

# Block until training completes
fm.wait_for_training(
    max_wait_time=3600,    # Maximum wait: 1 hour
    poll_interval=10,       # Check every 10 seconds
    show_progress=True      # Print progress updates
)

print(f"Status: {fm.status}")
print(f"Epochs: {fm.epochs}")
print(f"Final loss: {fm.final_loss}")

Checking Training Status

You can check status without blocking:

fm.refresh()  # Refresh state from server

print(f"Status: {fm.status}")  # "training", "done", or "error"
print(f"Ready: {fm.is_ready()}")

if fm.training_progress:
    print(f"Progress: {fm.training_progress}")

Foundational Model Attributes

After training completes:

print(fm.id)              # Session ID (use this to reload later)
print(fm.name)            # Model name
print(fm.status)          # "done"
print(fm.dimensions)      # Embedding dimensions (d_model)
print(fm.epochs)          # Training epochs completed
print(fm.final_loss)      # Final training loss
print(fm.created_at)      # Creation timestamp
print(fm.columns)         # Column names in the model

Loading an Existing Model

If you have a model's session ID, you can load it:

fm = featrix.foundational_model("your-session-id-here")
print(f"Loaded model: {fm.name}, status: {fm.status}")

Listing Your Models

Find models by name prefix:

session_ids = featrix.list_sessions(name_prefix="customer")
for sid in session_ids:
    fm = featrix.foundational_model(sid)
    print(f"{sid}: {fm.name} - {fm.status}")

Training Metrics

Get detailed training metrics:

metrics = fm.get_training_metrics()
print(metrics['loss_history'])
print(metrics['lr_history'])

Visualizations

Sphere Preview

Get a 2D projection preview of your embedding space:

png_bytes = fm.get_sphere_preview(save_path="preview.png")

Projections for Custom Visualization

Get 2D/3D projections for your own visualization:

projections = fm.get_projections()

Encoding Records

Convert records to embedding vectors (useful for similarity search, clustering, or custom ML):

# Full results with both 3D and full embeddings
results = fm.encode([
    {"age": 35, "income": 50000},
    {"age": 42, "income": 75000}
])
for r in results:
    print(r["embedding"])       # 3D vector (for visualization)
    print(r["embedding_long"])  # Full embedding vector

# Just 3D vectors
vectors_3d = fm.encode(records, short=True)

What Happens During Training

When you create a Foundational Model, Featrix:

  1. Analyzes your data: Detects column types (numeric, categorical, text, email, timestamp, etc.)
  2. Creates encoders: Builds specialized encoders for each column type
  3. Trains the model: Uses self-supervised learning to discover relationships between columns
  4. Validates quality: Monitors for training issues (collapse, overfitting, gradient problems)
  5. Generates projections: Creates 2D/3D visualizations of the embedding space

All of this happens automatically—you don't configure anything.

Common Issues

Training Takes Too Long

For very large datasets (millions of rows):

  • Use Parquet format instead of CSV (faster loading)
  • Consider sampling if you don't need all rows
  • Foundation mode is automatically enabled for 100K+ rows

Training Fails

Check the error message:

if fm.status == "error":
    print(f"Error: {fm.error_message}")

Common causes:

  • Empty or malformed data file
  • All columns have zero variance (no information)
  • Insufficient data (need at least ~100 rows)

Next Steps

Once your Foundational Model is trained:

Where are my hyperparameters?

There aren't any. That's the point.

Traditional ML requires tuning learning rates, batch sizes, layer dimensions, dropout rates, regularization strengths, and dozens of other parameters. Getting these wrong means poor results; getting them right requires expertise and extensive experimentation.

Featrix eliminates this entirely. The model analyzes your data and automatically configures itself—architecture, learning rate schedules, regularization, and training dynamics are all determined by the structure and statistics of your dataset. Your data is the configuration.

The only parameter you can optionally specify is epochs—the number of passes over your data. Even this is usually unnecessary; Featrix monitors training progress and stops when the model has learned what it can. For large datasets in foundation mode, the effective number of passes is automatically reduced since each epoch already covers substantial data.

This isn't a limitation—it's a design choice. Hyperparameter tuning is where most ML projects waste time and where most mistakes happen. By making your data the sole input, Featrix ensures reproducible results without requiring ML expertise.