Training Foundational Models¶

A Foundational Model is the core of Featrix. It learns the structure and relationships in your data through self-supervised learning, creating a rich embedding space that enables high-accuracy predictions.

Quick Start¶

from featrixsphere.api import FeatrixSphere

featrix = FeatrixSphere()

# Create and train a Foundational Model
fm = featrix.create_foundational_model(
    name="my_model",
    data_file="customers.csv"
)
fm.wait_for_training()

print(f"Training complete! Dimensions: {fm.dimensions}")

Data Requirements¶

Minimum Requirements¶

Requirement	Minimum	Recommended
Rows	100	1,000+
Columns	2	5+ (richer relationships)
Samples per class	10	50+ (better minority recall)
Null threshold	90% max	<30%

Supported Column Types¶

Featrix automatically detects and handles these column types:

Type	Detection	Encoding Strategy
Numeric	Age, Revenue, Temperature	20 adaptive strategies with dual-path (continuous + binned)
Categorical	Country, Product_Type	Hybrid learned + BERT embeddings for OOV handling
Text	Description, Comments	BERT embeddings with 7 compression strategies
Timestamp	Created_Date, Order_Time	12 cyclical features (seconds → years) + timezone
Email	customer_email	Decomposed into domain, TLD, free-email flags
URL/Domain	website, referrer	Parsed into TLD, subdomain, path, query
JSON	metadata	Flattened and encoded via child embedding space
List	"red\|green\|blue"	Delimiter attention encoding (specify delimiter)

Automatic Column Filtering¶

Featrix automatically excludes columns that would add noise:

Random strings (UUIDs, hashes, transaction IDs): >95% unique values with low semantic similarity
All-null columns: No information
Uniform columns: Single unique value, no variance
Internal columns: Columns starting with __featrix (metadata)

Column Type Overrides¶

When automatic detection gets it wrong:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    column_overrides={
        "zip_code": "set",        # Treat as category, not number
        "product_id": "string",   # Treat as text, not category
        "tags": "string_list",    # Pipe-separated list
        "score": "scalar"         # Force numeric treatment
    },
    string_list_delimiter="|"
)

Common override scenarios:

Zip codes: Look numeric but are categorical (10001 isn't "more" than 10000)
Product IDs: High cardinality but carry semantic meaning
Rating scores: 1-5 scale could be scalar or ordinal categorical

Data Sources¶

From a Local File¶

# CSV file
fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv"
)

# Parquet file
fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.parquet"
)

# JSON file
fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.json"
)

From a DataFrame¶

import pandas as pd

df = pd.read_csv("customers.csv")
fm = featrix.create_foundational_model(
    name="customer_model",
    df=df
)

From S3¶

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="s3://my-bucket/customers.parquet"
)

Configuration Options¶

Ignoring Columns¶

Some columns shouldn't be used for training—IDs, timestamps that leak information, or columns you want to predict:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    ignore_columns=["customer_id", "created_at", "target_column"]
)

Training Epochs¶

By default, Featrix automatically determines the optimal number of epochs. You can override this:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    epochs=100  # Force 100 epochs
)

Foundation Mode (Large Datasets)¶

For datasets with 100,000+ rows, Featrix automatically enables "foundation mode" which uses chunked iteration and stratified data splits for efficient training. You can also force it for smaller datasets:

fm = featrix.create_foundational_model(
    name="large_model",
    data_file="big_dataset.parquet",
    foundation_mode=True  # Force foundation mode
)

Data Splits¶

Foundation mode splits your data into four subsets:

Split	Size	Purpose
Warmup	5%	High-quality rows (<10% nulls) used for initial training stabilization
Train	80%	Main training data used for learning embeddings
Validation	10%	Used during training to monitor for overfitting and guide early stopping
Test	5%	Holdout set never seen during training, used for final evaluation

Training Phases¶

Warmup Phase: The first few epochs use the warmup split—clean rows with minimal missing values. During this phase, the joint encoder is frozen while column encoders learn initial representations. This stabilizes training before the full model starts adapting.
Main Training: After warmup, training proceeds on the full train split with all model components active. The learning rate follows a schedule that ramps up quickly, holds briefly at peak, then decays with small oscillations to explore the loss landscape.
Validation: Throughout training, validation loss is computed on the held-out validation split to detect overfitting and inform early stopping decisions.
Final Evaluation: After training completes, the test split provides an unbiased estimate of model quality since this data was never used to update weights or make training decisions.

Custom Metadata¶

Attach custom metadata to your model (up to 32KB JSON):

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    user_metadata={
        "project": "customer_churn",
        "version": "1.0",
        "owner": "data-team"
    }
)

Webhooks¶

Get notified when training completes:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv",
    webhooks={
        "training_finished": "https://your-server.com/webhook/training-done",
        "webhook_secret": "your_secret_key"  # Optional, for HMAC verification
    }
)

Webhook Payload¶

The training_finished webhook sends a POST request with this JSON payload:

{
  "event": "training_finished",
  "timestamp": "2025-01-13T10:30:00Z",
  "foundational_model_id": "abc123",
  "predictor_id": null,
  "data": {
    "status": "succeeded",
    "metrics": {
      "accuracy": 0.94,
      "epochs": 150
    }
  }
}

For failed training, the data object contains "status": "failed" and an "error" message.

Headers¶

Content-Type: application/json
User-Agent: FeatrixSphere-Webhook/1.0
X-Featrix-Signature: sha256=<hmac_digest> (if webhook_secret is provided)

Verifying the Signature¶

If you provide a webhook_secret, verify the signature on your server:

import hmac
import hashlib

def verify_signature(payload_bytes: bytes, signature_header: str, secret: str) -> bool:
    expected = hmac.new(secret.encode(), payload_bytes, hashlib.sha256).hexdigest()
    return hmac.compare_digest(f"sha256={expected}", signature_header)

Waiting for Training¶

Training typically takes 10-30 minutes depending on dataset size. Use wait_for_training() to block until complete:

fm = featrix.create_foundational_model(
    name="customer_model",
    data_file="customers.csv"
)

# Block until training completes
fm.wait_for_training(
    max_wait_time=3600,    # Maximum wait: 1 hour
    poll_interval=10,       # Check every 10 seconds
    show_progress=True      # Print progress updates
)

print(f"Status: {fm.status}")
print(f"Epochs: {fm.epochs}")
print(f"Final loss: {fm.final_loss}")

Checking Training Status¶

You can check status without blocking:

fm.refresh()  # Refresh state from server

print(f"Status: {fm.status}")  # "training", "done", or "error"
print(f"Ready: {fm.is_ready()}")

if fm.training_progress:
    print(f"Progress: {fm.training_progress}")

Foundational Model Attributes¶

After training completes:

print(fm.id)              # Session ID (use this to reload later)
print(fm.name)            # Model name
print(fm.status)          # "done"
print(fm.dimensions)      # Embedding dimensions (d_model)
print(fm.epochs)          # Training epochs completed
print(fm.final_loss)      # Final training loss
print(fm.created_at)      # Creation timestamp
print(fm.columns)         # Column names in the model

Loading an Existing Model¶

If you have a model's session ID, you can load it:

fm = featrix.foundational_model("your-session-id-here")
print(f"Loaded model: {fm.name}, status: {fm.status}")

Listing Your Models¶

Find models by name prefix:

session_ids = featrix.list_sessions(name_prefix="customer")
for sid in session_ids:
    fm = featrix.foundational_model(sid)
    print(f"{sid}: {fm.name} - {fm.status}")

Training Metrics¶

Get detailed training metrics:

metrics = fm.get_training_metrics()
print(metrics['loss_history'])
print(metrics['lr_history'])

Visualizations¶

Sphere Preview¶

Get a 2D projection preview of your embedding space:

png_bytes = fm.get_sphere_preview(save_path="preview.png")

Projections for Custom Visualization¶

Get 2D/3D projections for your own visualization:

projections = fm.get_projections()

Encoding Records¶

Convert records to embedding vectors (useful for similarity search, clustering, or custom ML):

# Full results with both 3D and full embeddings
results = fm.encode([
    {"age": 35, "income": 50000},
    {"age": 42, "income": 75000}
])
for r in results:
    print(r["embedding"])       # 3D vector (for visualization)
    print(r["embedding_long"])  # Full embedding vector

# Just 3D vectors
vectors_3d = fm.encode(records, short=True)

What Happens During Training¶

When you create a Foundational Model, Featrix:

Analyzes your data: Detects column types (numeric, categorical, text, email, timestamp, etc.)
Creates encoders: Builds specialized encoders for each column type
Trains the model: Uses self-supervised learning to discover relationships between columns
Validates quality: Monitors for training issues (collapse, overfitting, gradient problems)
Generates projections: Creates 2D/3D visualizations of the embedding space

All of this happens automatically—you don't configure anything.

Common Issues¶

Training Takes Too Long¶

For very large datasets (millions of rows):

Use Parquet format instead of CSV (faster loading)
Consider sampling if you don't need all rows
Foundation mode is automatically enabled for 100K+ rows

Training Fails¶

Check the error message:

if fm.status == "error":
    print(f"Error: {fm.error_message}")

Common causes:

Empty or malformed data file
All columns have zero variance (no information)
Insufficient data (need at least ~100 rows)

Next Steps¶

Once your Foundational Model is trained:

Train predictors for classification or regression
Run predictions on new data
Check safety and quality metrics

Where are my hyperparameters?¶

There aren't any. That's the point.

Traditional ML requires tuning learning rates, batch sizes, layer dimensions, dropout rates, regularization strengths, and dozens of other parameters. Getting these wrong means poor results; getting them right requires expertise and extensive experimentation.

Featrix eliminates this entirely. The model analyzes your data and automatically configures itself—architecture, learning rate schedules, regularization, and training dynamics are all determined by the structure and statistics of your dataset. Your data is the configuration.

The only parameter you can optionally specify is epochs—the number of passes over your data. Even this is usually unnecessary; Featrix monitors training progress and stops when the model has learned what it can. For large datasets in foundation mode, the effective number of passes is automatically reduced since each epoch already covers substantial data.

This isn't a limitation—it's a design choice. Hyperparameter tuning is where most ML projects waste time and where most mistakes happen. By making your data the sole input, Featrix ensures reproducible results without requiring ML expertise.