Training Foundational Models¶
A Foundational Model is the core of Featrix. It learns the structure and relationships in your data through self-supervised learning, creating a rich embedding space that enables high-accuracy predictions.
Quick Start¶
from featrixsphere.api import FeatrixSphere
featrix = FeatrixSphere()
# Create and train a Foundational Model
fm = featrix.create_foundational_model(
name="my_model",
data_file="customers.csv"
)
fm.wait_for_training()
print(f"Training complete! Dimensions: {fm.dimensions}")
Data Requirements¶
Minimum Requirements¶
| Requirement | Minimum | Recommended |
|---|---|---|
| Rows | 100 | 1,000+ |
| Columns | 2 | 5+ (richer relationships) |
| Samples per class | 10 | 50+ (better minority recall) |
| Null threshold | 90% max | <30% |
Supported Column Types¶
Featrix automatically detects and handles these column types:
| Type | Detection | Encoding Strategy |
|---|---|---|
| Numeric | Age, Revenue, Temperature | 20 adaptive strategies with dual-path (continuous + binned) |
| Categorical | Country, Product_Type | Hybrid learned + BERT embeddings for OOV handling |
| Text | Description, Comments | BERT embeddings with 7 compression strategies |
| Timestamp | Created_Date, Order_Time | 12 cyclical features (seconds → years) + timezone |
| customer_email | Decomposed into domain, TLD, free-email flags | |
| URL/Domain | website, referrer | Parsed into TLD, subdomain, path, query |
| JSON | metadata | Flattened and encoded via child embedding space |
| List | "red|green|blue" | Delimiter attention encoding (specify delimiter) |
Automatic Column Filtering¶
Featrix automatically excludes columns that would add noise:
- Random strings (UUIDs, hashes, transaction IDs): >95% unique values with low semantic similarity
- All-null columns: No information
- Uniform columns: Single unique value, no variance
- Internal columns: Columns starting with
__featrix(metadata)
Column Type Overrides¶
When automatic detection gets it wrong:
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv",
column_overrides={
"zip_code": "set", # Treat as category, not number
"product_id": "string", # Treat as text, not category
"tags": "string_list", # Pipe-separated list
"score": "scalar" # Force numeric treatment
},
string_list_delimiter="|"
)
Common override scenarios:
- Zip codes: Look numeric but are categorical (10001 isn't "more" than 10000)
- Product IDs: High cardinality but carry semantic meaning
- Rating scores: 1-5 scale could be scalar or ordinal categorical
Data Sources¶
From a Local File¶
# CSV file
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv"
)
# Parquet file
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.parquet"
)
# JSON file
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.json"
)
From a DataFrame¶
import pandas as pd
df = pd.read_csv("customers.csv")
fm = featrix.create_foundational_model(
name="customer_model",
df=df
)
From S3¶
fm = featrix.create_foundational_model(
name="customer_model",
data_file="s3://my-bucket/customers.parquet"
)
Configuration Options¶
Ignoring Columns¶
Some columns shouldn't be used for training—IDs, timestamps that leak information, or columns you want to predict:
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv",
ignore_columns=["customer_id", "created_at", "target_column"]
)
Training Epochs¶
By default, Featrix automatically determines the optimal number of epochs. You can override this:
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv",
epochs=100 # Force 100 epochs
)
Foundation Mode (Large Datasets)¶
For datasets with 100,000+ rows, Featrix automatically enables "foundation mode" which uses chunked iteration and stratified data splits for efficient training. You can also force it for smaller datasets:
fm = featrix.create_foundational_model(
name="large_model",
data_file="big_dataset.parquet",
foundation_mode=True # Force foundation mode
)
Data Splits¶
Foundation mode splits your data into four subsets:
| Split | Size | Purpose |
|---|---|---|
| Warmup | 5% | High-quality rows (<10% nulls) used for initial training stabilization |
| Train | 80% | Main training data used for learning embeddings |
| Validation | 10% | Used during training to monitor for overfitting and guide early stopping |
| Test | 5% | Holdout set never seen during training, used for final evaluation |
Training Phases¶
-
Warmup Phase: The first few epochs use the warmup split—clean rows with minimal missing values. During this phase, the joint encoder is frozen while column encoders learn initial representations. This stabilizes training before the full model starts adapting.
-
Main Training: After warmup, training proceeds on the full train split with all model components active. The learning rate follows a schedule that ramps up quickly, holds briefly at peak, then decays with small oscillations to explore the loss landscape.
-
Validation: Throughout training, validation loss is computed on the held-out validation split to detect overfitting and inform early stopping decisions.
-
Final Evaluation: After training completes, the test split provides an unbiased estimate of model quality since this data was never used to update weights or make training decisions.
Custom Metadata¶
Attach custom metadata to your model (up to 32KB JSON):
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv",
user_metadata={
"project": "customer_churn",
"version": "1.0",
"owner": "data-team"
}
)
Webhooks¶
Get notified when training completes:
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv",
webhooks={
"training_finished": "https://your-server.com/webhook/training-done",
"webhook_secret": "your_secret_key" # Optional, for HMAC verification
}
)
Webhook Payload¶
The training_finished webhook sends a POST request with this JSON payload:
{
"event": "training_finished",
"timestamp": "2025-01-13T10:30:00Z",
"foundational_model_id": "abc123",
"predictor_id": null,
"data": {
"status": "succeeded",
"metrics": {
"accuracy": 0.94,
"epochs": 150
}
}
}
For failed training, the data object contains "status": "failed" and an "error" message.
Headers¶
Content-Type: application/jsonUser-Agent: FeatrixSphere-Webhook/1.0X-Featrix-Signature: sha256=<hmac_digest>(ifwebhook_secretis provided)
Verifying the Signature¶
If you provide a webhook_secret, verify the signature on your server:
import hmac
import hashlib
def verify_signature(payload_bytes: bytes, signature_header: str, secret: str) -> bool:
expected = hmac.new(secret.encode(), payload_bytes, hashlib.sha256).hexdigest()
return hmac.compare_digest(f"sha256={expected}", signature_header)
Waiting for Training¶
Training typically takes 10-30 minutes depending on dataset size. Use wait_for_training() to block until complete:
fm = featrix.create_foundational_model(
name="customer_model",
data_file="customers.csv"
)
# Block until training completes
fm.wait_for_training(
max_wait_time=3600, # Maximum wait: 1 hour
poll_interval=10, # Check every 10 seconds
show_progress=True # Print progress updates
)
print(f"Status: {fm.status}")
print(f"Epochs: {fm.epochs}")
print(f"Final loss: {fm.final_loss}")
Checking Training Status¶
You can check status without blocking:
fm.refresh() # Refresh state from server
print(f"Status: {fm.status}") # "training", "done", or "error"
print(f"Ready: {fm.is_ready()}")
if fm.training_progress:
print(f"Progress: {fm.training_progress}")
Foundational Model Attributes¶
After training completes:
print(fm.id) # Session ID (use this to reload later)
print(fm.name) # Model name
print(fm.status) # "done"
print(fm.dimensions) # Embedding dimensions (d_model)
print(fm.epochs) # Training epochs completed
print(fm.final_loss) # Final training loss
print(fm.created_at) # Creation timestamp
print(fm.columns) # Column names in the model
Loading an Existing Model¶
If you have a model's session ID, you can load it:
fm = featrix.foundational_model("your-session-id-here")
print(f"Loaded model: {fm.name}, status: {fm.status}")
Listing Your Models¶
Find models by name prefix:
session_ids = featrix.list_sessions(name_prefix="customer")
for sid in session_ids:
fm = featrix.foundational_model(sid)
print(f"{sid}: {fm.name} - {fm.status}")
Training Metrics¶
Get detailed training metrics:
Visualizations¶
Sphere Preview¶
Get a 2D projection preview of your embedding space:
Projections for Custom Visualization¶
Get 2D/3D projections for your own visualization:
Encoding Records¶
Convert records to embedding vectors (useful for similarity search, clustering, or custom ML):
# Full results with both 3D and full embeddings
results = fm.encode([
{"age": 35, "income": 50000},
{"age": 42, "income": 75000}
])
for r in results:
print(r["embedding"]) # 3D vector (for visualization)
print(r["embedding_long"]) # Full embedding vector
# Just 3D vectors
vectors_3d = fm.encode(records, short=True)
What Happens During Training¶
When you create a Foundational Model, Featrix:
- Analyzes your data: Detects column types (numeric, categorical, text, email, timestamp, etc.)
- Creates encoders: Builds specialized encoders for each column type
- Trains the model: Uses self-supervised learning to discover relationships between columns
- Validates quality: Monitors for training issues (collapse, overfitting, gradient problems)
- Generates projections: Creates 2D/3D visualizations of the embedding space
All of this happens automatically—you don't configure anything.
Common Issues¶
Training Takes Too Long¶
For very large datasets (millions of rows):
- Use Parquet format instead of CSV (faster loading)
- Consider sampling if you don't need all rows
- Foundation mode is automatically enabled for 100K+ rows
Training Fails¶
Check the error message:
Common causes:
- Empty or malformed data file
- All columns have zero variance (no information)
- Insufficient data (need at least ~100 rows)
Next Steps¶
Once your Foundational Model is trained:
- Train predictors for classification or regression
- Run predictions on new data
- Check safety and quality metrics
Where are my hyperparameters?¶
There aren't any. That's the point.
Traditional ML requires tuning learning rates, batch sizes, layer dimensions, dropout rates, regularization strengths, and dozens of other parameters. Getting these wrong means poor results; getting them right requires expertise and extensive experimentation.
Featrix eliminates this entirely. The model analyzes your data and automatically configures itself—architecture, learning rate schedules, regularization, and training dynamics are all determined by the structure and statistics of your dataset. Your data is the configuration.
The only parameter you can optionally specify is epochs—the number of passes over your data. Even this is usually unnecessary; Featrix monitors training progress and stops when the model has learned what it can. For large datasets in foundation mode, the effective number of passes is automatically reduced since each epoch already covers substantial data.
This isn't a limitation—it's a design choice. Hyperparameter tuning is where most ML projects waste time and where most mistakes happen. By making your data the sole input, Featrix ensures reproducible results without requiring ML expertise.