Extending Embedding Spaces with Feature Engineering¶
Overview¶
The Extend ES feature allows you to add new engineered feature columns to an existing EmbeddingSpace without retraining from scratch. This is critical for iterative feature engineering where you want to:
- Preserve existing embeddings (expensive to retrain)
- Add new feature columns discovered during training
- Train only the new columns (much faster)
- Maintain lineage and provenance of features
The Problem This Solves¶
Before: Features Were Ignored¶
┌─────────────────────────────────────────────────────────────────┐
│ PREVIOUS WORKFLOW │
├─────────────────────────────────────────────────────────────────┤
│ 1. Train ES on 30 columns → Creates embeddings │
│ │
│ 2. Train Predictor: │
│ • Loads features from previous run │
│ • Adds 2 new columns to DataFrame │
│ • BUT: No codecs for new columns! │
│ • Dataset only encodes columns with codecs │
│ • ❌ New features SILENTLY IGNORED │
│ │
│ Result: All feature engineering wasted! │
└─────────────────────────────────────────────────────────────────┘
After: Extend ES Concept¶
┌─────────────────────────────────────────────────────────────────┐
│ "EXTEND ES" WORKFLOW │
├─────────────────────────────────────────────────────────────────┤
│ 1. Train ES v1 on 30 columns (50 epochs) │
│ → Creates embeddings for original data │
│ │
│ 2. Train Predictor, discover features: │
│ → younger_borrower │
│ → high_debt_ratio │
│ │
│ 3. Extend ES: Create ES v2 │
│ • Load ES v1 (30 columns) │
│ • Apply features to data (now 32 columns) │
│ • Create codecs for 2 new columns only │
│ • Copy existing encoder weights │
│ • Train for 12 epochs (50/4) to learn new columns │
│ → ES v2 has 32 columns with embeddings! │
│ │
│ 4. Train Predictor on ES v2: │
│ • All 32 columns have codecs │
│ • All columns encoded and used │
│ • ✅ Features actually work! │
└─────────────────────────────────────────────────────────────────┘
How It Works¶
Phase 1: Feature Discovery (Run 1)¶
[Code snippet coming soon.]
Phase 2: Extend ES (Run 2)¶
[Code snippet coming soon.]
Training Strategy¶
The extension training uses a two-phase approach:
Phase 1: New Columns Only (epochs/8)¶
- Freeze existing column encoders
- Train only new column encoders
- Goal: Learn embeddings for new features without disturbing existing ones
Phase 2: Joint Fine-tuning (epochs/8)¶
- Unfreeze all encoders
- Train everything jointly
- Goal: Allow new features to integrate with existing embeddings
Total: epochs/4 (much faster than full retraining)
Example Timeline¶
Original ES: 50 epochs × 30 columns = ~2 hours
Extended ES: 12 epochs × 2 new columns = ~15 minutes
Savings: ~7x faster than retraining from scratch!
API Reference¶
[Code snippet coming soon.]
Extension Metadata¶
Every extended ES stores lineage information:
[Code snippet coming soon.]
This enables: - Tracking which ES version a model was trained on - Reproducibility - know exactly which features were used - Lineage - trace back through extension history - Debugging - understand when/why features were added
Benefits¶
1. Efficiency¶
- Train only new columns, not entire ES from scratch
- 4x-8x faster than full retraining
- Saves GPU time and cost
2. Preservation¶
- Existing embeddings unchanged
- No loss of learned representations
- Stable foundation for iteration
3. Iterative Improvement¶
- Each run builds on previous success
- Compound improvements over time
- Gradual convergence to optimal features
4. Trackability¶
- Complete lineage of each ES version
- Know exactly which features were added when
- Reproducible results
5. Practical¶
- Works with existing feature suggestion system
- Integrates with effectiveness tracking
- Supports real-world workflows
Best Practices¶
1. Validate Features First¶
Only extend ES with features that actually improved metrics.
[Code snippet coming soon.]
2. Use Appropriate Epochs¶
- Rule of thumb: original_epochs / 4
- Minimum: 10 epochs (to properly learn new columns)
- Maximum: original_epochs / 2 (diminishing returns)
3. Save Versions¶
Keep all ES versions for comparison and rollback:
embedding_space_v1.pkl # Original: 30 columns
embedding_space_v2.pkl # Extended: 32 columns (+younger_borrower, high_debt_ratio)
embedding_space_v3.pkl # Extended: 34 columns (+duration_age_risk_score, ...)
4. Monitor Extension Quality¶
Check that extension training is working.
[Code snippet coming soon.]
5. Document Changes¶
Use feature_metadata to track provenance.
[Code snippet coming soon.]
Troubleshooting¶
"No new columns found"¶
Problem: Enriched DataFrame has same columns as existing ES
Solution: Check that features were actually applied to your data before extending.
"Could not copy weights for column X"¶
Problem: Column encoder structure changed between versions
Solution: This is usually not fatal - new columns will be randomly initialized
"Extended ES performs worse than original"¶
Problem: New features are hurting performance
Solution: 1. Check feature effectiveness history - only use proven features 2. Try training for more epochs 3. Remove problematic features and re-extend
"Out of memory during extension"¶
Problem: Extended ES is too large for GPU
Solution: 1. Use smaller batch size 2. Train on CPU (slower but works) 3. Remove least effective features before extending
Future Enhancements¶
Possible improvements to the extend ES feature:
- Selective Freezing: More fine-grained control over which layers to freeze
- Gradual Unfreezing: Unfreeze layers progressively during training
- Pruning: Remove ineffective columns during extension
- Multi-Version Merging: Combine features from multiple ES versions
- A/B Testing: Automatically compare extended vs original ES
- Auto-Epoch Determination: ML-based estimate of optimal training epochs
Status¶
Coming to FeatrixSphere API
This functionality is currently available in the internal Featrix engine and will be exposed in the FeatrixSphere API in a future release.