Extending Embedding Spaces with Feature Engineering¶
Overview¶
The Extend ES feature allows you to add new engineered feature columns to an existing EmbeddingSpace without retraining from scratch. This is critical for iterative feature engineering where you want to:
- ✅ Preserve existing embeddings (expensive to retrain)
- ✅ Add new feature columns discovered during training
- ✅ Train only the new columns (much faster)
- ✅ Maintain lineage and provenance of features
The Problem This Solves¶
Before: Features Were Ignored¶
┌─────────────────────────────────────────────────────────────────┐
│ CURRENT BROKEN WORKFLOW │
├─────────────────────────────────────────────────────────────────┤
│ 1. Train ES on 30 columns → Creates embeddings │
│ │
│ 2. Train Predictor: │
│ • Loads features from previous run │
│ • Adds 2 new columns to DataFrame │
│ • BUT: No codecs for new columns! │
│ • Dataset only encodes columns with codecs │
│ • ❌ New features SILENTLY IGNORED │
│ │
│ Result: All feature engineering wasted! 😱 │
└─────────────────────────────────────────────────────────────────┘
After: Extend ES Concept¶
┌─────────────────────────────────────────────────────────────────┐
│ NEW "EXTEND ES" WORKFLOW │
├─────────────────────────────────────────────────────────────────┤
│ 1. Train ES v1 on 30 columns (50 epochs) │
│ → Creates embeddings for original data │
│ │
│ 2. Train Predictor, discover features: │
│ → younger_borrower │
│ → high_debt_ratio │
│ │
│ 3. Extend ES: Create ES v2 │
│ • Load ES v1 (30 columns) │
│ • Apply features to data (now 32 columns) │
│ • Create codecs for 2 new columns only │
│ • Copy existing encoder weights │
│ • Train for 12 epochs (50/4) to learn new columns │
│ → ES v2 has 32 columns with embeddings! │
│ │
│ 4. Train Predictor on ES v2: │
│ • All 32 columns have codecs │
│ • All columns encoded and used │
│ • ✅ Features actually work! │
└─────────────────────────────────────────────────────────────────┘
How It Works¶
Phase 1: Feature Discovery (Run 1)¶
from featrix.neural.input_data_set import FeatrixInputDataSet
from featrix.neural.embedded_space import EmbeddingSpace
from featrix.neural.single_predictor import SinglePredictor
# Train initial ES
dataset = FeatrixInputDataSet(df=train_df, ignore_cols=["target"])
train_data, val_data = dataset.split(fraction=0.2)
es_v1 = EmbeddingSpace(train_data, val_data, n_epochs=50)
es_v1.train(batch_size=128, n_epochs=50)
es_v1.save("embedding_space_v1.pkl")
# Train predictor - discovers features
sp = SinglePredictor(es_v1, predictor)
sp.prep_for_training(train_df, 'target', 'set')
await sp.train(n_epochs=100)
# Features suggested and saved to:
# - feature_suggestion_history.json
# - feature_effectiveness.json
Phase 2: Extend ES (Run 2)¶
from featrix.neural.io_utils import load_embedding_space
from featrix.neural.feature_engineer import FeatureEngineer
# Load existing ES
es_v1 = load_embedding_space("embedding_space_v1.pkl")
# Apply discovered features to data
engineer = FeatureEngineer.from_json("qa.save/feature_suggestion_history.json")
enriched_train_df = engineer.fit_transform(train_df, verbose=True)
enriched_val_df = engineer.transform(val_df, verbose=False)
# Extend ES with new features
es_v2 = EmbeddingSpace.extend_from_existing(
existing_es=es_v1,
enriched_train_df=enriched_train_df,
enriched_val_df=enriched_val_df,
n_epochs=12, # 50 / 4 = 12.5 → 12
output_dir="qa.out/featrix_output",
feature_metadata={
"source": "feature_suggestion_history.json",
"applied_features": ["younger_borrower", "high_debt_ratio"]
}
)
# Train extended ES
es_v2.train(batch_size=128, n_epochs=12)
es_v2.save("embedding_space_v2.pkl")
# Train predictor on extended ES
sp2 = SinglePredictor(es_v2, predictor)
sp2.prep_for_training(enriched_train_df, 'target', 'set')
await sp2.train(n_epochs=100)
# NOW the features are actually used!
Alternative: Auto-Extension in SinglePredictor¶
If features were loaded during prep_for_training(), you can create an extended ES directly:
sp = SinglePredictor(es_v1, predictor)
sp.prep_for_training(train_df, 'target', 'set')
# Features loaded automatically
# Create extended ES from loaded features
es_v2 = sp.create_extended_embedding_space(
enriched_train_df=sp.train_df, # Already has features applied
enriched_val_df=sp.val_df,
output_dir="qa.out/featrix_output"
)
# Train extended ES
es_v2.train(batch_size=128, n_epochs=12)
es_v2.save("embedding_space_v2.pkl")
Training Strategy¶
The extension training uses a two-phase approach:
Phase 1: New Columns Only (epochs/8)¶
- Freeze existing column encoders
- Train only new column encoders
- Goal: Learn embeddings for new features without disturbing existing ones
Phase 2: Joint Fine-tuning (epochs/8)¶
- Unfreeze all encoders
- Train everything jointly
- Goal: Allow new features to integrate with existing embeddings
Total: epochs/4 (much faster than full retraining)
Example Timeline¶
Original ES: 50 epochs × 30 columns = ~2 hours
Extended ES: 12 epochs × 2 new columns = ~15 minutes
Savings: ~7x faster than retraining from scratch!
API Reference¶
EmbeddingSpace.extend_from_existing()¶
@classmethod
def extend_from_existing(
cls,
existing_es: 'EmbeddingSpace',
enriched_train_df: pd.DataFrame,
enriched_val_df: Optional[pd.DataFrame] = None,
n_epochs: int = None, # Default: original_epochs / 4
batch_size: int = None, # Default: existing_es.batch_size
output_dir: str = None,
name: str = None, # Default: f"{existing_es.name}_extended"
feature_metadata: Optional[Dict[str, Any]] = None
) -> 'EmbeddingSpace'
Arguments:
- existing_es: The EmbeddingSpace to extend
- enriched_train_df: Training DataFrame with new columns added
- enriched_val_df: Validation DataFrame (optional, will split if None)
- n_epochs: Training epochs for extension (default: original_epochs / 4)
- batch_size: Training batch size (default: use existing ES)
- output_dir: Output directory for extended ES
- name: Name for extended ES
- feature_metadata: Dict with provenance info about new features
Returns:
- New EmbeddingSpace with extended columns
Process:
1. Identifies new columns not in existing ES
2. Creates InputDataSet from enriched DataFrames
3. Creates new ES with all columns (old + new)
4. Copies encoder weights for existing columns
5. Creates codecs for new columns only
6. Stores extension metadata for lineage tracking
SinglePredictor.create_extended_embedding_space()¶
def create_extended_embedding_space(
self,
enriched_train_df: pd.DataFrame,
enriched_val_df: Optional[pd.DataFrame] = None,
n_epochs: int = None,
batch_size: int = None,
output_dir: str = None
) -> 'EmbeddingSpace'
When to use: After prep_for_training() has loaded features
Arguments: Same as extend_from_existing() but uses tracked feature metadata
Returns: New extended EmbeddingSpace
Extension Metadata¶
Every extended ES stores lineage information:
extended_es.extension_metadata = {
"extended_from_es_name": "credit_es_v1",
"extended_from_es_version": {"version": "0.2.4310"},
"extension_date": "2026-01-01T18:30:00-05:00",
"new_columns_added": ["younger_borrower", "high_debt_ratio"],
"original_column_count": 30,
"extended_column_count": 32,
"training_epochs_used": 12,
"feature_metadata": {
"source": "feature_suggestion_history.json",
"applied_features": ["younger_borrower", "high_debt_ratio"],
"load_date": "2026-01-01T18:00:00"
}
}
This enables: - ✅ Tracking which ES version a model was trained on - ✅ Reproducibility - know exactly which features were used - ✅ Lineage - trace back through extension history - ✅ Debugging - understand when/why features were added
Complete Example¶
Multi-Run Iterative Feature Engineering¶
# ============================================================================
# RUN 1: Initial Training
# ============================================================================
from featrix.neural.input_data_set import FeatrixInputDataSet
from featrix.neural.embedded_space import EmbeddingSpace
from featrix.neural.single_predictor import SinglePredictor
from featrix.neural.simple_mlp import SimpleMLP
# Load data
df = pd.read_csv("credit.csv")
# Train initial ES
dataset = FeatrixInputDataSet(df=df, ignore_cols=["target"])
train_data, val_data = dataset.split(fraction=0.2)
es_v1 = EmbeddingSpace(train_data, val_data, n_epochs=50, name="credit_es_v1")
es_v1.train(batch_size=128, n_epochs=50)
es_v1.save("embedding_space_v1.pkl")
# Train predictor
predictor = SimpleMLP.from_config(d_in=es_v1.d_model, d_out=2)
sp1 = SinglePredictor(es_v1, predictor, name="credit_predictor_v1")
sp1.prep_for_training(df, 'target', 'set')
await sp1.train(n_epochs=100)
sp1.save("predictor_v1.pkl")
# Results: AUC = 0.78
# Features suggested: younger_borrower, high_debt_ratio
# ============================================================================
# RUN 2: Extend ES with Features
# ============================================================================
from featrix.neural.io_utils import load_embedding_space
from featrix.neural.feature_engineer import FeatureEngineer
# Load ES v1
es_v1 = load_embedding_space("embedding_space_v1.pkl")
# Apply features
engineer = FeatureEngineer.from_json("qa.save/feature_suggestion_history.json")
train_df_enriched = engineer.fit_transform(df.iloc[:800], verbose=True)
val_df_enriched = engineer.transform(df.iloc[800:], verbose=False)
# Extend ES
es_v2 = EmbeddingSpace.extend_from_existing(
existing_es=es_v1,
enriched_train_df=train_df_enriched,
enriched_val_df=val_df_enriched,
n_epochs=12,
name="credit_es_v2"
)
# Train extended ES
es_v2.train(batch_size=128, n_epochs=12)
es_v2.save("embedding_space_v2.pkl")
# Train predictor on ES v2
predictor = SimpleMLP.from_config(d_in=es_v2.d_model, d_out=2)
sp2 = SinglePredictor(es_v2, predictor, name="credit_predictor_v2")
sp2.prep_for_training(train_df_enriched, 'target', 'set')
await sp2.train(n_epochs=100)
sp2.save("predictor_v2.pkl")
# Results: AUC = 0.84 (+6 points!)
# New features suggested: duration_age_risk_score
# ============================================================================
# RUN 3: Further Extension
# ============================================================================
# Load ES v2 and extend with new feature...
# And so on - each run builds on the previous one!
Benefits¶
1. Efficiency¶
- Train only new columns, not entire ES from scratch
- 4x-8x faster than full retraining
- Saves GPU time and cost
2. Preservation¶
- Existing embeddings unchanged
- No loss of learned representations
- Stable foundation for iteration
3. Iterative Improvement¶
- Each run builds on previous success
- Compound improvements over time
- Gradual convergence to optimal features
4. Trackability¶
- Complete lineage of each ES version
- Know exactly which features were added when
- Reproducible results
5. Practical¶
- Works with existing feature suggestion system
- Integrates with effectiveness tracking
- Supports real-world workflows
Best Practices¶
1. Validate Features First¶
Only extend ES with features that actually improved metrics:
# Check effectiveness history
tracker = FeatureEffectivenessTracker()
tracker.load_history("qa.save/feature_effectiveness.json")
# Only use features that improved by >1%
recommended = tracker.get_recommended_features(
metric='roc_auc',
min_improvement_pct=1.0
)
# Filter features before extending
filtered_suggestions = [s for s in suggestions if s['name'] in recommended]
engineer = FeatureEngineer(suggestions=filtered_suggestions)
2. Use Appropriate Epochs¶
- Rule of thumb: original_epochs / 4
- Minimum: 10 epochs (to properly learn new columns)
- Maximum: original_epochs / 2 (diminishing returns)
3. Save Versions¶
Keep all ES versions for comparison and rollback:
embedding_space_v1.pkl # Original: 30 columns
embedding_space_v2.pkl # Extended: 32 columns (+younger_borrower, high_debt_ratio)
embedding_space_v3.pkl # Extended: 34 columns (+duration_age_risk_score, ...)
4. Monitor Extension Quality¶
Check that extension training is working:
# After training extended ES
final_loss = es_v2.training_info['final_validation_loss']
logger.info(f"Extended ES validation loss: {final_loss}")
# Should be similar to or better than original ES loss
# If much worse, something went wrong
5. Document Changes¶
Use feature_metadata to track provenance:
feature_metadata = {
"source": "feature_suggestion_history.json",
"applied_features": ["younger_borrower", "high_debt_ratio"],
"feature_effectiveness": {
"younger_borrower": {"roc_auc_improvement": 3.8},
"high_debt_ratio": {"roc_auc_improvement": 2.3}
},
"run_date": "2026-01-01",
"run_by": "data_scientist_name"
}
Troubleshooting¶
"No new columns found"¶
Problem: Enriched DataFrame has same columns as existing ES
Solution: Check that features were actually applied:
print("Original columns:", set(df.columns))
print("Enriched columns:", set(enriched_df.columns))
print("New columns:", set(enriched_df.columns) - set(df.columns))
"Could not copy weights for column X"¶
Problem: Column encoder structure changed between versions
Solution: This is usually not fatal - new columns will be randomly initialized
"Extended ES performs worse than original"¶
Problem: New features are hurting performance
Solution: 1. Check feature effectiveness history - only use proven features 2. Try training for more epochs 3. Remove problematic features and re-extend
"Out of memory during extension"¶
Problem: Extended ES is too large for GPU
Solution: 1. Use smaller batch size 2. Train on CPU (slower but works) 3. Remove least effective features before extending
Future Enhancements¶
Possible improvements to the extend ES feature:
- Selective Freezing: More fine-grained control over which layers to freeze
- Gradual Unfreezing: Unfreeze layers progressively during training
- Pruning: Remove ineffective columns during extension
- Multi-Version Merging: Combine features from multiple ES versions
- A/B Testing: Automatically compare extended vs original ES
- Auto-Epoch Determination: ML-based estimate of optimal training epochs
Related Documentation¶
FEATURE_EFFECTIVENESS_TRACKING.md- Track which features improve metricsITERATIVE_FEATURE_ENGINEERING.md- Feature suggestion and application workflowFEATURE_ENGINEERING_INTEGRATION.md- FeatureEngineer class usage
Status¶
✅ FULLY IMPLEMENTED
EmbeddingSpace.extend_from_existing()- Core extension methodSinglePredictor.create_extended_embedding_space()- Convenience wrapper- Extension metadata tracking
- Lineage preservation
- Codec creation for new columns
- Weight copying for existing columns
Ready for production use