Skip to content

Extending Embedding Spaces with Feature Engineering

Overview

The Extend ES feature allows you to add new engineered feature columns to an existing EmbeddingSpace without retraining from scratch. This is critical for iterative feature engineering where you want to:

  1. ✅ Preserve existing embeddings (expensive to retrain)
  2. ✅ Add new feature columns discovered during training
  3. ✅ Train only the new columns (much faster)
  4. ✅ Maintain lineage and provenance of features

The Problem This Solves

Before: Features Were Ignored

┌─────────────────────────────────────────────────────────────────┐
│ CURRENT BROKEN WORKFLOW                                         │
├─────────────────────────────────────────────────────────────────┤
│ 1. Train ES on 30 columns → Creates embeddings                 │
│                                                                  │
│ 2. Train Predictor:                                             │
│    • Loads features from previous run                           │
│    • Adds 2 new columns to DataFrame                            │
│    • BUT: No codecs for new columns!                            │
│    • Dataset only encodes columns with codecs                   │
│    • ❌ New features SILENTLY IGNORED                           │
│                                                                  │
│ Result: All feature engineering wasted! 😱                      │
└─────────────────────────────────────────────────────────────────┘

After: Extend ES Concept

┌─────────────────────────────────────────────────────────────────┐
│ NEW "EXTEND ES" WORKFLOW                                        │
├─────────────────────────────────────────────────────────────────┤
│ 1. Train ES v1 on 30 columns (50 epochs)                       │
│    → Creates embeddings for original data                       │
│                                                                  │
│ 2. Train Predictor, discover features:                          │
│    → younger_borrower                                           │
│    → high_debt_ratio                                            │
│                                                                  │
│ 3. Extend ES: Create ES v2                                      │
│    • Load ES v1 (30 columns)                                    │
│    • Apply features to data (now 32 columns)                    │
│    • Create codecs for 2 new columns only                       │
│    • Copy existing encoder weights                              │
│    • Train for 12 epochs (50/4) to learn new columns            │
│    → ES v2 has 32 columns with embeddings!                      │
│                                                                  │
│ 4. Train Predictor on ES v2:                                    │
│    • All 32 columns have codecs                                 │
│    • All columns encoded and used                               │
│    • ✅ Features actually work!                                 │
└─────────────────────────────────────────────────────────────────┘

How It Works

Phase 1: Feature Discovery (Run 1)

from featrix.neural.input_data_set import FeatrixInputDataSet
from featrix.neural.embedded_space import EmbeddingSpace
from featrix.neural.single_predictor import SinglePredictor

# Train initial ES
dataset = FeatrixInputDataSet(df=train_df, ignore_cols=["target"])
train_data, val_data = dataset.split(fraction=0.2)

es_v1 = EmbeddingSpace(train_data, val_data, n_epochs=50)
es_v1.train(batch_size=128, n_epochs=50)
es_v1.save("embedding_space_v1.pkl")

# Train predictor - discovers features
sp = SinglePredictor(es_v1, predictor)
sp.prep_for_training(train_df, 'target', 'set')
await sp.train(n_epochs=100)

# Features suggested and saved to:
# - feature_suggestion_history.json
# - feature_effectiveness.json

Phase 2: Extend ES (Run 2)

from featrix.neural.io_utils import load_embedding_space
from featrix.neural.feature_engineer import FeatureEngineer

# Load existing ES
es_v1 = load_embedding_space("embedding_space_v1.pkl")

# Apply discovered features to data
engineer = FeatureEngineer.from_json("qa.save/feature_suggestion_history.json")
enriched_train_df = engineer.fit_transform(train_df, verbose=True)
enriched_val_df = engineer.transform(val_df, verbose=False)

# Extend ES with new features
es_v2 = EmbeddingSpace.extend_from_existing(
    existing_es=es_v1,
    enriched_train_df=enriched_train_df,
    enriched_val_df=enriched_val_df,
    n_epochs=12,  # 50 / 4 = 12.5 → 12
    output_dir="qa.out/featrix_output",
    feature_metadata={
        "source": "feature_suggestion_history.json",
        "applied_features": ["younger_borrower", "high_debt_ratio"]
    }
)

# Train extended ES
es_v2.train(batch_size=128, n_epochs=12)
es_v2.save("embedding_space_v2.pkl")

# Train predictor on extended ES
sp2 = SinglePredictor(es_v2, predictor)
sp2.prep_for_training(enriched_train_df, 'target', 'set')
await sp2.train(n_epochs=100)

# NOW the features are actually used!

Alternative: Auto-Extension in SinglePredictor

If features were loaded during prep_for_training(), you can create an extended ES directly:

sp = SinglePredictor(es_v1, predictor)
sp.prep_for_training(train_df, 'target', 'set')
# Features loaded automatically

# Create extended ES from loaded features
es_v2 = sp.create_extended_embedding_space(
    enriched_train_df=sp.train_df,  # Already has features applied
    enriched_val_df=sp.val_df,
    output_dir="qa.out/featrix_output"
)

# Train extended ES
es_v2.train(batch_size=128, n_epochs=12)
es_v2.save("embedding_space_v2.pkl")

Training Strategy

The extension training uses a two-phase approach:

Phase 1: New Columns Only (epochs/8)

  • Freeze existing column encoders
  • Train only new column encoders
  • Goal: Learn embeddings for new features without disturbing existing ones

Phase 2: Joint Fine-tuning (epochs/8)

  • Unfreeze all encoders
  • Train everything jointly
  • Goal: Allow new features to integrate with existing embeddings

Total: epochs/4 (much faster than full retraining)

Example Timeline

Original ES: 50 epochs × 30 columns = ~2 hours
Extended ES: 12 epochs × 2 new columns = ~15 minutes

Savings: ~7x faster than retraining from scratch!

API Reference

EmbeddingSpace.extend_from_existing()

@classmethod
def extend_from_existing(
    cls,
    existing_es: 'EmbeddingSpace',
    enriched_train_df: pd.DataFrame,
    enriched_val_df: Optional[pd.DataFrame] = None,
    n_epochs: int = None,  # Default: original_epochs / 4
    batch_size: int = None,  # Default: existing_es.batch_size
    output_dir: str = None,
    name: str = None,  # Default: f"{existing_es.name}_extended"
    feature_metadata: Optional[Dict[str, Any]] = None
) -> 'EmbeddingSpace'

Arguments: - existing_es: The EmbeddingSpace to extend - enriched_train_df: Training DataFrame with new columns added - enriched_val_df: Validation DataFrame (optional, will split if None) - n_epochs: Training epochs for extension (default: original_epochs / 4) - batch_size: Training batch size (default: use existing ES) - output_dir: Output directory for extended ES - name: Name for extended ES - feature_metadata: Dict with provenance info about new features

Returns: - New EmbeddingSpace with extended columns

Process: 1. Identifies new columns not in existing ES 2. Creates InputDataSet from enriched DataFrames 3. Creates new ES with all columns (old + new) 4. Copies encoder weights for existing columns 5. Creates codecs for new columns only 6. Stores extension metadata for lineage tracking

SinglePredictor.create_extended_embedding_space()

def create_extended_embedding_space(
    self,
    enriched_train_df: pd.DataFrame,
    enriched_val_df: Optional[pd.DataFrame] = None,
    n_epochs: int = None,
    batch_size: int = None,
    output_dir: str = None
) -> 'EmbeddingSpace'

When to use: After prep_for_training() has loaded features

Arguments: Same as extend_from_existing() but uses tracked feature metadata

Returns: New extended EmbeddingSpace

Extension Metadata

Every extended ES stores lineage information:

extended_es.extension_metadata = {
    "extended_from_es_name": "credit_es_v1",
    "extended_from_es_version": {"version": "0.2.4310"},
    "extension_date": "2026-01-01T18:30:00-05:00",
    "new_columns_added": ["younger_borrower", "high_debt_ratio"],
    "original_column_count": 30,
    "extended_column_count": 32,
    "training_epochs_used": 12,
    "feature_metadata": {
        "source": "feature_suggestion_history.json",
        "applied_features": ["younger_borrower", "high_debt_ratio"],
        "load_date": "2026-01-01T18:00:00"
    }
}

This enables: - ✅ Tracking which ES version a model was trained on - ✅ Reproducibility - know exactly which features were used - ✅ Lineage - trace back through extension history - ✅ Debugging - understand when/why features were added

Complete Example

Multi-Run Iterative Feature Engineering

# ============================================================================
# RUN 1: Initial Training
# ============================================================================
from featrix.neural.input_data_set import FeatrixInputDataSet
from featrix.neural.embedded_space import EmbeddingSpace
from featrix.neural.single_predictor import SinglePredictor
from featrix.neural.simple_mlp import SimpleMLP

# Load data
df = pd.read_csv("credit.csv")

# Train initial ES
dataset = FeatrixInputDataSet(df=df, ignore_cols=["target"])
train_data, val_data = dataset.split(fraction=0.2)

es_v1 = EmbeddingSpace(train_data, val_data, n_epochs=50, name="credit_es_v1")
es_v1.train(batch_size=128, n_epochs=50)
es_v1.save("embedding_space_v1.pkl")

# Train predictor
predictor = SimpleMLP.from_config(d_in=es_v1.d_model, d_out=2)
sp1 = SinglePredictor(es_v1, predictor, name="credit_predictor_v1")
sp1.prep_for_training(df, 'target', 'set')
await sp1.train(n_epochs=100)
sp1.save("predictor_v1.pkl")

# Results: AUC = 0.78
# Features suggested: younger_borrower, high_debt_ratio

# ============================================================================
# RUN 2: Extend ES with Features
# ============================================================================
from featrix.neural.io_utils import load_embedding_space
from featrix.neural.feature_engineer import FeatureEngineer

# Load ES v1
es_v1 = load_embedding_space("embedding_space_v1.pkl")

# Apply features
engineer = FeatureEngineer.from_json("qa.save/feature_suggestion_history.json")
train_df_enriched = engineer.fit_transform(df.iloc[:800], verbose=True)
val_df_enriched = engineer.transform(df.iloc[800:], verbose=False)

# Extend ES
es_v2 = EmbeddingSpace.extend_from_existing(
    existing_es=es_v1,
    enriched_train_df=train_df_enriched,
    enriched_val_df=val_df_enriched,
    n_epochs=12,
    name="credit_es_v2"
)

# Train extended ES
es_v2.train(batch_size=128, n_epochs=12)
es_v2.save("embedding_space_v2.pkl")

# Train predictor on ES v2
predictor = SimpleMLP.from_config(d_in=es_v2.d_model, d_out=2)
sp2 = SinglePredictor(es_v2, predictor, name="credit_predictor_v2")
sp2.prep_for_training(train_df_enriched, 'target', 'set')
await sp2.train(n_epochs=100)
sp2.save("predictor_v2.pkl")

# Results: AUC = 0.84 (+6 points!)
# New features suggested: duration_age_risk_score

# ============================================================================
# RUN 3: Further Extension
# ============================================================================
# Load ES v2 and extend with new feature...
# And so on - each run builds on the previous one!

Benefits

1. Efficiency

  • Train only new columns, not entire ES from scratch
  • 4x-8x faster than full retraining
  • Saves GPU time and cost

2. Preservation

  • Existing embeddings unchanged
  • No loss of learned representations
  • Stable foundation for iteration

3. Iterative Improvement

  • Each run builds on previous success
  • Compound improvements over time
  • Gradual convergence to optimal features

4. Trackability

  • Complete lineage of each ES version
  • Know exactly which features were added when
  • Reproducible results

5. Practical

  • Works with existing feature suggestion system
  • Integrates with effectiveness tracking
  • Supports real-world workflows

Best Practices

1. Validate Features First

Only extend ES with features that actually improved metrics:

# Check effectiveness history
tracker = FeatureEffectivenessTracker()
tracker.load_history("qa.save/feature_effectiveness.json")

# Only use features that improved by >1%
recommended = tracker.get_recommended_features(
    metric='roc_auc',
    min_improvement_pct=1.0
)

# Filter features before extending
filtered_suggestions = [s for s in suggestions if s['name'] in recommended]
engineer = FeatureEngineer(suggestions=filtered_suggestions)

2. Use Appropriate Epochs

  • Rule of thumb: original_epochs / 4
  • Minimum: 10 epochs (to properly learn new columns)
  • Maximum: original_epochs / 2 (diminishing returns)

3. Save Versions

Keep all ES versions for comparison and rollback:

embedding_space_v1.pkl  # Original: 30 columns
embedding_space_v2.pkl  # Extended: 32 columns (+younger_borrower, high_debt_ratio)
embedding_space_v3.pkl  # Extended: 34 columns (+duration_age_risk_score, ...)

4. Monitor Extension Quality

Check that extension training is working:

# After training extended ES
final_loss = es_v2.training_info['final_validation_loss']
logger.info(f"Extended ES validation loss: {final_loss}")

# Should be similar to or better than original ES loss
# If much worse, something went wrong

5. Document Changes

Use feature_metadata to track provenance:

feature_metadata = {
    "source": "feature_suggestion_history.json",
    "applied_features": ["younger_borrower", "high_debt_ratio"],
    "feature_effectiveness": {
        "younger_borrower": {"roc_auc_improvement": 3.8},
        "high_debt_ratio": {"roc_auc_improvement": 2.3}
    },
    "run_date": "2026-01-01",
    "run_by": "data_scientist_name"
}

Troubleshooting

"No new columns found"

Problem: Enriched DataFrame has same columns as existing ES

Solution: Check that features were actually applied:

print("Original columns:", set(df.columns))
print("Enriched columns:", set(enriched_df.columns))
print("New columns:", set(enriched_df.columns) - set(df.columns))

"Could not copy weights for column X"

Problem: Column encoder structure changed between versions

Solution: This is usually not fatal - new columns will be randomly initialized

"Extended ES performs worse than original"

Problem: New features are hurting performance

Solution: 1. Check feature effectiveness history - only use proven features 2. Try training for more epochs 3. Remove problematic features and re-extend

"Out of memory during extension"

Problem: Extended ES is too large for GPU

Solution: 1. Use smaller batch size 2. Train on CPU (slower but works) 3. Remove least effective features before extending

Future Enhancements

Possible improvements to the extend ES feature:

  1. Selective Freezing: More fine-grained control over which layers to freeze
  2. Gradual Unfreezing: Unfreeze layers progressively during training
  3. Pruning: Remove ineffective columns during extension
  4. Multi-Version Merging: Combine features from multiple ES versions
  5. A/B Testing: Automatically compare extended vs original ES
  6. Auto-Epoch Determination: ML-based estimate of optimal training epochs
  • FEATURE_EFFECTIVENESS_TRACKING.md - Track which features improve metrics
  • ITERATIVE_FEATURE_ENGINEERING.md - Feature suggestion and application workflow
  • FEATURE_ENGINEERING_INTEGRATION.md - FeatureEngineer class usage

Status

FULLY IMPLEMENTED

  • EmbeddingSpace.extend_from_existing() - Core extension method
  • SinglePredictor.create_extended_embedding_space() - Convenience wrapper
  • Extension metadata tracking
  • Lineage preservation
  • Codec creation for new columns
  • Weight copying for existing columns

Ready for production use