Prediction Results: Built-In Safety¶
Every prediction from Featrix includes safety information. If you pay attention to the warnings and errors, you cannot get silently bad results. The system tells you when something is wrong.
The Problem with Black-Box Predictions¶
Most ML systems return a single number—the prediction—and expect you to trust it blindly:
# Typical ML library
prediction = model.predict(record) # Returns 0.85
# Is this reliable? Who knows!
What if:
- The input data is completely different from training data?
- A critical column has an unexpected value?
- The model is extrapolating into unknown territory?
Traditional systems don't tell you. They just return a number.
Featrix Tells You Everything¶
Every Featrix prediction includes:
result = predictor.predict(record)
# The prediction
result.predicted_class # "will_churn"
result.probability # 0.85
result.confidence # 0.70 (distance from decision boundary)
# Safety information
result.guardrails # Per-column warnings
result.ignored_query_columns # Columns you sent that we don't know
result.available_query_columns # Columns we expected
result.prediction_uuid # Unique ID for tracking
Guardrails: Per-Column Safety Checks¶
Before making any prediction, Featrix analyzes each input column:
For Numeric Columns¶
The system compares your value to the training distribution:
| Zone | Z-Score | What It Means | System Response |
|---|---|---|---|
| Normal | ±1σ | Close to average | OK |
| In Range | ±2σ | Normal variation | OK |
| Outlier | ±3σ | Unusual but seen | OK (flagged) |
| Extreme | ±4σ | Rare in training | Warning |
| Extrapolation | >4σ | Outside training | Warning: "prediction may be less accurate" |
| Severe | >20σ | Far from training | Warning: "prediction quality uncertain" |
| Clamped | >100σ | Ridiculous value | Error: "prediction unreliable" |
Example:
# Training data had income from $20K-$200K
result = predictor.predict({"income": 50000000}) # $50M
result.guardrails
# {
# "income": "Error: value is extremely far from training data - prediction unreliable (85.2σ, clamped to 100σ)"
# }
The system won't silently fail. It tells you the prediction is unreliable.
For Categorical Columns¶
The system checks if it has seen the value before:
| Situation | System Response |
|---|---|
| Known value | OK |
| Null/missing | Warning: "categorical value is (null)" |
| Unknown value | Warning: "categorical value 'X' is UNKNOWN: expected one of [...]" |
Example:
# Training data had countries: ["US", "UK", "Canada", "Mexico"]
result = predictor.predict({"country": "Narnia"})
result.guardrails
# {
# "country": "Warning: categorical value 'Narnia' is UNKNOWN: expected one of ['US', 'UK', 'Canada', 'Mexico']"
# }
The prediction still runs (using BERT semantic similarity to find the closest known value), but you're warned that this is outside training distribution.
For Unknown Columns¶
If you send columns the model doesn't know about:
result = predictor.predict({
"age": 35,
"income": 50000,
"favorite_color": "blue" # Model never saw this column
})
result.ignored_query_columns
# ["favorite_color"]
result.available_query_columns
# ["age", "income", "city", "plan_type", ...]
The model ignores unknown columns and tells you which ones.
Probability Calibration¶
Raw neural network outputs are often overconfident or underconfident. Featrix calibrates probabilities so they mean what they say.
If a prediction returns 80% confidence, approximately 80% of similar predictions are actually correct.
Three calibration methods (auto-selected during training):
| Method | Best For |
|---|---|
| Temperature | Models that are uniformly overconfident |
| Platt Scaling | Binary classification with sigmoid miscalibration |
| Isotonic | Complex non-linear calibration patterns |
The model card records which calibration method was used and its effectiveness.
Confidence vs Probability¶
These are different:
- Probability: Raw softmax output for the predicted class (0.0-1.0)
- Confidence: How far from the decision boundary (0.0 = right at boundary, 1.0 = maximally certain)
# Example: threshold=0.5, probability=0.9
# confidence = (0.9 - 0.5) / (1.0 - 0.5) = 0.8
# Example: threshold=0.5, probability=0.55
# confidence = (0.55 - 0.5) / (1.0 - 0.5) = 0.1 (low confidence!)
A prediction can have high probability (0.55 > 0.5) but low confidence (only 0.05 above threshold).
Interpreting Confidence Levels¶
| Confidence | What It Means | Recommended Action |
|---|---|---|
| 95%+ | Very high confidence | Trust the prediction |
| 80-95% | Confident | Usually correct, minor uncertainty |
| 60-80% | Moderate | Consider additional review |
| 40-60% | Uncertain | Likely needs human review |
| <40% | Low | Definitely needs review |
When to Worry (and When Not To)¶
Don't Worry About¶
- Missing columns: The model handles them gracefully with learned null embeddings
- Unknown categories with semantic similarity: "Senior Software Engineer" works even if only "Software Engineer" was in training
- Minor extrapolation (4-20σ): Predictions are usually fine, just slightly less reliable
Do Worry About¶
- Errors in guardrails: These indicate predictions are unreliable
- Many ignored columns: The model might be missing critical information
- All predictions same class: Check training metrics for embedding collapse
- All low confidence: Model may not have converged
Using Safety Information in Production¶
Pattern 1: Reject Unreliable Predictions¶
def safe_predict(predictor, record):
result = predictor.predict(record)
# Check for errors in guardrails
for column, warning in (result.guardrails or {}).items():
if warning.startswith("Error:"):
return {
"prediction": None,
"rejected": True,
"reason": f"Column '{column}': {warning}"
}
return {
"prediction": result.predicted_class,
"confidence": result.confidence,
"warnings": result.guardrails
}
Pattern 2: Route by Confidence¶
def route_prediction(predictor, record):
result = predictor.predict(record)
if result.guardrails and any(w.startswith("Error:") for w in result.guardrails.values()):
return "human_review" # Unreliable prediction
if result.confidence > 0.95:
return "auto_approve"
elif result.confidence > 0.70:
return "standard_review"
else:
return "human_review"
Pattern 3: Log Everything for Analysis¶
def predict_with_logging(predictor, record, request_id):
result = predictor.predict(record)
log_entry = {
"request_id": request_id,
"prediction_uuid": result.prediction_uuid,
"predicted_class": result.predicted_class,
"confidence": result.confidence,
"guardrails": result.guardrails,
"ignored_columns": result.ignored_query_columns
}
# Log for later analysis
logger.info(json.dumps(log_entry))
return result
The Model Card: Training Quality Warnings¶
The model card includes warnings from training:
model_card = predictor.get_model_card()
# Check training quality
if model_card.get("training_quality_warning"):
print(f"Warning: {model_card['training_quality_warning']}")
# Check for known issues
for warning in model_card.get("warnings", []):
print(f"Training issue: {warning}")
Training warnings might include: - Class imbalance detected - Embedding collapse during training - Validation loss still decreasing (might benefit from more epochs) - Per-class recall issues (one class has very low recall)
Summary: You Can't Get Silently Bad Results¶
Featrix predictions are transparent:
- Guardrails tell you about input data issues (per column)
- Confidence tells you how certain the model is
- Calibration ensures probabilities mean what they say
- Ignored columns tells you what the model couldn't use
- Model card tells you about training quality issues
If you check the guardrails and confidence, you always know when to trust a prediction and when to escalate to human review.
This is the difference between "the model said 0.85" and "the model is 85% confident, with no guardrail warnings, using a well-calibrated probability distribution from a training run with no quality issues."
The second one is actionable. The first is gambling.