Troubleshooting¶
This guide covers common issues, diagnostic procedures, and solutions for MLflow Secrets Auth. Use this guide to resolve configuration problems, authentication failures, and performance issues.
Quick Diagnostics¶
Check Plugin Status¶
Start troubleshooting with basic status checks:
# Check plugin installation and configuration
mlflow-secrets-auth info
# Run comprehensive diagnostics
mlflow-secrets-auth doctor
# Test against specific MLflow server
mlflow-secrets-auth doctor --dry-run https://your-mlflow-server.com
Enable Debug Logging¶
Enable detailed logging for troubleshooting:
Common Issues¶
Plugin Not Recognized¶
Symptoms¶
- MLflow doesn't use authentication
- No authentication headers in requests
- Plugin appears inactive
Diagnosis¶
# Check if plugin is installed
pip show mlflow-secrets-auth
# Verify entry point registration
python -c "
import pkg_resources
for ep in pkg_resources.iter_entry_points('mlflow.request_auth_provider'):
print(f'{ep.name}: {ep.module_name}')
"
# Check if any provider is enabled
python -c "
from mlflow_secrets_auth.config import is_provider_enabled
providers = ['vault', 'aws-secrets-manager', 'azure-key-vault']
enabled = [p for p in providers if is_provider_enabled(p)]
print(f'Enabled providers: {enabled}')
"
Solutions¶
-
Activate the Plugin (most common issue):
-
Install the Plugin:
-
Enable a Provider:
-
Verify MLflow Version:
Authentication Failures¶
Symptoms¶
- "Authentication failed" errors
- 401/403 HTTP responses
- Secret retrieval failures
Provider-Specific Diagnosis¶
Vault:
# Test Vault connectivity
curl -k "$VAULT_ADDR/v1/sys/health"
# Check token validity
vault token lookup "$VAULT_TOKEN"
# Test secret access
vault kv get "$MLFLOW_VAULT_SECRET_PATH"
AWS:
# Check AWS credentials
aws sts get-caller-identity
# Test secret access
aws secretsmanager get-secret-value --secret-id "$MLFLOW_AWS_SECRET_NAME"
# Verify permissions
aws iam simulate-principal-policy \
--policy-source-arn "$(aws sts get-caller-identity --query Arn --output text)" \
--action-names "secretsmanager:GetSecretValue" \
--resource-arns "arn:aws:secretsmanager:region:account:secret:$MLFLOW_AWS_SECRET_NAME-*"
Azure:
# Check Azure login
az account show
# Test Key Vault access
az keyvault secret show \
--vault-name "vault-name" \
--name "$MLFLOW_AZURE_SECRET_NAME"
# Check access policies
az keyvault show \
--name "vault-name" \
--query "properties.accessPolicies"
Solutions¶
- Verify Credentials: Ensure authentication credentials are valid and not expired
- Check Permissions: Verify the service has appropriate permissions to read secrets
- Test Connectivity: Ensure network connectivity to the secret management service
- Validate Configuration: Check all required environment variables are set
Host Not Allowed¶
Symptoms¶
- "Host not allowed" errors
- Authentication works for some URLs but not others
- Wildcard patterns not matching expected hosts
Diagnosis¶
# Check current allowlist configuration
python -c "
from mlflow_secrets_auth.config import get_allowed_hosts
print(f'Allowed hosts: {get_allowed_hosts()}')
"
# Test host matching
python -c "
import fnmatch
from mlflow_secrets_auth.config import get_allowed_hosts
from urllib.parse import urlparse
url = 'https://your-server.com'
hostname = urlparse(url).hostname
allowed = get_allowed_hosts()
if allowed:
matches = [pattern for pattern in allowed if fnmatch.fnmatch(hostname, pattern)]
print(f'Host: {hostname}')
print(f'Patterns: {allowed}')
print(f'Matches: {matches}')
else:
print('No host restrictions configured')
"
Solutions¶
-
Add Host to Allowlist:
-
Use Wildcard Patterns:
-
Temporary Testing (not recommended for production):
Secret Not Found¶
Symptoms¶
- "Secret not found" errors
- Path or name not found messages
- Empty secret responses
Diagnosis¶
# List available secrets
# Vault
vault kv list secret/
# AWS
aws secretsmanager list-secrets
# Azure
az keyvault secret list --vault-name "your-vault"
# Check secret content
# Vault
vault kv get secret/path/to/secret
# AWS
aws secretsmanager get-secret-value --secret-id "secret-name"
# Azure
az keyvault secret show --vault-name "vault" --name "secret-name"
Solutions¶
- Verify Secret Path/Name: Ensure the path or name exactly matches the stored secret
- Check Secret Format: Verify the secret contains expected fields (
token
,username
,password
) - Validate Permissions: Ensure read access to the specific secret
- Test Different Versions: For AWS/Azure, try different secret versions
Cache Issues¶
Symptoms¶
- Stale authentication credentials
- Authentication works intermittently
- Changes to secrets not reflected
Diagnosis¶
# Check cache status (future feature)
mlflow-secrets-auth info # Shows cache hit rate
# Test with debug logging
MLFLOW_SECRETS_LOG_LEVEL=DEBUG mlflow-secrets-auth doctor
Solutions¶
-
Clear Cache (restart application):
-
Reduce TTL:
-
Force Cache Bust: Authentication failures automatically clear cache
Network Connectivity¶
Symptoms¶
- Connection timeouts
- DNS resolution failures
- SSL/TLS errors
Diagnosis¶
# Test basic connectivity
# Vault
curl -v "$VAULT_ADDR/v1/sys/health"
# AWS (test any AWS service)
aws sts get-caller-identity
# Azure
curl -v "https://vault-name.vault.azure.net/"
# Check DNS resolution
nslookup vault.company.com
nslookup secretsmanager.us-east-1.amazonaws.com
nslookup vault-name.vault.azure.net
# Test SSL/TLS
openssl s_client -connect vault.company.com:443 -servername vault.company.com
Solutions¶
- Verify Network Access: Ensure outbound connectivity to required endpoints
- Check Firewall Rules: Verify firewall allows HTTPS traffic
- Validate DNS: Ensure DNS resolution works for the secret management service
- Certificate Issues: For Vault, consider
VAULT_SKIP_VERIFY=true
for testing (not production)
Configuration Errors¶
Symptoms¶
- Missing environment variable errors
- Invalid configuration warnings
- Unexpected default values
Diagnosis¶
# Check all environment variables
env | grep -E "(VAULT_|AWS_|AZURE_|MLFLOW_SECRETS_)"
# Validate configuration
mlflow-secrets-auth info
Solutions¶
- Set Required Variables: Ensure all required environment variables are set
- Check Variable Names: Verify correct spelling and format of environment variables
- Validate Values: Ensure values are in the expected format (URLs, regions, etc.)
Performance Issues¶
Slow Secret Retrieval¶
Symptoms¶
- Long delays in authentication
- Timeouts during secret fetching
- Poor application performance
Diagnosis¶
# Time secret retrieval
time mlflow-secrets-auth doctor
# Check network latency
ping vault.company.com
ping secretsmanager.us-east-1.amazonaws.com
Solutions¶
-
Optimize TTL: Increase cache TTL for better performance:
-
Use Regional Endpoints: Ensure using the closest regional endpoint
- Network Optimization: Use VPC endpoints, private networking
- Connection Pooling: The plugin automatically reuses connections
High Memory Usage¶
Symptoms¶
- Excessive memory consumption
- Memory leaks over time
- Out of memory errors
Diagnosis¶
# Monitor memory usage
ps aux | grep python
top -p $(pgrep -f mlflow)
# Check cache size (future feature)
# mlflow-secrets-auth cache status
Solutions¶
-
Reduce Cache TTL: Lower TTL reduces memory usage:
-
Monitor Cache: Restart applications periodically if needed
- Update Plugin: Ensure using the latest version with memory optimizations
Environment-Specific Issues¶
Development Environment¶
Common Issues¶
- Local secret managers not configured
- Development certificates/tokens
- Network access restrictions
Solutions¶
# Use local Vault for development
export VAULT_ADDR="http://localhost:8200"
export VAULT_TOKEN="dev-only-token"
export MLFLOW_VAULT_SECRET_PATH="secret/dev/mlflow"
# Disable host restrictions for development
export MLFLOW_SECRETS_ALLOWED_HOSTS="localhost,127.0.0.1,*.local"
# Enable debug logging
export MLFLOW_SECRETS_LOG_LEVEL="DEBUG"
CI/CD Environment¶
Common Issues¶
- Missing environment variables
- Temporary credentials
- Build pipeline failures
Solutions¶
# Validate configuration in CI
mlflow-secrets-auth doctor
if [ $? -ne 0 ]; then
echo "MLflow Secrets Auth configuration failed"
exit 1
fi
# Use short TTL in CI
export MLFLOW_VAULT_TTL_SEC=60
# Test against staging environment
mlflow-secrets-auth doctor --dry-run "$STAGING_MLFLOW_URL"
Production Environment¶
Common Issues¶
- Credential rotation
- High availability requirements
- Security restrictions
Solutions¶
# Use appropriate TTL for production
export MLFLOW_VAULT_TTL_SEC=900 # 15 minutes
# Enable appropriate logging level
export MLFLOW_SECRETS_LOG_LEVEL="WARNING"
# Use production-grade allowlisting
export MLFLOW_SECRETS_ALLOWED_HOSTS="mlflow.company.com"
# Monitor authentication failures
# (Set up alerting on authentication errors)
Provider-Specific Troubleshooting¶
Vault Troubleshooting¶
KV Version Issues¶
# Check KV engine version
vault secrets list -detailed | grep secret
# Test both KV v1 and v2 paths
vault kv get secret/mlflow/auth # KV v2
vault read secret/mlflow/auth # KV v1
AppRole Issues¶
# Test AppRole login
vault write auth/approle/login \
role_id="$VAULT_ROLE_ID" \
secret_id="$VAULT_SECRET_ID"
# Check role configuration
vault read auth/approle/role/mlflow-secrets-auth
AWS Troubleshooting¶
IAM Issues¶
# Check effective permissions
aws iam get-account-authorization-details
# Test assume role
aws sts assume-role \
--role-arn "$AWS_ROLE_ARN" \
--role-session-name "test-session"
Cross-Region Issues¶
# List secrets in different regions
aws secretsmanager list-secrets --region us-east-1
aws secretsmanager list-secrets --region us-west-2
Azure Troubleshooting¶
Managed Identity Issues¶
# Test managed identity from Azure VM
curl -H "Metadata:true" \
"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://vault.azure.net/"
RBAC vs Access Policy¶
# Check if RBAC is enabled
az keyvault show --name "vault-name" --query "properties.enableRbacAuthorization"
# List RBAC assignments
az role assignment list --assignee "$AZURE_CLIENT_ID"
Advanced Debugging¶
Packet Capture¶
For network-level debugging:
# Capture HTTPS traffic (requires root)
sudo tcpdump -i any -w mlflow-auth.pcap host vault.company.com
# Analyze with Wireshark or tshark
tshark -r mlflow-auth.pcap -Y "ssl"
Python Debugging¶
For code-level debugging:
import logging
import mlflow_secrets_auth
# Enable debug logging for all components
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('mlflow_secrets_auth')
logger.setLevel(logging.DEBUG)
# Test provider directly
from mlflow_secrets_auth import SecretsAuthProviderFactory
factory = SecretsAuthProviderFactory()
auth = factory.get_request_auth("https://mlflow.company.com")
print(f"Auth object: {auth}")
Logging Analysis¶
Understanding log output:
# Normal operation
INFO: Secret fetched successfully (provider=vault, cache_hit=false)
INFO: Authentication successful (provider=vault, auth_mode=bearer)
# Cache hits
DEBUG: Cache hit for key: vault:https://vault.company.com:secret/mlflow/auth
INFO: Secret fetched successfully (provider=vault, cache_hit=true)
# Errors
ERROR: Vault authentication failed: 403 Forbidden
WARNING: Host not allowed: external.example.com
ERROR: Secret not found at path: secret/wrong/path
Getting Help¶
Before Asking for Help¶
- Check the Logs: Enable debug logging and review the output
- Run Diagnostics: Use
mlflow-secrets-auth doctor
- Test Components: Test each component (network, auth, secrets) individually
- Review Configuration: Double-check all environment variables
- Check Documentation: Review provider-specific documentation
Information to Include¶
When reporting issues, include:
- Plugin Version:
pip show mlflow-secrets-auth
- Python Version:
python --version
- MLflow Version:
pip show mlflow
- Environment: Development/staging/production
- Provider: Vault/AWS/Azure
- Configuration: Relevant environment variables (redacted)
- Error Messages: Full error messages and stack traces
- Debug Logs: Output from debug logging
- Diagnostic Output: Output from
mlflow-secrets-auth doctor
Useful Commands for Bug Reports¶
# Collect system information
echo "Plugin version:"
pip show mlflow-secrets-auth
echo "Python version:"
python --version
echo "MLflow version:"
pip show mlflow
echo "Environment variables:"
env | grep -E "(VAULT_|AWS_|AZURE_|MLFLOW_SECRETS_)" | sed 's/=.*/=***/'
echo "Diagnostic output:"
MLFLOW_SECRETS_LOG_LEVEL=DEBUG mlflow-secrets-auth doctor
Prevention¶
Best Practices for Avoiding Issues¶
- Test in Staging: Always test configuration changes in staging first
- Monitor Logs: Set up log monitoring for authentication failures
- Health Checks: Implement health checks using the CLI
- Documentation: Document your configuration and troubleshooting steps
- Regular Updates: Keep the plugin and dependencies updated
Monitoring and Alerting¶
# Health check script
#!/bin/bash
if ! mlflow-secrets-auth doctor > /dev/null 2>&1; then
echo "ALERT: MLflow Secrets Auth health check failed"
exit 1
fi
# Performance monitoring
#!/bin/bash
start_time=$(date +%s%N)
mlflow-secrets-auth doctor > /dev/null 2>&1
end_time=$(date +%s%N)
duration=$(( (end_time - start_time) / 1000000 )) # milliseconds
if [ $duration -gt 5000 ]; then # 5 seconds
echo "ALERT: MLflow Secrets Auth response time: ${duration}ms"
fi
Configuration Validation¶
#!/bin/bash
# Configuration validation script
required_vars=()
case "$MLFLOW_SECRETS_AUTH_ENABLE" in
*vault*)
required_vars+=("VAULT_ADDR" "MLFLOW_VAULT_SECRET_PATH")
;;
*aws-secrets-manager*)
required_vars+=("AWS_REGION" "MLFLOW_AWS_SECRET_NAME")
;;
*azure-key-vault*)
required_vars+=("MLFLOW_AZURE_KEY_VAULT_URL" "MLFLOW_AZURE_SECRET_NAME")
;;
esac
for var in "${required_vars[@]}"; do
if [ -z "${!var}" ]; then
echo "ERROR: Required variable $var is not set"
exit 1
fi
done
echo "Configuration validation passed"
Next Steps¶
- CLI Reference - Command-line tools for diagnostics
- Configuration Reference - Complete configuration options
- Provider Documentation - Provider-specific troubleshooting
- FAQ - Frequently asked questions and answers