Monitoring & Observability Guide
This guide covers metrics, logging, events, and alerting for Tenant Operator.
Getting Started
Accessing Metrics
Endpoint
Tenant Operator exposes Prometheus metrics at :8443/metrics over HTTPS.
Port-forward for local testing:
# Port-forward to metrics endpoint
kubectl port-forward -n tenant-operator-system \
deployment/tenant-operator-controller-manager 8443:8443
# Access metrics (requires valid TLS client or use --insecure)
curl -k https://localhost:8443/metricsCheck if metrics are enabled:
# Check if metrics port is exposed
kubectl get svc -n tenant-operator-system tenant-operator-controller-manager-metrics-service
# Check if ServiceMonitor is deployed (requires prometheus-operator)
kubectl get servicemonitor -n tenant-operator-systemEnabling ServiceMonitor
If using Prometheus Operator, enable ServiceMonitor by uncommenting in config/default/kustomization.yaml:
# Line 27: Uncomment this
- ../prometheusThen redeploy:
kubectl apply -k config/defaultVerify scrape job
After redeploying, confirm that a ServiceMonitor named tenant-operator-controller-manager appears and that Prometheus discovers the target.
Metrics
Tenant Operator exposes the following custom Prometheus metrics at :8443/metrics.
Controller Metrics
tenant_reconcile_duration_seconds
Histogram of tenant reconciliation duration.
Labels:
result:successorerror
Queries:
# 95th percentile reconciliation time
histogram_quantile(0.95, rate(tenant_reconcile_duration_seconds_bucket[5m]))
# Reconciliation rate
rate(tenant_reconcile_duration_seconds_count[5m])
# Error rate
rate(tenant_reconcile_duration_seconds{result="error"}[5m])Alerts:
- alert: SlowTenantReconciliation
expr: histogram_quantile(0.95, rate(tenant_reconcile_duration_seconds_bucket[5m])) > 30
for: 5m
annotations:
summary: Tenant reconciliation taking > 30s
- alert: TenantReconciliationErrors
expr: rate(tenant_reconcile_duration_seconds{result="error"}[5m]) > 0.1
annotations:
summary: High tenant reconciliation error rateResource Metrics
tenant_resources_desired
Gauge of desired resources for a tenant.
Labels:
tenant: Tenant namenamespace: Tenant namespace
Queries:
# Total desired resources
sum(tenant_resources_desired)
# Per tenant
tenant_resources_desired{tenant="acme-prod-template"}tenant_resources_ready
Gauge of ready resources for a tenant.
Labels:
tenant: Tenant namenamespace: Tenant namespace
Queries:
# Total ready resources
sum(tenant_resources_ready)
# Readiness percentage
sum(tenant_resources_ready) / sum(tenant_resources_desired) * 100tenant_resources_failed
Gauge of failed resources for a tenant.
Labels:
tenant: Tenant namenamespace: Tenant namespace
Alerts:
- alert: TenantResourcesFailed
expr: tenant_resources_failed > 0
for: 5m
annotations:
summary: Tenant {{ $labels.tenant }} has {{ $value }} failed resourcesRegistry Metrics
registry_desired
Gauge of desired tenant CRs for a registry.
Labels:
registry: Registry namenamespace: Registry namespace
Queries:
# Total desired tenants across all registries
sum(registry_desired)
# Per registry
registry_desired{registry="my-saas-registry"}registry_ready
Gauge of ready tenant CRs for a registry.
Queries:
# Registry health percentage
sum(registry_ready) / sum(registry_desired) * 100registry_failed
Gauge of failed tenant CRs for a registry.
Alerts:
- alert: RegistryUnhealthy
expr: registry_failed > 0
for: 10m
annotations:
summary: Registry {{ $labels.registry }} has {{ $value }} failed tenantsApply Metrics
apply_attempts_total
Counter of resource apply attempts.
Labels:
kind: Resource kind (Deployment, Service, etc.)result:successorerrorconflict_policy:StuckorForce
Queries:
# Apply success rate
rate(apply_attempts_total{result="success"}[5m]) / rate(apply_attempts_total[5m])
# Applies per kind
sum(rate(apply_attempts_total[5m])) by (kind)
# Conflict policy usage
sum(rate(apply_attempts_total[5m])) by (conflict_policy)Alerts:
- alert: HighApplyFailureRate
expr: rate(apply_attempts_total{result="error"}[5m]) / rate(apply_attempts_total[5m]) > 0.1
annotations:
summary: > 10% of apply attempts failingConflict and Failure Metrics
tenant_condition_status
Gauge tracking the status of tenant conditions.
Labels:
tenant: Tenant namenamespace: Tenant namespacetype: Condition type (e.g., Ready, Degraded)
Values:
0: False1: True2: Unknown
Queries:
# Check if tenants are ready
tenant_condition_status{type="Ready"} == 1
# Count tenants not ready
count(tenant_condition_status{type="Ready"} != 1)
# List degraded tenants
tenant_condition_status{type="Degraded"} == 1Alerts:
- alert: TenantNotReady
expr: tenant_condition_status{type="Ready"} != 1
for: 10m
annotations:
summary: Tenant {{ $labels.tenant }} is not readytenant_conflicts_total
Counter tracking the total number of resource conflicts encountered.
Labels:
tenant: Tenant namenamespace: Tenant namespaceresource_kind: Kind of resource in conflict (Deployment, Service, etc.)conflict_policy: Applied policy (Stuck or Force)
Queries:
# Total conflicts
sum(tenant_conflicts_total)
# Conflicts per tenant
sum(rate(tenant_conflicts_total[5m])) by (tenant)
# Conflicts by resource kind
sum(rate(tenant_conflicts_total[5m])) by (resource_kind)
# Conflicts by policy
sum(rate(tenant_conflicts_total[5m])) by (conflict_policy)Alerts:
- alert: HighConflictRate
expr: rate(tenant_conflicts_total[5m]) > 0.1
for: 10m
annotations:
summary: High conflict rate for tenant {{ $labels.tenant }}
- alert: NewConflictsDetected
expr: increase(tenant_conflicts_total[5m]) > 0
for: 1m
annotations:
summary: New conflicts detected for tenant {{ $labels.tenant }}tenant_resources_conflicted
Gauge tracking the current number of resources in conflict state.
Labels:
tenant: Tenant namenamespace: Tenant namespace
Queries:
# Total resources in conflict
sum(tenant_resources_conflicted)
# Tenants with conflicts
tenant_resources_conflicted > 0
# Conflict percentage
sum(tenant_resources_conflicted) / sum(tenant_resources_desired) * 100Alerts:
- alert: TenantResourcesConflicted
expr: tenant_resources_conflicted > 0
for: 10m
annotations:
summary: Tenant {{ $labels.tenant }} has {{ $value }} resources in conflicttenant_degraded_status
Gauge indicating if a tenant is in degraded state.
Labels:
tenant: Tenant namenamespace: Tenant namespacereason: Reason for degradation (TemplateRenderError, ConflictDetected, DependencyCycle, etc.)
Values:
0: Not degraded1: Degraded
Queries:
# Count degraded tenants
count(tenant_degraded_status == 1)
# List degraded tenants with reasons
tenant_degraded_status{reason!=""} == 1
# Degraded tenants by reason
sum(tenant_degraded_status) by (reason)Alerts:
- alert: TenantDegraded
expr: tenant_degraded_status > 0
for: 5m
annotations:
summary: Tenant {{ $labels.tenant }} is degraded
description: "Reason: {{ $labels.reason }}"Controller-Runtime Metrics
Standard controller-runtime metrics:
# Work queue depth
workqueue_depth{name="tenant"}
# Work queue add rate
rate(workqueue_adds_total{name="tenant"}[5m])
# Work queue latency
workqueue_queue_duration_seconds{name="tenant"}Metrics Collection
Prometheus ServiceMonitor
To enable ServiceMonitor, uncomment the prometheus section in config/default/kustomization.yaml:
# Uncomment this line:
#- ../prometheusThe ServiceMonitor configuration (already available in config/prometheus/monitor.yaml):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
control-plane: controller-manager
app.kubernetes.io/name: tenant-operator
app.kubernetes.io/managed-by: kustomize
name: controller-manager-metrics-monitor
namespace: tenant-operator-system
spec:
endpoints:
- path: /metrics
port: https
scheme: https
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
tlsConfig:
insecureSkipVerify: true
selector:
matchLabels:
control-plane: controller-manager
app.kubernetes.io/name: tenant-operatorNote: For production, use cert-manager for metrics TLS by enabling the cert patch in config/default/kustomization.yaml.
Manual Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'tenant-operator'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- tenant-operator-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_control_plane]
action: keep
regex: controller-manager
- source_labels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: httpsLogging
Log Levels
Configure via --zap-log-level:
args:
- --zap-log-level=info # Options: debug, info, errorLevels:
debug: Verbose logging (template values, API calls)info: Standard logging (reconciliation events)error: Errors only
Structured Logging
All logs are structured JSON:
{
"level": "info",
"ts": "2025-01-15T10:30:00.000Z",
"msg": "Reconciliation completed",
"tenant": "acme-prod-template",
"ready": 10,
"failed": 0,
"changed": 2
}Key Log Messages
Reconciliation Events
"msg": "Reconciliation completed"
"msg": "Reconciliation completed with changes"
"msg": "Failed to reconcile tenant"Resource Events
"msg": "Failed to render resource"
"msg": "Failed to apply resource"
"msg": "Resource not ready within timeout"Registry Events
"msg": "Deleting Tenant (no longer in desired set)"
"msg": "Successfully deleted Tenant"Querying Logs
# All logs
kubectl logs -n tenant-operator-system deployment/tenant-operator-controller-manager
# Follow logs
kubectl logs -n tenant-operator-system deployment/tenant-operator-controller-manager -f
# Errors only
kubectl logs -n tenant-operator-system deployment/tenant-operator-controller-manager | grep '"level":"error"'
# Specific tenant
kubectl logs -n tenant-operator-system deployment/tenant-operator-controller-manager | grep 'acme-prod'
# Reconciliation events
kubectl logs -n tenant-operator-system deployment/tenant-operator-controller-manager | grep "Reconciliation completed"Events
Kubernetes events are emitted for key operations.
Viewing Events
# All Tenant events
kubectl get events --all-namespaces --field-selector involvedObject.kind=Tenant
# Specific Tenant
kubectl describe tenant <name>
# Recent events
kubectl get events --sort-by='.lastTimestamp'Event Types
Normal Events
TemplateApplied: Template successfully appliedTemplateAppliedComplete: All resources appliedTenantDeleting: Tenant deletion startedTenantDeleted: Tenant deletion completed
Warning Events
TemplateRenderError: Template rendering failedApplyFailed: Resource apply failedResourceConflict: Ownership conflict detectedReadinessTimeout: Resource not ready within timeoutDependencyError: Dependency cycle detectedTenantDeletionFailed: Tenant deletion failed
Event Examples
# Success
TemplateAppliedComplete: Applied 10 resources (10 ready, 0 failed, 2 changed)
# Conflict
ResourceConflict: Resource conflict detected for default/acme-app (Kind: Deployment, Policy: Stuck).
Another controller or user may be managing this resource.
# Deletion
TenantDeleting: Deleting Tenant 'acme-prod-template' (template: prod-template, uid: acme) -
no longer in active dataset. This could be due to: row deletion, activate=false, or template change.Dashboards
Grafana Dashboard
A comprehensive Grafana dashboard is available at: config/monitoring/grafana-dashboard.json
How to import:
- Open Grafana UI
- Go to Dashboards → Import
- Upload
config/monitoring/grafana-dashboard.json - Select your Prometheus datasource
Dashboard includes 10 panels:
- Reconciliation Duration (Percentiles) - P50, P95, P99 latency
- Reconciliation Rate - Success vs Error rate
- Error Rate - Gauge showing current error percentage
- Total Desired Tenants - Sum across all registries
- Total Ready Tenants - Healthy tenant count
- Total Failed Tenants - Failed tenant count
- Resource Counts by Tenant - Stacked area chart per tenant
- Registry Health - Table showing health percentage per registry
- Apply Rate by Kind - Apply attempts by resource type
- Work Queue Depth - Controller queue depths
Sample Queries
Reconciliation Performance:
# P50, P95, P99 latency
histogram_quantile(0.50, rate(tenant_reconcile_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(tenant_reconcile_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(tenant_reconcile_duration_seconds_bucket[5m]))Resource Health:
# % of resources ready
sum(tenant_resources_ready) / sum(tenant_resources_desired) * 100Top Slow Tenants:
# Tenants with most failed resources
topk(10, tenant_resources_failed)Alerting
Prometheus Alert Rules
A comprehensive set of Prometheus alert rules is available at config/prometheus/alerts.yaml.
To deploy the alerts:
# Apply the PrometheusRule resource
kubectl apply -f config/prometheus/alerts.yaml
# Or use kustomize
kubectl apply -k config/prometheusAlert Categories:
Critical Alerts
TenantResourcesFailed- Tenant has failed resourcesTenantDegraded- Tenant is in degraded stateTenantNotReady- Tenant not ready for extended periodRegistryManyTenantsFailure- Many tenants failing in a registry
Warning Alerts
TenantResourcesConflicted- Resources in conflict stateTenantHighConflictRate- High rate of conflictsTenantResourcesMismatch- Ready count doesn't match desiredRegistryTenantsFailure- Some tenants failingTenantReconciliationErrors- High error rateTenantReconciliationSlow- Slow reconciliation performance
Info Alerts
TenantNewConflictsDetected- New conflicts detected
Sample Alert Rules
Critical Alerts
# Tenant has failed resources
- alert: TenantResourcesFailed
expr: tenant_resources_failed > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Tenant {{ $labels.tenant }} has failed resources"
# Tenant is degraded
- alert: TenantDegraded
expr: tenant_degraded_status > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Tenant {{ $labels.tenant }} is in degraded state"
description: "Reason: {{ $labels.reason }}"
# Registry has many failed tenants
- alert: RegistryManyTenantsFailure
expr: registry_failed > 5 or (registry_failed / registry_desired > 0.5 and registry_desired > 0)
for: 5m
labels:
severity: critical
annotations:
summary: "Registry {{ $labels.registry }} has many failed tenants"Warning Alerts
# Resources in conflict
- alert: TenantResourcesConflicted
expr: tenant_resources_conflicted > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Tenant {{ $labels.tenant }} has resources in conflict"
# High conflict rate
- alert: TenantHighConflictRate
expr: rate(tenant_conflicts_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High conflict rate for tenant {{ $labels.tenant }}"Performance Alerts
# Slow reconciliation
- alert: TenantReconciliationSlow
expr: histogram_quantile(0.95, rate(tenant_reconcile_duration_seconds_bucket[5m])) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "Slow tenant reconciliation"
# High error rate
- alert: TenantReconciliationErrors
expr: rate(tenant_reconcile_duration_seconds_count{result="error"}[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High tenant reconciliation error rate"Alert Routing (AlertManager)
Configure AlertManager to route alerts based on severity:
# alertmanager.yml
route:
group_by: ['alertname', 'tenant', 'namespace']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
continue: true
# Warning alerts to Slack
- match:
severity: warning
receiver: 'slack'
continue: true
# Info alerts to email
- match:
severity: info
receiver: 'email'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://example.com/webhook'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack'
slack_configs:
- api_url: '<slack-webhook>'
channel: '#tenant-operator-alerts'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'email'
email_configs:
- to: 'team@example.com'
from: 'alertmanager@example.com'Tracing
Distributed Tracing (Future)
Planned for v1.2:
- OpenTelemetry integration
- Trace reconciliation across controllers
- Span for each resource apply
- Database query tracing
Best Practices
1. Monitor Key Metrics
Essential metrics to track:
- Reconciliation duration (P95)
- Error rate
- Resource ready/failed counts
- Registry desired vs ready
2. Set Up Alerts
Minimum recommended alerts:
- Operator down
- High error rate (> 10%)
- Slow reconciliation (P95 > 30s)
- Resources failed (> 0 for 5min)
3. Retain Logs
Recommended log retention:
- Debug logs: 1-3 days
- Info logs: 7-14 days
- Error logs: 30+ days
4. Dashboard Review
Weekly review:
- Reconciliation performance trends
- Error patterns
- Resource health
- Capacity planning
5. Event Monitoring
Monitor events for:
- Conflicts (investigate ownership)
- Timeouts (adjust readiness settings)
- Template errors (fix templates)
Troubleshooting Metrics
Metrics Not Available
Problem: curl https://localhost:8443/metrics returns connection refused.
Solution:
Check if metrics port is configured:
bashkubectl get deployment -n tenant-operator-system tenant-operator-controller-manager -o yaml | grep metrics-bind-addressShould see:
--metrics-bind-address=:8443Check if port is exposed:
bashkubectl get deployment -n tenant-operator-system tenant-operator-controller-manager -o yaml | grep -A 5 "ports:"Should see containerPort 8443.
Check if service exists:
bashkubectl get svc -n tenant-operator-system tenant-operator-controller-manager-metrics-serviceCheck operator logs:
bashkubectl logs -n tenant-operator-system deployment/tenant-operator-controller-manager | grep metrics
No Metrics Data
Problem: Metrics endpoint works but returns no custom metrics.
Solution:
Verify metrics are registered:
bashcurl -k https://localhost:8443/metrics | grep tenant_Should see:
tenant_reconcile_duration_seconds,tenant_resources_ready, etc.Trigger reconciliation:
bash# Apply a test resource kubectl apply -f config/samples/operator_v1_tenantregistry.yaml # Wait 30s and check metrics again curl -k https://localhost:8443/metrics | grep tenant_reconcile_duration_seconds_countCheck if controllers are running:
bashkubectl logs -n tenant-operator-system deployment/tenant-operator-controller-manager | grep "Starting Controller"
ServiceMonitor Not Working
Problem: Prometheus not scraping metrics.
Solution:
Check if Prometheus Operator is installed:
bashkubectl get crd servicemonitors.monitoring.coreos.comCheck if ServiceMonitor is created:
bashkubectl get servicemonitor -n tenant-operator-systemCheck ServiceMonitor labels match Prometheus selector:
bashkubectl get servicemonitor -n tenant-operator-system tenant-operator-controller-manager-metrics-monitor -o yamlCheck Prometheus logs:
bashkubectl logs -n monitoring prometheus-xyz
TLS Certificate Errors
Problem: x509: certificate signed by unknown authority
Solution:
For development, use --insecure or -k:
curl -k https://localhost:8443/metricsFor production, use cert-manager by enabling the cert patch in config/default/kustomization.yaml:
# Uncomment this line:
#- path: cert_metrics_manager_patch.yamlSee Also
- Performance Guide - Performance tuning
- Troubleshooting Guide - Common issues
