在 Kubernetes 中部署 Jaeger
在 Kubernetes 中部署 Jaeger
Jaeger 架构组件
组件说明
┌─────────────────────────────────────────────┐
│ Application Pods │
│ ┌──────────────────────────────────────┐ │
│ │ App Container + Jaeger Client SDK │ │
│ └────────────┬─────────────────────────┘ │
└───────────────┼─────────────────────────────┘
│ UDP (6831)
▼
┌─────────────────────────────────────────────┐
│ Jaeger Agent (DaemonSet) │
│ - 接收 Spans (UDP) │
│ - 批量发送到 Collector │
│ - 降低应用负载 │
└────────────┬────────────────────────────────┘
│ gRPC/HTTP
▼
┌─────────────────────────────────────────────┐
│ Jaeger Collector (Deployment) │
│ - 验证和处理 Spans │
│ - 批量写入存储 │
│ - 可水平扩展 │
└────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Storage Backend │
│ - ElasticSearch (生产推荐) │
│ - Cassandra │
│ - Memory (开发测试) │
│ - Kafka (缓冲层) │
└────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Jaeger Query + UI (Deployment) │
│ - 查询 API │
│ - Web UI │
│ - Service Dependencies 分析 │
└─────────────────────────────────────────────┘
部署方式对比
| 方式 | 适用场景 | 复杂度 | 可控性 |
|---|---|---|---|
| All-in-One | 开发、测试 | ⭐ | ⭐⭐ |
| Operator | 生产环境 | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Helm | 快速部署 | ⭐⭐ | ⭐⭐⭐ |
| 手动 YAML | 完全自定义 | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
方式 1:All-in-One(开发环境)
特点
- 单个容器包含所有组件
- 内存存储(重启丢失数据)
- 快速启动,适合本地开发
部署
# jaeger-all-in-one.yaml
apiVersion: v1
kind: Namespace
metadata:
name: observability
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: observability
labels:
app: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.51
ports:
# Agent 接收端口
- containerPort: 6831
protocol: UDP
name: jaeger-thrift
- containerPort: 6832
protocol: UDP
name: jaeger-binary
# Collector 接收端口
- containerPort: 14268
name: jaeger-coll-http
- containerPort: 14250
name: jaeger-grpc
# UI 端口
- containerPort: 16686
name: jaeger-query
# Health check
- containerPort: 14269
name: admin-http
env:
- name: COLLECTOR_ZIPKIN_HTTP_PORT
value: "9411"
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /
port: 14269
initialDelaySeconds: 5
readinessProbe:
httpGet:
path: /
port: 14269
initialDelaySeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: observability
labels:
app: jaeger
spec:
type: ClusterIP
ports:
# Agent 端口
- port: 6831
targetPort: 6831
protocol: UDP
name: jaeger-thrift
- port: 6832
targetPort: 6832
protocol: UDP
name: jaeger-binary
# Collector 端口
- port: 14268
targetPort: 14268
name: jaeger-collector-http
- port: 14250
targetPort: 14250
name: jaeger-grpc
# Query 端口
- port: 16686
targetPort: 16686
name: jaeger-query
- port: 9411
targetPort: 9411
name: zipkin
selector:
app: jaeger
---
# Ingress(可选)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: jaeger-ui
namespace: observability
annotations:
kubernetes.io/ingress.class: nginx
spec:
rules:
- host: jaeger.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: jaeger
port:
number: 16686
部署和访问:
# 部署
kubectl apply -f jaeger-all-in-one.yaml
# 验证
kubectl -n observability get pods
kubectl -n observability get svc
# 端口转发访问 UI
kubectl -n observability port-forward svc/jaeger 16686:16686
# 浏览器访问
open http://localhost:16686
方式 2:Jaeger Operator(生产推荐)
安装 Operator
1. 安装 Cert-Manager(前置依赖)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# 等待 cert-manager 就绪
kubectl -n cert-manager wait --for=condition=ready pod -l app.kubernetes.io/instance=cert-manager --timeout=2m
2. 安装 Jaeger Operator
# 创建命名空间
kubectl create namespace observability
# 安装 Operator
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
# 验证
kubectl -n observability get pods -l name=jaeger-operator
生产环境部署(ElasticSearch 后端)
1. 部署 ElasticSearch
# elasticsearch.yaml
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: observability
spec:
ports:
- port: 9200
targetPort: 9200
selector:
app: elasticsearch
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: observability
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: increase-vm-max-map
image: busybox
command: ["sysctl", "-w", "vm.max_map_count=262144"]
securityContext:
privileged: true
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.9
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
env:
- name: cluster.name
value: jaeger-cluster
- name: discovery.seed_hosts
value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: ES_JAVA_OPTS
value: "-Xms1g -Xmx1g"
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 1000m
memory: 3Gi
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard
resources:
requests:
storage: 10Gi
部署:
kubectl apply -f elasticsearch.yaml
# 等待 ES 就绪
kubectl -n observability wait --for=condition=ready pod -l app=elasticsearch --timeout=5m
# 验证
kubectl -n observability exec -it elasticsearch-0 -- curl http://localhost:9200/_cluster/health?pretty
2. 部署 Jaeger(生产配置)
# jaeger-production.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger-prod
namespace: observability
spec:
# 生产策略
strategy: production
# 存储配置
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
index-prefix: jaeger
num-shards: 5
num-replicas: 1
# 自动清理旧数据
esIndexCleaner:
enabled: true
numberOfDays: 7
schedule: "55 23 * * *"
# Collector 配置
collector:
replicas: 3
maxReplicas: 5
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
autoscale: true
options:
collector:
num-workers: 100
queue-size: 2000
# Query 配置
query:
replicas: 2
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
options:
query:
base-path: /
# Agent 配置(DaemonSet)
agent:
strategy: DaemonSet
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
# Ingress 配置
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- jaeger.example.com
tls:
- secretName: jaeger-tls
hosts:
- jaeger.example.com
# UI 配置
ui:
options:
dependencies:
menuEnabled: true
tracking:
gaID: UA-000000-1 # Google Analytics (可选)
menu:
- label: "Docs"
items:
- label: "Documentation"
url: "https://www.jaegertracing.io/docs/"
部署:
kubectl apply -f jaeger-production.yaml
# 监控部署状态
kubectl -n observability get jaeger jaeger-prod -w
# 查看创建的资源
kubectl -n observability get all -l app.kubernetes.io/instance=jaeger-prod
# 访问 UI
kubectl -n observability port-forward svc/jaeger-prod-query 16686:16686
3. 验证部署
# 检查所有组件
kubectl -n observability get pods -l app.kubernetes.io/instance=jaeger-prod
# 预期输出:
# NAME READY STATUS
# jaeger-prod-collector-xxx 1/1 Running
# jaeger-prod-query-xxx 2/2 Running
# jaeger-prod-agent-xxx 1/1 Running (DaemonSet)
# 测试发送 Trace
kubectl -n observability run test-client --rm -it --image=curlimages/curl -- sh
curl -X POST http://jaeger-prod-collector:14268/api/traces \
-H "Content-Type: application/x-thrift" \
--data-binary @test-trace.thrift
# 查看 UI
open http://jaeger.example.com
方式 3:Helm 部署
快速部署(开发环境)
# 添加 Helm 仓库
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
# 部署 All-in-One
helm install jaeger jaegertracing/jaeger \
--namespace observability \
--create-namespace \
--set allInOne.enabled=true \
--set storage.type=memory \
--set agent.enabled=false
# 访问 UI
kubectl -n observability port-forward svc/jaeger-query 16686:16686
生产部署(ElasticSearch)
# 创建 values.yaml
cat <<EOF > jaeger-values.yaml
storage:
type: elasticsearch
elasticsearch:
host: elasticsearch
port: 9200
scheme: http
provisionDataStore:
cassandra: false
elasticsearch: true
kafka: false
agent:
enabled: true
daemonset:
useHostPort: true
collector:
replicaCount: 3
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 80
query:
replicaCount: 2
resources:
requests:
cpu: 200m
memory: 512Mi
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: nginx
hosts:
- jaeger.example.com
EOF
# 部署
helm install jaeger jaegertracing/jaeger \
--namespace observability \
--create-namespace \
-f jaeger-values.yaml
# 升级
helm upgrade jaeger jaegertracing/jaeger \
--namespace observability \
-f jaeger-values.yaml
性能调优
Collector 调优
spec:
collector:
options:
# 队列大小
collector.queue-size: "5000"
# 工作线程数
collector.num-workers: "100"
# 批处理
collector.queue-size-memory: "1000"
# 采样策略
sampling.strategies-file: /etc/jaeger/sampling.json
ElasticSearch 调优
storage:
options:
es:
# 批量大小
bulk.size: "5000000"
# 批量 workers
bulk.workers: "5"
# 批量刷新间隔
bulk.flush-interval: "200ms"
# 索引分片
num-shards: "5"
num-replicas: "1"
# 连接池
max-connections: "20"
Agent 采样配置
agent:
options:
# 默认采样率
reporter.type: "const"
reporter.param: "1" # 1 = 100% 采样
# 限流采样
# reporter.type: "ratelimiting"
# reporter.param: "100" # 每秒 100 个
# 概率采样
# reporter.type: "probabilistic"
# reporter.param: "0.1" # 10% 采样
监控 Jaeger
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: jaeger
namespace: observability
spec:
selector:
matchLabels:
app.kubernetes.io/component: service-monitor
endpoints:
- port: admin-http
interval: 30s
path: /metrics
关键指标
# Collector 接收的 Spans 数量
rate(jaeger_collector_spans_received_total[5m])
# Collector 保存的 Spans 数量
rate(jaeger_collector_spans_saved_total[5m])
# 队列大小
jaeger_collector_queue_length
# 处理延迟
jaeger_collector_save_latency_bucket
# 错误率
rate(jaeger_collector_spans_dropped_total[5m])
Grafana Dashboard
# 导入官方 Dashboard
Dashboard ID: 10001
故障排查
问题 1:Spans 丢失
# 检查 Collector 日志
kubectl -n observability logs -l app.kubernetes.io/component=collector
# 检查队列是否满
kubectl -n observability exec -it <collector-pod> -- \
curl http://localhost:14269/metrics | grep queue
# 增加队列大小或 Collector 副本数
问题 2:ElasticSearch 写入慢
# 检查 ES 健康状态
kubectl -n observability exec -it elasticsearch-0 -- \
curl http://localhost:9200/_cluster/health?pretty
# 检查索引状态
kubectl -n observability exec -it elasticsearch-0 -- \
curl http://localhost:9200/_cat/indices?v
# 增加 bulk size 或 workers
问题 3:UI 查询慢
# 检查 ES 查询性能
kubectl -n observability exec -it elasticsearch-0 -- \
curl http://localhost:9200/_cat/thread_pool/search?v
# 增加 Query 副本数
kubectl -n observability scale deployment jaeger-prod-query --replicas=3
# 优化 ES 索引
# - 减少 num-shards
# - 增加缓存
# - 使用 SSD 存储
总结
部署方式选择
| 环境 | 推荐方式 | 存储 |
|---|---|---|
| 本地开发 | All-in-One | Memory |
| 测试环境 | All-in-One/Helm | Memory/ES |
| 生产环境 | Operator | ElasticSearch |
| 大规模 | Operator | ES + Kafka |
资源规划
小规模(< 1000 RPS):
- Collector: 2 副本,500m CPU, 1Gi Memory
- Query: 2 副本,200m CPU, 512Mi Memory
- ES: 3 节点,1 CPU, 2Gi Memory
中规模(1000-10000 RPS):
- Collector: 3-5 副本,1 CPU, 2Gi Memory
- Query: 2-3 副本,500m CPU, 1Gi Memory
- ES: 5 节点,2 CPU, 4Gi Memory
大规模(> 10000 RPS):
- Collector: 5-10 副本,2 CPU, 4Gi Memory
- Query: 3-5 副本,1 CPU, 2Gi Memory
- ES: 10+ 节点,4 CPU, 8Gi Memory
- 引入 Kafka 缓冲层
下一节将介绍如何在应用中集成 Jaeger SDK。