在 Kubernetes 中部署 Jaeger

在 Kubernetes 中部署 Jaeger

Jaeger 架构组件

组件说明

┌─────────────────────────────────────────────┐
│          Application Pods                   │
│  ┌──────────────────────────────────────┐   │
│  │ App Container + Jaeger Client SDK    │   │
│  └────────────┬─────────────────────────┘   │
└───────────────┼─────────────────────────────┘
                │ UDP (6831)
                ▼
┌─────────────────────────────────────────────┐
│       Jaeger Agent (DaemonSet)              │
│  - 接收 Spans (UDP)                          │
│  - 批量发送到 Collector                      │
│  - 降低应用负载                              │
└────────────┬────────────────────────────────┘
             │ gRPC/HTTP
             ▼
┌─────────────────────────────────────────────┐
│       Jaeger Collector (Deployment)         │
│  - 验证和处理 Spans                          │
│  - 批量写入存储                              │
│  - 可水平扩展                                │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│          Storage Backend                    │
│  - ElasticSearch (生产推荐)                  │
│  - Cassandra                                │
│  - Memory (开发测试)                         │
│  - Kafka (缓冲层)                            │
└────────────┬────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────┐
│       Jaeger Query + UI (Deployment)        │
│  - 查询 API                                  │
│  - Web UI                                   │
│  - Service Dependencies 分析                │
└─────────────────────────────────────────────┘

部署方式对比

方式 适用场景 复杂度 可控性
All-in-One 开发、测试 ⭐⭐
Operator 生产环境 ⭐⭐⭐ ⭐⭐⭐⭐
Helm 快速部署 ⭐⭐ ⭐⭐⭐
手动 YAML 完全自定义 ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

方式 1:All-in-One(开发环境)

特点

  • 单个容器包含所有组件
  • 内存存储(重启丢失数据)
  • 快速启动,适合本地开发

部署

# jaeger-all-in-one.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: observability

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: observability
  labels:
    app: jaeger
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.51
        ports:
        # Agent 接收端口
        - containerPort: 6831
          protocol: UDP
          name: jaeger-thrift
        - containerPort: 6832
          protocol: UDP
          name: jaeger-binary
        # Collector 接收端口
        - containerPort: 14268
          name: jaeger-coll-http
        - containerPort: 14250
          name: jaeger-grpc
        # UI 端口
        - containerPort: 16686
          name: jaeger-query
        # Health check
        - containerPort: 14269
          name: admin-http
        env:
        - name: COLLECTOR_ZIPKIN_HTTP_PORT
          value: "9411"
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /
            port: 14269
          initialDelaySeconds: 5
        readinessProbe:
          httpGet:
            path: /
            port: 14269
          initialDelaySeconds: 5

---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: observability
  labels:
    app: jaeger
spec:
  type: ClusterIP
  ports:
  # Agent 端口
  - port: 6831
    targetPort: 6831
    protocol: UDP
    name: jaeger-thrift
  - port: 6832
    targetPort: 6832
    protocol: UDP
    name: jaeger-binary
  # Collector 端口
  - port: 14268
    targetPort: 14268
    name: jaeger-collector-http
  - port: 14250
    targetPort: 14250
    name: jaeger-grpc
  # Query 端口
  - port: 16686
    targetPort: 16686
    name: jaeger-query
  - port: 9411
    targetPort: 9411
    name: zipkin
  selector:
    app: jaeger

---
# Ingress(可选)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: jaeger-ui
  namespace: observability
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
  - host: jaeger.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: jaeger
            port:
              number: 16686

部署和访问:

# 部署
kubectl apply -f jaeger-all-in-one.yaml

# 验证
kubectl -n observability get pods
kubectl -n observability get svc

# 端口转发访问 UI
kubectl -n observability port-forward svc/jaeger 16686:16686

# 浏览器访问
open http://localhost:16686

方式 2:Jaeger Operator(生产推荐)

安装 Operator

1. 安装 Cert-Manager(前置依赖)

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# 等待 cert-manager 就绪
kubectl -n cert-manager wait --for=condition=ready pod -l app.kubernetes.io/instance=cert-manager --timeout=2m

2. 安装 Jaeger Operator

# 创建命名空间
kubectl create namespace observability

# 安装 Operator
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

# 验证
kubectl -n observability get pods -l name=jaeger-operator

生产环境部署(ElasticSearch 后端)

1. 部署 ElasticSearch

# elasticsearch.yaml
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: observability
spec:
  ports:
  - port: 9200
    targetPort: 9200
  selector:
    app: elasticsearch

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: observability
spec:
  serviceName: elasticsearch
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
      - name: increase-vm-max-map
        image: busybox
        command: ["sysctl", "-w", "vm.max_map_count=262144"]
        securityContext:
          privileged: true
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:7.17.9
        ports:
        - containerPort: 9200
          name: http
        - containerPort: 9300
          name: transport
        env:
        - name: cluster.name
          value: jaeger-cluster
        - name: discovery.seed_hosts
          value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        - name: ES_JAVA_OPTS
          value: "-Xms1g -Xmx1g"
        resources:
          requests:
            cpu: 500m
            memory: 2Gi
          limits:
            cpu: 1000m
            memory: 3Gi
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard
      resources:
        requests:
          storage: 10Gi

部署:

kubectl apply -f elasticsearch.yaml

# 等待 ES 就绪
kubectl -n observability wait --for=condition=ready pod -l app=elasticsearch --timeout=5m

# 验证
kubectl -n observability exec -it elasticsearch-0 -- curl http://localhost:9200/_cluster/health?pretty

2. 部署 Jaeger(生产配置)

# jaeger-production.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-prod
  namespace: observability
spec:
  # 生产策略
  strategy: production
  
  # 存储配置
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
        index-prefix: jaeger
        num-shards: 5
        num-replicas: 1
    # 自动清理旧数据
    esIndexCleaner:
      enabled: true
      numberOfDays: 7
      schedule: "55 23 * * *"
  
  # Collector 配置
  collector:
    replicas: 3
    maxReplicas: 5
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 1000m
        memory: 2Gi
    autoscale: true
    options:
      collector:
        num-workers: 100
        queue-size: 2000
  
  # Query 配置
  query:
    replicas: 2
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 500m
        memory: 1Gi
    options:
      query:
        base-path: /
  
  # Agent 配置(DaemonSet)
  agent:
    strategy: DaemonSet
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
  
  # Ingress 配置
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
      cert-manager.io/cluster-issuer: letsencrypt-prod
    hosts:
    - jaeger.example.com
    tls:
    - secretName: jaeger-tls
      hosts:
      - jaeger.example.com
  
  # UI 配置
  ui:
    options:
      dependencies:
        menuEnabled: true
      tracking:
        gaID: UA-000000-1  # Google Analytics (可选)
      menu:
      - label: "Docs"
        items:
        - label: "Documentation"
          url: "https://www.jaegertracing.io/docs/"

部署:

kubectl apply -f jaeger-production.yaml

# 监控部署状态
kubectl -n observability get jaeger jaeger-prod -w

# 查看创建的资源
kubectl -n observability get all -l app.kubernetes.io/instance=jaeger-prod

# 访问 UI
kubectl -n observability port-forward svc/jaeger-prod-query 16686:16686

3. 验证部署

# 检查所有组件
kubectl -n observability get pods -l app.kubernetes.io/instance=jaeger-prod

# 预期输出:
# NAME                                    READY   STATUS
# jaeger-prod-collector-xxx               1/1     Running
# jaeger-prod-query-xxx                   2/2     Running
# jaeger-prod-agent-xxx                   1/1     Running (DaemonSet)

# 测试发送 Trace
kubectl -n observability run test-client --rm -it --image=curlimages/curl -- sh
curl -X POST http://jaeger-prod-collector:14268/api/traces \
  -H "Content-Type: application/x-thrift" \
  --data-binary @test-trace.thrift

# 查看 UI
open http://jaeger.example.com

方式 3:Helm 部署

快速部署(开发环境)

# 添加 Helm 仓库
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

# 部署 All-in-One
helm install jaeger jaegertracing/jaeger \
  --namespace observability \
  --create-namespace \
  --set allInOne.enabled=true \
  --set storage.type=memory \
  --set agent.enabled=false

# 访问 UI
kubectl -n observability port-forward svc/jaeger-query 16686:16686

生产部署(ElasticSearch)

# 创建 values.yaml
cat <<EOF > jaeger-values.yaml
storage:
  type: elasticsearch
  elasticsearch:
    host: elasticsearch
    port: 9200
    scheme: http

provisionDataStore:
  cassandra: false
  elasticsearch: true
  kafka: false

agent:
  enabled: true
  daemonset:
    useHostPort: true

collector:
  replicaCount: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 2Gi
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 5
    targetCPUUtilizationPercentage: 80

query:
  replicaCount: 2
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
    hosts:
    - jaeger.example.com
EOF

# 部署
helm install jaeger jaegertracing/jaeger \
  --namespace observability \
  --create-namespace \
  -f jaeger-values.yaml

# 升级
helm upgrade jaeger jaegertracing/jaeger \
  --namespace observability \
  -f jaeger-values.yaml

性能调优

Collector 调优

spec:
  collector:
    options:
      # 队列大小
      collector.queue-size: "5000"
      
      # 工作线程数
      collector.num-workers: "100"
      
      # 批处理
      collector.queue-size-memory: "1000"
      
      # 采样策略
      sampling.strategies-file: /etc/jaeger/sampling.json

ElasticSearch 调优

storage:
  options:
    es:
      # 批量大小
      bulk.size: "5000000"
      
      # 批量 workers
      bulk.workers: "5"
      
      # 批量刷新间隔
      bulk.flush-interval: "200ms"
      
      # 索引分片
      num-shards: "5"
      num-replicas: "1"
      
      # 连接池
      max-connections: "20"

Agent 采样配置

agent:
  options:
    # 默认采样率
    reporter.type: "const"
    reporter.param: "1"  # 1 = 100% 采样
    
    # 限流采样
    # reporter.type: "ratelimiting"
    # reporter.param: "100"  # 每秒 100 个
    
    # 概率采样
    # reporter.type: "probabilistic"
    # reporter.param: "0.1"  # 10% 采样

监控 Jaeger

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: jaeger
  namespace: observability
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: service-monitor
  endpoints:
  - port: admin-http
    interval: 30s
    path: /metrics

关键指标

# Collector 接收的 Spans 数量
rate(jaeger_collector_spans_received_total[5m])

# Collector 保存的 Spans 数量
rate(jaeger_collector_spans_saved_total[5m])

# 队列大小
jaeger_collector_queue_length

# 处理延迟
jaeger_collector_save_latency_bucket

# 错误率
rate(jaeger_collector_spans_dropped_total[5m])

Grafana Dashboard

# 导入官方 Dashboard
Dashboard ID: 10001

故障排查

问题 1:Spans 丢失

# 检查 Collector 日志
kubectl -n observability logs -l app.kubernetes.io/component=collector

# 检查队列是否满
kubectl -n observability exec -it <collector-pod> -- \
  curl http://localhost:14269/metrics | grep queue

# 增加队列大小或 Collector 副本数

问题 2:ElasticSearch 写入慢

# 检查 ES 健康状态
kubectl -n observability exec -it elasticsearch-0 -- \
  curl http://localhost:9200/_cluster/health?pretty

# 检查索引状态
kubectl -n observability exec -it elasticsearch-0 -- \
  curl http://localhost:9200/_cat/indices?v

# 增加 bulk size 或 workers

问题 3:UI 查询慢

# 检查 ES 查询性能
kubectl -n observability exec -it elasticsearch-0 -- \
  curl http://localhost:9200/_cat/thread_pool/search?v

# 增加 Query 副本数
kubectl -n observability scale deployment jaeger-prod-query --replicas=3

# 优化 ES 索引
# - 减少 num-shards
# - 增加缓存
# - 使用 SSD 存储

总结

部署方式选择

环境 推荐方式 存储
本地开发 All-in-One Memory
测试环境 All-in-One/Helm Memory/ES
生产环境 Operator ElasticSearch
大规模 Operator ES + Kafka

资源规划

小规模(< 1000 RPS):

  • Collector: 2 副本,500m CPU, 1Gi Memory
  • Query: 2 副本,200m CPU, 512Mi Memory
  • ES: 3 节点,1 CPU, 2Gi Memory

中规模(1000-10000 RPS):

  • Collector: 3-5 副本,1 CPU, 2Gi Memory
  • Query: 2-3 副本,500m CPU, 1Gi Memory
  • ES: 5 节点,2 CPU, 4Gi Memory

大规模(> 10000 RPS):

  • Collector: 5-10 副本,2 CPU, 4Gi Memory
  • Query: 3-5 副本,1 CPU, 2Gi Memory
  • ES: 10+ 节点,4 CPU, 8Gi Memory
  • 引入 Kafka 缓冲层

下一节将介绍如何在应用中集成 Jaeger SDK。