Istio tracing and correlation with Jaeger and Grafana Loki
Istio is an open source service mesh that layers transparently onto existing distributed applications. Tracing is an important feature to detect issues early time and to decrease troubleshooting time. This article carries on my previous article Multi-hop tracing with OpenTelemetry in Golang on Kubernetes with Istio service mesh network.
Components
Istio supports tracing, but does not provide a whole tracing solution, instead, it integrates a tracing solution. The Istio core just collect a very few Span information by the sidecar proxies (no Trace State, for example). Most of the tracing integration is made by the official Istio dashboard: Kiali.
Without an integrated tracing solution (Jaeger, Zipkin, etc.) Istio tracing capabilities are very poor. The default tracing solution for Istio is Zipkin (which supports the vendor-agnostic OpenTelemetry Protocol Exporter), but it supports Jaeger too (configuration is needed). This article uses the sample Jaeger deployment to Istio. Kiali integrates Prometheus and Grafana, too (sample deployments are used).
Grafana Jaeger Data source has similar view to Jaeger Trace view. This view and Loki log items can link each other. So, Traces and logs can be correlated on Grafana.
The typical real-world scenario is: an issue is discovered by Kiali and the troubleshooting will be continued by Jaeger. The next step will be the log analysis. Grafana supports the trace - log correlaton, so this scenario is easy follow on UIs.
Test setup
Same client and app servers are used to Multi-hop tracing with OpenTelemetry in Golang, but the deployment is different: the services (frontend, backend) are packed into Docker image and deployed on Kubernetes with Istio. The deployment files and description can be found on a separated branch of https://github.com/pgillich/kind-on-dev/tree/1.24
Kiali vs Jaeger
Let’s take a look screenshots about Jaeger and Kiali:
It’s visible on above screenshots, Jaeger focuses on particular things, mostly for end-to-end troubleshooting (deep Trace and Span info). Kiali focuses on higher level (it does not show the Service instances/replicas) to detect the issues as early as possible (based on metrics, statistics and versions), mostly after a new version deployment.
Istio (with Kiali) collects the information passively. So, it cannot show the client app, which is missing on the Kiali Graph overview figure. My client app sends tracing info directly to Jaeger, so it can show the client app Span as the root of the Trace.
Kiali focuses on metrics and statistics of the connections between Services (not between Pods), Jaeger assigns the Span tree to Trace.
Kiali metrics
Kiali integrates Prometheus, too. Kiali draws several charts from Prometheus queries, for example:
Grafana
This chapter is based on Distributed Tracing in Grafana with Tempo and Jaeger , but with improved configuration.
Trace to log correlation
The Grafana Jaeger Data source lists the possible Traces and shows Trace similar to Jaeger Trace view:
Correlation is based on Span attributes and Loki log labels:
See more details at Tracing in Explore .
Log to trace correlation
The simplest way to correlate log items is regex pattern matching configured in Loki Data source config.
Example for filtering logs:
The traceID
log label is detected by an additional Promtail pattern matching .
Integration
Istio, Kiali, Jaeger, Prometheus and Grafana uses each other, so it’s important to configure it properly.
The used samples and examples don’t fulfill enterprise expectations and aren’t secure. The used Kiali, Jaeger, Prometheus and Grafana sample deployments are good for demo, but must be improved (or replaced) in a production environment.
Configurations can be found in below repos:
Istio
Instead of the default OpenTracing Collector (Zipkin), Jaeger Collector endpoint must be set.
Config files: istio-config.yaml, istio-ingress.yaml
Jaeger
Sample Istio addon Jaeger is used. Additional Ingress configuration is needed to access the Jaeger UI.
Config file: telemetry-ingress.yaml
Prometheus
Sample Istio addon Prometheus is used. It’s prepared to scrape the needed metrics for Istio and Kiali. Additional Ingress configuration is needed to access the Prometheus UI.
Config file: telemetry-ingress.yaml
Grafana
Sample Istio addon Grafana is used. It already contains the expected dashboards by Kiali. Additional Ingress configuration is needed to access the Grafana UI.
Config file: telemetry-ingress.yaml
Grafana, Jaeger Data source
The correlation keys are the Span attributes (Jaeger tags) and Loki log labels (including Pod labels). The default Jaeger tags are documented at Jaeger data source / Trace to logs . The implemented example uses same key to Pod label and instrumented below way:
httpClient := &http.Client{Transport: otelhttp.NewTransport(
http.DefaultTransport,
otelhttp.WithPropagators(otel.GetTextMapPropagator()),
otelhttp.WithSpanOptions(trace.WithAttributes(
attribute.String("component", "opentracing-example"),
)),
)}ctx, span = tr.Start(ctx, "IN HTTP "+r.Method+" "+r.URL.String(),
.
.
.
trace.WithAttributes(
attribute.String(StateKeyClientCommand, clientCommand),
attribute.String("component", "opentracing-example"),
),
)
See more details at New in Grafana 8.5: How to jump from traces to Splunk logs .
Grafana, Loki Data source
The correlation is based on a regex group expression, which parses the Trace ID.
Promtail
Promtail collect the logs for Loki. Below match parses the Trace ID:
- match:
selector: '{component="opentracing-example"}'
stages:
- regex:
expression: '.*(?P<trace>TraceID)\\":\\"(?P<traceID>[a-zA-Z0-9]+).*'
traceID: traceID
- labels:
traceID:
The config file is: loki-values.yaml
Kiali
Config files: kiali-values.yaml, istio-ingress.yaml
Sample Istio addon Kiali is used. URL to Jaeger, Prometheus and Grafana must be set. gRPC to Jaeger is disabled.
Istio sidecar proxy is disabled for Kiali Pod.
Instrumentation
In order to match the app traces in Kiali, the service name must conform to service.namespace format in the Tracer Provider attributes.
Config files: deployments/kustomize/
Summary
Istio is a complex service mesh. Any observability solution which helps to detect and discover issues are useful at deployment+test pipeline and Operations & Maintenance. It’s hard to detect issues on time without above tracing solutions.
If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇