Istio tracing and correlation with Jaeger and Grafana Loki

Peter Gillich
FAUN — Developer Community 🐾
6 min readNov 1, 2022

--

Istio is an open source service mesh that layers transparently onto existing distributed applications. Tracing is an important feature to detect issues early time and to decrease troubleshooting time. This article carries on my previous article Multi-hop tracing with OpenTelemetry in Golang on Kubernetes with Istio service mesh network.

Istio components

Components

Istio supports tracing, but does not provide a whole tracing solution, instead, it integrates a tracing solution. The Istio core just collect a very few Span information by the sidecar proxies (no Trace State, for example). Most of the tracing integration is made by the official Istio dashboard: Kiali.

Without an integrated tracing solution (Jaeger, Zipkin, etc.) Istio tracing capabilities are very poor. The default tracing solution for Istio is Zipkin (which supports the vendor-agnostic OpenTelemetry Protocol Exporter), but it supports Jaeger too (configuration is needed). This article uses the sample Jaeger deployment to Istio. Kiali integrates Prometheus and Grafana, too (sample deployments are used).

Grafana Jaeger Data source has similar view to Jaeger Trace view. This view and Loki log items can link each other. So, Traces and logs can be correlated on Grafana.

The typical real-world scenario is: an issue is discovered by Kiali and the troubleshooting will be continued by Jaeger. The next step will be the log analysis. Grafana supports the trace - log correlaton, so this scenario is easy follow on UIs.

Test setup

Same client and app servers are used to Multi-hop tracing with OpenTelemetry in Golang, but the deployment is different: the services (frontend, backend) are packed into Docker image and deployed on Kubernetes with Istio. The deployment files and description can be found on a separated branch of https://github.com/pgillich/kind-on-dev/tree/1.24

Kiali vs Jaeger

Let’s take a look screenshots about Jaeger and Kiali:

Deep Dependency Graph in Jaeger
Kiali Graph overview

It’s visible on above screenshots, Jaeger focuses on particular things, mostly for end-to-end troubleshooting (deep Trace and Span info). Kiali focuses on higher level (it does not show the Service instances/replicas) to detect the issues as early as possible (based on metrics, statistics and versions), mostly after a new version deployment.

Jaeger Trace view
Kiali Application trace (frontend)

Istio (with Kiali) collects the information passively. So, it cannot show the client app, which is missing on the Kiali Graph overview figure. My client app sends tracing info directly to Jaeger, so it can show the client app Span as the root of the Trace.

Kiali focuses on metrics and statistics of the connections between Services (not between Pods), Jaeger assigns the Span tree to Trace.

Kiali metrics

Kiali integrates Prometheus, too. Kiali draws several charts from Prometheus queries, for example:

Kiali Application Inbound Metrics (frontend)

Grafana

This chapter is based on Distributed Tracing in Grafana with Tempo and Jaeger , but with improved configuration.

Trace to log correlation

The Grafana Jaeger Data source lists the possible Traces and shows Trace similar to Jaeger Trace view:

Jaeger Data source with Node graph
Jaeger Data source with Trace view and link to Loki logs

Correlation is based on Span attributes and Loki log labels:

Correlated log items to Trace from Jaeger

See more details at Tracing in Explore .

Log to trace correlation

The simplest way to correlate log items is regex pattern matching configured in Loki Data source config.

Example for filtering logs:

Loki log filtering by Pod label
Generated Jaeger URL by regex matching

The traceIDlog label is detected by an additional Promtail pattern matching .

Integration

Istio, Kiali, Jaeger, Prometheus and Grafana uses each other, so it’s important to configure it properly.

The used samples and examples don’t fulfill enterprise expectations and aren’t secure. The used Kiali, Jaeger, Prometheus and Grafana sample deployments are good for demo, but must be improved (or replaced) in a production environment.

Configurations can be found in below repos:

Istio

Instead of the default OpenTracing Collector (Zipkin), Jaeger Collector endpoint must be set.

Config files: istio-config.yaml, istio-ingress.yaml

Jaeger

Sample Istio addon Jaeger is used. Additional Ingress configuration is needed to access the Jaeger UI.

Config file: telemetry-ingress.yaml

Prometheus

Sample Istio addon Prometheus is used. It’s prepared to scrape the needed metrics for Istio and Kiali. Additional Ingress configuration is needed to access the Prometheus UI.

Config file: telemetry-ingress.yaml

Grafana

Sample Istio addon Grafana is used. It already contains the expected dashboards by Kiali. Additional Ingress configuration is needed to access the Grafana UI.

Config file: telemetry-ingress.yaml

Grafana, Jaeger Data source

Jaeger Data source configuration, URL
Jaeger Data source configuration, log correlation options

The correlation keys are the Span attributes (Jaeger tags) and Loki log labels (including Pod labels). The default Jaeger tags are documented at Jaeger data source / Trace to logs . The implemented example uses same key to Pod label and instrumented below way:

httpClient := &http.Client{Transport: otelhttp.NewTransport(
http.DefaultTransport,
otelhttp.WithPropagators(otel.GetTextMapPropagator()),
otelhttp.WithSpanOptions(trace.WithAttributes(
attribute.String("component", "opentracing-example"),
)),
)}
ctx, span = tr.Start(ctx, "IN HTTP "+r.Method+" "+r.URL.String(),
.
.
.
trace.WithAttributes(
attribute.String(StateKeyClientCommand, clientCommand),
attribute.String("component", "opentracing-example"),
),
)

See more details at New in Grafana 8.5: How to jump from traces to Splunk logs .

Grafana, Loki Data source

Loki Data source configuration, URL
Loki Data source configuration, Derived fields

The correlation is based on a regex group expression, which parses the Trace ID.

Promtail

Promtail collect the logs for Loki. Below match parses the Trace ID:

- match:
selector: '{component="opentracing-example"}'
stages:
- regex:
expression: '.*(?P<trace>TraceID)\\":\\"(?P<traceID>[a-zA-Z0-9]+).*'
traceID: traceID
- labels:
traceID:

The config file is: loki-values.yaml

Kiali

Config files: kiali-values.yaml, istio-ingress.yaml

Sample Istio addon Kiali is used. URL to Jaeger, Prometheus and Grafana must be set. gRPC to Jaeger is disabled.
Istio sidecar proxy is disabled for Kiali Pod.

Instrumentation

In order to match the app traces in Kiali, the service name must conform to service.namespace format in the Tracer Provider attributes.

Config files: deployments/kustomize/

Summary

Istio is a complex service mesh. Any observability solution which helps to detect and discover issues are useful at deployment+test pipeline and Operations & Maintenance. It’s hard to detect issues on time without above tracing solutions.

If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Join FAUN & get similar stories in your inbox each week

--

--