Ops @ EECC

An introduction to observability

Three Pillars of Observability

observability: a measure of how well internal states of a system can be inferred from knowledge of its external outputs

logs
metrics
traces

Logs

“An event log is an immutable, timestamped record of discrete events that happened over time.”

Plaintext Log

String orderId = "689";
logger.info("Order {} saved", orderId);

22:53:03.689 [main] INFO  MyApplication - Order 689 saved

Structured Log (JSON)

import static net.logstash.logback.argument.StructuredArguments.v;
…
String orderId = "689";
logger.info("Order {} saved", v("orderId", orderId));

{
  "@timestamp": "2019-04-16T22:53:03.689+02:00",
  "@version": 1,
  "message": "Order 689 saved",
  "logger_name": "MyApplication",
  "thread_name": "main",
  "level": "INFO",
  "level_value": 20000,
  "orderId": "689"
}

Log Architecture

Kibana: Discover

Kibana: Filter by Log Level

Kibana: Filter by Key

Kibana: Visualizations & Dashboards

{
  "@timestamp": "2020-01-30T13:28:20.122+00:00",
  "@version": 1,
  …
  "mercuryRequest": {
    "executionTime": 93,
    "status": "COMPLETED",
    "success": true,
    "errorCodes": [],
    "requestType": "GetStockEntryChangesListResponse"
  }
}

kibana dashboard mg2m

Kibana: Visualizations & Dashboards

Metrics

“Metrics are a numeric representation of data measured over intervals of time.“

httpsessions_active 0.0
datasource_primary_active 0.0
mem 2074624.0
mem_free 1526867.0
processors 2.0
uptime 1.8064965529E10
heap_used 547756.0
threads 27.0
classes 12524.0
http_requests{status=200} 19652.0
http_requests{status=404} 42.0
http_requests{status=500} 2.0

Types of Metrics

counter: a value that can only increase
gauge: a value that can arbitrarily go up and down
histogram: sampled observations counted in buckets (e.g. request durations, response sizes)

The Four Golden Signals

latency: the time it takes to service a request; distinguished between successful and failed requests
traffic: a measure of how much demand is being placed on the system (e.g. HTTP requests per second)
errors: the rate of requests that fail (e.g., HTTP 500s)
saturation: how "full" a service is (e.g. in terms of CPU, memory, I/O)
latency increases are often a leading indicator of saturation

also concerned with predictions, e.g. "the database will fill its hard drive in 4 hours"

from landing.google.com/sre/sre-book/toc/

Spring Boot / Micrometer

Includes many auto-configured metrics about

JVM
- memory usage
- garbage collection
- threads utilization
- number of classes loaded/unloaded
CPU
Uptime
HTTP requests
database connections

Add Application Metrics

import io.micrometer.core.instrument.*;

…

@Autowired private MeterRegistry registry;

…

// counter
Counter counter = registry.counter("my_counter", "label_key", "some_value");
counter.increment();
counter.increment(42);

// gauge
List<String> list = new ArrayList<>();
meterRegistry.gauge("my_gauge", list, List::size);

// timer
Timer timer = registry.timer("my_timer");
timer.record(() -> {
    // my operation
});

Metric Collection & Monitoring Tools

Prometheus Time Series Database

prometheus metric graph

prometheus.eecc.info

Grafana

grafana dashboard

grafana.eecc.info

Prometheus Alerting Rules

Alertmanager Notifications

alertmanager.eecc.info

Prometheus Pull Model

eecc-internal/monitoring

Datadog Time Series Database

mcc9-metrodr.datadoghq.com

Datadog Alerts

Datadog Alert Notifications

Traces

“A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.“

span: represents an individual unit of work done in a distributed system
trace: a collection of spans that share the same ID, representing a unique transaction handled by an application and its constituent services

Tracing Process

One Platform Trace ID

introduced special HTTP header X-One-Platform-Trace-Id
generated on inital request
passed on to subsequent requests

Tracing Tools

Datadog APM Example

datadog apm

Ops @ EECC

Three Pillars of Observability

Logs

Plaintext Log

Structured Log (JSON)

Log Architecture

Kibana: Discover

Kibana: Filter by Log Level

Kibana: Filter by Key

Kibana: Visualizations & Dashboards

Kibana: Visualizations & Dashboards

Metrics

Types of Metrics

The Four Golden Signals

Spring Boot / Micrometer

Add Application Metrics

Metric Collection & Monitoring Tools

Prometheus Time Series Database

Grafana

Prometheus Alerting Rules

Alertmanager Notifications

Prometheus Pull Model

Datadog Time Series Database

Datadog Alerts

Datadog Alert Notifications

Traces

Tracing Process

One Platform Trace ID

Tracing Tools

Datadog APM Example

Ops @ EECC

Spring Boot / Micrometer