Ops @ EECC

An introduction to observability

Three Pillars of Observability

observability

a measure of how well internal states of a system can be inferred from knowledge of its external outputs

  • logs

  • metrics

  • traces

Logs

“An event log is an immutable, timestamped record of discrete events that happened over time.”

logs

Plaintext Log

String orderId = "689";
logger.info("Order {} saved", orderId);
22:53:03.689 [main] INFO  MyApplication - Order 689 saved

Structured Log (JSON)

import static net.logstash.logback.argument.StructuredArguments.v;
…
String orderId = "689";
logger.info("Order {} saved", v("orderId", orderId));
{
  "@timestamp": "2019-04-16T22:53:03.689+02:00",
  "@version": 1,
  "message": "Order 689 saved",
  "logger_name": "MyApplication",
  "thread_name": "main",
  "level": "INFO",
  "level_value": 20000,
  "orderId": "689"
}

Log Architecture

elastic logo
log architecture

Kibana: Discover

kibana discover

Kibana: Filter by Log Level

kibana filter by level

Kibana: Filter by Key

kibana filter by key

Kibana: Visualizations & Dashboards

{
  "@timestamp": "2020-01-30T13:28:20.122+00:00",
  "@version": 1,
  …
  "mercuryRequest": {
    "executionTime": 93,
    "status": "COMPLETED",
    "success": true,
    "errorCodes": [],
    "requestType": "GetStockEntryChangesListResponse"
  }
}

 kibana dashboard mg2m

Kibana: Visualizations & Dashboards

kibana dashboard gtin manager

Metrics

“Metrics are a numeric representation of data measured over intervals of time.“

metrics structure
httpsessions_active 0.0
datasource_primary_active 0.0
mem 2074624.0
mem_free 1526867.0
processors 2.0
uptime 1.8064965529E10
heap_used 547756.0
threads 27.0
classes 12524.0
http_requests{status=200} 19652.0
http_requests{status=404} 42.0
http_requests{status=500} 2.0

Types of Metrics

counter

a value that can only increase

gauge

a value that can arbitrarily go up and down

histogram

sampled observations counted in buckets (e.g. request durations, response sizes)

The Four Golden Signals

latency

the time it takes to service a request; distinguished between successful and failed requests

traffic

a measure of how much demand is being placed on the system (e.g. HTTP requests per second)

errors

the rate of requests that fail (e.g., HTTP 500s)

saturation

how "full" a service is (e.g. in terms of CPU, memory, I/O)

latency increases are often a leading indicator of saturation

also concerned with predictions, e.g. "the database will fill its hard drive in 4 hours"

Spring Boot / Micrometer

Includes many auto-configured metrics about

  • JVM

    • memory usage

    • garbage collection

    • threads utilization

    • number of classes loaded/unloaded

  • CPU

  • Uptime

  • HTTP requests

  • database connections

Add Application Metrics

import io.micrometer.core.instrument.*;

…

@Autowired private MeterRegistry registry;

…

// counter
Counter counter = registry.counter("my_counter", "label_key", "some_value");
counter.increment();
counter.increment(42);

// gauge
List<String> list = new ArrayList<>();
meterRegistry.gauge("my_gauge", list, List::size);

// timer
Timer timer = registry.timer("my_timer");
timer.record(() -> {
    // my operation
});

Metric Collection & Monitoring Tools

prometheus logo grafana logo

datadog logo

Prometheus Time Series Database

prometheus metric graph

Grafana

grafana dashboard

Prometheus Alerting Rules

prometheus alerts

Alertmanager Notifications

prometheus alert notification
prometheus alert notification custom

Prometheus Pull Model

prometheus targets

Datadog Time Series Database

datadog dashboard

Datadog Alerts

datadog monitor

Datadog Alert Notifications

datadog alert notification

Traces

“A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.“

trace
span

represents an individual unit of work done in a distributed system

trace

a collection of spans that share the same ID, representing a unique transaction handled by an application and its constituent services

Tracing Process

trace process

One Platform Trace ID

  • introduced special HTTP header X-One-Platform-Trace-Id

  • generated on inital request

  • passed on to subsequent requests

kibana trace id

Tracing Tools

jaeger logo

open tracing logo

zipkin logo

datadog logo

elastic logo

Datadog APM Example

datadog apm