- the ability to understand a system by examining it’s outputs
- metrics
- logs
- traces
faster troubleshooting
- Raise an alarm when something goes wrong
- Know ahead of time the metrics or events we are interested in
What do we want to monitor?
What do we want to alert on?
Capturing detailed information
- record events
- errors
- provide contextual data
ELK stack
- Elasticsearch for search
- Logstash for collecting and aggregating logs
- Kibana for visualisation
Distributed Tracing
Ability to track end-to-end an action
- Spans: a logical operation
- Traces: multiple spans
For example: A user submits a request which travels through many backend services
The request goes from A->B->C, each service A,B,C is an individual span, while a trace could be A->B or A->C or B-C
- used for displaying data
- alerting
- monitor infra
- reduce time to find issues
- view trends in data
collects and stores metrics as time series data
- typically application exposes metrics via endpoint
Core metrics:
counter: incremental
- example: number of requests
- tasks completed
- errors reported
guage: increase/decrease (ie cpu usage, memory usage, dynamic data)
- items in a queue
- disk space used
histogram: distribution of values (ie latency P95, P98, P99)
- request durations
- response sizes`
- Define what you want to expose (request counts, error rates, response times…)
- start metric collection in application
- (most usecases) start http server which exposes these metrics
- prometheus will scrape this endpoint
- grafana will query promethus to display content