Issue 181

I hope you’re having a lovely weekend and enjoying the weather as we transition into the next season. There were some really interesting articles this week, with a particular emphasis on Machine Learning observability and Prometheus monitoring. Oh, and if you love network monitoring you should definitely check out the new tool that records IP flows to ClickHouse… enjoy! 😺🌈🚲

This issue is sponsored by:

Chronosphere logo

Are your engineers struggling to locate monitoring data quickly? Do they need to retain more data for longer periods of time? Well then, it’s time to level up. Chronosphere dives into the Four Signs It’s Time to Level Up Prometheus at the recent Cloud Native Day Virtual Summit. Watch the session on-demand here.

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Sherlock.io: An Upgraded Machine Learning Monitoring System

Some fascinating insights into eBay’s “Sherlock.io” monitoring system, including its use of machine learning for anomaly detection. Not gonna lie, some of their algorithms broke my brain.

A beginner’s guide to OpenTelemetry

This article looks to be a good jumping off point for folks curious about OpenTelemetry, with plenty of links to more in-depth resources elsewhere.

Prevent metrics explosion in Prometheus

An unexpected flood of new metrics can ruin anyone’s day. Here’s one pattern for filtering out unwanted metrics in your Prometheus cluster.

Towards Machine Learning Observability at Etsy

I haven’t yet had to manage the monitoring or observability of any machine learning systems, but it’s interesting to see how their engineers are tackling this challenging surface area.

AWS Batch Monitoring on Grafana

A useful pattern for monitoring AWS batch jobs with InfluxDB and Grafana.

Metrics, Tracing, and Logging: Three Methods for Better Observability

Another “what is Observability” article, but one that strikes a nice balance between readability and depth.

Adevinta logo

Discover how Adevinta migrated a mission-critical Redis cluster without affecting daily operations. In passing, SRE Benjamin Riou took the opportunity to save AWS costs, implement Terraform IAC and ensure extra backups. The state of the art for achieving maximum impact through a single migration. Visit the blog. (SPONSORED)

MLOps: Monitoring phase

Part of a larger series on machine learning operations (“MLOps”), this article focuses on the monitoring considerations of these services. Although the author seems to be targeting data analysts, there are some useful takeaways if you’re tasked with supporting data folks in your company.

Incident Commander Role

If you’ve heard about Incident Commanders but were too shy to ask, here’s a very quick description of the role and their responsibilities.

Observability Concepts you should know

A two-part series covering a variety of observability topics. The first part covers many of the high-level concepts and priorities (from an SRE’s perspective) while the second part dives a bit deeper into tracing and OpenTelemetry.

Grafana 9.1 release

The release of Grafana 9.1 includes a variety of subtle improvements over its predecessor. There’s a clear emphasis on sharing with external parties, but I worry about the potential for data exfiltration given the project’s history of vulnerabilities in their authorization controls.

Tools

eait-itig/flow-collector

“flow-collector aggregates IP flow data for storage in a ClickHouse database.”

Job Opportunities

Cloud Engineer at SparkMeter (US Remote)

Junior Site Reliability Engineer at Sesame Workshop (US Remote)

Senior Platform Engineer at Replicated (Remote)

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor