Issue 181
I hope you’re having a lovely weekend and enjoying the weather as we transition into the next season. There were some really interesting articles this week, with a particular emphasis on Machine Learning observability and Prometheus monitoring. Oh, and if you love network monitoring you should definitely check out the new tool that records IP flows to ClickHouse… enjoy! 😺🌈🚲
This issue is sponsored by:
Are your engineers struggling to locate monitoring data quickly? Do they need to retain more data for longer periods of time? Well then, it’s time to level up. Chronosphere dives into the Four Signs It’s Time to Level Up Prometheus at the recent Cloud Native Day Virtual Summit. Watch the session on-demand here.
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
Sherlock.io: An Upgraded Machine Learning Monitoring System
Some fascinating insights into eBay’s “Sherlock.io” monitoring system, including its use of machine learning for anomaly detection. Not gonna lie, some of their algorithms broke my brain.
A beginner’s guide to OpenTelemetry
This article looks to be a good jumping off point for folks curious about OpenTelemetry, with plenty of links to more in-depth resources elsewhere.
Prevent metrics explosion in Prometheus
An unexpected flood of new metrics can ruin anyone’s day. Here’s one pattern for filtering out unwanted metrics in your Prometheus cluster.
Towards Machine Learning Observability at Etsy
I haven’t yet had to manage the monitoring or observability of any machine learning systems, but it’s interesting to see how their engineers are tackling this challenging surface area.
AWS Batch Monitoring on Grafana
A useful pattern for monitoring AWS batch jobs with InfluxDB and Grafana.
Metrics, Tracing, and Logging: Three Methods for Better Observability
Another “what is Observability” article, but one that strikes a nice balance between readability and depth.
Discover how Adevinta migrated a mission-critical Redis cluster without affecting daily operations. In passing, SRE Benjamin Riou took the opportunity to save AWS costs, implement Terraform IAC and ensure extra backups. The state of the art for achieving maximum impact through a single migration. Visit the blog. (SPONSORED)
Part of a larger series on machine learning operations (“MLOps”), this article focuses on the monitoring considerations of these services. Although the author seems to be targeting data analysts, there are some useful takeaways if you’re tasked with supporting data folks in your company.
If you’ve heard about Incident Commanders but were too shy to ask, here’s a very quick description of the role and their responsibilities.
Observability Concepts you should know
A two-part series covering a variety of observability topics. The first part covers many of the high-level concepts and priorities (from an SRE’s perspective) while the second part dives a bit deeper into tracing and OpenTelemetry.
The release of Grafana 9.1 includes a variety of subtle improvements over its predecessor. There’s a clear emphasis on sharing with external parties, but I worry about the potential for data exfiltration given the project’s history of vulnerabilities in their authorization controls.
Tools
“flow-collector aggregates IP flow data for storage in a ClickHouse database.”
Job Opportunities
Cloud Engineer at SparkMeter (US Remote)
Junior Site Reliability Engineer at Sesame Workshop (US Remote)
Senior Platform Engineer at Replicated (Remote)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor