Velocity Santa Clara and Monitorama 2016: It’s a Wrap
My head is still spinning with all of the great content, amazing speakers, new tools and technologies that were covered at Monitorama 2016 and Velocity Santa Clara; however, there are still many challenges to overcome.
To get the most out of any experience, sometimes you have to take a step back, get back down to earth, wash away all the spin and hype, and condense the information into something that is actionable and applicable to your business.
Since 1997—when I was given the responsibility of performance monitoring at DoubleClick—I have been chasing the dream of finding the single tool, the single pane, the single screen that will show me in a very simple way my entire infrastructure (which at the time consisted of 17 datacenters, 5000 servers, 2000 network devices, databases, storage arrays, ETL, 90 different applications in 7 different languages, 68+ internet uplinks, etc.); unfortunately, that dream never materialized during my tenure.
After realizing that such a tool did not exist at the time, we wound up using a bunch of tools to try to solve the same problem, which was fine.
For example:
- 2 external synthetic products,
- Sitescope to do internal synthetic testing
- Adlex in 2 datacenter to do RUM
- Our own home grown APM (for application tracing and heart beat)
- Web Monitoring layer on each application for people to access the app and check it
- Network monitoring from Cisco
And all of this fed into SMARTS, a really cool event correlation solution.
In my 10 years of being in that role, I spent so much money in software licenses and FTE costs regarding monitoring, it’s not even funny.
Perhaps one day we will get there; however what we do is complex, what we monitor is complex, what we want to know is not simple and is getting more complex every year, what we ask of monitoring tools is not easy. And the complexity of all of these things is increasing, not decreasing.
What monitoring tools (both commercial and open source) have to stop doing is selling a pipe dream. That Swiss army knife—the one tool that can do everything, and can do it the best—does not exist in the monitoring world. IT professionals must stop chasing that pipe dream, as well. I almost lost my job one year after spending $3 million on one of those tools, without including the implementation, which at the time was a cost of $3 for consultants for every $1 spent on software; and a year later, I was left with nothing to show besides a bunch of consultants taking up space.
One of the best themes from this year’s Monitorama was around the “human” factor. Placing a premium on the people that consume the monitoring data and implement monitoring tools. We need to do a better job in certain areas.
These key takeaways include:
- We need to make the tools easier to use.
- It’s crucial to train people on how to install / run / (most importantly) UNDERSTAND THE MONITORING DATA. This includes the math that goes behind it. Like, averages vs. median, when to use 95thpercentile, etc.
- You still need people to be able to understand the context of an alert. I was pleasantly surprised to come across some unique professional titles at Monitorama like Visibility Engineer and Observability Engineer. They are the people who are able to look at their company’s infrastructure, decipher the data, and make sense of the patterns. In fact, if I had started the QoS team in 1999 at DoubleClick, I would have used those titles as well.
One interesting trend in IT monitoring is the emergence of a “size contest.” People are so proud to be collecting millions of metrics per second and monitoring databases that require petabytes.
When I built our own agentless APM solution at DoubleClick in 2000, we started collecting 500,000 metrics per hour and I remember going back to the entire team asking them to find a way to cut it down. I thought it was insane—how useful could that much data be, and for how long should it be kept? Everyone wanted to store the data for three to five years, leaving me to justify a 1 PTb storage system. The point here is that ROI matters! It’s not about the biggest monitoring systems, but the most efficient and cost effective system. You cannot have a monitoring system so large and complex that it requires its own monitoring system.
A monitoring tool—or tools—must be fast, reliable, easy to deploy, and easy to repair. They also need to be simple and inexpensive to run; the cost of buying or building tends to be at the forefront of our minds, but keep in mind there is also a cost of running. One company mentioned that their AWS storage cost for just the monitoring data was seven figures a year!
During one of the sessions at Monitorama, Pinterest described their monitoring system evolution and it was incredible. The system became just as complex to run and monitor than the actual applications being monitored. When a monitoring systems requires front end and back end load balancers, that is when it’s time to stop and ask yourself if it’s worth it.
My advice:
- Put an end to the hype over the “single pane of glass.” People build and buy a bunch of tools to get their job done, and there is nothing wrong with that.
- Monitoring tools exist to answer a few simple but important questions:
- Is something broken or going to break?
- Where and why is it broken?
- A monitoring tool needs to be measured by how much it reduces the time to detect an issue, the negative business impact it is able to preempt, and how it can help you fix an outage. Even better—the ability to tell you this information before any real user notices something is wrong.
- Keep the end user experience in mind. Internal monitoring is very important, but your users do not live in your datacenter or cloud provider. Think reachability, think transaction health. I have been preaching the Quality of Experience equation for two decades now: QoE= Availability + Transaction Time + Transaction Health.
- We need more people with professional titles containing words like Visibility or Observability.
- We need more standards around methodologies and metrics, and a better understanding of statistics. Is sampling good or bad?
- Performance and monitoring need to be at the front of the Dev cycles, not after the fact or when it hits production.
- We need application developers writing code to have a curiosity about the hardware and network. Just the same, we need Ops guys to have the reverse curiosity. In general, we need the entire IT family to understand each other and the business master we all serve. Ask yourself: Why do we have this application?
- We need to stop comparing the sizes of our monitoring systems and databases, and start talking about how a monitoring project or tool deployment saved time, money, and business, increased revenue, impacted brand, and helped engineers and ops work faster and more efficiently and sleep better at night!
- Add a new abbreviation, borrowed from business, to your lexicon ROI – Return on Investment. Where investment is related to resources, time, and money spent.
I really enjoyed Monitorama 2016; it was my first time there, but certainly not my last. I strongly encourage everyone to attend next year.
Mehdi