This is the information age, or so we hear very commonly these days. It is with little doubt that I say this, supported by the ever-increasing amount of information we are producing and the commoditization of storage media and devices at a pace that is very consistently increasing.
If knowledge is power, and information is what knowledge is made-up of, then understanding is the machinery that converts that power into useful work. Understanding and information is what I wanted to talk about in this entry. We have been working for some time on metric collection, storage and visualization in the MyRack Manager UI, and there are a number of metrics which we capture and attempt to visualize. Most have to do with the amount of bytes moved in some form, be that at the network layer, services layer or cache, which is effectively system memory, etc. We are exposing a tiny fraction of information that the system aggregates constantly.
BrickStorOS has a massive metrics storage facility in the kernel, and we can access this data from userspace via a library in C, bindings in Go and Python, etc. Very appropriately named, kstats are units of data or bits of information. There are statistics for all sorts of things one could ever want to know. Many are only useful to developers, at least in their unrefined form, but many, even with little or no refinement are useful for the typical administrative user.
One reason BrickStorOS is able store all this information, in the kernel, and there is a lot of information stored, is the fact that data is continuously updated no history is retained. In other words most data of the data is counters and they simply keep increasing without ever being reset. Regardless of whether a system has been online for just a few hours or years, the amount of memory used to store this information does not change. This is a classic case of: if you don’t use it, you lose it.
Our current metrics processing engine essentially collects this data at fixed intervals, and in the cases where we are trying to measure the number of something’s that occurred between two samples, we only have to store the last sample and compare it against latest sample to get our delta. Counters which just grow forever in one direction are very helpful in this way.
Having all of this information within easy reach is extremely empowering. It feels like we can do so much with it, and derive massive amounts of value. And this is where things get really difficult. It is true that we have a lot of data to share, but the downside of that is the curse of choice. There are many many different parts of the system that could tell us a lot about what is going on with this system at any given time, and of course the ability to capture frames over time helps us to observe changes. But, sometimes just knowing that something is changing, no matter the frequency of the change and perhaps reversion to previous norms, does not necessarily improve our understanding of the environment, the system itself, and may not get us any closer to solving the root problem.
I am challenging myself with trying to identify data points that I believe are useful enough for enough people to be included in our data collection framework. We are in fact storing datapoints that kstat internally constantly updates in a database running on the box. Arguably this is the best design, given data is coming from the kernel, where we want to minimize the memory footprint, and the database does not have to have 100% of data in memory, which we also control by setting limits on the database application, preventing the app from sucking massive amounts of memory.
I find it extremely difficult to point to any one thing and say of course everyone should be seeing this information. Largely it is because I recognize that some users will be less equipped than others to understand the meaning of the data, and it would be a failure on our part to present users with information that they cannot understand, or end-up misinterpreting incorrectly, possibly leading to making a poor decision as a consequence. It is easy to overwhelm someone with information, in particular information that does not entirely make sense. Over the years I heard a lot of commentary about just how easy various metrics systems and tools are, or how complicated they are. Generally, this meant their setup, care and feeding. That’s not, in my view the biggest difficulty.
Going over the many thousands of data points that we generate routinely I attempt to very objectively look at each and ask myself whether most people will understand what it means and how it should be used. It’s the process of converting that data into knowledge. Data is meaningless when it is not actionable, which is why so often we hear people saying we give you actionable information! And in my view, when we collect thousands upon thousands of metrics, picking data out of the ocean of information that are indeed actionable is very difficult indeed.
In this information age, I am seemingly complaining about having a blessing of too much information. The problem or perhaps the opportunity is to figure out how to take this data much of which may be too low level to really make sense for most people and express it as something very useful and actionable. Something I am noticing from tools that collect data is that they are not necessarily doing anything to make information more comprehensible. Their purpose is not to do that, but tools are becoming better at allowing us to slice and dice time-series information and visualize changes in the data. There is a lot of power in being able to spot patterns even if you don’t understand the data. But, a combination of the ability to observe patterns in the timeline and an understanding of what the data means can reveal deeper insights into our systems. This is not necessarily only true for BrickstorOS, but is just as applicable to more general computing systems, other storage systems, etc.
In future posts, I hope to be able to pick out specific metrics that our systems collect today and focus on those points, why I chose to look at them, their significance, etc. Any one data point on its own is nearly meaningless, yet combined and laid over a timeline, they become vastly more insightful. Hopefully I will come up with some good examples of datapoints meaningless in isolation, but profoundly meaningful when placed onto a timeline in my future posts.