Something I thought deserves some attention is what I have been spending a decent bit of time on lately. Specifically, without really getting into the weeds, I have been examining queueing of IOs, trying to be a student of the data, and developing some intuition for what the data is actually telling me as well as how we can use this information to give users more tools to understand load and to project trends. My attention was mostly at the boundary of ZFS, so filesystem and driver, which is about as close as we can get to the disks before data physically leaves HBA(s) on its way to the disks. I have been doing research on which metrics could be meaningful for expressing IO load as part of the metrics suite that we are continuously trying to improve and queueing seemed like a reasonable place to look. What I thought I should do with this post is briefly describe the notion of queues and what the two different queues are that I am specifically focusing on, as well as why.
Some basics about queueing are in order. Queues are something we all experience every day in our lives. Best examples I can think of are gas stations and checkout lines in stores. There are usually two types of wait, if you think about it. We are often not the first to get into the line, so we wait to be serviced, and then once it is our turn, we are being serviced, which takes some time. So, if the amount of time it takes to perform this service, say to checkout a single customer, we will call this time T, then if we are say fourth customer in line, we will wait for 3T units of time to be serviced, since we have three other customers ahead of us, and spend another T units actually receiving service, in this instance be checked out. Of course, each client ahead of us is waiting less, assuming they arrived to the line when it was shorter, or waiting more, if they arrived when line was longer. For instance, we may have entered the line when it was just three customers long, but the people in front of us may have arrived to a much longer line. Additional registers may have opened between the time they arrived and the time we arrived, resulting in reduction of wait time for us due a more parallel checkout.
In reality queue times are rarely fixed. If you think about checkout at a store, some have large number of small items, some may have one or two large items requiring special handling, some may be paying with cash, some check, some cards, etc. Each transaction is unique and requires different amount of time. Same with cars, some may want a little bit of fuel, others a lot more, some are fast some are slow to pump, some ask for help. Experience of those in the line is very dynamic and changes constantly. There is a whole science to queue theory, and I can only encourage learning more about it, but cannot get into the details without rewriting one of many books already written on the subject.
Also important to note that queues are not infinitely open-ended. Think in terms of a store. In the most extreme cases lines can get so long that they stretch outside the store, and this can lead to breakdown and outages. Same thing can happen with storage, where if a system is inundated with IOs that it cannot process fast enough, at some point the queue cannot grow any larger and things fall down, IOs are dropped, customers are given IO Errors(EIO
), forcing them to drop what they were doing and re-queue IOs for which they received errors. This in fact can create a death spiral where the more IOs get dropped, the more re-queuing happens, the busier the storage system gets, the more IOs get dropped, etc. You get the idea… Limits on queues have to exist. After all, without limits you may have queues with infinite IO times. That cannot be good for applications.
In storage, in particular shared storage, IOs are usually not happening in a very orderly manner where predictably they can immediately receive service without waiting. Storage systems are by nature examples of stochastic processes, where IOs’ arrival is entirely random and therefore unpredictable. Because we cannot predict every possible outcome, we create queues, which give us an overflow mechanism for those cases where periods of increased demand which cannot be satisfied without waiting for service arise. I think of the wait queue as a periodic overflow management mechanism.
In practice, even on a moderately loaded system IOs will spend some amount of time waiting to be serviced and some time being processed. We can roughly map the store checkout example to storage IOs. In fact, at the higher level there are two queues, one which IOs enter to wait for service and one which IOs enter to be serviced. We can think of the combination of time spent in both queues as total service time, or latency. All IOs have to go through both queues, even if the wait queue is zero-length. It is a fixed path that IOs must travel. There are many places where IOs spend time before they reach the driver and ultimately disks, but we will focus on one such place for the purposes of this post.
As far as queues go, there are many different types of queues, but the ones we are concerned with here are called FIFOs or First In First Out queues. You can think of each IO as having to first stand in one line, and then be shifted into another line, where it is immediately picked-up for processing. In general we would like to see the wait line be as short as possible, or zero, and we would like to see a consistent in length run queue. Let’s examine in a little more detail my reasoning for making these points.
In the wait queue IOs are effectively doing nothing other than waiting for something useful to happen to them. If an IO is a read, and it ends-up in wait queue, the consumer on the other end of that IO is waiting for it too, which results in the application that issued said IO to block, or set that IO aside and try to tackle next thing, if possible, which may not require any IO at all. In many cases application cannot progress until IO completes. Waiting is not doing useful work, and we want to minimize this period. When we enter the run queue we are actually getting to the point where something is actually happening to the IO. The IO is picked-up, evaluated to determine required action, and it is then sent to the lower levels, i.e. driver, the HBA and eventually disks. There are other queues at those levels, and they all work very similarly. At a higher level we can measure the time that all these lower levels, cumulatively spend to process each IO.
Applications, in particular interactive applications that are heavy IO consumers, are usually designed in a way where IO latency is expected and tricks are built in to make experience smooth for the user. Applications are aware of IO latency, and so they employ tactics that allow them to pretend that IO completed and continue working, while a background process that is not experienced by the user is dealing with the actual latency of IO. There are many other tricks, many of which are actually built into filesystems, including network filesystems like NFS, which “pretend” that IO succeeded, tell the application that IO succeeded, and carry the burden of making sure that indeed IO succeeds or if not, problems are handled correctly. Queueing on the storage system helps to further buffer this pipeline between applications and storage. It is indeed a pipeline which has many steps along the way, and the wait and run queues in the filesystem are close to the end, with driver and hardware being next steps.
We want to, ideally experience a situation where each time application issues an IO, or a series of IOs, the storage is either entirely idle, able to nearly instantly streamline the IO(s) through the wait and into the run queue. This is effectively best case scenario. No waiting in the wait queue is our most desired option. If the run queue is able to process more IOs at one time than the number of IOs issued by application, then as long as the application issues IOs at the same rate, the wait queue ends-up being a rotary and IOs are processed without delay. As long as there are no IOs in the wait queue, application(s) experience(s) pretty consistent latency response. Storage latency is therefore fairly constant, and the buffering that applications do, as well as other buffering at other points in the system contributes to a smooth and consistent application experience.
We desire to have periods of time where the run queue and the wait queue can drain, which means that fewer IOs are coming and the queues reduce in size down to zero. Many shared system exhibit this ebb and flow, where there are short stretches (bursts) of IO leading to queues to grow, and then there are periods during which queues drain, often down to zero. For a production system this should be desired. Realistically we should expect short periods of deep queues, which are then followed by relative slowdown leading to queues draining. These periods, if we were to plot them would result in latency peaks and valleys. At the peaks, queues are deepest and IOs require most time to complete, and in the valleys queues are shallow or zero, and IOs are processed optimally.
As we continue to increase the demand of one application, or increase number of applications seeking to do IO at same time, at some point the run queue will not be able to handle in parallel as many IOs as applications issue. In practice, the run queue will be full and the wait queue will begin to build. The more items we add to the wait queue the greater the latency will be for at least some IOs. If you think about latency for a specific IO, you have to think that all IOs ahead of it will influence its latency, therefore when you think about latency for one IO, you are indirectly thinking about latency for all IOs that must process before it. If some IOs are extraordinarily slow to process, their processing time will be added to the total latency experienced by the IO behind them, and this sort of variability increases as the run queue and wait queue increase in size. As we increase queue size, we increase probability that some IOs will be extremely slow relative to other IOs, because they just happen to enter the queue at its longest, or a large number of IOs ahead of them are slow themselves, or many IOs were issued at once, by several applications, etc.
This all leads me to conclude, and validate with measurements that the variability in latency increases as we increase size of the wait queue. Intuitively this should make sense. Each added IO contributes a little bit to the variability in amount of time it takes for Nth IO in the queue to reach then end of the pipeline. The higher the N, the more influence previous IOs had on it. This is where we get into the conversation about dispersion.
Dispersion is a property of data that describes how much difference there is between values in the given set of points. This is a topic I would like to leave for next post, but want to create some concrete understanding here. If we had say this set of numbers: [1, 2, 8, 11], then we could say that lowest value is 1, and highest 11. Dispersion is a measure of distance, in a way. When we have a lot of data points what we want to know is whether most points are close to each other in value, or are they far apart? If points are pretty close in value, dispersion is low, if they are very far apart from each other, dispersion is high. What makes it high or low is very context specific, of course. Specifically applied to latency, our goal is very low dispersion, which should mean a steady queue and steady load on the storage.
Hopefully this was a good summary of queueing more broadly as well as a good primer for next post, which will go a little bit more about the whys and the hows and with some actual data to back-up these ideas.