BBR system: congestion control directly based on congestion. Keith Matsudeira: Scalable Web Architecture and Distributed Systems Response to Transitions

28.11.2023

Video

Translation

Measuring the throughput of bottlenecks based on the double pass time of a packet

By all measures, today's Internet cannot move data as fast as it should. Most cellular users in the world experience delays ranging from a few seconds to several minutes: public WiFi hotspots at airports and conferences are even worse. Physicists and climate scientists need to share petabytes of data with colleagues around the world, but they find that their elaborate multi-gigabit infrastructure often produces only a few megabits per second over transcontinental links.

These problems arose from the architectural choices made when TCP congestion was created in the 1980s, when packet loss was interpreted as "congestion." The equivalence of these concepts was true for the time, but only due to the limitations of technology, and not by definition. As NICs (network interface controllers) upgraded from megabit to gigabit speeds and memory chips from kilobytes to gigabytes, the relationship between packet loss and congestion became less clear.

In modern TCP, packet loss congestion throttling - even in the most advanced technology of its kind, CUBIC - is the main cause of these problems. If bottleneck buffers are too large, packet loss congestion control keeps them full, causing unnecessary network buffering. If the buffers are too small, the packet loss congestion control system incorrectly interprets the packet loss as a congestion signal, resulting in reduced throughput. Solving these problems requires an alternative to packet loss congestion throttling. To find this alternative, you need to understand where and how congestion occurs.

Congestion and bottleneck

At any given time there is only one slowest link in a (full duplex) TCP connection, or bottleneck in every direction. The bottleneck is important for the following reasons:

It determines the maximum data transfer rate over the connection. This is the main property of an uncompressed data stream (for example, imagine a six-lane highway during rush hour if, due to an accident, one small section of the road is limited to just one lane. The traffic in front of the accident will not move faster than the traffic through this lane.
This is the place where queues constantly form. They decrease only if the intensity of the outgoing flow from the bottleneck exceeds the intensity of the incoming flow. For connections running at maximum bit rate, all outgoing flows to the bottleneck always have a higher outgoing flow rate, so queues move towards the bottleneck.

Regardless of how many links there are in a connection and what their individual speeds are, from a TCP point of view, an arbitrarily complex route appears as a single connection with the same RTT (round trip time) and bottleneck throughput . Two physical limitations RTprop(round-trip propagation time, double packet pass time) and BtlBw(bottleneck bandwidth) affect transmission performance. (If the network path were a material pipe, RTprop would be the length of the pipe, and BtlBw would be its diameter).

Figure 1 shows various combinations of RTT and baud rate with data volume in flight, that is, inflight (data sent, but without delivery confirmation). The blue lines show the RTprop limit, the green lines the BtlBw limit, and the red lines the bottleneck buffer. Operations in gray areas are not possible because they contradict at least one restriction. Transitions between restrictions led to the formation of three regions (app-limited, bandwidth-limited and buffer-limited) with qualitatively different behavior.

Picture 1

When there is not enough data in flight to fill the pipe, RTprop determines the behavior; otherwise BtlBw dominates. The restriction lines intersect at point , also known as BDP pipes (bandwidth-delay product, a derivative of throughput and delay). Since the pipe is full after this point, the excess creates a queue to the bottleneck, resulting in the linear relationship between RTT and inflight data shown in the top graph. Packets are discarded when the excess exceeds the buffer capacity. Congestion is simply being continuously to the right of the BDP line, and congestion control- some scheme to set a limit on how far to the right of the BDP line, on average, data transmission occurs.

Loss-based congestion control operates at the right edge of the bandwidth-limited region, providing full bottleneck capacity at the cost of high latency and frequent packet loss. In the days when memory was expensive, buffer sizes were only slightly larger than BDP, which minimized the excess latency of lossy congestion control. Subsequent reductions in memory prices led to an increase in unnecessary network buffering and an increase in RTT to seconds instead of milliseconds.

On the left edge of the area limited by the channel's capacity, there is an operating point with better conditions than on the right. In 1979, Leonard Kleinrock showed that this operating point is optimal; it maximizes actual throughput while minimizing delays and losses, both for individual connections and for the network as a whole. Unfortunately, around the same time, Jeffrey Yaffe proved that it was impossible to create a distributed algorithm that converges at this operating point. This finding changed the direction of research. Instead of searching for a distributed algorithm that strives for the optimal Kleinrock operating point, researchers have begun to explore alternative approaches to congestion control.

Our team at Google spends hours every day studying TCP header capture logs from around the world, understanding anomalies and pathological behavior. Usually we first look for the key characteristics of the route, RTprop and BtlBw. That they can be inferred from network traces suggests that Jaffe's conclusion may not be as clear-cut as once thought. Its inference relies on fundamental measurement uncertainty (for example, whether the measured increase in RTT is caused by a change in path length, a decrease in bottleneck capacity, or an increase in queuing delay due to traffic from another connection). Although it is not possible to eliminate the uncertainty of each specific measurement, the behavior of the connection over time provides a clearer picture, suggesting the possibility of using measurement strategies designed to eliminate uncertainty.

By combining these measurements with a robust servo loop and leveraging the latest advances in control systems, one can hope to develop a distributed congestion management protocol that responds to actual congestion rather than packet loss or momentary queue delays. And it will converge with a high probability at the optimal Kleinrock operating point. Thus began our three-year project to develop a congestion management system based on measuring two parameters that characterize a route: bottleneck capacity and packet round trip time, or BBR.

Bottleneck Characteristics

A connection operates at maximum throughput and minimum latency when the (rate balance) packet arrival rate at the bottleneck equals BtlBw and the (full link) total data in flight equals BDP().

The first condition ensures that the bottleneck is 100% utilized. The second ensures that we have enough data to prevent bottleneck downtime, but not to overwhelm the channel. The speed balance condition itself Not guarantees no queue, only its constant size (for example, if a connection starts by sending a 10-packet Initial Window to a five-packet BDP, then runs at exactly the same bottleneck speed, then five of the ten initial packets will fill the channel, and the excess will form a standing queue in bottleneck that will not be able to dissolve). Likewise, the full link condition does not guarantee that there is no queue (for example, a connection sends BDP in bursts over BDP/2 and fully utilizes the bottleneck, but the average queue is BDP/4). The only way to minimize queuing at the bottleneck and across the entire link is to satisfy both conditions simultaneously.

The BtlBw and RTprop values change over the life of the connection, so they should be continually evaluated. Currently, TCP monitors RTT (the time interval between sending a data packet and reporting its delivery) because it is required to determine losses. At any given time,

where represents the “noise” from queues along the route, the recipient’s strategy of acknowledgment delay, acknowledgment crowding, etc. RTprop is a physical property of the connection, and it only changes when the channel itself changes. Since channel changes occur on a time scale ≥ RTprop, then an unbiased and efficient estimate of time would be
that is, running in a time window (which is typically tens of seconds to minutes).

Unlike RTT, nothing in the TCP specifications requires an implementation to track bottleneck throughput, but a good estimate of this can be obtained from tracking delivery rates. When an acknowledgment of the delivery of a packet returns to the sender, it shows the RTT of the packet and announces the delivery of the data in flight when the packet departs. The average delivery speed between sending a packet and receiving an acknowledgment is the ratio of the delivered data to the elapsed time: . This rate must be less than or equal to the bottleneck capacity (the arrival quantity is known exactly, so all the uncertainty lies in , which must be greater than or equal to the real arrival interval; therefore the ratio must be less than or equal to the real delivery rate, which, in turn, queue, limited from above by the capacity of the bottleneck). Therefore, the maximum delivery rate window is an efficient, unbiased estimate of BtlBw:

where the time window is typically between six and ten RTTs.

TCP must record the time each packet was sent in order to calculate the RTT. BBR augments these records with the total amount of data delivered so that the arrival of each acknowledgment reports both the RTT and the result of the delivery rate measurement, which the filters convert into estimates of RTprop and BtlBw.

Note that these values are completely independent: RTprop can change (for example, when the route changes) but suffers the same bottleneck, or BtlBw can change (for example, when the wireless channel capacity changes) without changing the route. (The independence of both moderating factors is the reason why they must be known in order to relate shipping behavior to the delivery route). Since RTprop is visible only on the left side of the BDP and BtlBw only on the right side in Figure 1, they are subject to the uncertainty principle: whenever one of the two can be measured, the other cannot be measured. Intuitively, this can be understood as follows: in order to determine the capacity of a channel, it must be overfilled, which creates a queue that makes it impossible to measure the length of the channel. For example, an application with a request/response protocol may never send enough data to fill the channel and monitor BtlBw. A multi-hour transmission of a large data array may spend all of its time in an area with limited bandwidth and only receive a single RTprop sample from the RTT of the first packet. The inherent uncertainty principle means that in addition to estimates to reconstruct the two route parameters, there must be states that track simultaneously both what can be learned at the current operating point and, as the information becomes less fresh, how to get to the operating point where one can obtain more recent data.

Mapping a packet stream to a delivery path

The basic BBR algorithm consists of two parts:

When we receive confirmation (ack)

Each confirmation provides a new RTT and average delivery rate measurements, which updates the RTprop and BtlBw estimates:

Function onAck(packet) rtt = now - packet.sendtime update_min_filter(RTpropFilter, rtt) delivered += packet.size delivered_time = now deliveryRate = (delivered - packet.delivered) / (delivered_time - packet.delivered_time) if (deliveryRate > BtlBwFilter. currentMax || ! packet.app_limited) update_max_filter(BtlBwFilter, deliveryRate) if (app_limited_until > 0) app_limited_until = app_limited_until - packet.size
The if check is needed because of the uncertainty problem discussed in the last paragraph: senders may be limited by the application, meaning the application may run out of data to fill the network. This is a fairly typical situation due to request/response traffic. When there is an opportunity to send, but no data is sent, BBR marks the corresponding bandwidth samples as "application limited", that is, app_limited (see send() pseudocode). This code determines which samples to include in the throughput model so that it reflects network limits rather than application limits. BtlBw is a hard upper bound on the delivery rate, so a measured delivery rate greater than the current estimate of BtlBw should imply an underestimate of BtlBw, regardless of whether the sample was application-limited or not. Otherwise, application-constrained samples are discarded. (Figure 1 shows that in the region with the deliveryRate application limitation, the BtlBw parameter is underestimated). These checks prevent the BtlBw filter from populating the BtlBw filter with too low values, which could slow down the sending of data.

When data is sent

To match the rate of arrival of packets with the rate of departure from the bottleneck, BBR tracks each data packet. (BBR must match rate bottleneck: this means that tracking is an integral part of the architecture and a fundamental part of the operation - so pacing_rate is the main control parameter. Auxiliary parameter cwnd_gain connects inflight with a small set of BDPs to handle typical network and receiver pathologies (see the chapter on delayed and stretched acknowledgments below). Conceptually, the send procedure in TCP looks like the following code. (On Linux, sending data uses an efficient queuing procedure, FQ/pacing, which gives BBR linear single-connection performance on multi-gigabit links and handles thousands of lower-speed connections with negligible CPU overhead.)

Function send(packet) bdp = BtlBwFilter.currentMax × RTpropFilter.currentMin if (inflight >= cwnd_gain × bdp) // wait for ack or retransmission timeout return if (now >= nextSendTime) packet = nextPacketToSend() if (! packet) app_limited_until = inflight return packet.app_limited = (app_limited_until > 0) packet.sendtime = now packet.delivered = delivered packet.delivered_time = delivered_time ship(packet) nextSendTime = now + packet.size / (pacing_gain × BtlBwFilter.currentMax) timerCallbackAt(send, nextSendTime)
Steady State Behavior

The speed and amount of data sent over the BBR depends only on BtlBw and RTprop, so the filters control not only the bottleneck constraint estimates, but also their application. This creates the custom control loop depicted in Figure 2, which shows the RTT (blue), inflight (green), and delivery rate (red) over 700 milliseconds for a 10-Mbit 40-millisecond stream. The thick gray line above the delivery speed is the state of the BtlBw max filter. The triangular shapes are obtained by cycling the pacing_gain factor in BBR to determine the increase in BtlBw. The gain in each part of the cycle is shown time-aligned with the data it affected. This factor worked on RTT earlier when the data was sent. This is shown by the horizontal ridges in the description of the sequence of events on the left side.

Figure 2

BBR minimizes latency because most of the time is spent with the same BDP in action. Due to this, the bottleneck moves towards the sender, so that the increase in BtlBw becomes unnoticeable. Consequently, BBR periodically spends the RTprop interval at pacing_gain > 1, which increases the sending speed and the amount of data sent without confirmation of delivery (inflight). If BtlBw does not change, then a queue is created in front of the bottleneck, increasing the RTT, which keeps deliveryRate unchanged. (The queue can be eliminated by sending data with a compensating pacing_gain value< 1 для следующего RTprop). Если BtlBw увеличивается, то скорость deliveryRate тоже увеличивается, а новое максимальное значение фильтра немедленно увеличивает базовую скорость отслеживания (pacing_rate). For this reason, BBR adapts to the new bottleneck size exponentially quickly. Figure 3 shows this effect on a 10Mbps 40ms stream that BtlBw suddenly increases to 20Mbps after 20 seconds of steady operation (on the left side of the graph), then drops to 10Mbps after another 20 seconds of sustained operation at 20 Mbit/s (right side of the graph).

Figure 3

(Essentially, BBR is a simple example of a "max-plus" control system, a new approach to control systems based on non-standard max-plus algebra. This approach allows speed adaptation [controlled by gain max] regardless of queue growth [controlled by transmission ratio average]. When applied to our problem, this boils down to a simple, unconditional control loop, where adaptation to changes in physical constraints is automatically accomplished by changing the filters that express those constraints. A traditional system would require the creation of many control loops combined into a complex state machine to achieve this result).

Behavior when running a single BBR thread

Modern implementations handle events like startup, shutdown, and loss recovery using “event” algorithms with many lines of code. BBR uses the code above (in the previous chapter, "Matching a Packet Flow to a Delivery Path") for everything. Events are processed by passing through a sequence of “states.” The “states” themselves are defined by a table with one or more fixed values of coefficients and exit criteria. Most of the time is spent in the ProbeBW state, which is described in the chapter “Steady State Behavior.” The Startup and Drain states are used when starting a connection (Figure 4). To handle a connection where the throughput increases by 12 orders of magnitude, the Startup state implements a binary search algorithm for BtlBw, which uses a transfer factor to double the transfer rate when the delivery rate increases. This defines BtlBw in RTT, but at the same time creates an unnecessary queue to . Once Startup finds BtlBw, the BBR system enters the Drain state, which uses Startup's inverse gains to get rid of the excess queue, and then enters the ProbeBW state as soon as inflight drops to the BDP line.

Figure 4

Figure 4 shows the first second of a 10Mbps 40ms BBR stream. The top graph shows the time and sequence of events, as well as the progress of the sender (green) and recipient (blue): the amount of data transmitted and received. The red line shows the sender's CUBIC performance under the same conditions. Vertical gray lines indicate transition times between BBR states. The bottom graph shows the change in packet round trip time (RTT) for both connections. Note that the time scale for this graph corresponds to the time of receipt of confirmation of arrival (ack). Therefore, the graphs may appear to be shifted in time, but in fact the events below correspond to the moments when the BBR system learns about these events and reacts.

The bottom graph in Figure 4 compares BBR and CUBIC. At first, their behavior is almost the same, but BBR completely empties the queue created when the connection is started, while CUBIC cannot do this. It doesn't have a trace model that tells you how much of the data sent is redundant. Therefore, CUBIC increases data transmission without confirmation of delivery less aggressively, but this growth continues until the bottleneck buffer overflows and begins to drop packets, or until the recipient's limit on sending data without confirmation (TCP receive window) is reached.

Figure 5 shows the RTT behavior in the first eight seconds of the connection depicted in the previous Figure 4. CUBIC technology (in red) fills the entire available buffer, then spins between 70% and 100% full every few seconds. Once the connection is started, BBR (in green) works with virtually no queue creation.

Figure 5

Behavior when multiple BBR threads meet at a bottleneck

Figure 6 shows how the throughput of multiple BBR flows converges on a fair section of the link with a 100 megabit/10 millisecond bottleneck. Downward-facing triangular structures are states of ProbeRTT connections in which self-synchronization accelerates eventual convergence.

Figure 6

ProbeBW gain cycles (Figure 2) incentivize larger flows to give up bandwidth to smaller flows, causing each flow to understand its fair state. This happens quite quickly (several ProbeBW cycles), although unfairness can persist if late-starting threads overestimate RTprop due to earlier threads (temporarily) creating a queue.

To figure out the true RTProp, the stream is moved to the left of the BDP line using the ProbeRTT state: if the RTProp estimate is not updated (that is, by sampling a smaller RTT) for many seconds, then the BBR enters the ProbeRTT state, reducing the amount of data sent (inflight) to four packets for at least one double pass, and then returns to the previous state. When large threads enter the ProbeRTT state, many packets are dequeued so that multiple threads see the new RTprop value (the new minimum RTT). This resets their RTprop estimates to zero at the same time, so they all enter the ProbeRTT state together - and the queue shrinks even further, causing even more threads to see the new RTprop value, and so on. This distributed coordination is a key factor for both fair bandwidth allocation and stability.

BBR synchronizes threads around the desired event, which is an empty queue at a bottleneck. In contrast to this approach, packet loss congestion throttling synchronizes flows around unwanted events such as periodic queue growth and buffer overflows, which increases latency and packet loss.

Experience of implementing B4 WAN at Google

The B4 network at Google is a high-speed WAN (wide-area network), built on ordinary cheap switches. Losses on these small-buffered switches occur primarily due to the occasional influx of small bursts of traffic. In 2015, Google began migrating production traffic from CUBIC to BBR. There have been no reported issues or regressions, and since 2016 all B4 TCP traffic uses BBR. Figure 7 illustrates one reason why this was done: BBR's throughput is consistently 2 to 25 times higher than CUBIC's. We saw even greater improvement, but found that 75% of BBR connections were limited by the TCP receive buffer in the kernel, which the network operations staff deliberately set to a low value (8 MB) to prevent CUBIC from flooding the network with megabytes of excess traffic without delivery confirmation (if divide 8 MB by 200 ms intercontinental RTT, you get 335 Mbps maximum). Manually increasing the receive buffer on one US-Europe route instantly increased the BBR data rate to 2 Gbps, while CUBIC remained at 15 Mbps - this 133-fold relative increase in throughput was predicted by Mathis and colleagues in the scientific work in 1997.

Figure 7

Figure 7 shows the relative improvement of BBR compared to CUBIC; Inset: cumulative distribution functions (CDF) for capacity. The measurements were made using an active sensing service, which opens persistent BBR and CUBIC connections to remote data centers, then transmits 8 MB of data every minute. The probes are exploring many B4 routes between North America, Europe and Asia.

The huge improvement is a direct result of BBR Not uses packet loss as an indicator of congestion. To achieve full throughput, existing packet loss congestion control methods require the loss rate to be less than the inverse square of the BDP (for example, less than one loss per 30 million packets for a 10 Gbps, 100 ms connection). Figure 8 compares the measured usable throughput at different packet loss levels. In the CUBIC algorithm, packet loss tolerance is a structural property of the algorithm, while in BBR it is a configuration parameter. As the loss level in the BBR approaches the maximum gain of ProbeBW, the probability of measuring the delivery rate of the real BtlBw drops sharply, causing the filter to underestimate max.

Figure 8

Figure 8 compares the usable throughput of BBR and CUBIC in a 60 second stream on a 100 Mbps and 100 ms connection with random losses ranging from 0.001% to 50%. CUBIC throughput decreases by a factor of 10 at 0.1% loss and completely stalls at more than 1%. The maximum throughput is considered to be the fraction of bandwidth minus packet losses (). BBR stays at this level up to a loss level of 5% and close to it up to 15%.

YouTube Edge implementation experience

BBR is deployed on YouTube and Google.com video servers. Google is running a small experiment in which a small percentage of users are randomly transferred to BBR or CUBIC. Playing the video over BBR shows a significant improvement in all user ratings of service quality. Perhaps because BBR behavior is more consistent and predictable. BBR only slightly improves connection throughput because YouTube already adapts streaming speeds to levels well below BtlBw to minimize unnecessary network buffering and rebuffering. But even here, BBR reduces the average median RTT by 53% globally and by more than 80% in developing countries.

Figure 9 shows the median RTT improvement between BBR and CUBIC over 200 million YouTube video views measured across five continents over a one-week period. More than half of the world's 7 billion mobile connections are connected over 2.5G at speeds between 8 and 114 Kbps, with well-documented problems due to the fact that packet loss congestion control techniques tend to overflow buffers. The bottleneck in these systems is usually between the SGSN (Serving GPRS Support Node) and the mobile device. SGSN software runs on a standard PC platform with copious amounts of RAM, where there are often megabytes of buffer between the Internet and the mobile device. Figure 10 compares the (emulated) SGSN latency between internet and mobile device for BBR and CUBIC. Horizontal lines mark some of the most serious consequences: TCP adapts to long RTT delays, with the exception of the SYN packet that initiates the connection, which has a fixed timeout, depending on the operating system. When a mobile device receives a large amount of data (for example, from an automatic software update) through a large buffer SGSN, the device cannot establish any connection to the Internet until the queue is cleared (the SYN ACK packet is delayed longer than the fixed SYN timeout) .

Figure 9

Figure 10 shows median steady-state RTT changes versus buffer size on a 128 Kbps, 40 ms connection with eight BBR (green) or CUBIC (red) streams. BBR keeps the queue at minimum values, regardless of the size of the bottleneck buffer and the number of active threads. CUBIC streams always fill the buffer, so latency increases linearly with buffer size.

Figure 10

Adaptive band in mobile cellular communications

Cellular systems adapt bandwidth to each user based in part on a demand forecast that takes into account the queue of packets dedicated to that user. Early versions of BBR were configured to create very small queues, causing connections to stall at low speeds. Increasing the peak pacing_gain value for ProbeBW and increasing the queues led to a decrease in stalled connections. This shows excellent adaptability for some networks. With the current peak gain setting, no degradation appears in any network compared to CUBIC.

Delayed and stretched ack packets

Cellular, WiFi, and broadband networks often delay and accumulate ack packets. When inflight is limited to a single BDP, it causes disruptions that reduce throughput. Increasing ProbeBW's cwnd_gain factor to two allowed smooth transmission to continue at the predicted delivery rate, even when ack packets were delayed to one RTT. This greatly prevents breakdowns.

Current bucket limiters

The initial deployment of BBR on YouTube showed that most ISPs in the world were distorting traffic using current bucket rate limiters. The bucket is usually full at the start of the connection, so the BBR learns the BtlBw parameter for the underlying network. But once the bucket is empty, all packets sent faster than (much less than BtlBw) the bucket's filling rate are discarded. BBR eventually learns this new delivery rate, but ProbeBW's gain cycling results in a constant, moderate loss. To minimize upstream bandwidth loss and the increase in application latency from this loss, we added a dedicated policer detector and an explicit policer model to BBR. We are also actively studying the best ways to eliminate the harm caused by speed limiters.

Competing with congestion control methods based on packet loss

BBR is all about fair sharing of bottleneck bandwidth, whether in competition with other BBR flows or with flows controlled by packet loss congestion control techniques. Even when the latter fills the available buffer, ProbeBW still reliably shifts the estimate of BtlBw toward fair thread sharing, and ProbeRTT considers the estimate of RTProp high enough for tit-for-tat convergence to fair sharing. However, unmanaged router buffers larger than some BDPs cause legacy competitors with packet loss congestion throttling to bloat the queue and capture more than their fair share. Eliminating this is another area of active research.

Conclusion

Rethinking how we manage congestion brings enormous benefits. Instead of using events such as wastage or buffer occupation, which are only weakly correlated with congestion, BBR starts with a formal Kleinrock congestion model and its associated optimal operating point. The unfortunate conclusion that it is “impossible” to simultaneously determine the critical parameters of latency and bandwidth is circumvented by the observation that they can be predicted simultaneously. Recent advances in control systems and estimation theory are then applied to create a simple distributed control loop that approaches the optimum by fully utilizing the network while maintaining a small queue. Google's BBR implementation is available in the open source Linux kernel and is described in detail in the appendix to this article.

BBR technology is used on the Google B4 backbone, improving throughput by orders of magnitude compared to CUBIC. It also deploys to Google and YouTube web servers, significantly reducing latency on all five continents tested to date, and especially strongly in developing countries. BBR technology operates exclusively on the sender side and does not require changes to the protocol, recipient or network, allowing it to be deployed gradually. It depends only on RTT and packet delivery notifications, so it can be used in most Internet transport protocols.

Acknowledgments

The authors are grateful to Len Kleinrock for pointing out how to properly manage congestion. We are indebted to Larry Brakmo for his pioneering work on the Vegas and New Vegas congestion control systems, which foreshadowed many elements of BBR, and for his advice and guidance during the early development of BBR. We would also like to thank Eric Dumazet, Nandita Dukkipati, Jana Iyengar, Ian Swett, M. Fitz Nowlan, David Wetherall, Leonidas Leonidas Kontothanassis, Amin Vahdat, and the Google BwE and YouTube infrastructure teams for their invaluable help and support.

Appendix - detailed description

State machine for sequential probing

The pacing_gain controls how fast packets are sent relative to BtlBw, and is the key to BBR's intelligence. When pacing_gain is greater than one, inflight increases and the interval between packet arrivals decreases, which moves the connection to the right side in Figure 1. When pacing_gain is less than one, the opposite effect occurs, the connection moves to the left.

BBR uses pacing_gain to implement a simple sequential state probing machine that alternates between testing for a larger bandwidth and testing for a shorter packet double pass time. (It is not necessary to test for a smaller bandwidth because it is handled automatically by the BtlBw msx filter: new measurements reflect the fall, so BtlBw will correct itself as soon as the last old changes are removed from the timeout filter. The RTprop min filter handles increasing delivery paths in the same way) .

If the bottleneck's throughput increases, the BBR must send data faster to detect it. Likewise, if the actual round trip time of a packet changes, this changes the BDP, and so the BDP must send data more slowly to keep inflight below the BDP to measure the new RTprop. Therefore, the only way to detect these changes is to conduct experiments, sending faster to test for an increase in BtlBw or sending slower to test for a decrease in RTprop. The frequency, scale, duration, and design of these experiments vary depending on what is already known (connection running vs. steady state) and the behavior of the application that is sending the data (intermittent vs. persistent).

Steady state

BBR threads spend most of their time in the ProbeBW state probing the band using a technique called gain cycling, which helps BBR flows achieve high throughput, low queuing latency, and fair bandwidth convergence. Using gain cycling, BBR cycles through a series of values for gain pacing_gain. Eight phases of the cycle are used with the following values: 5/4, 3/4, 1, 1, 1, 1, 1, 1. Each phase usually runs for the predicted RTprop. This design allows the coefficient loop to first probe the channel for a larger capacity with a value pacing_gain greater than 1.0, and then disperse any resulting queue using pacing_gain by the same amount less than 1.0, and then move at cruising speed with a short queue of coefficients of exactly 1.0. The average gain for all phases is 1.0 because ProbeBW tends to the average to equalize the available bandwidth and therefore maintain high bandwidth utilization without increasing the queue. Note that although rate cycles change pacing_gain, but the meaning cwnd_gain always remains equal to two, since delayed and stretched ack packets can appear at any time (see the chapter on delayed and stretched ack packets).

Moreover, to improve thread mixing and fair bandwidth sharing, and to reduce queuing when multiple BBR threads share a bottleneck, BBR randomly changes the phases of the ProbeBW loop, randomly selecting the first phase among all but 3/4 values. Why doesn't the cycle start at 3/4? The main purpose of this coefficient value is to disperse any queue that may form during the application of the 5/4 coefficient when the channel is already full. When a process exits the Drain or ProbeRTT state and enters ProbeBW, there is no queue, so the 3/4 factor does not do its job. Applying 3/4 in such a context only has a negative effect: the filling of the channel in this phase will be 3/4, not 1. Thus, starting a cycle with 3/4 has only a negative effect, but no positive one, and since entering the state ProbeBW occurs at the start of any connection for a long time, then BBR uses this small optimization.

BBR threads work together to periodically empty the bottleneck queue using a state called ProbeRTT. In any state other than ProbeRTT itself, if the RTProp estimate has not been updated (for example, by sampling a smaller RTT) for more than 10 seconds, then the BBR enters the ProbeRTT state and reduces cwnd to a very small value (four packets). By maintaining this minimum number of packets in flight for at least 200 ms and for one packet pass-through time, the BBR exits the ProbeRTT state and enters either Startup or ProbeBW, depending on its assessment of whether the channel is already full.

BBR is designed to operate most of the time (about 98%) in the ProbeBW state and the rest of the time in ProbeRTT, based on a set of trade-offs. The ProbeRTT state lasts long enough (at least 200ms) to allow threads with different RTTs to have parallel ProbeRTT states, but at the same time it lasts a short enough amount of time to limit the performance hit to about two percent (200ms / 10 seconds ). The RTprop filter window (10 seconds) is small enough to quickly adjust to changing traffic levels or route changes, but large enough for interactive applications. Such applications (eg, web pages, remote procedure calls, video snippets) often exhibit natural periods of silence or little activity in the slots of this window, where the flow rate is low enough or long enough to disperse the queue into a bottleneck. The RTprop filter then opportunistically fits these RTprop measurements, and RTProp is updated without the need to engage ProbeRTT. Thus, flows typically only need to sacrifice 2% of the bandwidth if multiple flows heavily fill the channel throughout the entire RTProp window.

Startup Behavior

When a BBR thread starts, it performs its first (and fastest) sequential queue probing/emptying process. Network bandwidth varies in the range 10 12 - from a few bits to 100 gigabits per second. To figure out the value of BtlBw for such a giant change in range, BBR performs a binary search in velocity space. It finds BtlBw (double pass packets) very quickly, but at the expense of creating a queue in 2BDP at the last stage of the search. The search is performed in the Startup state in the BBR, and the emptying of the created queue is performed in the Drain state.

First, Startup exponentially increases the speed of sending data, doubling it every round. To achieve such fast probing in the most relaxed manner, when starting the gain values pacing_gain And cwnd_gain set to , the minimum value that allows the data sending rate to double on each round. Once the channel is full, the coefficient cwnd_gain limits the queue size to .

In the connection startup state, the BBR evaluates whether the channel is full by looking for a plateau in the BtlBw estimate. If it finds that it has gone through several (three) rounds where trying to double the delivery speed doesn't really produce much of a speed increase (less than 25%), then it considers it to have reached BtlBw, so it exits the Startup state and enters the Drain state. BBR waits three rounds to obtain conclusive evidence that the sender's observed delivery rate plateau is not a temporary effect of the receive window. Waiting three rounds allows enough time for automatic tuning on the recipient side to open the receive window and for the BBR sender to detect the need to increase BtlBw: in the first round, the automatic tuning algorithm for the receive window increases the receive window; in the second round, the sender fills the increased receive window; in the third round, the sender receives samples at an increased delivery speed. This limit of three rounds has proven itself based on the results of its implementation on YouTube.

In the Drain state, the BBR algorithm seeks to quickly empty the queue that formed in the connection start state by switching to pacing_gain with reverse values than those used in the Startup state. When the number of packets in flight matches the BDP estimate, this means that the BBR estimates the queue to be completely empty, but the channel is still full. Then BBR exits the Drain state and enters ProbeBW.

Note that the BBR connection start and the CUBIC slow start both explore the bottleneck throughput exponentially, doubling the sending rate in each round. However, they are radically different. First, BBR more reliably determines the available bandwidth because it does not stop searching in the event of packet loss or (like CUBIC's Hystart) increased latency. Secondly, BBR smoothly increases the sending speed, while CUBIC has a burst of packets at each round (even with pacing), and then a period of silence. Figure 4 shows the number of packets in flight and the observed RTT time for each acknowledgment message for BBR and CUBIC.

Responding to Transitional Situations

The network path and the traffic flowing along it may experience sudden, dramatic changes. To adapt to them smoothly and reliably, and to reduce packet loss in these situations, BBR uses a number of strategies to implement its basic model. Firstly, BBR regards as the goal towards which the current cwnd carefully approaches from below, increasing cwnd each time no more than the amount of data for which delivery confirmations were received. Second, at a retransmission timeout, which means that the sender considers all packets in flight to be lost, BBR conservatively reduces cwnd up to one packet and sends a single packet (just like packet loss congestion control algorithms such as CUBIC). Finally, when the sender notices a packet loss but there are still packets in flight, in the first step of the loss recovery process, BBR temporarily reduces the sending rate to the current delivery rate; in the second and subsequent rounds of loss recovery, it checks that the sending rate is never more than twice the current delivery rate. This significantly reduces losses in transient situations when the BBR encounters rate limiters or competes with other flows in a buffer of comparable size to the BDP.

Links

1. Abrahamsson, M. 2015. TCP ACK suppression. IETF AQM mailing list;

Open source software has become a core building block in the creation of some of the world's largest websites. With the growth of these websites, best practices and guidelines for their architecture have emerged. This chapter aims to cover some of the key issues to consider when designing large websites, as well as some of the basic components used to achieve these goals.

The focus of this chapter is on the analysis of web-based systems, although some of the material may be extrapolated to other distributed systems.

1.1 Principles of building distributed web systems

What exactly does it mean to create and manage a scalable website or application? At a primitive level, it is simply connecting users to remote resources over the Internet. And resources or access to these resources, which are distributed over many servers and are the link that ensures the scalability of the website.

Like most things in life, time spent upfront planning the build of a web service can help later; Understanding some of the considerations and trade-offs behind large websites can yield smarter decisions when building smaller websites. Below are some key principles that influence the design of large-scale web systems:

Availability: Website uptime is critical to the reputation and functionality of many companies. For some larger online retailers, being unavailable for even a few minutes can result in thousands or millions of dollars in lost revenue. Thus, developing their systems to be always available and resilient to failure is both a fundamental business and technology requirement. High availability in distributed systems requires careful consideration of redundancy for key components, rapid recovery from partial system failures, and smooth curtailment of capabilities when problems arise.
Performance: Website performance has become an important metric for most websites. Website speed impacts user experience and satisfaction, as well as search engine rankings—a factor that directly impacts audience retention and revenue. As a result, the key is to create a system that is optimized for fast responses and low latency.
Reliability: the system must be reliable such that a given data request consistently returns specified data. In case of data change or update, the same request should return new data. Users need to know that if something is recorded or stored in the system, they can be confident that it will remain in place for later retrieval.
Scalability: When it comes to any large distributed system, size is just one item on a list that needs to be considered. Equally important are efforts to increase the throughput to handle large volumes of workload, which is usually referred to as system scalability. Scalability can refer to various parameters of a system: the amount of additional traffic it can handle, how easy it is to add storage capacity, or how much more other transactions can be processed.
Controllability: Designing a system that is easy to operate is another important factor. System manageability equates to the scalability of “maintenance” and “update” operations. To ensure manageability, it is necessary to consider the ease of diagnosing and understanding emerging problems, the ease of updating or modifying, and the ease of use of the system. (That is, does it work as expected without failures or exceptions?)
Price: Cost is an important factor. This can obviously include hardware and software costs, but it is also important to consider other aspects needed to deploy and maintain the system. The amount of developer time required to build the system, the amount of operational effort required to get the system up and running, and even the appropriate level of training must all be provided for. Cost represents the total cost of ownership.

Each of these principles provides a basis for decision-making in the design of a distributed web architecture. However, they can also be in conflict with each other because achieving the goals of one comes at the expense of neglecting the other. A simple example: choosing to simply add multiple servers as a performance (scalability) solution can increase the cost of management (you have to run an additional server) and server purchases.

When developing any kind of web application, it is important to consider these key principles, even if it is to confirm that the project can sacrifice one or more of them.

1.2 Basics

When considering system architecture, there are several issues that need to be addressed, such as which components are worth using, how they fit together, and what trade-offs can be made. Investing money in scaling without a clear need for it is not a smart business decision. However, some forethought in planning can save significant time and resources in the future.

This section focuses on some basic factors that are critical to almost all large web applications: Services,
redundancy, segmentation, And failure handling. Each of these factors involves choices and trade-offs, especially in the context of the principles described in the previous section. To clarify, let's give an example.

Example: Image Hosting Application

You've probably posted images online before. For large sites that store and deliver many images, there are challenges in creating a cost-effective, highly reliable architecture that has low response latencies (fast retrieval).

Imagine a system where users have the ability to upload their images to a central server, and where images can be requested through a site link or API, similar to Flickr or Picasa. To simplify the description, let's assume that this application has two main tasks: the ability to upload (write) images to the server and request images. Of course, efficient loading is an important criterion, but fast delivery when requested by users (for example, images may be requested for display on a web page or by another application) will be a priority. This functionality is similar to what a web server or Content Delivery Network (CDN) edge server can provide. A CDN server typically stores data objects in multiple locations, thus bringing them geographically/physically closer to users, resulting in improved performance.

Other important aspects of the system:

The number of images stored can be unlimited, so storage scalability must be considered from this point of view.
There should be low latency for image downloads/requests.
If a user uploads an image to the server, its data must always remain intact and accessible.
The system must be easy to maintain (manageability).
Since image hosting does not generate much profit, the system must be cost-effective.

Another potential problem with this design is that a web server such as Apache or lighttpd usually has an upper limit on the number of simultaneous connections it is able to service (the default is approximately 500, but it can be much higher) and with high traffic, recordings can quickly use up this limit. Since reads can be asynchronous or take advantage of other performance optimizations like gzip compression or chunking, the web server can switch feed reads faster and switch between clients while serving many more requests than the maximum number of connections (with Apache and with a maximum number of connections set to 500, it is quite possible to service several thousand read requests per second). Records, on the other hand, tend to keep the connection open for the entire duration of the download. Thus, transferring a 1 MB file to the server could take more than 1 second on most home networks, resulting in the web server only being able to process 500 of these simultaneous entries.

Figure 1.2: Read/Write Separation

Anticipating this potential problem suggests the need to separate image reading and writing into independent services, shown in . This will allow us to not only scale each of them individually (since it's likely that we will always be doing more reads than writes), but also to be aware of what is happening in each service. Finally, it will differentiate problems that may arise in the future, making it easier to diagnose and evaluate the problem of slow read access.

The advantage of this approach is that we are able to solve problems independently of each other - without having to worry about recording and retrieving new images in the same context. Both of these services still use a global corpus of images, but by using service-specific techniques, they are able to optimize their own performance (for example, by queuing requests, or caching popular images - more on this later). From both a service and cost perspective, each service can be scaled independently as needed. And this is a positive thing, since combining and mixing them could inadvertently affect their performance, as in the scenario described above.

Of course, the above model will work optimally if there are two different endpoints (in fact, this is very similar to several implementations of cloud storage providers and Content Delivery Networks). There are many ways to solve such problems, and in each case a compromise can be found.

For example, Flickr solves this read-write problem by distributing users across different pods such that each pod can only serve a limited number of specific users, and as the number of users increases, more pods are added to the cluster (see Flickr's scaling presentation,
http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html). In the first example, it is easier to scale the hardware based on the actual usage load (number of reads and writes across the entire system), whereas Flickr scales based on the user base (however, this assumes equal usage across users, so capacity needs to be planned according to stock). In the past, an unavailability or problem with one of the services would break the functionality of the entire system (for example, no one could write files), then the unavailability of one of the Flickr modules would only affect the users associated with it. In the first example, it is easier to perform operations on the entire data set - for example, updating the recording service to include new metadata, or searching all image metadata - whereas with the Flickr architecture, each module had to be updated or searched (or the search service had to be created to sort the metadata that is actually intended for this purpose).

As for these systems, there is no panacea, but you should always proceed from the principles described at the beginning of this chapter: determine system needs (read or write load or both, level of parallelism, queries on data sets, ranges, sorting, etc.), conduct comparative benchmark testing of various alternatives, understand potential system failure conditions, and develop a comprehensive plan should failure occur.

Redundancy

To gracefully handle failure, the web architecture must have redundancy in its services and data. For example, if there is only one copy of a file stored on a single server, the loss of that server will mean the loss of the file. This is unlikely to be a positive situation and can usually be avoided by creating multiple copies or backups.

This same principle applies to services. You can protect against single node failure by providing an integral part of the functionality for the application to ensure that multiple copies or versions of it can run simultaneously.

Creating redundancy in the system allows you to get rid of weak points and provide backup or redundant functionality in case of an emergency. For example, if there are two instances of the same service running in production, and one of them fails completely or partially, the system can overcome the failure by switching to a working copy.
Switching can occur automatically or require manual intervention.

Another key role of service redundancy is to create architecture that does not provide for resource sharing. With this architecture, each node is able to operate independently and, moreover, in the absence of a central “brain” managing the states or coordinating the actions of other nodes. It promotes scalability because adding new nodes does not require special conditions or knowledge. And most importantly, there is no critical point of failure in these systems, making them much more resilient to failure.

For example, in our image server application, all the images would have redundant copies somewhere in another piece of hardware (ideally in a different geographic location in the event of a disaster such as an earthquake or fire in the data center), and the services would access the images will be redundant, given that all of them will potentially serve requests. (Cm. .)
Looking ahead, load balancers are a great way to make this possible, but more on that below.

Figure 1.3: Redundant Image Hosting Application

Segmentation

Data sets can be so large that they cannot be accommodated on a single server. It may also happen that computing operations require too many computer resources, reducing performance and necessitating increased power. In any case, you have two options: vertical or horizontal scaling.

Vertical scaling involves adding more resources to a single server. So, for a very large data set, this would mean adding more (or larger) hard drives so that the entire data set could fit on a single server. In the case of computing operations, this would mean moving the calculations to a larger server with a faster CPU or more memory. In any case, vertical scaling is done in order to make a single computing system resource capable of additional data processing.

Horizontal scaling, on the other hand, involves adding more nodes. In the case of a large data set, this would mean adding a second server to store part of the total data volume, and for a computing resource, this would mean dividing the work or load across some additional nodes. To take full advantage of the potential of horizontal scaling, it must be implemented as an internal design principle of the system architecture. Otherwise, changing and isolating the context needed for horizontal scaling can be problematic.

The most common method of horizontal scaling is to divide services into segments or modules. They can be distributed in such a way that each logical set of functionality works separately. This can be done by geographic boundaries, or other criteria such as paying and non-paying users. The advantage of these schemes is that they provide a service or data store with enhanced functionality.

In our image server example, it is possible that the single file server used to store the image could be replaced by multiple file servers, each containing its own unique set of images. (See.) This architecture would allow the system to fill each file server with images, adding additional servers as disk space becomes full. The design will require a naming scheme that associates the name of the image file with the server containing it. The image name can be generated from a consistent hashing scheme tied to the servers. Or alternatively, each image could have an incremental ID, which would allow the delivery service, when requesting an image, to only process the range of IDs associated with each server (as an index).

Figure 1.4: Image hosting application with redundancy and segmentation

Of course, there are difficulties in distributing data or functionality across many servers. One of the key questions is data location; In distributed systems, the closer the data is to the operation or computation point, the better the system performance. Consequently, distributing data across multiple servers is potentially problematic, since at any time the data may be needed, there is a risk that it may not be available at the location required, the server will have to perform a costly retrieval of the necessary information over the network.

Another potential problem arises in the form
inconsistency (inconsistency) When different services read and write to a shared resource, potentially another service or data store, it is possible for a "race condition" to occur - where some data is considered to be updated to the latest state, but is actually read before it is updated - in which case the data is inconsistent. For example, in an image hosting scenario, a race condition might occur if one client sent a request to update an image of a dog, changing the title "Dog" to "Gizmo" while another client was reading the image. In such a situation, it is unclear which title, "Dog" or "Gizmo", would have been received by the second client.

There are, of course, some obstacles associated with data segmentation, but segmentation allows you to isolate each problem from the others: by data, by load, by usage patterns, etc. into managed blocks. This can help with scalability and manageability, but there is still risk. There are many ways to reduce risk and handle failures; however, in the interests of brevity they are not covered in this chapter. If you want more information on this topic, you should take a look at the blog post on fault tolerance and monitoring.

1.3. Building blocks for fast and scalable data access

Having looked at some basic principles in distributed system development, let's now move on to a more complex issue - scaling data access.

The simplest web applications, such as LAMP stack applications, are similar to the image in .

Figure 1.5: Simple web applications

As an application grows, two main challenges arise: scaling access to the application server and to the database. In a highly scalable application design, the web or application server is typically minimized and often implements a resource-sharing architecture. This makes the application server layer of the system horizontally scalable. As a result of this design, the heavy lifting will shift down the stack to the database server and supporting services; This layer is where the real scaling and performance issues come into play.

The rest of this chapter covers some of the most common strategies and techniques for improving the performance and scalability of these types of services by providing fast access to data.

Figure 1.6: Simplified web application

Most systems can be simplified to a circuit in
which is a good place to start looking. If you have a lot of data, it's safe to assume that you want it to be as easy and quick to access as the box of candy in your top desk drawer. Although this comparison is overly simplistic, it highlights two complex issues: data storage scalability and fast data access.

For the purposes of this section, let's assume that you have many terabytes (TB) of data, and you allow users to access small portions of that data in random order. (Cm. .)
A similar task is to determine the location of an image file somewhere on a file server in a sample image hosting application.

Figure 1.7: Access to specific data

This is especially difficult because loading terabytes of data into memory can be very expensive and directly impacts disk I/O. The speed of reading from a disk is several times lower than the speed of reading from RAM - we can say that accessing memory is as fast as Chuck Norris, while accessing a disk is slower than the queue at the clinic. This difference in speed is especially noticeable for large data sets; In raw numbers, memory access is 6 times faster than disk reads for sequential reads, and 100,000 times faster for random reads (see Pathologies of Big Data, http://queue.acm.org/detail. cfm?id=1563874).). Moreover, even with unique identifiers, solving the problem of locating a small piece of data can be as difficult as trying to pick the last chocolate-filled candy out of a box of hundreds of other candies without looking.

Fortunately, there are many approaches that can be taken to simplify things, with the four most important approaches being the use of caches, proxies, indexes, and load balancers. The remainder of this section discusses how each of these concepts can be used to make data access much faster.

Caches

Caching benefits from a characteristic of the underlying principle: recently requested data is likely to be needed again. Caches are used at almost every level of computing: hardware, operating systems, web browsers, web applications and more. A cache is like short-term memory: limited in size, but faster than the original data source and containing items that have been recently accessed. Caches can exist at all levels in the architecture, but are often found at the level closest to the front end, where they are implemented to return data quickly without significant backend load.

So how can a cache be used to speed up data access within our example API? In this case, there are several suitable cache locations. One possible placement option is to select nodes at the query level, as shown in
.

Figure 1.8: Cache placement on a query-level node

Placing the cache directly on the request-level node allows local storage of response data. Each time a service request is made, the node will quickly return local, cached data, if any exists. If it is not in the cache, the request node will request the data from disk. The cache on a single query-level node could also be located either in memory (which is very fast) or on the node's local disk (faster than trying to access network storage).

Figure 1.9: Cache systems

What happens when you spread caching across many nodes? As you can see, if the request level includes many nodes, then it is likely that each node will have its own cache. However, if your load balancer randomly distributes requests among nodes, then the same request will go to different nodes, thus increasing cache misses. Two ways to overcome this obstacle are global and distributed caches.

Global cache

The meaning of a global cache is clear from the name: all nodes share one single cache space. In this case, you add a server or file storage of some kind that is faster than your original storage and that will be available to all request level nodes. Each of the request nodes queries the cache in the same way as if it were local. This type of caching scheme can cause some problems, since a single cache can be very easily overloaded if the number of clients and requests increases. At the same time, this scheme is very effective on certain architectures (especially those associated with specialized hardware that makes this global cache very fast, or that has a fixed set of data that must be cached).

There are two standard forms of global caches shown in the diagrams. The situation shown is that when the cached response is not found in the cache, the cache itself becomes responsible for retrieving the missing part of the data from the underlying storage. Illustrated is the responsibility of request nodes to retrieve any data that is not found in the cache.

Figure 1.10: Global cache, where the cache is responsible for retrieval

Figure 1.11: Global cache, where request nodes are responsible for retrieval

Most applications that leverage global caches tend to use the first type, where the cache itself manages replacement and fetch data to prevent clients from flooding requests for the same data. However, there are some cases where the second implementation makes more sense. For example, if the cache is used for very large files, a low cache hit rate will cause the buffer cache to become overloaded with cache misses; in this situation it helps to have a large percentage of the total data set (or hot data set) in the cache. Another example is an architecture where files stored in the cache are static and should not be deleted. (This may occur due to underlying performance characteristics regarding such data latency - perhaps certain parts of the data need to be very fast for large data sets - where the application logic understands the replacement strategy or hotspots better than the cache.)

Distributed cache

These indexes are often stored in memory or somewhere very local to the incoming client request. Berkeley DB (BDB) and tree data structures, which are typically used to store data in ordered lists, are ideal for indexed access.

There are often many layers of indexes that act as a map, moving you from one location to another, and so on, until you have the piece of data you need. (Cm. )

Figure 1.17: Multi-level indexes

Indexes can also be used to create multiple other views of the same data. For large data sets, this is a great way to define different filters and views without having to create many additional copies of the data.

For example, let's say that the image hosting system mentioned above actually hosts images of book pages, and the service allows clients to query the text in those images, searching for all text content on a given topic in the same way that search engines allow you to search HTML. content. In this case, all these book images use so many servers to store files, and finding a single page to present to the user can be quite difficult. Initially, reverse indexes for querying arbitrary words and sets of words should be readily available; then there is the task of navigating to the exact page and location in that book and retrieving the correct image for the search results. So in this case, the inverted index would map to the location (such as book B), and then B could contain the index with all the words, locations, and number of occurrences in each part.

An inverted index, which Index1 might display in the diagram above, would look something like this: Each word or set of words serves as an index for those books that contain them.

The intermediate index will look similar, but will only contain the words, location, and information for book B. This layered architecture allows each index to take up less space than if all this information were stored in one large inverted index.

And this is a key point in large-scale systems, because even when compressed, these indexes can be quite large and expensive to store. Let's assume that we have many books from all over the world in this system - 100,000,000 (see blog post "Inside Google Books") - and that each book is only 10 pages long (to simplify calculations) with 250 words per page : This gives us a total of 250 billion words. If we take the average number of characters in a word to be 5, and encode each character with 8 bits (or 1 byte, even though some characters actually take up 2 bytes), thus spending 5 bytes per word, then the index containing each word only once would require more than 1 terabyte of storage. So you can see that indexes that also contain other information, such as word sets, data locations, and usage counts, can grow in size very quickly.

Creating such intermediate indexes and presenting the data in smaller chunks makes the big data problem easier to solve. Data can be distributed across multiple servers and at the same time be quickly accessible. Indexes are the cornerstone of information retrieval and the basis for today's modern search engines. Of course, this section only scratches the surface of the topic of indexing, and there has been a lot of research on how to make indexes smaller, faster, contain more information (such as relevance), and easily updated. (There are some issues with managing competing conditions, as well as the number of updates required to add new data or change existing data, especially when relevance or scoring is involved).

Being able to find your data quickly and easily is important, and indexes are the simplest and most effective tool for achieving this goal.

Load Balancers

Finally, another critical part of any distributed system is the load balancer. Load balancers are a core part of any architecture because their role is to distribute the load among the nodes responsible for serving requests. This allows multiple nodes to transparently serve the same function in the system. (See.) Their main purpose is to handle many simultaneous connections and route those connections to one of the requested nodes, allowing the system to scale by simply adding nodes to serve more requests.

Figure 1.18: Load Balancer

There are many different algorithms for servicing requests, including random node selection, round robin, or even node selection based on certain criteria such as CPU or RAM usage. Load balancers can be implemented as hardware devices or software. Among open source software load balancers, HAProxy is the most widely used.

In a distributed system, load balancers are often located at the "front edge" of the system, so that all incoming requests pass directly through them. It is very likely that in a complex distributed system a request will have to go through multiple balancers, as shown in
.

Figure 1.19: Multiple load balancers

Like proxies, some load balancers can also route requests differently depending on the type of request. They are also known as reverse proxies.

Managing data specific to a particular user session is one of the challenges when using load balancers. On an e-commerce site, when you only have one customer, it's very easy to allow users to place items in their cart and save the contents between visits (this is important because the likelihood of selling an item increases significantly if, when the user returns to the site, the product is still is in his cart). However, if a user is directed to one node for the first session, and then to a different node on his next visit, inconsistencies may occur because the new node may not have knowledge of the contents of that user's shopping cart. (Wouldn't you be upset if you put a package of Mountain Dew in your cart and when you come back it's not there?) One solution might be to make sessions "sticky" so that the user is always directed to the same node. However, taking advantage of some reliability features, such as automatic failover, will be significantly difficult. In this case, the user's cart will always have content, but if their sticky node becomes inaccessible then a special approach will be needed and the assumption about the contents of the cart will no longer be true (though hopefully this assumption will not be built into the application). Of course, this problem can be solved using other strategies and tools, such as those described in this chapter, such as services, and many others (such as browser caches, cookies, and URL rewriting).

If the system only has a few nodes, then techniques such as DNS carousel are likely to be more practical than load balancers, which can be expensive and add an unnecessary layer of complexity to the system. Of course, in large systems there are all sorts of different scheduling and load balancing algorithms, including something as simple as randomization or carousel, to more complex mechanisms that take into account the performance characteristics of the system's usage pattern. All of these algorithms allow traffic and requests to be distributed, and can provide useful reliability tools such as automatic fault tolerance or automatic removal of a failed node (for example, when it stops responding to requests). However, these advanced features can make diagnosing problems cumbersome. For example, in high-load situations, load balancers will remove nodes that may be running slow or timing out (due to a barrage of requests), which will only make things worse for other nodes. Extensive monitoring is important in these cases because even though overall system traffic and load appear to be decreasing (as nodes are serving fewer requests), individual nodes may be stretched to their limits.

Load balancers are an easy way to increase system capacity. Like the other methods described in this article, it plays an essential role in distributed system architecture. Load balancers also provide the critical function of checking the health of nodes. If, as a result of such a check, a node is not responding or is overloaded, then it can be removed from the request processing pool, and, thanks to the redundancy of your system, the load will be redistributed among the remaining working nodes.

Queues

So far we have looked at many ways to quickly read data. At the same time, another important part of scaling the data layer is effective record management. When systems are simple and have minimal processing loads and small databases, writing can be predictably fast. However, in more complex systems this process may take indefinitely. For example, data may have to be written to multiple places on different servers or indexes, or the system may simply be under high load. In cases where writes, or even just any task, take a long time, achieving performance and availability requires building asynchrony into the system. A common way to do this is to organize a request queue.

Figure 1.20: Synchronous request

Imagine a system in which each client requests a remote service task. Each of these clients sends its request to the server, which performs the tasks as quickly as possible and returns their results to the corresponding clients. In small systems where a single server (or logical service) can serve incoming clients as quickly as they arrive, situations of this nature should work fine. However, when the server receives more requests than it can handle, then each client is forced to wait for the other clients' requests to complete processing before a response to its own request is generated. This is an example of a synchronous request, depicted in .

This type of synchronous behavior can significantly degrade client performance; in fact, being idle, the client is forced to wait until it receives a response to the request. Adding additional servers to cope with the system load does not actually solve the problem; Even with effective load balancing in place, it is extremely difficult to provide the even and fair load distribution needed to maximize client productivity. Moreover, if the server to process this request is unavailable (or it has crashed), then the client connected to it will also stop working. An effective solution to this problem requires an abstraction between the client's request and the actual work performed to serve it.

Figure 1.21: Using Queues to Manage Requests

Entry queues. The way the queue works is very simple: a task arrives, gets into the queue, and then the “workers” accept the next task as soon as they have the opportunity to process it. (See.) These tasks can be simple database entries or something as complex as generating a preview image for a document. When a client submits task requests to a queue, it no longer needs to wait for execution results; instead, requests only need confirmation that they were properly received. This confirmation can later serve as a reference to the work results when the client requests them.

Queues allow clients to work in an asynchronous manner, providing a strategic abstraction of the client request and response. On the other hand, in a synchronous system, there is no differentiation between the request and the response, and therefore they cannot be managed separately. In an asynchronous system, the client submits a task, the service responds with a message confirming that the task has been received, and the client can then periodically check the status of the task, only requesting the result once it has completed. While the client is making an asynchronous request, it is free to do other work, and even make asynchronous requests from other services. The latter is an example of how queues and messages work in distributed systems.

Queues also provide some protection against service interruptions and failures. For example, it is quite easy to create a very resilient queue that can retry service requests that fail due to momentary server failures. It is preferable to use a queue to enforce quality of service guarantees rather than expose clients to temporary service interruptions, requiring complex and often inconsistent error handling on the client side.

1.4. Conclusion

Developing efficient systems with fast access to large amounts of data is a very interesting topic, and there are still a significant number of good tools that can accommodate all kinds of new applications. This chapter has touched on just a few examples, but in reality there are many more - and new innovations in this area will only continue to be created.

This work is licensed under a Creative Commons Attribution 3.0 Unmodified License. See details in

In this article, we will outline some concepts that you need to use when choosing a case fan and talk about the features characteristic of different types of fans. In addition, we will test the 120 mm akasa Amber AK-183-L2B fan, which has been serving as an active element of the Thermaltake Sonic Tower processor cooling system for more than a year, used when testing processors and video cards. And it should be noted that he has fully earned the right to become the first hero in a series of reviews dedicated to fans on our resource.

The questions we will try to answer are...

What requirements can be placed on a case fan?

Performance level.
Noise level.
Appearance (presence and type of backlight).
Additional features (support for PWM power supply, speed controller, anti-vibration mount).

What are the main differences between case fans?

1.Dimensions– impeller size.

To create internal ventilation in the case, in most cases, fans with a standard size of 80 mm, 92 mm or 120 mm are used, since mounting holes are initially provided for their installation in the case. It is clear that the larger the size of the fan impeller, the higher its ability to pump air flow. Therefore, to achieve the efficiency levels of larger models, a smaller fan will have to rotate at a higher speed and, accordingly, produce significantly more noise. Actually, for this reason, large 120 mm fans are popular.

To make these properties clearer, you can compare the characteristics of fan models of the same series from Xinruilian Science & Technology Co. having different sizes:

Size, mm

Speed rpm
Airflow, CFM
Noise level, dB
Pressure, mm H 2 0

Judging by the given characteristics of the models, we can conclude that at the same rotation speed, a 92 mm fan will be 1.65 times more efficient than an 80 mm one, and a 120 mm fan will be twice as efficient as a 92 mm model, taking into account the fact that all fans have one type of impeller.

In addition to different impeller diameters, the profile size or depth of the fan is also equally important. The “classic” profile for regular cases is the 25 mm profile. Fans with a smaller profile are called low-profile, while fans with a larger profile are called high-profile. The strength of the air flow, which is characterized in the specification by the value of the maximum pressure, directly depends on the size of the profile.

For example, let's compare the characteristics of two 120 mm models - with a regular profile and a high profile, operating at the same rotation speed.


Size, mm
Speed rpm
Airflow, CFM
Noise level, dB
Pressure, mm H20

The table shows that a high-profile fan is distinguished only by a better indicator of the maximum air flow pressure. In conventional computer systems there is no need to create excess pressure, so these fans are not used in them. In most cases, high-profile fans are used in servers; in addition, they have an increased rotation speed and, as a result, quite high operating noise.

2. Bearing type.

Fans use three main types of bearings: plain, combined sliding and rolling, and consisting of two rolling bearings. In addition to these types of bearings, much less frequently, there are hydrodynamic types that are specially developed separately by some manufacturers.

Most often, you can determine the type of bearing by the presence of the following indices in the name of the fan model, although it is always advisable to check the official specification:

S (sleeve) – a sliding bearing, which is essentially a bushing;
WITH (combo) - one ball bearing and a short bushing;
B (ball) or 2B (2 ball)– two ball bearings.

The simplest and cheapest, but, unfortunately, not particularly durable is the plain bearing. This bearing looks like a small copper bushing, inside which the impeller shaft (rod) slides, rotating. A new fan with a lubricated plain bearing may be completely silent, but over time this property may be lost. In the absence of the proper level of lubrication, the bushing “wears out” quite quickly, which is why the fan begins to make noise. In addition, in the absence of lubrication, when operating in an area with elevated temperatures, the fan may completely jam. This fact is especially clearly demonstrated by cases with inexpensive Chinese power supplies, in which there were often cases of stopping the fan on the plain bearing, which ensures cooling of power semiconductor elements. As a result, the power supply failed.

The combined version of the bearing has a relatively longer bearing life.

A bearing consisting of two ball bearings is the most expensive, but at the same time, more durable option. This type of bearing can operate freely in high temperature areas. But at the same time, often among such fans you can find specimens that produce quite loud and unpleasant noise. This picture is especially typical for cheap fans, because the noise characteristics of the entire structure directly depend on the quality of manufacturing of the miniature bearing. Therefore, if you choose a product with ball bearings, do not under any circumstances go after its cheapness, because even more expensive models are rarely quiet.

3. Engine class.

All case fans are powered by four-pole DC motors. According to speed, they are all classified into three categories: low-speed up to 2000 rpm, medium from 2000 to 3000 rpm and high-speed over 3000 rpm. You can often find out the class of a fan motor by the index in its name, which is often indicated on a sticker:

L (low) - low-speed ( up to 2000 rpm);
M (middle) - average ( from 2000 to 3000 rpm);
H (high) – high-speed ( over 3000 rpm).

What characterizes the data given in the manufacturers' specifications?

Fan speed measured in the number of revolutions per minute (RPM - rotations per minute). Since the fan speed itself can vary almost in direct proportion to the supply voltage, the value given in the specification corresponds to the rated supply voltage. The higher the rotation speed, the more efficient the fan is, but also usually the noisier.

Air flow can be indicated in CFM (cubic feet per minute, CFM) - cubic feet per minute or in cubic meters per hour (m 3 / h). In this case, 1 CFM ≈ 1.7 m 3 / h. This value displays the amount of “pumped” air over a certain period of time, provided there is complete absence of resistance to air flow, that is, with equal air pressure on both sides of the fan. Of course, the larger this value, the better.

Air flow static pressure fan is usually given in mm of water column and characterizes the force of air flow that the fan can create.

Recall that pressure is calculated using the formula P=F/S. That is, pressure is the ratio of the force of the air flow to the area on which it acts. The specification indicates the maximum amount of air flow that the fan creates when it cannot create air flow due to resistance.

The entire characteristics of the fan can be seen on the “Performance curve” graph.

The performance curve represents the relationship between air flow and pressure. The highest point of the curve, located on the axis, is precisely nothing more than the maximum pressure, which is given in the specification. The lowest point of the curve, lying on the other axis, corresponds to the maximum air flow of the fan when it does not have to create pressure. In real conditions, namely in a housing, the air flow must overcome some resistance. Each case individually has its own degree of resistance. The system resistance will be expressed as a slope on the graph, and the point of intersection of the straight line and the curve is nothing more than the operating point of the fan in our conditional system.

Fan life measured by the number of thousand hours during which the fan must operate and provide the declared characteristics. The author of the article does not fully know under what operating conditions the given values are achieved, since the service life directly depends on the operating conditions. It is assumed that the fan is capable of operating for the specified number of hours without losing its noise qualities. The resource largely depends on the type and quality of the bearing. For plain bearings, 30,000 hours are usually specified, for combined bearings - 45,000 hours, and for double ball bearings, values of 60,000 hours and above are given.

In most cases, a fan on a sliding bearing can operate quite quietly for about a year, although manufacturers claim a figure of 30 thousand. Therefore, one should not be biased about these numbers - they are quite relative. And if you dig deeper, it turns out that the manufacturer also implied periodic maintenance of the fans, i.e. a lubricant that ordinary users rarely produce.

Now let's go back to the hero of our review, the fan. akasa AK-183-L2B and look at its specification:

	akasaAK-183-L2B
Size, mm
Rotation speed, rpm
Airflow, CFM
Pressure, mm H20
Resource, h
Bearing type	two swings
	3 pin
Noise level, dB
Supply voltage, V
Products webpage
average price

The akasa AK-183-L2B fan is packaged in transparent plastic packaging. The back of the package contains a blue cardboard sheet, highlighting the transparent body and translucent amber-yellow fan impeller. In general, everything is decorated in the familiar blue and yellow corporate colors of the Akasa Group.

On the back of the cardboard insert there is a specification in five different languages.

In addition to the fan, at the very bottom of the package there is a small black cardboard box with an inscription translated as “3-4 pin adapter”.

Therefore, it is not at all surprising that the box contained an adapter that allows you to connect the fan from the connectors of peripheral devices, and in addition, it additionally has a 3-pin connector for controlling the fan rotation speed. Also in the small black box were four screws for installing the fan into the case.

The akasa AK-183-L2B fan body is made of white transparent plastic, and its impeller is made of amber yellow translucent plastic. Actually, if we consider the entire product range of the akasa company, we rarely come across any simple and standard solutions, because the company distributes products, for the most part, designed specifically for modding. But this does not mean at all that if you do not consider yourself to be a modder and are accustomed to evaluating a product only for practical reasons, then this fan and other similar products are not suitable for you. In fact, when producing products for so-called difficult users, modders or overlockers, increased attention is always meant, first of all, to the quality of workmanship and its somewhat higher properties. Therefore, it is not at all surprising that almost always the consumer qualities of such products are higher than those of simple goods.

Externally, the fan, of course, stands out primarily with its beautiful combination of yellow and white colors, and it seems that such a fan simply must have a backlight, but in fact it does not. From an aerodynamic point of view, no innovations or know-how have been implemented in the fan - a simple seven-blade impeller.

The akasa AK-183-L2B fan is a low-speed model, its nominal rotation speed is 1400 rpm, while the maximum air flow can reach 44.8 CFM, and the static pressure is 1.1 mm water column. The characteristics are neither super high nor quite ordinary, but this is not surprising, since for a case fan of a home system, first of all, increased noise requirements are placed.

And in this regard, the specification indicates a fairly low noise level of 18 dB. It should be noted that the akasa AK-183-L2B fan is based on two ball bearings, and its service life, according to the manufacturer, is 80,000 hours. The usual value for ball bearings is 60,000 hours, so a slightly increased value gives reason to hope for a better approach to their manufacture, because we have already noted that it is the quality of the ball bearing that determines its noise characteristics.

The fan is powered by a 3-pin power connector, which does not support pulse-width power supplies.

Testing

The practical part of the test consists of two stages of testing the generated air flow of the fan at nominal speeds, as well as assessing its volume by ear.

Test No. 1

In the first test, the fan will act as an active element of the Thermalright SI-128 cooler. Testing is carried out in an open case, and the main controlled parameter is the heating temperature of the Intel Core 2 Duo E6300 processor overclocked to 2.8 GHz and the temperature of the motherboard.

Test No. 2

In the second test, the fan will be used for its intended purpose as a case fan installed on the rear panel and operating as an exhaust fan. During the test, the case will be closed and there will be no other fans turned on. The main controlled parameter will also be the heating temperature of the Intel Core 2 Duo E6300 processor, but now without overclocking, which is equipped with a passive cooling system Thermaltake Sonic Tower (CL-P0071) and the temperature of the motherboard. The test result is recorded after a 10-minute “run” of the processor stress test in the Everest application.

The test platform configuration with an Intel processor consists of the following components:

Motherboard	Gigabyte GA-965P-DS4 (Intel P965 Express)
CPU	Intel Core 2 Duo E6300 (LGA775, 1.86 GHz, L2 2 MB) @2.8 GHz
RAM	2 x DDR2-800 1024 MB Apacer PC6400
Video card	EVGA GeForce 8600GTS 256 MB DDR3 PCI-E
HDD	Samsung HD080HJ, 80 GB, SATA-300
Optical drive	ASUS DRW-1814BLT SATA
power unit	Chieftec CFT-500-A12S 500W, 120 mm fan
	CODEGEN M603 MidiTower

The akasa AK-183-L2B fan demonstrates good performance on par with other competitors. We can confirm the fairly low declared noise level of 18 dB only with our subjective opinion. The indicated value is indeed quite consistent with the real noise level - at maximum speed it emits a slight hum.

Conclusions.

The akasa AK-183-L2B fan is not exclusively a beautiful decoration for modding, as it might immediately seem to the buyer. It can create a fairly large air flow, and, most importantly, it produces a fairly low noise level. If we take into account its reasonable price for a high-quality model of a ball-bearing fan, then in this category of goods it will be one of the best in terms of price/consumer quality ratio. Let us repeat that in our test laboratory the akasa AK-183-L2B fan has been in operation for more than a year, and its noise qualities have remained virtually unchanged during this time.

Advantages:

creates a large air flow;
low noise level;
good price/performance ratio.

The disadvantages include:

lack of PWM support.

We express our gratitude to the company PF Service LLC (Dnepropetrovsk) for the equipment provided for testing.

Article read 4669 times

Subscribe to our channels

Please enable Javascript to use the unit converter

›› More information from the unit converter

How many cfm in 1 cubic meter/minute? The answer is 35.314666212661.
We assume you are converting between cubic feet/minute and cubic meter/minute.
You can view more details on each measurement unit:
cfm or cubic meter/minute
The SI derived unit for volume flow rate is the cubic meter/second.
1 cubic meter/second is equal to 2118.8799727597 cfm, or 60 cubic meter/minute.
Note that rounding errors may occur, so always check the results.
Use this page to learn how to convert between cubic feet/minute and cubic meters/minute.
Type in your own numbers in the form to convert the units!

›› Quick conversion chart of cfm to cubic meter/minute

1 cfm to cubic meter/minute = 0.02832 cubic meter/minute

10 cfm to cubic meter/minute = 0.28317 cubic meter/minute

20 cfm to cubic meter/minute = 0.56634 cubic meter/minute

30 cfm to cubic meter/minute = 0.84951 cubic meter/minute

40 cfm to cubic meter/minute = 1.13267 cubic meter/minute

50 cfm to cubic meter/minute = 1.41584 cubic meter/minute

100 cfm to cubic meter/minute = 2.83168 cubic meter/minute

200 cfm to cubic meter/minute = 5.66337 cubic meter/minute

›› Want other units?

›› Definition: Cubic foot/minute

Cubic feet per minute (CFM) is a measure used in Industrial hygiene and ventilation engineering. It describes the rate of flow of a gas or air volume into or out of a space.

A standard measurement of airflow that indicates how many cubic feet of air pass by a stationary point in one minute. The higher the number, the more air is being forced through the system. The volumetric flow rate of a liquid or gas in cubic feet per minute. 1 CFM equals approximately 0.47 liter per second.

›› Metric conversions and more

website provides an online conversion calculator for all types of measurement units. You can find metric conversion tables for SI units, as well as English units, currency, and other data. Type in unit symbols, abbreviations, or full names for units of length, area, mass, pressure, and other types. Examples include mm, inch, 100 kg, US fluid ounce, 6"3", 10 stone 4, cubic cm, meters squared, grams, moles, feet per second, and many more!