Fetching directory, one moment please ...
WAN Optimisation and Application Acceleration
submitted by John Streeter
WAN Optimisation / Application Acceleration appliances terminate links on both sides of a LAN / WAN boundary to take control of TCP traffic. They implement multiple strategies to improve WAN throughput and application responsiveness.
WAN optimisation refers to a group of technologies which are applied to traffic at the transport layer. Common WAN optimisation strategies are compression, caching and protocol optimisation.
Application acceleration technologies are targeted to the application, presentation and session layers of specific applications. Acceleration is commonly provided for CIFS, NFS, HTTP, FTP, LDAP, RTSP, MAPI, POP, SMTP, ICA, database traffic and others.
Compression mitigates the bottleneck effect of traffic on a high bandwidth network being routed to a low capacity link. The sending accelerator compresses data in the packets and re- transmits. The receiving device decompresses and re-transmits to the ultimate destination on its’ LAN interface.
Data suppression is a form of caching that reduces the volume of traffic traversing the WAN. Traffic is cached on both sides of the WAN as data segments. The sending device matches TCP streams received from hosts on its’ LAN interface against cached segments. If a match is found, the data is dropped and not forwarded over the WAN link; instead a token representing the cached segment is sent. The receiving accelerator uses the token to retrieve the data from its’ cache and forwards that data to the application on the destination host.
Data suppression architecture guarantees the data receiving application gets is identical to that sent by the sender because the actual data from the sender is matched as it is streamed. Data suppression commonly achieves a 1,000:1 compression rate for a data segment that is not transmitted. A compression rate of 50:1 for the entire data flow is commonly achieved.
Protocol optimisation tunes and modifies the standard RFC 793 TCP implementation to maximise bandwidth utilisation and throughput. TCP is designed to achieve reliable delivery at the expense of timely delivery, consequently most protocol optimisations address design inefficiencies in configuration and algorithms that impact WAN performance.
TCP Max Window Size
TCP window size defines the buffer size for TCP connections. A receiving endpoint informs its’ sender of its buffer capacity via an ACK packet header called window. Senders throttle transmission volumes to the receiver’s window size as packets arriving at a full buffer are lost. If the window header has a zero value, the buffer is full and the sender will suspend transmission until it receives a further packet advising of available window capacity.
Endpoints configured with low TCP window sizes are incapable of utilising all the available bandwidth capacity on long-distance high-bandwidth links, known as long fat pipes (LFP)
Long fat pipes have a loading capacity for data in transit called the Bandwidth Delay Product (BDP). BDP is calculated by multiplying bandwidth in bytes (not bits) by the round-trip time in seconds. A 45 Mbps link with 280 millisecond latency would be calculated by multiplying 5.9 MB/Sec (capacity in bytes) by 0.28 seconds. This equals 1.6Mb; it informs us that the fully utilised link would have 1.6Mb of data in flight at any point in time. A 45 Mbps MPLS link from Melbourne to New York operating at capacity would be loaded with, and effectively caching, 1.6Mb of data at any given point in time.
When BDP is over twice the value of the configured window size on endpoints, receiving endpoints are unable to accept the volume of data the pipe is capable of delivering. The transmission rate is throttled back to a level the receiver can handle as advertised via the window value and available bandwidth cannot be utilised.
RFC 793 window size is defined by a 16 bit header that restricts its’ maximum value to 65,536 bytes. This is not adequate to support current WAN and LAN capabilities, and is addressed by window scaling an RFC 1323 extension that supports window sizes up to 1Gb. Appliances have a configurable window scale which should be set to a value two to three times the BDP.
This value is also configurable on servers and workstations, but the optimal value for WAN traffic may incur unacceptable memory utilisation. Additionally there is a configuration management cost to ensuring the value is configured for WAN conditions on all endpoints.
TCP Slow Start
TCP connections initialise in slow start mode to discover bandwidth capacity. TCP segments are increased additively (that is by a fixed segment size amount) with each acknowledged transmission. The segment size is defined by the receiving end-point during connection set-up. A small segment size may require many acknowledged transmissions before the connection is taking full advantage of the WAN links throughput capacity. This may not be a problem for protocols such as FTP and HTTP, but low volume chatty protocols such as SIP and ICA can be delayed if it takes a number of acknowledged transmissions to complete each communication.
TCP slow start can be mitigated by either:
- Increasing segment size during TCP slow start
- Starting with a higher cwnd size, which requires real-time awareness of link capacity
The second option is seldom employed because it requires the additional overhead of monitoring capacity on highly variable links. The first option is implemented as a TCP extension called large initial windows (RFC3390). Large initial windows preserves TCP’s bandwidth discovery capability without adding additional overhead and leverages the growth of the cwnd< during slow start.
Packet loss is a reliable indicator of congestion, as it occurs when packets are discarded by buffers that are filled to capacity. TCP detects packet loss through the absence of an ACK packet within the round-trip timeout period. In a standard TCP implementation, this triggers the RFC 2581 congestion control mechanism; which immediately cuts the congestion window (cwnd) by half. The cwnd is the number of segments the sender will allow on the wire unacknowledged. When it is reached transmission is halted until sent packets are acknowledged. The cwnd is increased by one segment per successful round trip. This aggressive backing-off on transmission rate ensures bandwidth is shared fairly.
This process is called the additive increase / multiplicative decrease algorithm (AIMD) . Its’ impact on WAN performance can be devastating as latency is usually orders of magnitude higher than that observed on LANs. The longer round-trip times can lead to cwnd recovery times that are sometimes measured in minutes, compared to milliseconds on a LAN. For example, a 2Mb cwnd that cut is by half on a link having a 512 byte maximum segment size and 280ms latency will need:
- about 2,000 round trips
- each taking 280ms
- increasing the cwnd by 512 bytes per round-trip
On WAN links, the impact of this mechanism can be mitigated by:
- Implementing the TCP extensions: selective acknowledgement (SACK) and forward error correction (FEC)
- Alternative congestion avoidance algorithms
FEC includes parity packets in the transmission which can be used by the receiving end to recover data from a lost packet, again circumventing the congestion control mechanism.
Advanced congestion avoidance in specialised TCP stacks implement
alternatives to the standard RFC 2581 congestion control algorithm. Some examples of these algorithms are:
- TCP BIC – implemented in Linux kernels 2.6.8 – 2.6.18
- TCP CUBIC – implemented in Linux kernels sin=2.6.19
- Compound TCP – implemented in Windows Server 2008 and Vista
- High Speed TCP – Defined in RFC 3649 to improve performance in high BDP links
Application acceleration can be reactive or proactive. Reactive acceleration occurs when the devices responds to a trigger, proactive acceleration is administratively configured activity.
Application Layer Caching
CIFS, HTTP and media streaming are classic candidates for application layer caching. Each application has specific security, version control and integrity requirements that application caches need to be aware of and work within. Application specific caches commonly store metadata and directory information with cached objects to help support this requirement.
HTTP has a built-in cache support via the If-Modified-Since header. Appliances intercept client requests to check if the requested objects are in cache. If they are an If-Modified-Since request is sent to the server. The server will respond by sending an updated object if a newer version exists, or will reply with a 200 – OK packet if the object is current. If the object is not in cache, the request is forwarded to the server.
Each appliance has its’ own proprietary application intelligence to service the hundreds of message types implemented by CIFS to support file and directory operations. Common approaches are to read-ahead file blocks so that anticipated requests can be intercepted and serviced directly by the appliance. Cached data segments from files shared by multiple users are verified by the server before being serviced by the cache.
Appliances typically use opportunistic lock functionality to manage and validate cached CIFS data. Opportunistic locks can be batch, exclusive or level 2. Batch affords the greatest acceleration potential as the client is able to perform all operations granted by privileges and update the server in batch transactions. Exclusive locks are as unrestrictive as batch locks, but require the client to synchronise any changes with the server in real time. Level 2 locks are implemented when other users have the file open; only affording potential for some read operations to be handled by the appliance. Appliances also cache CIFS directory traversal data for very short validation periods.
Splitting and Multiplexing
Media streams are intercepted by a server-side appliance and sent as a single stream across WAN links, where they are cached by the receiving appliance. This multiplexes requests from clients streaming the same presentation. The receiving appliance splits the stream to clients on its’ LAN interface.
Read-ahead is employed to service anticipated future requests from the appliance cache. Appliances need application intelligence to reliably predict likely future data requests. Examples are parsing a HTML page and caching linked objects or pre-fetching CIFS file blocks after an initial block request is received. Read-ahead effectively batches sequences of requests for small blocks of data mitigating application exposure to latency.
More advanced appliances fine tune read-ahead behaviour by learning client behaviour patterns and calculating the probability of future requests based on past behaviour.
Pipelining and Multiplexing
Applications can pipeline multiple transactions through a single TCP socket or multiplex multiple sockets to improve bandwidth throughput on high latency links. Pipelining multiple transactions through a single socket improves throughput because the batched packets are sent ahead of response packets. Packets are evenly distributed across multiplexed connections, minimising delays on heavily trafficked connections by utilising capacity in lighter trafficked connections.
The standard TCP implementation queues all packets and transmits them on a first in – first out (FIFO) basis. Fair queuing allocates equal bandwidth to all application sockets; ensuring applications with short chatty transfers are not degraded by applications that queue large volumes of data. Each socket is given a turn to transmit packets in a rotor arrangement, creating a queue for each socket and a rotating queue of sockets.
Quality of Service
QoS assigns priority to the business’s most important applications. In practice, priority is usually assigned to interactive applications in which user experience is vulnerable to degradation resulting from transmission delays.
Effective WAN Optimisation and Application Acceleration is achieved through the aggregation of multiple strategies to improve WAN performance and reduce loading. Their effective use to consolidate infrastructure in branch locations to a central data centre requires knowledge of the volume and type of traffic to be accelerated and provisioning of adequate bandwidth to handle the anticipated load.