Tag Archives: Messaging

Uncovering Time in the Financial Markets

In this era of low-latency, high-performance electronic and algorithmic trading, vendors, regulators and business strategist continue to misinform and sometimes disinform industry participants with references to time. Vendors, for example, can selectively manipulate their marketing campaigns to suggest dubious sub-millisecond advantages over competitor technologies. Regulators, who continue their ambitious drive to innovate for the twenty-first century industry changes, may get a bit ahead of themselves when not providing the appropriate clock synchronization context in quoting their temporal constraints. Investment banks and brokerage firms continue to preach the million dollar advantages of millisecond improvements in their trade lifecycle.

The widespread industry shifts in the financial markets have created an unprecedented and collective awareness and sensitivity to small intervals of time. The fact is that despite driving both regulatory and strategic policies, the quoted measure of these intervals remains another piece of misinformation and sometimes disinformation that misleads and confuses industry stakeholders.

In this three part series, I’ll first show examples of time’s importance from a financial market regulatory and strategic perspective. Second I’ll show exactly how and why this time is misinterpreted. Finally i’ll talk about how clock synchronization techniques can be used to better rationalize the measure of time across system boundaries.

Advertisements
Tagged , , , ,

“5 microseconds”…You Said What?

Measuring message latency, especially for the data volumes and latency thresholds expected by Wall Street is tricky business these days, as we’ve
previously covered.

Even trickier is finding clarity in the midst of confusing and too many times inaccurate media coverage on the topic as shown in a recent article, from the Securities Industry News. I’m specifically referring to a low-latency monitoring solution vendor, who states that today’s algo trading engines require end-to-end network latencies of “less than 5 microseconds with no packet loss“.

In 2008, a colocated trading engine, which minimizes propagation delay, can expect end-to-end latency on the order of a couple of milliseconds at best. End-to-end, in this context, typically refers to time an algo issues a buy/sell order to the the time a receiving system acknowledges and executes that order.

Can the latency between these two point really be 5 microseconds? Highly unlikely. There are many reasons for this, which will be covered over time. For now i’ll mention that off-the-shelf clock synchronization solutions, a prerequisite to measuring message latency across system boundaries, just can’t support an accuracy of 5 microseconds.

Tagged , ,

Designing for Performance on Wall Street – The Need For Speed

Collapsing Time

While the impact has already been enormous, history will show how the shift from floor-based specialist trading to electronic trading changed the way investors, specialists, investment banks, brokers, exchanges and other industry participants make their money. Wall Street as a whole is now firmly entrenched in this new electronic trading frontier and the barriers to entry have shifted from the human imperfections of floor based traders or specialists, to the high-speed, low latency capabilities of profit seeking electronic algorithms.

Low latency in the scope of electronic trading refers to the utilization of high-performance technology that collapses the time between price discovery (i.e. 100 shares of IBM are now available at $100.00) and the execution of orders (i.e. buy or sell) at the newly discovered price. Electronic trading has created a world where the lifecycle of price discovery to trade execution is on the order of single-digit milliseconds.

Time is Money

Previously, I talked about the bandwidth problem. Inability to handle the required bandwidth utilizations of modern market data feeds will certainly cause significant delays in this millisecond-sensitive trade lifecycle, resulting in lost profits. However, the single most important need that has resulted from the unanimous shift to electronic trading is the need for speed, where speed refers to the ability to “see” stock prices as quickly as they appear in the electronic marketplace and similarly the ability to immediately trade on that price before competitors do

Some of the low-latency design strategies or techniques exhibit the elegant characteristic of solving the bandwidth problem as well as the need for speed. For example, colocating your electronic trading algorithm in the same facility as an exchange’s matching engines (i.e. the systems that execute the buy/sell orders) will not only save your firm the wide-area network infrastructure required to feed market data to your trading algorithm, but will also minimize the propagation delays between market data sources and execution venues. Incredibly, some of these solutions, such as FAST compression, can theoretically address the bandwidth problem, the need for speed, and the storage dilemma.

Low-Latency Approaches

How does Wall Street solve the need for speed? Here are just some of the approaches used to minimize stock trading related delays:

Chip Level Multiprocessors (CMP)

When Intel’s microprocessors started melting because of excessive heat, the multi-core chip industry became mainstream. Smaller multiple cores on a single chip could now permit multi-threaded code to achieve true parallelism while collapsing the time it takes to complete processing tasks. Multi-core chips from Intel and AMD have a strong presence in the capital markets and can achieve remarkable performance as shown in SPEC benchmarks.

An emerging challenge on Wall Street is to deploy microprocessor architectures capable of scaling to the enormous processing required by risk-modeling and algorithmic-trading solutions. If one-core architectures encountered space and heat limitations which eventually lead to the introduction of multi-core architectures, what new limitations will emerge? The shared message bus found with existing multi-core processors is one such limitation as the number of cores multiply. Vendors, such as Tilera are innovating around these limitations and you can expect more to follow. Furthermore, evidence is building to support the notion that multi-core microprocessor architectures, and the threading model behind them are inherently flawed. Multi-core CPUs may provide near term flexibility for designers and engineers looking to tap more processing power from a single machine. Long term however, they may be doing more harm then good.

Concurrency

With multiple cores now in place, the software and hardware community are steadily catching up. For example, older versions of Microsoft’s Network Driver Interface Specification (NDIS) would limit protocol processing to a single CPU. NDIS 6.0 introduced a new feature called Receive Side Scaling (RSS) which enables message processing from the NIC to be distributed across the multiple cores on the host server.

As Herb Sutter explains in his paper “The Free Lunch is Over: A Fundamental turn Towards Concurrency in Software”, software applications will increasingly need to be concurrent if they want to exploit CPU throughput gains. The problem is that concurrency remains a challenge from an education and training perspective as described in David A. Patterson paper. Conceptually concurrency can drive the need for speed. The practice of this approach remains a challenging one.

Colocation

Colocation is a fascinating approach towards achieving low-latency, mainly because it reconfigures physical proximity between application stacks instead of relying on a sophisticated technology approach. We’ve already shown how it can minimize the bandwidth requirements for a firm’s algorithmic trading platforms, but its biggest accomplishment is to minimize the distance between electronic trading platforms and the systems that execute the trades. Organizations such as BT Radianz have armed their high-performance datacenters with the fastest, highest throughput technology on the planet. When coupled with colocated hosting services, these data centers provide the the lowest latency money can buy while opening up new opportunities to translate this value throughout the application stack starting at the NIC card and moving on up.

The Exchanges themselves, are also using colocation services as a way to attract customers and introduce new sources of revenue. For example, International Securities Exchange ISE, offers colocation services while promising 200 microsecond service levels.

Hardware Accelerators

Field Programmable Gate Arrays

The name says it all – an integrated circuit that can be customized for a specific solution domain. Specialized coprocessors have existed for years, handling floating point calculations, video processing and other processing intensive tasks. FPGA builds on this by offering design tools allow programmers to customize the behavior of the FPGA’s integrated circuit, usually through a high-level programming language which is then “compiled” into the board itself. An example of how FPGA boards are being deployed on wall street includes replacing software feed handlers, the components that read, transform and route market data feeds, with their FPGA equivalents. This approach results in higher throughput and lower latency because message processing is handled by the customized FPGA board, instead of the host CPU/OS, saving the precious cycles that would have been required for moving messages up the protocol stack and interrupting the kernel. ACTIV Fiancial, a leading vendor of a feed handling solution claims that the introduction of FPGA accelerators to their feed processing platform reduced the feed processing latency by a factor of ten while allowing them to reduce the servers required to process some US market data feeds from 12 servers, in the software based feed processing approach, to just one server in the FPGA accelerated approach.  Celoxica is another firm specializing in FPGA solutions for Wall Street’s electronic trading.  Celoxica’s hardware accelerated trading solution promises microsecond latency between host NIC and user application with support for throughput rates reaching 7 million messages per second.

TCP Offload Engine

The idea with TCP Offload Engines (TOE) is for the host operating system to offload processing of TCP messages to hardware located on the network interface card itself, thus decreasing CPU utilization while increasing outbound throughput.  Windows 2003 Server includes the Chimney Offload architecture which defines the hooks required for OEM and 3rd party hardware vendors to implement layer 1, 2, 3 and 4 of the OSI protocol stack in the NIC itself, before passing the message to the host operating system’s protocol handlers.  Similar examples of offload technology include TCP Segmentation Offload (TSO) or Generic Segmentation Offload (GSO) where the NIC handles the segmenting of large blocks of data into packets.

Network Processing Offload

Coming Soon

Kernal Bypass

Coming soon

High-Performance Interconnections (I/O)

Infiniband

From the Infiniband Trade Association website:

In 1999, two competing input/output (I/O) standards called Future I/O (developed by Compaq, IBM and Hewlett-Packard) and Next Generation I/O (developed by Intel, Microsoft and Sun) merged into a unified I/O standard called InfiniBand. InfiniBand is an industry-standard specification that defines an input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems. InfiniBand is a true fabric architecture that leverages switched, point-to-point channels with data transfers up to 120 gigabits per second, both in chassis backplane applications as well as through external copper and optical fiber connections.

Infiniband technologies also exhibit the characteristic of solving multiple problems facing Wall Street today including bandwidth, latency, efficiency, reliability and data integrity. Visit the Voltaire website for a vendor specific look into the performance benefits of Infiniband on Wall Street.

Please check back in second quarter 2008 when The Techdoer Times presents a detailed look into the many existing and future applications of Infiniband technology.

Remote Direct Memory Access

RDMA is a zero-copy protocol specification for transferring data between memory modules of separate computers without involving either source or target operating sytem or CPU, resulting in low-latency and high-throughput computing.

Fibre Channel

Gigabit Ethernet (GbE) & 10 Gigabit Ethernet (10GbE)

AMD HyperTransport (Chip-level)

Intel Common System Interface (Chip-level)

Ethernet Virtual Private Line (EVPL) and Ethernet Virtual Connection (EVC)

Faster Compression

As we mentioned in the bandwidth problem, some firms are relying on innovations in compression as a way to minimize escalating bandwidth costs. FAST is an example of this but there’s more. In our previous postings on measuring the latency in messaging systems we explained how the different components of latency react to variations in packet size or transmission rates. Herein lies the potential latency improvements resulting from the adoption of FAST. FAST can potentially minimize packetization and serialization delays. It is true that the process of compressing messages requires additional CPU cycles and therefore adds to the application delay, however, depending on the nature of the solution, this additional delay may be offset by the savings that result from serializing significantly smaller sized packets onto the wire, potentially 80% smaller. FAST can be incredibly effective at bandwidth reduction and can potentially reduce end-to-end latency as well.

Messaging

Messaging technology has evolved greatly to the point where requirements for speed and reliability are no longer in conflict. Publish/Subscribe messaging paradigms can be supported with different levels of service quality, ensuring that latency-sensitive subscribers can forgo message recovery for the sake of speed, while data-completeness sensitive subscribers can rely on extremely fast message recovery built on top of layer 3 protocol and routing technologies such as UDP and Multicast. These real-time messaging technologies also ensure robustness and scalability across a number of downstream subscribers. Cases where slow subscribers begin to “scream” for message retransmission (aka. ‘crying-baby’) can be handled individually and gracefully by the messaging layer, ensuring uninterrupted service to other subscribers. Messaging technology vendors include:

Multicast Routing

As mentioned in the bandwidth problem, multicast routing technologies can potentially reduce latency in addition to bandwidth utilization. The latency play results from the fact that multicast packets are rejected or accepted at the Network Interface Card (NIC) level, and not the more CPU expensive kernel level.

Data Grids/Compute Grids

With the industry’s reliance on the timely evaluation of strategic trading and risk models comes the need to access and crunch large amounts of data efficiently. This reliance has spawned innovations in the form of data and compute grids which offer highly-resilient, scalable distributed processing infrastructure on demand for compute intensive as well as data intensive environments. Data grids, in particular, offer a high-performance, highly-resilient middle-tier data layer that sits on top of storage technologies and other information sources but offers ubiquitous data access to enterprise business processes. Key vendors or technologies in this space include the following:

  • Gigaspaces
  • Gemstone
  • Tangosol
  • Intersystems
  • Memcache
  • DataSynapse
  • Terracotta
  • Collapsing Distributed Processing

    Yet another approach to decreasing the overall end-to-end latency of messaging systems is to collapse the ends, which also minimizes the propagation delays. The closer each distributed processing node is to being within the same process of dependent nodes, the better the overall performance. The rise of Direct Market Access (DMA) approaches where firms connect directly to the exchanges and other providers of market data, instead of third party vendors of the data is an example of this. DMA alone spawned a new market data distribution industry with the net result being end-to-end latency for market data measuring in the low milliseconds, which for a while remained faster than the same data distributed by vendors such as Reuters and Bloomberg.

    Thus far we’ve shown how firms in the capital markets are confronting their bandwidth problem and need for speed. The third category of challenges is the Storage Dilemma facing these firms.

    Tagged , , , ,

    Measuring Latency – Bringing it all together – Part 5

    Part 4 presented propagation delay, or the delay incurred as packets travel between sending and receiving nodes. The Measuring Latency series has illustrated some of the low-level but nonetheless key contributors to messaging latency. Understanding these contributors can help information technology customers better decide between competing low-latency technologies. There are many other factors that contribute to latency along the message path including those described below.

    Application Latency

    This delay refers to the time it takes applications to route, transform, embellish or apply any other business rule, prior to sending messages to downstream applications. The application’s architectural characteristics are key to minimizing this latency. Threading, pipelining, caching, and direct memory access are just a few of the performance design techniques that can minimize application latency.

    (click here for article series on how Financial firms on Wall Street find innovative techniques that minimize application latency)

    Device Latency

    Routers and switches can add between 30 microseconds and 1000 milliseconds to a message’s overall latency. Configuration options with these devices can add even more latency. Switches, for example, can be designed to forward frames with store and forward, or cut-through semantics. With store-and-forward, a switch will wait to forward a frame until it has received the the entire frame. Cut-through configurations, on the other hand, allow the switch to operate at wire speed by forwarding a frame as soon as the destination address is read.

    Before You Start Measuring

    You must choose your endpoints wisely. Your source and destination endpoints will incorporate some or all of the delays I’ve presented in this series. Not knowing which of the latency delays are included in this message path will make it almost impossible to act upon the results.

    Before you measure and collect timestamps, keep in mind the need to synchronize the clocks between message processing nodes. Without clock synchronization, or understanding the variability between these clocks, will only destroy the integrity of your measurements. Some other considerations:

    • Precision of your timestamps (i.e. will millisecond precision suffice?)
    • Latency of your measuring tools (i.e. how much time to the overall latency belongs to your testing tools themselves)
    • Relevancy of your configuration (i.e. do the software/hardware specs reflect your target environment)
    • Network Congestion (i.e have other applications/users been locked out of the testing network)
    • Message rate (i.e. measure at a message rate that reflects your target environment)

    Organize your Measurements

    Depending on the frequency of measurements you may find yourself with large amounts of data representing timestamps between your chosen endpoints. As you process through this information use the following statistics to summarize the latency characteristics:

    • mean – The mean is the sum of the observations divided by the number of observations. When referring to the average, we’re referring to the arithmetic mean.
    • median – The number separating the higher half of a sample, a population, or a probability distribution, from the lower half.
    • standard deviation – Describes the spread of data from the mean, it is defined as the square root of the variance.
    • percentile – The value of a variable below which a certain percent of observations fall.

    In order to accurately communicate the latency characteristics of your application you must include statistics such as the arithmetic mean as well as median, standard deviation and percentile statistics to  communicate the spread of these latency measures.   Some high-performance computing requirements, such as those surrounding electronic trading on Wall Street, are especially sensitive to measures of latency that fall outside the mean.

    I hope you’ve found this series helpful and I look forward to your comments.

    Good luck!

    Tagged ,

    Measuring Latency – Propagation Delay – 4

    Part 3 to this series presented serialization delay, or the delay incurred while placing packets of data on the wire. Part 4 of this series on measuring the latency in messaging systems will focus on the latency of these packets as they travel between the sending and receiving nodes.

    Propagation Delay

    The speed of light, or circa 186,000 miles per second is clearly the upper limit for the speed which packets can travel between sending and receiving nodes. The material used to wire and connect computer networks, be it copper or fiber, limits the speed at which messages can travel by a factor of the speed of light to roughly 75%.

    Copper Cabling

    The Telecommunications Industry Association (TIA) has developed standards over the years to address commercial cabling for telecom products and services ensuring minimum quality thresholds are met throughout the world. Cable types are typically characterized with performance attributes like those shown in the table below.

    .

    The table displays the propagation characteristics of that category of cable. The propagation delay is typically quoted for a 100 meters length cable, which represents the maximum recommended distance for cables in a 10/100/1000baseT[X] environment.

    To illustrate, at a high-level, how propagation delay affects common messaging applications, imagine a streaming application that generates 2,600 new 100 byte messages per second. Individual messages for this application would require about 3 seconds to travel between nodes in New York and California connected when traveling at T1 speeds. Circa 1.5 seconds of this time (2600messages*100bytes / 192,500bytes/second) would be spent serializing the messages onto the wire, and another 20+ milliseconds(3200 miles/ 75% the speed of light) for the messages to propagate between New York and California. Additional latencies are incurred as each network device routes the packet along the its path. We’ll cover these other latencies in part 5 of this series.

    Fiber Optic Cabling

    Copper wiring is not the only cable type that is used in modern computer networks. Fiber optic cables make up the backbone wiring technology to many of the world’s computer networks. We read above that various types of copper cables exhibit roughly 5.48 nanosecond propagation delay per meter of cable.   For a practical understanding of the propagation delay characteristics of fiber optic cable, refer to the comments following this post.

    Cable Lengths

    The minimum packet size for Ethernet networks is closely related to the maximum cable length for segmented nodes in these networks.  Click here for a detailed description of this relationship.

    Minimizing Propagation Delay

    Low latency sensitive industries, such as the financial services, have relied heavily on tactics such as collocation, which geographically minimizes the distance between nodes along the message path by hosting sender and receiver in the same physical location. As the geographic distance is minimized, so is the propagation delay between nodes.

    In part 5 of this series, we’ll wrap up the topic of measuring latency by covering other factors that contribute to latency.

    Tagged ,

    Measuring Latency – Serialization Delay – 3

    Part 2 to this series presented packetization delay, or the delay incurred as all systems, along the message path, create and reshape packets. Part 3 of this series on measuring the latency in messaging systems will focus on serialization delay, or the delay in moving packets from the Network Interface Controller’s (NIC) transmit buffer to the wire.

    Minimizing Serialization Delay

    Larger bandwidth technologies play a much greater roll in reducing the serialization delay than does changing the packet size. This is because serialization delay is a function of packet size and transmission rate expressed as:

    Serialization Delay = Size of Packet (bits) / Transmission Rate (bps)

    A packet size of 1500 bytes, transmitted using the T1 technology (1,544,000 bps) would produce a serialization delay of about 8 milliseconds. The same 1500 byte packet using 56K modem technology (57, 344 bps) would result in a 200 millisecond serialization delay, whereas using Gigabit Ethernet technology (1,000,000,000 bps) would reduce the 1500 byte packet’s serialization delay to 11 microseconds.

    In part 4 of this series, I’ll cover the third of the three latency delays, namely propagation delay.

    Tagged ,

    Measuring Latency – Packetization Delay – 2

    Part 1 to this series presented 3 types of delays that constitute latency measurements in messaging systems. Packetization delay represents the first type of delay, which we will discuss next. All references to packets below are for the IPv4 definition of packets.

    What is Packetization Delay?

    This delay refers to the time it takes a system to create and fill packets of data for sending over internet protocol related technologies. A packet represents a fundamental unit of data for IP related technologies and the delay is comprised of the time it takes to create the packet’s headers coupled with the time it takes to fill the packet’s upper data layer, or payload, with application specific data.

    Creating Packets

    A packet’s header is structured with 20 bytes of fixed-value fields and 4 bytes of variable-value fields, as shown in the diagram below. The first component of packetization delay is the amount of time it takes the sending system to populate this header information. Generally speaking this time is negligible as compared to the time it takes to populate the upper layer data portion of the packet. The Maximum Transmission Unit, or MTU, represents the largest packet size that any given layer of the IP protocol stack can pass to another layer. The Ethernet MTU is 1518 bytes, which includes the 18 byte Ethernet header information. This results in a 1500 byte MTU for the IP Layer which generally is the largest packet size for IP related technology. Subtracting the 24 bytes of the IP header itself results in a maximum payload of 1476 bytes and a minimum payload of 40 bytes.

    ippacket11.jpg

    ‘Hydrating Packets’

    By ‘hydrating’ I mean the time it takes to fill the upper data layer portion of the packet. This time is regulated by the size of the upper data layer, the rate of message creation on the sending system and the message batching algorithm used in the sender’s protocol implementation.  The batching of multiple messages into a single packet may increase the overall message-latency as the batching algorithm waits for additional messages before sending the packet.  To put this into context, a real-time streaming application that sends 100 50-byte messages per second is generating 125 packets per second if the minimum Ethernet packet size of 64 bytes is used, where 24 of these bytes correspond to the header fields, and the remaining 40 bytes are used to store the upper data layer information.  Disabling the batching of messages could improve latency as packets are sent upon message arrival, however network and CPU resources may get saturated processing the flood of smaller sized packets.

    Packet Size and Latency

    While message size and message rate are directly related to the system’s functional requirements, packet size can be configured to suit the system’s non-functional requirements. One can hypothesize that smaller packet sizes introduce inefficiencies as network and CPU utilization increase in order to process the growth in smaller sized packets. Similarly, larger packet sizes result in more time spent waiting to fill packets, although efficiencies can be gained by network and CPU resources processing significantly less packets. Determining the optimal packet size for your messaging application requires thorough testing that highlights the impact this size has on latency.

    Packet Fragmentation

    MTU, or Maximum Transmission Unit, refers to the largest packet size supported by any node along the message path. When a packet’s size is larger than the MTU for a receiving node, the packet needs to be fragmented into smaller packets. This results in packet fragmentation which negatively impacts the message latency for two reasons. First, routers need to perform the fragmentation operation, which costs time and router resources. Second, downstream nodes are now required to process more packets which results in the potential inefficiencies described in the section above.

    In part 3 of this series, I’ll cover the second of the three latency delays, namely serialization delay.

    Tagged ,

    Measuring Latency – Introduction – 1

    Latency Introduction

    In this 5 part series, we’ll cover the topic of latency in messaging systems. For the purpose of this series, we refer to latency as the time it takes for binary or ascii messages to travel from a sending source to a receiving destination. There are numerous factors along the message path that ultimately contribute to message latency which we will cover in this series.

    As many of you know, latency, along with throughput is a key component of the performance equation for any information technology system. Despite marketing hype, latency will never be zero. Incredibly however, modern day hardware and software technologies are providing customers with the ability to achieve remarkably small latency measures. If you want to radically empower your high-performance messaging strategies it is essential you understand how the functions and characteristics of all components along the message path contribute to latency.

    Air Travel Example

    During a recent trip to Argentina, quick math allowed me to estimate travel time between my originating city in New Jersey and my destination, Buenos Aires, to be 15 hours and 7 minutes. These estimates were based on a combination of tacit and explicit knowledge regarding the time it takes to pack, get to the airport, check in, clear security, reported flight times, as well as similar overhead upon arriving at my destination (i.e. retrieve baggage, clear customs and immigration etc). While these estimates held upon reaching the airport, they simply broke down once an 8 hour delay was announced by my chosen air carrier.

    Knowing the different sources of latency in my travel, as well as having empirical data that shows how these sources behave in different situations (i.e. holiday travel vs. weekend travel) can help me better design an itinerary where actual travel time is more closely aligned with estimated travel time.

    This same logic applies to low-latency messaging technologies. Ensuring your low-latency needs are met requires an understanding of the different components of latency along the message path.

    The Delays of Latency

    In this series, we will present 3 types of delays contributing to messaging latency, namely packetization delay, serialization and propagation delays.

    Tagged ,