Monthly Archives: January 2008

Designing for Performance on Wall Street – The Need For Speed

Collapsing Time

While the impact has already been enormous, history will show how the shift from floor-based specialist trading to electronic trading changed the way investors, specialists, investment banks, brokers, exchanges and other industry participants make their money. Wall Street as a whole is now firmly entrenched in this new electronic trading frontier and the barriers to entry have shifted from the human imperfections of floor based traders or specialists, to the high-speed, low latency capabilities of profit seeking electronic algorithms.

Low latency in the scope of electronic trading refers to the utilization of high-performance technology that collapses the time between price discovery (i.e. 100 shares of IBM are now available at $100.00) and the execution of orders (i.e. buy or sell) at the newly discovered price. Electronic trading has created a world where the lifecycle of price discovery to trade execution is on the order of single-digit milliseconds.

Time is Money

Previously, I talked about the bandwidth problem. Inability to handle the required bandwidth utilizations of modern market data feeds will certainly cause significant delays in this millisecond-sensitive trade lifecycle, resulting in lost profits. However, the single most important need that has resulted from the unanimous shift to electronic trading is the need for speed, where speed refers to the ability to “see” stock prices as quickly as they appear in the electronic marketplace and similarly the ability to immediately trade on that price before competitors do

Some of the low-latency design strategies or techniques exhibit the elegant characteristic of solving the bandwidth problem as well as the need for speed. For example, colocating your electronic trading algorithm in the same facility as an exchange’s matching engines (i.e. the systems that execute the buy/sell orders) will not only save your firm the wide-area network infrastructure required to feed market data to your trading algorithm, but will also minimize the propagation delays between market data sources and execution venues. Incredibly, some of these solutions, such as FAST compression, can theoretically address the bandwidth problem, the need for speed, and the storage dilemma.

Low-Latency Approaches

How does Wall Street solve the need for speed? Here are just some of the approaches used to minimize stock trading related delays:

Chip Level Multiprocessors (CMP)

When Intel’s microprocessors started melting because of excessive heat, the multi-core chip industry became mainstream. Smaller multiple cores on a single chip could now permit multi-threaded code to achieve true parallelism while collapsing the time it takes to complete processing tasks. Multi-core chips from Intel and AMD have a strong presence in the capital markets and can achieve remarkable performance as shown in SPEC benchmarks.

An emerging challenge on Wall Street is to deploy microprocessor architectures capable of scaling to the enormous processing required by risk-modeling and algorithmic-trading solutions. If one-core architectures encountered space and heat limitations which eventually lead to the introduction of multi-core architectures, what new limitations will emerge? The shared message bus found with existing multi-core processors is one such limitation as the number of cores multiply. Vendors, such as Tilera are innovating around these limitations and you can expect more to follow. Furthermore, evidence is building to support the notion that multi-core microprocessor architectures, and the threading model behind them are inherently flawed. Multi-core CPUs may provide near term flexibility for designers and engineers looking to tap more processing power from a single machine. Long term however, they may be doing more harm then good.


With multiple cores now in place, the software and hardware community are steadily catching up. For example, older versions of Microsoft’s Network Driver Interface Specification (NDIS) would limit protocol processing to a single CPU. NDIS 6.0 introduced a new feature called Receive Side Scaling (RSS) which enables message processing from the NIC to be distributed across the multiple cores on the host server.

As Herb Sutter explains in his paper “The Free Lunch is Over: A Fundamental turn Towards Concurrency in Software”, software applications will increasingly need to be concurrent if they want to exploit CPU throughput gains. The problem is that concurrency remains a challenge from an education and training perspective as described in David A. Patterson paper. Conceptually concurrency can drive the need for speed. The practice of this approach remains a challenging one.


Colocation is a fascinating approach towards achieving low-latency, mainly because it reconfigures physical proximity between application stacks instead of relying on a sophisticated technology approach. We’ve already shown how it can minimize the bandwidth requirements for a firm’s algorithmic trading platforms, but its biggest accomplishment is to minimize the distance between electronic trading platforms and the systems that execute the trades. Organizations such as BT Radianz have armed their high-performance datacenters with the fastest, highest throughput technology on the planet. When coupled with colocated hosting services, these data centers provide the the lowest latency money can buy while opening up new opportunities to translate this value throughout the application stack starting at the NIC card and moving on up.

The Exchanges themselves, are also using colocation services as a way to attract customers and introduce new sources of revenue. For example, International Securities Exchange ISE, offers colocation services while promising 200 microsecond service levels.

Hardware Accelerators

Field Programmable Gate Arrays

The name says it all – an integrated circuit that can be customized for a specific solution domain. Specialized coprocessors have existed for years, handling floating point calculations, video processing and other processing intensive tasks. FPGA builds on this by offering design tools allow programmers to customize the behavior of the FPGA’s integrated circuit, usually through a high-level programming language which is then “compiled” into the board itself. An example of how FPGA boards are being deployed on wall street includes replacing software feed handlers, the components that read, transform and route market data feeds, with their FPGA equivalents. This approach results in higher throughput and lower latency because message processing is handled by the customized FPGA board, instead of the host CPU/OS, saving the precious cycles that would have been required for moving messages up the protocol stack and interrupting the kernel. ACTIV Fiancial, a leading vendor of a feed handling solution claims that the introduction of FPGA accelerators to their feed processing platform reduced the feed processing latency by a factor of ten while allowing them to reduce the servers required to process some US market data feeds from 12 servers, in the software based feed processing approach, to just one server in the FPGA accelerated approach.  Celoxica is another firm specializing in FPGA solutions for Wall Street’s electronic trading.  Celoxica’s hardware accelerated trading solution promises microsecond latency between host NIC and user application with support for throughput rates reaching 7 million messages per second.

TCP Offload Engine

The idea with TCP Offload Engines (TOE) is for the host operating system to offload processing of TCP messages to hardware located on the network interface card itself, thus decreasing CPU utilization while increasing outbound throughput.  Windows 2003 Server includes the Chimney Offload architecture which defines the hooks required for OEM and 3rd party hardware vendors to implement layer 1, 2, 3 and 4 of the OSI protocol stack in the NIC itself, before passing the message to the host operating system’s protocol handlers.  Similar examples of offload technology include TCP Segmentation Offload (TSO) or Generic Segmentation Offload (GSO) where the NIC handles the segmenting of large blocks of data into packets.

Network Processing Offload

Coming Soon

Kernal Bypass

Coming soon

High-Performance Interconnections (I/O)


From the Infiniband Trade Association website:

In 1999, two competing input/output (I/O) standards called Future I/O (developed by Compaq, IBM and Hewlett-Packard) and Next Generation I/O (developed by Intel, Microsoft and Sun) merged into a unified I/O standard called InfiniBand. InfiniBand is an industry-standard specification that defines an input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems. InfiniBand is a true fabric architecture that leverages switched, point-to-point channels with data transfers up to 120 gigabits per second, both in chassis backplane applications as well as through external copper and optical fiber connections.

Infiniband technologies also exhibit the characteristic of solving multiple problems facing Wall Street today including bandwidth, latency, efficiency, reliability and data integrity. Visit the Voltaire website for a vendor specific look into the performance benefits of Infiniband on Wall Street.

Please check back in second quarter 2008 when The Techdoer Times presents a detailed look into the many existing and future applications of Infiniband technology.

Remote Direct Memory Access

RDMA is a zero-copy protocol specification for transferring data between memory modules of separate computers without involving either source or target operating sytem or CPU, resulting in low-latency and high-throughput computing.

Fibre Channel

Gigabit Ethernet (GbE) & 10 Gigabit Ethernet (10GbE)

AMD HyperTransport (Chip-level)

Intel Common System Interface (Chip-level)

Ethernet Virtual Private Line (EVPL) and Ethernet Virtual Connection (EVC)

Faster Compression

As we mentioned in the bandwidth problem, some firms are relying on innovations in compression as a way to minimize escalating bandwidth costs. FAST is an example of this but there’s more. In our previous postings on measuring the latency in messaging systems we explained how the different components of latency react to variations in packet size or transmission rates. Herein lies the potential latency improvements resulting from the adoption of FAST. FAST can potentially minimize packetization and serialization delays. It is true that the process of compressing messages requires additional CPU cycles and therefore adds to the application delay, however, depending on the nature of the solution, this additional delay may be offset by the savings that result from serializing significantly smaller sized packets onto the wire, potentially 80% smaller. FAST can be incredibly effective at bandwidth reduction and can potentially reduce end-to-end latency as well.


Messaging technology has evolved greatly to the point where requirements for speed and reliability are no longer in conflict. Publish/Subscribe messaging paradigms can be supported with different levels of service quality, ensuring that latency-sensitive subscribers can forgo message recovery for the sake of speed, while data-completeness sensitive subscribers can rely on extremely fast message recovery built on top of layer 3 protocol and routing technologies such as UDP and Multicast. These real-time messaging technologies also ensure robustness and scalability across a number of downstream subscribers. Cases where slow subscribers begin to “scream” for message retransmission (aka. ‘crying-baby’) can be handled individually and gracefully by the messaging layer, ensuring uninterrupted service to other subscribers. Messaging technology vendors include:

Multicast Routing

As mentioned in the bandwidth problem, multicast routing technologies can potentially reduce latency in addition to bandwidth utilization. The latency play results from the fact that multicast packets are rejected or accepted at the Network Interface Card (NIC) level, and not the more CPU expensive kernel level.

Data Grids/Compute Grids

With the industry’s reliance on the timely evaluation of strategic trading and risk models comes the need to access and crunch large amounts of data efficiently. This reliance has spawned innovations in the form of data and compute grids which offer highly-resilient, scalable distributed processing infrastructure on demand for compute intensive as well as data intensive environments. Data grids, in particular, offer a high-performance, highly-resilient middle-tier data layer that sits on top of storage technologies and other information sources but offers ubiquitous data access to enterprise business processes. Key vendors or technologies in this space include the following:

  • Gigaspaces
  • Gemstone
  • Tangosol
  • Intersystems
  • Memcache
  • DataSynapse
  • Terracotta
  • Collapsing Distributed Processing

    Yet another approach to decreasing the overall end-to-end latency of messaging systems is to collapse the ends, which also minimizes the propagation delays. The closer each distributed processing node is to being within the same process of dependent nodes, the better the overall performance. The rise of Direct Market Access (DMA) approaches where firms connect directly to the exchanges and other providers of market data, instead of third party vendors of the data is an example of this. DMA alone spawned a new market data distribution industry with the net result being end-to-end latency for market data measuring in the low milliseconds, which for a while remained faster than the same data distributed by vendors such as Reuters and Bloomberg.

    Thus far we’ve shown how firms in the capital markets are confronting their bandwidth problem and need for speed. The third category of challenges is the Storage Dilemma facing these firms.

    Tagged , , , ,

    Challenges Facing Virtual Teams

    Virtual Teams are responsible for delivering software solutions, similar to collocated teams, however virtual team members are distributed across the globe and rarely meet in person.  They instead rely on chat, video, and voice technologies to enable a continuous daily collaboration towards delivering valuable software solutions.

    Here is the list of challenges facing virtual teams:

    • Language Barriers
    • Trust
    • Team Intimacy
    • Cultural Differences
    • Differences in Communication Styles
    • Orchestrating Across Timezones
    • Effective Work Distribution
    • Effective Productivity Tracking

    Language Barriers

    Years ago, I was on a team who was building a content management system. The team consisted of a group of programmers from four different countries.  The code was particularly difficult to maintain because a few of the programmers adopted a variable naming convention using their mother tongue, while others used English.

    In other experiences, I found language barriers between virtual team members slowed the pace of communication, and generally prevented key individuals from speaking up.  I remember a project where some team members were forced to switch language away from their mother tongue.  This completely changed the overall dynamic of the team.  Prior to the switch, the individuals spoke up, were proactive in addressing project issues, and were generally more engaged in their day-to-day work.  Following the switch, the preferred Communication Style also shifted from vibrant real time communication technologies to asynchronous technologies such as email. All told, the language barriers had a significant negative impact on the team. 


    The distributed nature of virtual teams also makes it difficult to build and maintain professional trust.  The old adage, “Out of sight, out of mind” applies as team members struggle to give and obtain the necessary feedback.  To offset these challenges, teams should rely on occasional face-to-face meetings hosted in locations that are practical for all to attend.

    Update August 1, 2010 – ODesk which provides a marketplace for online work teams, recently developed a new software application called ODesk Team. What’s particularly interesting about this application are the innovative features it offers to help build trust between the buyer and seller.  Features such as Time Tracker and Screensnap, although arguably intrusive for established teams, can be effective at building trust between parties who are working together for the first time.

    Team Intimacy

    Intimacy is a good indicator to the strength and health of any relationship including those in a virtual team.  In this context, intimacy refers to the level of caring team members feel for one another’s needs.  The distributed nature of virtual team members will make it difficult for them to grow and sustain high levels of professional intimacy.  Members, who are naturally empathetic and proactive in reaching out to others, may lack the necessary communication and feedback channels, in a remote setting,  that help spring them to action.  This will ultimately impact the team’s Trust and overall performance.

    Cultural Differences

    Coming from New York City where working lunches are a normal part of any work day, I didn’t anticipate the negative reaction to my working through lunch during a project in Rome, Italy. This was perceived by fellow team members as an act of competitiveness and ultimately required many trips to the espresso machine to reestablish the Trust between us all.

    Cultural issues are more pronounced in globally distributed teams. Learning and adapting to the various cultural traits of virtual team members will help grow the team’s trust and intimacy, while not losing focus on the universal business culture of satisfying customer needs.

    Differences in Communication Styles

    Synchronous communication technologies, such as chat, may be favored when distributed team members share command of the same spoken language.  Asynchronous communication technologies, such as email, may instead be favored when distributed team members are lacking this command.  Asynchronous technologies will permit them more time to process and translate messages. The key is to identify and select those technologies appropriate to the team as I talked about in my own move to a virtual team.

    The April 24, 2008 New York Times article on making long-distance business partnerships work confirms the importance of selecting the right technologies whle managing the privacy issues that may arise from their use.

    The July 30, 2010 article on telepresence robots hints at a new technology just around the corner that may just promote a new level of  communication between team members.

    Orchestrating Across Timezones

    Virtual team members situated in different timezones presents both challenges and opportunities.   The benefits of multiple time zones are best captured by the expression “follow the sun”, which implies a virtual team structure that permits continuous productivity through the seamless hand-off of work between team members who are leaving for the day, and those coming online. Despite the challenges different timezones bring, the ability to orchestrate this hand-off can have many benefits and will allow team members to work the traditional working hours of their timezone.

    On a recent project,  for example, a colleague lived and worked in a timezone that was eight hours ahead of our colocated team.  We struggled with the timezone difference.  In particular we made the mistake of leaving some of his questions unanswered at the end of our workday.  This resulted in nearly a day of lost productivity as his workday started without the answers he needed to move forward.

    Effective Work Distribution

    The “out of sight, out of mind” adage described in the previous section on Trust, can also impact the effective distribution of work across virtual team members.  Peter Drucker described the job of a manager as one who creates productive work and assigns the most effective people to perform it but the remoteness of virtual teams may make it difficult to understand a team member’s effectiveness.  Managers must find new ways to understand worker effectiveness while ensuring the work is distributed evenly, fairly and avoids overworking virtual team members.

    Effective Productivity Tracking

    The level of productivity tracking needed in a virtual team will depend on the level of trust between it’s members.  Effective virtual teams who’ve built a solid foundation of trust and effectiveness won’t need to confront this challenge.  In teams where members are working together for the first time and are under pressure to deliver fast, closely monitoring productivity may be the only way to mitigate risk while building trust.  As I mentioned in the section on Trust, ODesk has introduced a software application called ODesk Team, which, in this context of productivity tracking, includes a feature to capture hourly screenshots of the contractor’s work.  While this feature would be unquestionably demotivating in high-performance teams, for some virtual teams, it might just provide the right level of transparency.


    Designing for Performance on Wall Street – The Bandwidth Problem

    Exploding Message Rates

    Previously, we introduced the industry changes driving the technology performance problems on Wall Street. The first of these problems is growing bandwidth utilization resulting from the exploding message rates in market data feeds. Take for example the industry options feed, OPRA. Since 2001, the peak message rate for OPRA rose from 7000 messages per second (mps) to well over 300,000 mps in 2007. 2008 projections are racing towards the 700,000 mps mark. With message sizes averaging 120 bytes, network infrastructure, consuming OPRA, is pushed to support peak bit rates of 672 million bits per second (Mbps).

    Footprint Minimizing Approaches

    As expected, these message rates have forced network engineers, system administrators, and software engineers to come up with innovative solutions. The result is that Wall Street today relies heavily on multicast routing as a bandwidth minimizing technology and is looking to FAST, Conflation, Colocation and other approaches towards minimizing bandwidth utilization.

    Multicast Routing

    Multicast is a routing paradigm supported by modern routers that minimizes bandwidth utilization by allowing one or more systems to publish a single stream of packets to one or more receivers. Multicast-enabled routers achieve their efficiency by ensuring there is only one packet copy on the network branches they control if an only if there are subscribers on that branch. Additionally, in cases where there are subscribers, all non-subscribing nodes can reject the multicast packets at the Network Interface Controller (NIC) level, avoiding costly CPU time and benefiting the need for speed. For example, if a network branch has two multicast-group subscribers, the network’s router(s) will route a single copy of the stream’s packets to both subscribers.

    For multicast routing to work, two things must exist, an addressing scheme that permits communicating to multiple receivers as well as a subscription mechanism, permitting multiple subscribers to joint a multicast-group.

    FAST Compression

    FIX Adapted for STreaming, or FAST, is a highly-efficient way of compressing market data quote and trade messages that focuses on maximizing the compression ratio and minimizing the compression latency. Early proof-of-concepts resulted in compression ratios around 80%, shrinking some trade messages from their original 241 bytes to a compressed size of 29 bytes. FAST compression has the added benefit of compressing faster when data volumes rise.


    Conflation refers to a pub/sub mechanism which allows for throttled and/or finer event-based subscriptions to real time market data streams. Typically, subscribing to market data directly from the source of the data (i.e. exchanges), versus the vendors (i.e. bloomberg), means losing the sophisticated subscription/filtering mechanisms in favor of lightening fast data access (although this may all change with NYSE’s aquisition of Wombat Software). The result is having to process significantly more information in order to permit algos and other electronic trading applications to quickly discover best prices before their competitors. Vendors of direct market access distribution platforms have introduced conflation as a method to minimize the amount of data processed by electronic trading applications, while maintaining the minimal latency characteristics of market data directly from its source. For example, an electronic algorithm looking to purchase any number of shares in a stock, may be overwhelmed if that stock’s price changes 700 times a second. Instead, with conflation, the algo can subscribe to the available latest price and receive a single message during that second. Furthermore, an algo looking to purchase IBM stock only after it crosses the $100.00 mark between 3pm and 4pm may also suppress all quote messages for IBM during that period, and instead receive a single message if the event of interest occurs.


    Colocation, also known as proximity hosting, refers to the practice whereby firms physically place the systems running their algorithms in the same geographic location as the systems that power the stock exchanges. This practice ensures minimal latency in processing buy/sell orders from the algo engines but has the additional benefit that the escalating market data volumes feeding these algorithms will be contained within the same network infrastructure. Firms that choose to colocate their algos circumvent the need to purchase and deploy expensive WAN infrastructure that can accommodate the insatiable bandwidth requirements of market data feeds. In other words, colocating customers will be rewarded with the lowest latency while at the same time ridding themselves of the network infrastructure supporting market data volumes.

    API Efficiency

    I mentioned how electronic trading and specifically how electronic algorithms are generating bursts of buy/sell orders that are executed as quickly as they are canceled. One approach for minimizing the bandwidth utilization of this activity was to introduce batching operations, such as bulk cancels, in the transactional interface (i.e. FIX Protocol) between broker and electronic marketplace. Another approach was to create generic messages, such as pegged orders, that can “follow” changing market conditions without the need to reissue new orders. Yet another approach was to suppress acknowledgment messages between electronic marketplace and algo, resulting in less bandwidth utilization. What all these approaches have in common is they all require modification to the transactional API between the algo and the electronic marketplace.

    Feed Partitioning

    OPRA remains the largest volume market data feed for the US capital markets with 2008 projections racing towards the 900,000 messages per second figure. In 2006 OPRA’s distributor, SIAC, shifted from distributing the feed over 12 separate network lines, to a total of 24 lines in an attempt to minimize the per line bandwidth requirements. This move has removed the per line bandwidth pressures many subscribers were facing, despite retaining the ridiculous bandwidth required to process the entire feed.

    The bandwidth problem facing firms in the capital markets is only one part of the problem. Next we’ll show how their need for speed is presenting significant computing challenges as well.

    Tagged , ,

    Designing for Performance on Wall Street

    On Your Marks…

    Since the start of this decade, the US financial stock markets have experienced massive industry changes resulting from regulatory, competitive and innovative forces. These changes have led market participants to engage in an all out arms race towards extremely low-latency, high-throughput performance computing in an attempt to stay competitive.

    Unfortunately, arms races like this one are sometimes driven by reaction than by vision. Not knowing, for example, how switched fabric based technologies can specifically address your performance troubles should prevent anyone in the purchasing of InfiniBand based technologies. Thankfully organizations such as STAC Research have begun to address a much neglected need for accurate performance measurements across this industry’s technology platforms.

    What’s Driving All This?

    Electronic Trading

    Since the start of the 21st century, the adoption of electronic trading among hedge funds, broker dealers, and investment banks has skyrocketed. In fact as of 2007, electronic trading on the New York Stock Exchange (NYSE) makes up 60 to 70 percent of the daily volume. Technological innovation and economies of scale have led to the widespread digitization the stock trader’s profession resulting in highly advanced trading strategies. This form of trading relies on algorithms that seek profitability by scanning exchanges and other electronic execution venues with storms of buy/sell/cancel/replace orders at near wire speed. Similarly, this form of trading thrives on discovering the best prices before competitors, forcing many of the algo trading engines to circumvent the traditional sources for market data (i.e. Bloomberg, Reuters) and instead connect directly to to the source of the bid or ask quotes.


    The Securities and Exchange Commission has designed many new regulations towards protecting investors in this new electronic marketplace. Regulation NMS (or RegNMS), in particular, forces all market participants a chance at the best price for any individual security, on any of the available electronic marketplaces during market hours. This regulation has resulted in a smart order routing strategies where traders and the algorithms implementing their strategies fulfill regulatory requirements and at the same time maximize their returns. Similar regulation has been introduced for the European markets in the form of The Markets in Financial Instruments Directive, or MiFID.

    Many other types of regulation have been introduced with the goal of improving transparency in the financial markets. These regulations mandate record retention policies that weigh heavily on the storage capabilities of complying firms. For example, a firm wishing to store each level-1 market data quote disseminated from all the US ECNs, exchanges, and OTC markets is now facing a 600 million quote per day reality (based on January 2008 volume). Assuming off-the-shelf database products and straightforward indexing schemes to store this data means having to allocate 60 GB of storage daily. The point here is that regulation is impacting the industry resulting in changes that expose limits in network bandwidth as well as data storage.


    Technological innovations in hardware, software, and networking technologies have enabled creative new opportunities for discovering liquidity and making money on wall street. Market participants, of all sizes, are maximizing their technology investments and utilizing the high-performance, high bandwidth solutions to stay competitive. Smaller sized hedge funds, for example, can now adopt relatively inexpensive off the shelf solutions towards generating startling returns that fill their investment banking brethren with envy.

    High Performance Elements Uncovered

    In this 3 part series, I’ll present the elements of high-performance computing in today’s US stock markets, and how these element are specifically designed to address the performance problems that emerged from the restructuring of this age-old industry. First, I’ll cover the problem of exploding message rates and the bandwidth problem, followed by the new low-latency reality in the need for speed. Finally, I’ll show how all this high-performance messaging is leading to the storage dilemma.

    Tagged , ,