Category Archives: Technology

Challenges in Multi-Core Era – Part 1

A few years ago, in 2005, Herb Sutter published an article in Dr. Dobb’s Journal , “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software”. He talked about the need to start developing software considering concurrency to fully exploit continuing exponential microprocessors throughput gains.

Here we are in year 2009 – more than four years after Sutter’s article publication. What’s going on? How are we doing? How did the industry evolve to tackle the multi-core revolution?

In this three part series, we’ll answer these questions by exploring the recent multi-core inspired evolution of components throughout the application stack, including microprocessors, operating systems and development platforms.

The New Microprocessors

Microprocessor manufacturers are adding processing cores. Most machines today have at least a dual-core CPU. However, quad-core CPUs are quite popular on servers and advanced workstations. More cores are round the corner.

There is a new free lunch. If you have an application designed to take advantage of multi-core and multiprocessor systems, you will be able to scale as the number of cores increase.

Some people say multi-core wasn’t useful. You can take a look as this simple video. It runs four applications (processes) at the same time on a quad-core CPU. Each application runs in a different physical processing core, as shown in the CPU usage history real-time graph (it uses one independent graph per processing core). Hence, it takes nearly the same time to run four applications than to run just one. Running just one application takes 6 seconds. Running four applications takes 7 seconds. What you see is what you get. There are no tricks. Multi-core offers more processing power. It is really easy to test this. However, most of the software wasn’t developed to take advantage of these parallel architectures in single applications.

There is another simple video showing one application running on a quad-core CPU. The first time, it runs using a classic, old-fashioned serial programming model. Hence, it just uses one of the four cores available, as shown in the CPU usage history real-time graph. Then, the same application runs in a parallelized version, taking less time to do the same job.

In recent years, parallel hardware became the mainstream standard in most developed countries. The great problem is that the speed of hardware evolution went much faster than the speed of software evolution, resulting in a large gap between the two. The microprocessors added new features that software developers didn’t exploit. Why did this happen? Because it was very complex to accomplish it. By the way, it’s still a complex task. I’ll get back to this later.

However, the most widespread model for multiprocessor support, SMP (Symmetric Multi-Processor) leaves the pole position to NUMA (Non-Uniform Memory Access). On the one hand, with SMP, the processor bus becomes a limitation to future scalability because each processor has equal access to memory and I/O. On the other hand, with NUMA, each processor gains access to the memory it is close to faster than to the memory that is farther away. NUMA offers better scalability when the number of processors is more than four.

With NUMA, computers have more than one system bus. A certain set of processors uses each available system bus. Hence, each set of processors can access its own memory and its own I/O channels. They are still capable of accessing the memory owned by the other sets of processors, with appropriate coordination schemes. However, it is obviously more expensive to access the memory owned by the other sets of processors (foreign NUMA nodes) than to work with the memory accessed by the local system bus (the NUMA node own memory).

Therefore, NUMA hardware requires different kinds of optimizations. The applications have to be aware of NUMA hardware and its configurations. Hence, they can run concurrent tasks and threads that have to access similar memory positions in the same NUMA node. The applications must avoid expensive memory accesses and they have to favor concurrency taking into account the memory needs.

A new free lunch offers manycore scalability. Expect more cores coming in the next months and years. Learn about the new microprocessors. Be aware of NUMA and optimize your applications for these new powerful architectures.

The New Specialized Hardware

On the one hand, we have a lot of software that is not taking full advantage of the available hardware power. On the other hand, there are many manufacturers developing additional hardware to offload processing from the main CPU. Does this make sense?

This means that you are wasting watts all the time because you’re using obsolete software. In order to solve this problem, you have to add additional, expensive hardware to free CPU cycles. But, you aren’t using entire cores.

TCP/IP Offload Engine (TOE) uses a more powerful NIC (Network Interface Card) or HBA (Host Bus Adapter) microprocessor to process TCP/IP over Ethernet in dedicated hardware. This technique eliminates the need to process TCP/IP via software running over the operating system and consuming cycles from the main CPU. It sounds really attractive, especially when working with 10 Gigabit ethernet and iSCSI.

CPUs are adding additional cores. So far, modern software is not taking full advantage of these additional cores. However, you still need new specialized hardware to handle the network I/O… Most drivers don’t even take advantage of old parallel processing capabilities based on SIMD (Single Instruction Multiple Data) offered since Pentium MMX arrival. TCP/IP offload Engine is a great idea. However, if I own a quad-core CPU with outstanding vectorization capabilities, SSE4.2 and previous versions, I’d love my TCP/IP stack to take advantage of it.

Vectorization based on SIMD allows a single CPU instruction to process multiple complex data at the same time. Thus, using them, it speeds up the execution time of complex algorithms many times. For example, an encryption algorithm requiring thousands of CPU cycles could perform the same results requiring less than a quarter of these CPU cycles using vectorization instructions.

Something pretty similar happens with games. Games are always asking for new GPUs. However, most games take advantage of neither multi-core nor vectorization capabilities offered by modern CPUs. I don’t want to buy new hardware because of software inefficiencies. Do you?

Modern GPUs (Graphics Processing Units) are really very powerful and they offer an outstanding processing power. There are many standards to allow software developers to use these GPUs as a CPU, like CUDA and OpenCL. They allow the possibility to run general purpose code on the GPU to release the main CPU from this load. It sounds really attractive. However, again, most software does not take full advantage of multi-core. It seems rather difficult to see commercial and mainstream software considering the possibilities offered by these modern and quite expensive GPUs. Most modern notebooks don’t offer these GPUs. Therefore, I see many limitations to this technique.

Before considering these great but limited capabilities, it seems logical to exploit the main CPU’s full processing capabilities. Most modern notebooks offer dual-core CPUs.

Specialized hardware is very interesting indeed. However, it isn’t available in every modern computer. It seems logical to develop software that takes full advantage of all the power and instruction sets offered by modern multi-core CPUs before adding more specialized and expensive hardware.

In part two of Challenges in Multi-Core Era, I’ll compare the multi-core capabilities of the latest operating systems.

About the author: Gaston Hillar has more than 15 years of experience in IT consulting, IT product development, IT management, embedded systems and computer electronics. He is actively researching about parallel programming, multiprocessor and multicore since 1997. He is the author of more than 40 books about computer science and electronics.

Gaston is currently focused on tackling the multicore revolution, researching about new technologies and working as an independent IT consultant, and a freelance author. He contributes with Dr Dobb’s Parallel Programming Portal, http://www.go-parallel and is a guest blogger at Intel Software Network

Gaston holds a Bachelor degree in Computer Science and an MBA.You can find him in and

Tagged , , ,

Leap Forward with Rails

Lately I find myself caught in the middle of major sporting events.  I started the year in Argentina while the Dakar rally was moving its way through the vibrant streets of Buenos Aires.  A few months later I stumbled upon the 2009 champions league final and Giro d’Italia festivities in Rome.  Whether you’re a fan or not,  these sporting events provide a good opportunity to  learn from the psychology of the world class athletes and teams behind them.

For instance, an recent article in a pro cycling journal serves as a good reminder that sometimes moving forward requires letting go.  As software practitioners we’ve all probably experienced the frustration that comes with holding on to overly complex development environments, inefficient development processes, or seeking new job opportunities based strictly on prior experience alone.   In the latter case, a Java developer may unnecessarily limit a  job search to only those positions matching his or her current skillset,  not realizing the attractiveness offered by development opportunities involving newer technologies such as Ruby on Rails.  In recent years, the Rails community has made enormous strides towards simplifying software development while rendering it enjoyable.

In this three part series, I’ll share my first impressions as I let go of prior knowledge investments in J2EE and .NET in favor of this exciting new ecosystem of software development I refer to as “Ruby Land”.  I’ll show how long standing best practices in software engineering  have been injected into the core culture of its practitioners, and where the lines between artist, entrepreneur, and programmer have blurred in favor of promoting the human side of software development, with an acute focus on continuous testing, productivity, and programmer happiness.

Let’s Rewind a Bit

It may be unfair to single out Java, but the start of this decade saw some consensus among the J2EE development community (2002, 2004, 2006) that the technology’s focus on enterprise computing  had rendered it increasingly complex, cumbersome and dissatisfying.  Am I the only technologist who ran the other way with the introduction of EJB application servers?

Java has since evolved by leveraging the strong foundation and support of its Virtual Machine with the introduction of dynamic languages such as JRuby and Groovy.

Nevertheless, by the middle of this decade, a window of opportunity had opened, permitting a stream of defections from development  communities whose technology had disenchanted its practitioners.

An interesting quote from Charles Connel says:

Junior programmers create simple solutions to simple problems. Senior programmers create complex solutions to complex problems. Great programmers find simple solutions to complex problems.

The one defining characteristic of the Ruby/Rails community seems to be the high concentration of great programmers – professionals who continuously seek out simple technologies to help solve their complex problems.  As Martin Fowler put it:

Ruby has a philosophy of an environment that gives a talented developer more leverage, rather than trying to protect a less talented developer from errors. An environment like Ruby thus gives our developers more ability to produce their true value.

Programmers At Work

Each generation of software development has always had its share of great programmers.   Early on we had the programmers at work – the early pioneers who blazed the trail we continue to walk on today.

When I first entered the professional world in the 90’s, I remember admiring names like Gosling, Bosworth, Bray, Wall, Ozzie, Booch, Lee, Linus, Raymond, Sessions, O’Reily, Box, DeMarco, and McConnell.

By the middle of this decade this list had grown to include people like  Joel, Brinn, Page, Fowler, and Graham.

The current generation of Ruby/Rails programmers is not only limited to engineers and entrepreneurs but artists as well.  As newer languages climbed the ladder of abstraction in recent decades, the programming discipline began appeal to a wider more diversified audience of professionals. Today you find artists, engineers, and entrepreneurs all involved somehow in the task of building software systems or promoting and growing the business of software.  Names and monikers include DHH, Uncle Bob, Dr. Nic, and Bates.

These programmers have distinguished themselves by providing exceptional solutions, in the form of code, tutorials, and conferences, that keep Ruby/Rails anchored to its core values while helping grow its usefulness and adoption.

Check back soon as we continue this series Leap Forward with Rails.

Tagged , ,

As the World Turns

These are incredibly exciting times.  World economies are well into their massive transformation resulting from the rising demands of emerging countries.   The Techdoer Times is focused on providing you the insight and knowledge you need to successfully overcome the economic, organizational and engineering challenges of information technology.

Over the coming months we’ll continue to cover the role of high-performance computing on Wall Street.  New trends in outsourcing and consolidation are emerging, so stay tuned.

We’ll also continue to cover the challenges facing highly-productive teams.  Virtual Teams and teleworking are quickly becoming strategic necessities for technology firms looking to gain competitive advantage. Round-the-clock development, the world-wide talent pool, and cost advantages of outsourcing are all strategic levers that can empower any successful technology business.

We all know the invaluable role of knowledge and research in the face of rapidly changing industries, economies and organizations.  We thank our readers for their continued support and we look forward to serving you in the months ahead.

The Techdoer Times

Uncovering Time in the Financial Markets – Precision & Accuracy

In this third and final part to our series Uncovering Time in the Financial Markets we’ll look at clock synchronization techniques for improving the quality of time in the distributed systems that power the trade-lifecycle in the financial markets.

Previously I’ve shown how regulators and business strategy in the financial markets are more sensitive than ever to small intervals of time and why the inherent inaccuracy of time in the trade-lifecycle makes any temporal references potentially misleading, not to mention marketing’s blatant misuse of time in justifying advantages over competitor offerings.

Atomic Time

Time, for our purposes, is the production of a clock that measures changes of a natural phenomenon or of an artificial machine according to the rules of the time standard it is meant to implement. We’re accustomed to dealing with the Mean Time, or Civil Time standard, which is based on the earth’s rotation around the sun. Because of variability resulting from the elliptical nature of this rotation, leap years are required to adjust for the natural clock-drift that occurs. Specifically we deal with the International Atomic Time (TAI) and Coordinated Universal Time (UTC) standards with UTC simply calculated by adding leap seconds to TAI time.

Atomic Clocks, which rely on atomic resonance of a cesium 133 atom for example, have become the standard for accurate time. In 1967, the 13th General Conference on Weights and Measures defined the International System (SI) unit of time, the second, in terms of atomic time rather than the motion of the Earth. A second was defined as:

The duration of 9,192,631,770 cycles of microwave light absorbed or emitted by the hyperfine transition of cesium-133 atoms in their ground state undisturbed by external fields.

It turns out that TAI time is based on atomic time and calculated by computing a weighted average of time kept by roughly 300 atomic clocks in over 50 national laboratories around the world. Many of these atomic clocks are cesium clocks.

Computer Time

When software running on a single computer requires a precise version of the current time it does so by calling the appropriate operating system function such as gettimeofday on linux or the precise QueryPerformanceCounter and less precise GetTickCounts on windows. The values returned by these functions are based on the system’s local oscillator which updates the clock counter at a frequency known as the tick rate. This tick rate determines the precision (i.e. resolution), of time. Windows, for example, allows users to query the tick rate via the QueryPerformanceFrequency function (Note: requires support from underlying hardware). A tick rate of 1,000,000 updates a second, for example, allows the clock to support microsecond precision. One challenge for hardware engineers is setting a tick rate where the accuracy of the clock can be maintained without overloading the system with the tick events themselves.

There are numerous factors that cause variations in the the frequency of oscillation including age of the hardware components, system load, and temperature. These variations are called jitter and jitter leads to clock drift which results in inaccurate timings.

Clock Synchronization

The Financial Industry Regulatory Authority FINRA (formerly NASD) devised Rule 6953 to address the need for accurate time in their Order Audit Trail System (OATS). The rule imposed clock synchronization requirements by stating:

Rule 6953 requires any FINRA member firm that records order, transaction or related data to synchronize all business clocks used to record the date and time of any market event. Clocks, including computer system clocks and manual timestamp machines, must record time in hours, minutes and seconds with to-the-second granularity and must be synchronized to a source that is synchronized to within three seconds of the National Institute of Standards’ (NIST) atomic clock. Clocks must be synchronized once a day prior to the opening of the market, and remain in synch throughout the day. In addition, firms are to maintain a copy of their clock synchronization procedures on-site. Clocks not used to record the date and time of market events need not be synchronized.

The rule is written so it addresses the requirement for accuracy and precision of time in an inherently distributed system like OATS as well as addressing the inaccuracy that can result from clock synchronization itself. Like the jitter I described in a system’s local oscillator, the propagation delay of the clock synchronization signal can also cause jitter.

Here we hit upon the double-edged inaccuracy of distributed time. First, the local system clock will drift (a.k.a clock drift) compared to the other clocks, necessitating each clock’s synchronization to a shared accurate time source. Second, each clock in the synchronization scheme will experience varying propagation delays (a.k.a clock skew) with this time source, potentially resulting in more inaccuracy between clocks.

A high-quality clock synchronization solution will ensure the accuracy for each node being synchronized by providing a reference source for actual time and disciplining each node’s local clock to be synchronized to this time.

Network Time Protocol

Network Time Protocol is a common clock synchronization protocol standard used on packet switched networks. It currently stands at version 4.

Check back soon as we show how standard implementations of the Network Time Protocol, handle drift and jitter to synchronize the clocks in the machines that power the trade lifecycle.

Tagged ,

Uncovering Time in the Financial Markets – Time of the Trade

Previously I listed examples of how small intervals of time are deeply rooted in modern electronic trading strategies and regulations. Although the process of buying and selling stocks, whether in a manual environment involving specialists on the floor of a major equities exchange, or in an automated environment involving two systems with complex decision rules for valuating and purchasing stocks, has significantly changed over the centuries, the outcome remains the same – to enable the exchange of shares (equity) of ownership, for a particular company, between a buyer and a seller at some agreed upon price, this is fundamentally the process of trading in the equities markets.

It is within this process that regulators and trading organizations have become incredibly sensitive to even the smallest measures of time. The trading process can be broken down into the following steps. First is the specification of the order details (eg. price, symbol, size) from the buyer or seller followed by the acknowledgment of that order by a trading venue such as an exchange, ECN, or broker dealer. The order is then optionally routed to one or more venues that will execute the trade by matching it with a counterparty order before reporting the details of the execution back to a trade-reporting facility.

Vendors, regulators, data providers, and marketers of trading technology mislead when quoting microsecond, millisecond or any temporal measures by failing to describe the inherent inaccuracies of such measures in a distributed systems context. Because the trade process described above is fundamentally distributed across machines whose concept of time is largely subjective, a measured interval of one second between any nodes in this trading process may amount to significantly more or less than one second when measured on the scale of an atomic clock.

To ensure that regulatory or strategic measures of time, are in fact accurate, it is necessary to create a single global understanding of time between related machines. Clock synchronization refers to the problems caused by clock skew and jitter, and the solutions that enable a common, more accurate understanding of time, albeit with built in margins of error.

Tagged ,

Uncovering Time in the Financial Markets – Law & Profit

As I mentioned, the measure of small intervals of time, in the financial markets, is deeply rooted in both modern regulatory policies as well as electronic trading strategies. The SEC, FINRA and other industry regulators have innovated their way towards temporal constraints that reflect the lightening speed of today’s electronic trading landscape. On the business side, the continuing arms-race towards low-latency algo trading platforms is built on the premise that profit comes to those who discover and trade the best available price first. With milliseconds and now microseconds separating competing trade requests, industry participants are paying huge premiums for technology that promises even the smallest temporal improvements over competitor offerings.

Regulatory Time

Here is just a small sampling of the temporal references found in today’s electronic-trading compliance requirements:

FINRA Trade Reporting:

…transactions that are subject to NASD Rules 6130(g) and 6130 (c) and also required pursuant to an NASD trade reporting rule to be reported within 90 seconds.

SEC Regulation NMS Self-Help:

If a market repeatedly does not respond within one second or less, market participants may exercise “self-help” and avoid that market for purposes of the Order Protection Rule.

OATS Reporting

…Order Sent Timestamp (date and time) is within +/- 3 seconds

SEC Regulation NMS Intermarket Sweep Order Workflow

Answer: Yes, waiting one full second to route a new ISO to an unchanged price at a trading center would qualify as a reasonable policy and procedure under Rule 611(a)(1) to prevent trade-throughs.

SEC Regulation NMS – Flickering Quote Exemption

In addition, Rule 611 provides exceptions for the quotations of trading centers experiencing, among other things, a material delay in providing a response to incoming orders and for flickering quotations with prices that have been displayed for less than one second.

SEC Regulation NMS – 3 Second Quote Window

To eliminate false trade-throughs, the staff calculated trade-through rates using a 3-second window – a reference price must have been displayed one second before a trade and still have been displayed one second after a trade.

At best these temporal references serve as explicit requirements that drive the necessary software decisions to stay compliant. However, interpreting these time-intervals without considering the distributed nature of the trade-lifecycle and the ambiguity of time in this context, can lead to misinterpretation and confusion.

Profit Time

Similarly, on the business side, there is an unprecedented awareness and profit-sensitivity to small time intervals. Here are some quotes from industry stakeholders:

Chicago Mercantile Exchange

“Traders using CME Globex demand serious speed. If the network is even a few milliseconds slower than 40 milliseconds of response time, they don’t hesitate to notify CME.”

Philadelphia Stock Exchange

“The standard now is sub-one millisecond,” said Philadelphia Stock Exchange CEO Sandy Frucher. “If you get faster than sub-one millisecond you are trading ahead.”

Investment Banks

“Firms are turning to electronic trading, in part because a 1-millisecond advantage in trading applications can be worth millions of dollars a year to a major brokerage firm.”

The TABB Group

“For US equity electronic trading brokerage, handling the speed of the market is of critical importance because latency impedes a broker’s ability to provide best execution. In 2008, 16% of all US institutional equity commissions are exposed to latency risk, totaling $2B in revenue. As in the Indy 500, the value of time for a trading desk is decidedly non-linear. TABB Group estimates that if a broker’s electronic trading platform is 5 milliseconds behind the competition, it could lose at least 1% of its flow; that’s $4 million in revenues per millisecond. Up to 10 milliseconds of latency could result in a 10% drop in revenues. From there it gets worse. If a broker is 100 milliseconds slower than the fastest broker, it may as well shut down its FIX engine and become a floor broker.”

Brokerage House

“Arbitrage trading is critically dependent on trading off valid prices and getting the orders in as fast as possible without overwhelming the exchange gateway and so latency on the market data stream and order entry gateway capacity is a big issue.”

Chi-X/TransactTools Press Release

“TransactTools’ standard benchmark tests found that over 95 percent of messages sent to Chi-X were responded to in an average of 10 milliseconds…with the fastest response time being four milliseconds.  For high volume throughput testing, in which five million messages were generated in total, Chi-X maintained an average roundtrip latency of 18 milliseconds while handling 16,000 messages per second.  Chi-X’s internal latency, which is a measure of the system’s ability to process messages in its core rather than the roundtrip measurement, was measured by Instinet Chi-X at 890 microseconds, or less than one millisecond.”

Millisecond Marketing

With the industry’s increasing awareness to small time intervals, marketers are playing their temporal cards. Vendors of market data distribution platforms, high-performance messaging solutions, complex event processing and many of the other high-performance technologies on Wall Street can also misinform and sometimes disinform the capabilities of their offerings with respect to time and performance. Suggesting a vendor’s market data distribution technology offers millisecond or microsecond improvements over a competitors offering, without describing the testing context and particularly how clocks were synchronized in reaching the final measure is unethical. As high-performance trading technologies continue to commoditize, the pressure to show even the most minute temporal improvements will only increase.

Next I’ll describe the lifecycle of a trade request, and why measures of time in this context are inherently innacurate.

Tagged ,

Uncovering Time in the Financial Markets

In this era of low-latency, high-performance electronic and algorithmic trading, vendors, regulators and business strategist continue to misinform and sometimes disinform industry participants with references to time. Vendors, for example, can selectively manipulate their marketing campaigns to suggest dubious sub-millisecond advantages over competitor technologies. Regulators, who continue their ambitious drive to innovate for the twenty-first century industry changes, may get a bit ahead of themselves when not providing the appropriate clock synchronization context in quoting their temporal constraints. Investment banks and brokerage firms continue to preach the million dollar advantages of millisecond improvements in their trade lifecycle.

The widespread industry shifts in the financial markets have created an unprecedented and collective awareness and sensitivity to small intervals of time. The fact is that despite driving both regulatory and strategic policies, the quoted measure of these intervals remains another piece of misinformation and sometimes disinformation that misleads and confuses industry stakeholders.

In this three part series, I’ll first show examples of time’s importance from a financial market regulatory and strategic perspective. Second I’ll show exactly how and why this time is misinterpreted. Finally i’ll talk about how clock synchronization techniques can be used to better rationalize the measure of time across system boundaries.

Tagged , , , ,

Designing for Performance on Wall Street – The Storage Dilemma

What and How Much?

Prediction, transparency and compliance all come with a heavy storage price these days. Electronic trading applications, and specifically the algorithms that drive them, depend on access to high-quality historical data both at runtime and design time when the data is mined in an effort to identify new patterns which can drive future trading strategies. Risk analysis, a discipline that attempts to valuate a firm’s investments in real time, relies on the ability to detect and predict patterns found by mining the same high-quality historical data. The SEC’s Regulation NMS, which requires that trades execute at the best available price mandates that firm’s retain records that show a history of trading at the best price. These records should include both the trades the firm executed, as well as the stock quotes that inspired them.

Previously I showed how electronic trading, regulation and innovation have indirectly resulted in the bandwidth problem and the need for speed. The storage dilemma is closely related. It is a dilemma because what to store, how much of it to store, and how to store it is an inexact science which can have enormous consequences on storage requirements. It can potentially cost 60GB a day (January 2008 conditions) to store each and every level-1 quote, disseminated on a daily basis from any of the ECNs, exchanges, and OTC markets. Knowing how to resolve the storage dilemma requires a balanced view of constraints in the problem and solution domains as well as a little bit of luck.

Compliance History

The storage dilemma was initially and in many ways continues to be driven by regulatory compliance. The US financial accounting scandals, that spilled into the start of this century, and resulting debacle, as well as the terrorist attacks in 2001 led regulators to write and in some cases rewrite the rules surrounding digital communications (voice, chat, email) within and between firms, and the need to retain the history trail of communication, for all individuals within a firm, for a predetermined period of time. The general regulations I’m referring to here include Sarbanes-Oxley Act of 2002, NASD 3010 & 3110, NYSE 342/440/472, SEC Rule 17a-4.

These regulations are as ambitious as the technological innovations that are needed to support them. Requiring that firms store all digital communications for a period of three years, for example, is largely based on the fact that with today’s technology you can. From a regulatory standpoint the technology capacity and capability exists on paper, but from an implementation standpoint, it’s not so easy. For example, we can backup anything these days, but can you easily restore from that backup? The same applies to email archiving. Sure you can store all emails for all members of your organization as far back as you want, but when regulators come asking for specific email records dating back 5 years, the true test begins. Amazingly, and i say amazingly because compliance here is largely determined by the design and implementation of the storage/archival system, incredibly large fines are being issued for failure to comply, on the order of hundreds of millions of dollars.

The storage dilemma surrounding these regulatory requirements is further fueled by willing and paranoid compliance departments, whose job it is to ensure a firm’s compliance with all applicable regulations, and sometimes unwilling IT departments, who are fully aware of the financial, technological, and temporal constraints surrounding these ambitious regulations. You’re not supposed to meet halfway on these requirements, but there are many reasons to try to compromise. Finding a balance between the regulatory, financial, temporal and technological requirements is just one example of storage dilemma.

It Gets Worse

Electronic trading has introduced incredible efficiencies in the markets, resulting in lower per-order profit margins. Simultaneously, the structure of the securities themselves, think mortgaged-backed securities, credit default swaps, has become so complex it becomes almost impossible to valuate them. How can a firm design the most intelligent, empirically backed electronic trading algorithm, or the most sophisticated empirically backed risk analysis model? The answer is access to high-quality historical data. Deriving intelligence from the mining of market data is a key differentiator in the electronic trading and risk analysis space. When you add to this the exponentially increasing data volumes we’ve shown in the bandwidth problem, you have another example of the storage dilemma. There are no easy answers for which market data to store, and how much of it to store. The need is clear, and is even more clear if you consider that Regulation NMS requires that trading firms capture sufficient amounts of quote and trade data to show they are executing trades at the best prices across all market centers.

If you build it, they will saturate it…

As we’ve shown at the start of this series, the convergence of regulation, innovation and electronic trading have redefined the magnitude of problems and solutions in the capital markets. Technological innovation, however, is the primary driver. Just like mobile communication devices have inspired the increasing amounts of email and chat conversations between related and unrelated (i.e. junk mail) parties, innovations in technology have creatively inspired regulators, investors, brokers, investment banks, exchanges, hedge funds to stay ahead of their objectives.

Regulation NMS, in particular, implicitly addresses the sophistication of today’s technology, and shows that regulators can be as innovative as for-profit firms in demanding transparency and fairness. Algorithmic trading’s thirst for predictive intelligence is driven by the necessity to be accurate and fast, as we’ve shown in the need for speed. This transparency and prediction requires data, and if you thought email archiving was a lot, wait until you need to store and efficiently mine terabytes upon terabytes of market data information – assuming you’ve found the budget or technology to store it all.

Tagged ,

“5 microseconds”…You Said What?

Measuring message latency, especially for the data volumes and latency thresholds expected by Wall Street is tricky business these days, as we’ve
previously covered.

Even trickier is finding clarity in the midst of confusing and too many times inaccurate media coverage on the topic as shown in a recent article, from the Securities Industry News. I’m specifically referring to a low-latency monitoring solution vendor, who states that today’s algo trading engines require end-to-end network latencies of “less than 5 microseconds with no packet loss“.

In 2008, a colocated trading engine, which minimizes propagation delay, can expect end-to-end latency on the order of a couple of milliseconds at best. End-to-end, in this context, typically refers to time an algo issues a buy/sell order to the the time a receiving system acknowledges and executes that order.

Can the latency between these two point really be 5 microseconds? Highly unlikely. There are many reasons for this, which will be covered over time. For now i’ll mention that off-the-shelf clock synchronization solutions, a prerequisite to measuring message latency across system boundaries, just can’t support an accuracy of 5 microseconds.

Tagged , ,

Designing for Performance on Wall Street – The Need For Speed

Collapsing Time

While the impact has already been enormous, history will show how the shift from floor-based specialist trading to electronic trading changed the way investors, specialists, investment banks, brokers, exchanges and other industry participants make their money. Wall Street as a whole is now firmly entrenched in this new electronic trading frontier and the barriers to entry have shifted from the human imperfections of floor based traders or specialists, to the high-speed, low latency capabilities of profit seeking electronic algorithms.

Low latency in the scope of electronic trading refers to the utilization of high-performance technology that collapses the time between price discovery (i.e. 100 shares of IBM are now available at $100.00) and the execution of orders (i.e. buy or sell) at the newly discovered price. Electronic trading has created a world where the lifecycle of price discovery to trade execution is on the order of single-digit milliseconds.

Time is Money

Previously, I talked about the bandwidth problem. Inability to handle the required bandwidth utilizations of modern market data feeds will certainly cause significant delays in this millisecond-sensitive trade lifecycle, resulting in lost profits. However, the single most important need that has resulted from the unanimous shift to electronic trading is the need for speed, where speed refers to the ability to “see” stock prices as quickly as they appear in the electronic marketplace and similarly the ability to immediately trade on that price before competitors do

Some of the low-latency design strategies or techniques exhibit the elegant characteristic of solving the bandwidth problem as well as the need for speed. For example, colocating your electronic trading algorithm in the same facility as an exchange’s matching engines (i.e. the systems that execute the buy/sell orders) will not only save your firm the wide-area network infrastructure required to feed market data to your trading algorithm, but will also minimize the propagation delays between market data sources and execution venues. Incredibly, some of these solutions, such as FAST compression, can theoretically address the bandwidth problem, the need for speed, and the storage dilemma.

Low-Latency Approaches

How does Wall Street solve the need for speed? Here are just some of the approaches used to minimize stock trading related delays:

Chip Level Multiprocessors (CMP)

When Intel’s microprocessors started melting because of excessive heat, the multi-core chip industry became mainstream. Smaller multiple cores on a single chip could now permit multi-threaded code to achieve true parallelism while collapsing the time it takes to complete processing tasks. Multi-core chips from Intel and AMD have a strong presence in the capital markets and can achieve remarkable performance as shown in SPEC benchmarks.

An emerging challenge on Wall Street is to deploy microprocessor architectures capable of scaling to the enormous processing required by risk-modeling and algorithmic-trading solutions. If one-core architectures encountered space and heat limitations which eventually lead to the introduction of multi-core architectures, what new limitations will emerge? The shared message bus found with existing multi-core processors is one such limitation as the number of cores multiply. Vendors, such as Tilera are innovating around these limitations and you can expect more to follow. Furthermore, evidence is building to support the notion that multi-core microprocessor architectures, and the threading model behind them are inherently flawed. Multi-core CPUs may provide near term flexibility for designers and engineers looking to tap more processing power from a single machine. Long term however, they may be doing more harm then good.


With multiple cores now in place, the software and hardware community are steadily catching up. For example, older versions of Microsoft’s Network Driver Interface Specification (NDIS) would limit protocol processing to a single CPU. NDIS 6.0 introduced a new feature called Receive Side Scaling (RSS) which enables message processing from the NIC to be distributed across the multiple cores on the host server.

As Herb Sutter explains in his paper “The Free Lunch is Over: A Fundamental turn Towards Concurrency in Software”, software applications will increasingly need to be concurrent if they want to exploit CPU throughput gains. The problem is that concurrency remains a challenge from an education and training perspective as described in David A. Patterson paper. Conceptually concurrency can drive the need for speed. The practice of this approach remains a challenging one.


Colocation is a fascinating approach towards achieving low-latency, mainly because it reconfigures physical proximity between application stacks instead of relying on a sophisticated technology approach. We’ve already shown how it can minimize the bandwidth requirements for a firm’s algorithmic trading platforms, but its biggest accomplishment is to minimize the distance between electronic trading platforms and the systems that execute the trades. Organizations such as BT Radianz have armed their high-performance datacenters with the fastest, highest throughput technology on the planet. When coupled with colocated hosting services, these data centers provide the the lowest latency money can buy while opening up new opportunities to translate this value throughout the application stack starting at the NIC card and moving on up.

The Exchanges themselves, are also using colocation services as a way to attract customers and introduce new sources of revenue. For example, International Securities Exchange ISE, offers colocation services while promising 200 microsecond service levels.

Hardware Accelerators

Field Programmable Gate Arrays

The name says it all – an integrated circuit that can be customized for a specific solution domain. Specialized coprocessors have existed for years, handling floating point calculations, video processing and other processing intensive tasks. FPGA builds on this by offering design tools allow programmers to customize the behavior of the FPGA’s integrated circuit, usually through a high-level programming language which is then “compiled” into the board itself. An example of how FPGA boards are being deployed on wall street includes replacing software feed handlers, the components that read, transform and route market data feeds, with their FPGA equivalents. This approach results in higher throughput and lower latency because message processing is handled by the customized FPGA board, instead of the host CPU/OS, saving the precious cycles that would have been required for moving messages up the protocol stack and interrupting the kernel. ACTIV Fiancial, a leading vendor of a feed handling solution claims that the introduction of FPGA accelerators to their feed processing platform reduced the feed processing latency by a factor of ten while allowing them to reduce the servers required to process some US market data feeds from 12 servers, in the software based feed processing approach, to just one server in the FPGA accelerated approach.  Celoxica is another firm specializing in FPGA solutions for Wall Street’s electronic trading.  Celoxica’s hardware accelerated trading solution promises microsecond latency between host NIC and user application with support for throughput rates reaching 7 million messages per second.

TCP Offload Engine

The idea with TCP Offload Engines (TOE) is for the host operating system to offload processing of TCP messages to hardware located on the network interface card itself, thus decreasing CPU utilization while increasing outbound throughput.  Windows 2003 Server includes the Chimney Offload architecture which defines the hooks required for OEM and 3rd party hardware vendors to implement layer 1, 2, 3 and 4 of the OSI protocol stack in the NIC itself, before passing the message to the host operating system’s protocol handlers.  Similar examples of offload technology include TCP Segmentation Offload (TSO) or Generic Segmentation Offload (GSO) where the NIC handles the segmenting of large blocks of data into packets.

Network Processing Offload

Coming Soon

Kernal Bypass

Coming soon

High-Performance Interconnections (I/O)


From the Infiniband Trade Association website:

In 1999, two competing input/output (I/O) standards called Future I/O (developed by Compaq, IBM and Hewlett-Packard) and Next Generation I/O (developed by Intel, Microsoft and Sun) merged into a unified I/O standard called InfiniBand. InfiniBand is an industry-standard specification that defines an input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems. InfiniBand is a true fabric architecture that leverages switched, point-to-point channels with data transfers up to 120 gigabits per second, both in chassis backplane applications as well as through external copper and optical fiber connections.

Infiniband technologies also exhibit the characteristic of solving multiple problems facing Wall Street today including bandwidth, latency, efficiency, reliability and data integrity. Visit the Voltaire website for a vendor specific look into the performance benefits of Infiniband on Wall Street.

Please check back in second quarter 2008 when The Techdoer Times presents a detailed look into the many existing and future applications of Infiniband technology.

Remote Direct Memory Access

RDMA is a zero-copy protocol specification for transferring data between memory modules of separate computers without involving either source or target operating sytem or CPU, resulting in low-latency and high-throughput computing.

Fibre Channel

Gigabit Ethernet (GbE) & 10 Gigabit Ethernet (10GbE)

AMD HyperTransport (Chip-level)

Intel Common System Interface (Chip-level)

Ethernet Virtual Private Line (EVPL) and Ethernet Virtual Connection (EVC)

Faster Compression

As we mentioned in the bandwidth problem, some firms are relying on innovations in compression as a way to minimize escalating bandwidth costs. FAST is an example of this but there’s more. In our previous postings on measuring the latency in messaging systems we explained how the different components of latency react to variations in packet size or transmission rates. Herein lies the potential latency improvements resulting from the adoption of FAST. FAST can potentially minimize packetization and serialization delays. It is true that the process of compressing messages requires additional CPU cycles and therefore adds to the application delay, however, depending on the nature of the solution, this additional delay may be offset by the savings that result from serializing significantly smaller sized packets onto the wire, potentially 80% smaller. FAST can be incredibly effective at bandwidth reduction and can potentially reduce end-to-end latency as well.


Messaging technology has evolved greatly to the point where requirements for speed and reliability are no longer in conflict. Publish/Subscribe messaging paradigms can be supported with different levels of service quality, ensuring that latency-sensitive subscribers can forgo message recovery for the sake of speed, while data-completeness sensitive subscribers can rely on extremely fast message recovery built on top of layer 3 protocol and routing technologies such as UDP and Multicast. These real-time messaging technologies also ensure robustness and scalability across a number of downstream subscribers. Cases where slow subscribers begin to “scream” for message retransmission (aka. ‘crying-baby’) can be handled individually and gracefully by the messaging layer, ensuring uninterrupted service to other subscribers. Messaging technology vendors include:

Multicast Routing

As mentioned in the bandwidth problem, multicast routing technologies can potentially reduce latency in addition to bandwidth utilization. The latency play results from the fact that multicast packets are rejected or accepted at the Network Interface Card (NIC) level, and not the more CPU expensive kernel level.

Data Grids/Compute Grids

With the industry’s reliance on the timely evaluation of strategic trading and risk models comes the need to access and crunch large amounts of data efficiently. This reliance has spawned innovations in the form of data and compute grids which offer highly-resilient, scalable distributed processing infrastructure on demand for compute intensive as well as data intensive environments. Data grids, in particular, offer a high-performance, highly-resilient middle-tier data layer that sits on top of storage technologies and other information sources but offers ubiquitous data access to enterprise business processes. Key vendors or technologies in this space include the following:

  • Gigaspaces
  • Gemstone
  • Tangosol
  • Intersystems
  • Memcache
  • DataSynapse
  • Terracotta
  • Collapsing Distributed Processing

    Yet another approach to decreasing the overall end-to-end latency of messaging systems is to collapse the ends, which also minimizes the propagation delays. The closer each distributed processing node is to being within the same process of dependent nodes, the better the overall performance. The rise of Direct Market Access (DMA) approaches where firms connect directly to the exchanges and other providers of market data, instead of third party vendors of the data is an example of this. DMA alone spawned a new market data distribution industry with the net result being end-to-end latency for market data measuring in the low milliseconds, which for a while remained faster than the same data distributed by vendors such as Reuters and Bloomberg.

    Thus far we’ve shown how firms in the capital markets are confronting their bandwidth problem and need for speed. The third category of challenges is the Storage Dilemma facing these firms.

    Tagged , , , ,