SEC order explains Knight Capital systems failure

More than a year ago, Knight Capital suffered a loss of nearly half a billion dollars and needed to sell itself after a defective software resulted in nearly $7 billion of wrong trades. A few days back, the US SEC issued an order against Knight Capital that described exactly what happened:

Knight used a software called SMARS which broke up incoming “parent” orders into smaller “child” orders that were transmitted to various exchanges or trading venues for execution. (para 12)
SMARS used to have a functionality called “Power Peg”. Knight stopped using this functionality in 2003, but the code was neither deleted nor deactivated. A decade later, the code was still sitting in the servers waiting to spring into action if a particular flag was set to “yes”. (para 13 and 14)
“... [A]s child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed ... [and] instructed the code to stop routing child orders after the parent order had been filled completely. ... In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.” (para 14)
In July 2012, the New York Stock Exchange announced that it would launch its new Retail Liquidity Program (RLP) on August 1, 2012. The RLP would enable retail customers to get price improvement for their orders. Knight Capital therefore added new code to SMARS to allow its customers to participate in the RLP. (para 12)
Knight decided that it would now delete the decade old Power Peg code and replace it with the new RLP code. The flag that was earlier used to activate the Power Peg code would be repurposed to now call the RLP code. (para 13)
“Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.” (para 15)
“On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution. Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers. Although one part of Knight’s order handling system recognized that the parent orders had been filled, this information was not communicated to SMARS.” (para 16)
“While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion. Ultimately, Knight lost over $460 million from these unwanted positions. ” (para 1)

It appears to me that there were three failures:

It could be argued that the first failure occurred in 2003 when Knight chose to let executable code lie dormant in the system after it was no longer needed. I would like such code to be commented out or disabled (through a conditional compilation flag) in the source code itself.
I think the biggest failure was in 2005. While making changes to the cumulative order routine, Knight did not subject the Power Peg code to the full panoply of regression tests. Testing should be mandatory for any code that is left in the system even if it is in disuse.
The third and perhaps least egregious failure was in 2012 when Knight did not have a second technician review the deployment of the RLP code. Furthermore, Knight did not have written procedures that required such a review.

I am thus in complete agreement with the SEC’s observation that:

Knight also violated the requirements of Rule 15c3-5(b) because Knight did not have technology governance controls and supervisory procedures sufficient to ensure the orderly deployment of new code or to prevent the activation of code no longer intended for use in Knight’s current operations but left on its servers that were accessing the market; and Knight did not have controls and supervisory procedures reasonably designed to guide employees’ responses to significant technological and compliance incidents; (para 9 D)

However, the SEC adopted Rule 15c3-5 only in November 2010. The two biggest failures occurred prior to this rule. Perhaps, the SEC found it awkward to levy a $12 million file for the failure of a technician to copy a file correctly to one out of eight servers. The SEC tries to get around this problem by providing a long litany of other alleged risk management failures at Knight many of which do not stand up under serious scrutiny.

For example, the SEC says: “Knight had a number of controls in place prior to the point that orders reached SMARS ... However, Knight did not have adequate controls in SMARS to prevent the entry of erroneous orders.” In well designed code, it is good practice to have a number of “asserts” that ensure that inputs are not logically inconsistent (for example, that price and quantity are not negative or that an order date is not in the future). But a piece of code that is called only from other code would not normally implement control checks.

For example, an authentication routine might verify a customer’s password (and other token in case of two factor authentication). Is every routine in the code required to check the password again before it does its work? This is surely absurd.

Posted at 9:45 pm IST on Sun, 20 Oct 2013 permanent link

Categories: failure, risk management, technology

Comments