* * * * *

         It's hard to break a program when the network keeps breaking

One of my jobs at The Corporation has been to load test a phone network
service by simulating a ton of calls and see where our service breaks. Which
means I get to write code to simulate a phone network initiating calls. It's
not easy, and no, it's not entirely related to The Protocol Stack From Hell™
(but don't get me wrong—there's still plenty of blame to go around).

Problem the first: generating a given load, a measured set of packets per
second. It's harder than it sounds. Under the operating system we're using,
the smallest unit we can reliably pause is 1/100 of a second. If I want to
send out 500 messages per second, the best I can do is 5 packets every 0.01
seconds, which isn't the same as one packet every 0.002 seconds (even though
it averages out). The end result tends towards bursty traffic (that is, if I
attempt to control the rate; if I don't bother with that, I tend to break the
phone network connection—more on that below).

Sure, there's some form of congestion control in The Protocol Stack From
Hell™, but attempting to integrate the sample code provided into my testing
program failed—not only do I not get the proper messages, but what I do get
is completely different from the sample program. This is compounded by the
documentation (which everybody agrees is completely worthless) and the fact
that this is the first time I've ever worked on anything remotely related to
telephony. I'm unfamiliar with the protocols, and with the ins and outs of
The Protocol Stack From Hell™ (unlike my manager R, who's worked with this
stuff for the past fifteen to twenty years, but is swamped with other,
manager-type work).

Now the second problem: even though the testing system is in the same
cabinet, and hooked to the same network switch, as the target system (in
fact, I think they're physcially touching each other) due to the nature of
the phone system, communications between phone network components must go
through an intermediary system known as an STP (Signal Transfer Point);
actually, a pair of STPs (for redundancy). Unfortunately for us, the only STP
we have access to is out in Washington State (where The Corporate Master
Headquarters are stationed) and said traffic between our two testing systems
(here in Lower Sheol) goes back and forth across the Inernet over a VPN
(Virtual Private Network).

Yeah, what I can say? When I asked about getting an STP a bit closer to us, I
was told it wasn't in the budget (and no wonder—the price is far into the “if
you have to ask, you can't afford it” territory—deep into that territory).

Obligatory Sidebar Links

* Original series on Buffer Bloat
* * First puzzle piece … [1]
* * Browsers and TCP revisited … [2]
* * Home Router Puzzle Piece One—Fun with your switch [3]
* * Home Router Puzzle Piece Two—Fun with wireless [4]
* * The criminal mastermind: bufferbloat! [5]
* * Whose house is of glass, must not throw stones at another [6]
* * Bufferbloat and network neutrality—back to the past … [7]
* * Mitigations versus Solutions of Bufferbloat in Broadband [8]
* * Bufferbloat and congestion collapse—Back to the Future? [9]
* * Mitigations and Solutions of Bufferbloat in Home Routers and Operating
   Systems [10]
* * RED in a Different Light [11]
* * Bufferbloat in 802.11 and 3G Networks [12]

* Recaps and Discussions
* * Whose house is of glass, must not throw stones at another [13]
* * 2011 predictions: One word—bufferbloat. Or is that two words? [14]
* * Buffer Bloat: The calculations [15]



So we're stuck with a six thousand mile round trip for the phone network
traffic. And now we come to the punch line—the Internet is broken [16]. The
quick synopsis: excessive buffering of Internet traffic by various routers is
causing a breakdown of anti-congestion algorithms used by TCP (Transmission
Control Protocol)/IP (Internet Protocol). Now, the article is talking about
buffer bloat in consumer grade equipment, but it is possible that there are
commercial grade routers doing the same thing—excessive buffering and that
could be the cause of largish spikes in traffic, as well as increased latency
in round trips. If there's a spike in traffic, the STP will attempt to assert
flow control, but if there's still traffic coming in, it's considered an
error. Also, the phone network is very time sensitive and execessive
latencies are also an error condition.

Worse, if the STP receives too many errors from an endpoint, it (like every
other STP on the phone network) is programmed to take that endpoint out of
service. It's hard to say where that point is, but it happens with
frightening regularity when I attempt to do load testing. The packets are
being pushed when suddenly we start receiving either canceled messages, or
delivery failures about five levels down in the protocol stack, which means
one (or both) endpoints have been cut loose from the phone network due to
excessive errors. It then requires manual intervention to restart the entire
stack on both sides.

So, there's bursty traffic due to my attempts at sending a settable amount of
traffic. Then there's the (possible) bursty traffic due to excessive
buffering across the Internet. Oh, and I forgot to mention the licensing
restriction on The Protocol Stack From Hell™ that limits the number of
messages we can send and receive. All that makes it quite difficult to find
the breaking point of the program I'm testing. I keep breaking the
communications channel.

That tends to put a damper on load testing.

[1] http://gettys.wordpress.com/2010/10/02/first-puzzle-piece/
[2] http://gettys.wordpress.com/2010/10/13/browsers-and-tcp-revisited/
[3] http://gettys.wordpress.com/2010/11/29/home-router-puzzle-piece-one-fun-with-your-switch/
[4] http://gettys.wordpress.com/2010/12/02/home-router-puzzle-piece-two-fun-with-wireless/
[5] http://gettys.wordpress.com/2010/12/03/introducing-the-criminal-mastermind-bufferbloat/
[6] http://gettys.wordpress.com/2010/12/06/whose-house-is-of-glasse-must-not-throw-stones-at-another/
[7] http://gettys.wordpress.com/2010/12/07/bufferbloat-and-network-neutrality-back-to-the-past/
[8] http://gettys.wordpress.com/2010/12/08/bufferbloat-mitigations/
[9] http://gettys.wordpress.com/2010/12/09/bufferbloat-and-congestion-collapse-back-to-the-future/
[10] http://gettys.wordpress.com/2010/12/13/mitigations-and-solutions-of-bufferbloat-in-home-routers-and-operating-systems/
[11] http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/
[12] http://gettys.wordpress.com/2011/01/03/aggregate-bufferbloat-802-11-and-3g-networks/
[13] http://news.ycombinator.com/item?id=2002992
[14] http://www.cringely.com/2011/01/2011-predictions-one-word-bufferbloat-or-is-that-two-words/
[15] http://netoptimizer.blogspot.com/2010/12/buffer-bloat-calculations.html
[16] http://gettys.wordpress.com/2010/12/03/introducing-the-criminal-mastermind-bufferbloat/

Email author at [email protected]