Due to my new role, I have been thinking a lot about how modern network architecture has evolved over the years. When I first entered networking it was the “cool” thing that we did so that we could play Doom in our fraternity house with a bunch of other guys. That was the old 10Base-2 connectors. I quickly graduated to a real job shortly after pulling and terminating cables and in some cases repairing the old baseband systems that we can thank IBM for in schools. But really since the standardization of cable around Cat5 or Cat6 – haven’t we come a long way? Yes I realize that things can interfere with cable such as electrical current, but for the most part if you have well terminated copper cable you can run at a minimum 100M without a problem.
This got me thinking to why were networks ever built to plan for dropped packets. Did it go back to the original ARPAnet design that we wanted to be able to survive a nuclear blast? Yes we all know that this rumor is not true, but in fact part of ARPAnet was to be able to work across an unreliable medium. Is this why protocols were built on top to compensate for this problem with reliable delivery? Does that same problem with a reliable medium still exist today? If not, how would we change our network architectures? These questions plagued me for the last month or so and hence I decided to do some research.
Here are a couple thoughts:
Dropped Packets Suck in the Data Center!
Whether it is from oversubscription or a microburst in the network, dropped packets in large suck. This problem is only going to get worse as we increase the speeds of the networks. Are you aware that on a 10Gb Ethernet link, that a single second of dropped packets will spew ~15M packets onto the datacenter floor. Worse yet is that we are in an era of now 40Gb ethernet and soon to be 100Gb Ethernet. Yes that math is correct that a single second of impediment on 100Gb ethernet would throw away 150 million packets (assuming the smallest frame size). I don’t care what application you are using if you lose 150M packets, the user experience is going to be bad.
This makes me ask the question, why would you ever build a datacenter network that is massively oversubscribed? You could suggest that you are playing the economics game…explain that to your CIO when the #1 revenue producing application is down. You could also hypothesize that it is only oversubscription if everyone talks at the same time…which I understand. The premise that I am suggesting is that we should never build a network that works “in a perfect world.” Believe me I have seen some of the finest pieces of powerpoint engineering that in practice fail worse than Ben Affleck and Gigli did in 2003.
Dropped packets decrease the user experience!
In a world that demands instant customer satisfaction from our applications, we can’t afford to drop packets in the data center. Akamai did a study in 2009 that says the new average for loading a page was 2 seconds. Worse yet 40% of the users at that time said they would abandon a page that took longer than 3 seconds to load. Gartner claims that by 2014, 80% of the datacenter traffic will be east/west. I once again ask the question, if this is true, why would you build a data center network that is highly oversubscribed for intra data center traffic?
I was responsible for a legacy application in my previous life that made 1200 one way trips to put away data in a database. Can you imagine the impact that dropped packets would have on this performance? To make matters worse, the application had to perform flawlessly to be any where near a 5 second threshold that we had set for the entire application (of which our piece was a smaller portion).
Anyone that is working in the modern enterprise with any type of legacy applications and a service chain that can be a mile long knows exactly what I am talking about. The truth of the matter is that the business applications struggle to meet the SLA metrics when everything works perfect…so again, why build a data center network that is highly oversubscribed and likely to drop packets?
There is good news!
The good news is that with standards and modern advances in silicon and signaling, we don’t have to live in the dark ages of data center network architecture. We can in fact build a completely non-blocking architecture that scales to 000’s of physical servers and even more virtual servers. As I dive back into the networking world, I am impressed with the advances that have been made in the hardware capabilities. I am suggesting that the architectures need to evolve to take better advantage of these hardware capabilities. Certainly the differentiation will come in the software space and the integration of this software into the business and data systems. You can call it what you would like – a fabric, a two-tier architecture, leaf/spine – it doesn’t matter. What does matter is that you move with speed needed to stop dropping packets in the data center.
Let’s face it – the application and data teams just expect the network to work. I will argue that the network is the furthest thing from a commodity in any enterprise. Just ask any CIO how much work gets done when the network is down. BUT we have to evolve our architectures to fulfill the current and future business needs…and I am suggesting that we can start by architecting solutions that don’t drop packets even during the highest periods of business need! The technology, products, and standards are there…remember 150M packets is just the start…what happens as speeds move to the Terabit range?