Stone Host: Facebook Data-Center

The 621,000 square foot facility will be the social networking giant’s second data centre in Europe and the first to be announced on the continent since conflict erupted between the EU and US over data privacy between the two trading blocs.

The facility will be powered by wind energy from Brookfield Renewable, the company which bought Bord Gáis’s energy business two years ago.

Facebook recently announced its intention to create a further 200 jobs in Dublin in 2016, to add to the 1,300 employees it currently has at its Grand Canal Basic international headquarters.

The more than 1.35 billion people who use Facebook on an ongoing basis rely on a seamless, “always on” site performance. On the back end, we have many advanced sub-systems and infrastructures in place that make such a real-time experience possible, and our scalable, high-performance network is one of them.

Facebook’s network infrastructure needs to constantly scale and evolve, rapidly adapting to our application needs. The amount of traffic from Facebook to Internet – we call it “machine to user” traffic – is large and ever increasing, as more people connect and as we create new products and services. However, this type of traffic is only the tip of the iceberg. What happens inside the Facebook data centers – “machine to machine” traffic – is several orders of magnitude larger than what goes out to the Internet.

Our back-end service tiers and applications are distributed and logically interconnected. They rely on extensive real-time “cooperation” with each other to deliver a fast and seamless experience on the front end, customized for each person using our apps and our site. We are constantly optimizing internal application efficiency, but nonetheless the rate of our machine-to-machine traffic growth remains exponential, and the volume has been doubling at an interval of less than a year.
The ability to move fast and support rapid growth is at the core of our infrastructure design philosophy. At the same time, we are always striving to keep our networking infrastructure simple enough that small, highly efficient teams of engineers can manage it. Our goal is to make deploying and operating our networks easier and faster over time, despite the scale and exponential growth.

Next Generation Data-Center for Facebook
For our next-generation data center network design we challenged ourselves to make the entire data center building one high-performance network, instead of a hierarchically oversubscribed system of clusters. We also wanted a clear and easy path for rapid network deployment and performance scalability without ripping out or customizing massive previous infrastructures every time we need to build more capacity.
To achieve this, we took a disaggregated approach: Instead of the large devices and clusters, we broke the network up into small identical units – server pods – and created uniform high-performance connectivity between all pods in the data center.
There is nothing particularly special about a pod – it’s just like a layer3 micro-cluster. The pod is not defined by any hard physical properties; it is simply a standard “unit of network” on our new fabric. Each pod is served by a set of four devices that we call fabric switches, maintaining the advantages of our current 3+1 four-post architecture for server rack TOR uplinks, and scalable beyond that if needed. Each TOR currently has 4 x 40G uplinks, providing 160G total bandwidth capacity for a rack of 10G-connected servers.

What’s different is the much smaller size of our new unit – each pod has only 48 server racks, and this form factor is always the same for all pods. It’s an efficient building block that fits nicely into various data center floor plans, and it requires only basic mid-size switches to aggregate the TORs. The smaller port density of the fabric switches makes their internal architecture very simple, modular, and robust, and there are several easy-to-find options available from multiple sources.
Another notable difference is how the pods are connected together to form a data center network. For each downlink port to a TOR, we are reserving an equal amount of uplink capacity on the pod’s fabric switches, which allows us to scale the network performance up to statistically non-blocking.
To implement building-wide connectivity, we created four independent “planes” of spine switches, each scalable up to 48 independent devices within a plane. Each fabric switch of each pod connects to each spine switch within its local plane. Together, pods and planes form a modular network topology capable of accommodating hundreds of thousands of 10G-connected servers, scaling to multi-petabit bisection bandwidth, and covering our data center buildings with non-oversubscribed rack-to-rack performance.

Network Technology at Facebook
We took a “top down” approach – thinking in terms of the overall network first, and then translating the necessary actions to individual topology elements and devices.
We were able to build our fabric using standard BGP4 as the only routing protocol. To keep things simple, we used only the minimum necessary protocol features. This enabled us to leverage the performance and scalability of a distributed control plane for convergence, while offering tight and granular routing propagation management and ensuring compatibility with a broad range of existing systems and software. At the same time, we developed a centralized BGP controller that is able to override any routing paths on the fabric by pure software decisions. We call this flexible hybrid approach “distributed control, centralized override.”
The network is all layer3 – from TOR uplinks to the edge. And like all our networks, it’s dual stack, natively supporting both IPv4 and IPv6. We’ve designed the routing in a way that minimizes the use of RIB and FIB resources, allowing us to leverage merchant silicon and keep the requirements to switches as basic as possible.

For most traffic, our fabric makes heavy use of equal-cost multi-path (ECMP) routing, with flow-based hashing. There are a very large number of diverse concurrent flows in a Facebook data center, and statistically we are seeing almost ideal load distribution across all fabric links. To prevent occasional “elephant flows” from taking over and degrading an end-to-end path, we’ve made the network multi-speed – with 40G links between all switches, while connecting the servers on 10G ports on the TORs. We also have server-side means to “hash away” and route around trouble spots, if they occur.

Stone Host

Pages

Wednesday, May 11, 2016

Facebook Data-Center

No comments:

Post a Comment