Global computer networks are immensely beneficial to many users but they also can be immensely difficult for network administrators. Running a modern data network – with thousands of computers spread across a wide area – requires juggling myriad systems including power regulation, maintenance and traffic management, not to mention security.
To meet the increasing needs of ever-expanding systems, researchers at Princeton and Microsoft have created an automated system that manages the network’s needs. Called Statesman, the new system acts like an air traffic controller for large computer networks: it constantly monitors the needs of the system and coordinates actions of other systems involved in maintenance and operations.
“Companies that run these large clouds have a scale problem,” said Jennifer Rexford, the Gordon Y.S. Wu Professor in Engineering and one of the developers of Statesman. “The size of the networks keeps getting bigger and bigger.”
Working with the networking team at Microsoft’s Azure network, Rexford and graduate student Peng Sun set out to create a network management system that is reliable, adaptable and requires little or no human intervention. In order to manage multiple tasks that are running independently – running things like power management and network traffic – the team created three states in a network: an observed state; a proposed state; and a target state. The Statesman program maintains a current view of a network, which is the observed state, and is also responsible for updating the network to a desired target state.
Lihua Yuan, a principal engineer with Microsoft, said that another challenge for engineers was developing a system that could easily change as the network changed and also could eliminate errors that would cause problems for users.
“It needs to scale really well, and it also needs to be failure proof,” he said. Yuan said that his engineering team constantly tested the system as it was developed and also ensured that the Statesman could be paused at any moment without causing problems for the underlying network.
Yuan described Statesman as a “gatekeeper – keeping watch over the network and taking action to resolve possible conflicts. When a subordinate system wants to make a change in the network – say a traffic management system wants to send data requests to a different group of servers on the network – it develops a proposed state and sends this to Statesman. Statesman compares the proposed state to other changes proposed by other programs and uses a set of rules to determine whether the change can be allowed. If, for example, the traffic manager wanted to use servers that a power management system needed to take offline, the traffic request would be denied.
“We wanted a system that could manage very large-scale infrastructure automatically and handle conflict and safety issues on its own,” Sun said. “It is not a prototype. It was built for use from day one.”
Statesman began operating in Microsoft data centers last October. The researchers presented the project at the SIGCOMM conference in Chicago last August. Besides Rexford, Sun and Yuan, researchers included Ahsan Arefin, Ratul Mahajan, and Ming Zhang of Microsoft.