High-powered Internet applications typically need teams of experts to maintain them. Not any more, say European researchers who have built a system to create applications that manage and fix themselves.
There is a problem with the programs that help Internet users download videos from a peer-to-peer service, perform scientific research on a grid, use "cloud computing" to manage a business, and allow millions of linked devices and applications to work together, says Peter Van Roy, coordinator of the EU-supported SELFMAN project. The problem is that it's getting harder to keep those systems working.
"The central challenge when you build big Internet applications is how to keep them running without having to tweak and manage them all the time," he says.
The SELFMAN team set out three years ago to solve that problem by finding out how to build programs that take care of themselves in the rough-and-tumble Internet environment. "We wanted to make big Internet applications easy so that all the management problems you normally have are handled by the system itself," Van Roy says.
The payoff, he says, will be huge. "It will take the Internet to the next level."
The SELFMAN researchers identified four vital functions for a distributed application to manage itself — self-configuring, -tuning, -healing and -protecting.
Software is continually being patched, updated or replaced. For a distributed system to configure itself, it needs to keep track of all its components, update them as needed, and make sure that all parts of the system can still talk to each other.
"Our system can ask a component, what version are you? Who are you talking to? It can then replace an old version with a new one as needed," says Van Roy.
Self-tuning means that the system can instantly adjust to changing loads and to components leaving or joining the network.
"Suppose one node is getting overloaded," says Van Roy. "Our load-balancing algorithm allocates new nodes close to that hotspot. It spreads the heat to the other nodes and the hotspot cools down."
The Internet is an unpredictable environment. Routers crash, cables get cut, parts of the system overload and grind to a stop, and components come and go.
"With SELFMAN," Van Roy says, "each node stores some of the data and each piece of data is replicated a certain number of times. If a node crashes, the other nodes detect the crash, find a new node and give it the missing data. The system heals itself."
One of the biggest problems SELFMAN tackled was self defense. The researchers discovered that a system's security depends on its topology — how nodes are linked to each other. They found that "small world" networks — in which most nodes are not directly linked, but in which any node can communicate with another in a few steps — were the safest.
"With a small world network, it's easier to detect, isolate, and eject bad nodes," says Van Roy. "The security service observes the system's behavior. If it notices that certain parts of the network are acting abnormally, it takes action."
The SELFMAN team found that building these advanced capabilities into useful applications required a highly structured approach.
The foundation of each application is a structured overlay network. That's a program — itself replicated across the network — that keeps track of all the nodes and connections between them, and can decide when and how to fix problems.
The next level is a replicated storage system. It makes sure that each node has access to the same data, and that data are always replicated to ensure they do not disappear.
The third level houses SELFMAN's transactional problem-solver. It relies on a sophisticated algorithm called Paxos to provide a systematic way of reaching consensus among any number of fallible components.
Van Roy uses the analogy of a transfer between two bank accounts. "If you want to reduce one bank account by 100 euros and add that 100 to another, you want both or nothing," he says. "Each node must see the same data."
"Getting all this fluid behavior — where even if nodes are crashing or new nodes are coming in or the network has problems it never blocks the system — was a big technical problem," says Van Roy. "We needed Paxos to get it to work."
The SELFMAN architecture and components have been used to build some impressive applications. These include a prize-winning distributed Wikipedia that can handle far more queries than the current version, a commercially successful media streaming service, and a graphics program that lets multiple users collaborate on a design.
Van Roy believes that SELFMAN opens the door to a host of high-powered, flexible, and "unbreakable" Internet applications. "Right now we're just scratching the surface," he says.
The SELFMAN project received funding from the ICT strand of the EU's Sixth Framework Program for research.
From ICT Results