The idea of machines monitoring their own vital signs and fixing themselves when broken has been a longstanding conceit in speculative fiction, frequently employed by writers even before computer scientist John McCarthy coined the term "artificial intelligence" in 1955. The rationale for developing real-world applications of such technology is straightforward. Instead of being paralyzed by system glitches, a self-healing system would automatically identify its problems, isolate them from the rest of the system (if necessary), then either fix those problems or implement workarounds. In theory, systems operating in this way not only could improve fault tolerance, but also could help mitigate IT complexity and automate routine tasks, thereby reducing operating costs.
Technology mainstays such as Hewlett-Packard (HP), IBM, Microsoft, Oracle, and Sun Microsystems are promoting self-healing technology not as some future fantasy, but as viable solutions for some of today's vexing IT problems. Microsoft, for example, rolled out self-healing in Windows Server 2008, which can automatically identify certain file-system corruptions and fix them without user intervention. Oracle, for its part, uses self-healing technology in its latest database products for automated diagnosis and repair. And HP has incorporated the technology in its enterprise servers and software. With these and other implementations offering benefits far more sophisticated than generating simple error reports to already overworked IT staff, self-healing technology appears to be coming of age.
One of the companies conducting a great deal of research in this area is IBM, which launched a major self-healing initiative in 2001. "We have been applying the principles of fault-detection, provisioning, configuration, security, and optimization to the complete data-center infrastructure," says Matthew Ellis, vice president of IBM's autonomic computing division. Thanks to research efforts in that division, IBM now uses self-healing technologies in WebSphere, DB2, the Lotus Foundations product line, and its Power servers. Another company investing heavily in self-healing and making headlines in this field is Sun, whose engineers have designed predictive self-healing modules that enable Solaris to self-diagnose and mitigate problems.
Still, one of the major issues that IBM, Sun, and others are facing in this area is hardware failure. In virtualized or clustered systems already implementing redundancy or failover technologies, this issue is less critical. But when it comes to enhancing the capabilities of automation and self-healing technologies, hardware will continue to be a major concern, says Mike Shapiro, a distinguished engineer at Sun credited with leading the effort to design and build the Sun architecture for predictive self-healing. The key, Shapiro says, is to connect the unit of hardware failure to the right layer of software abstraction in order to make a good decision about what to do, such as triggering a redundancy, asking for a downtime, or moving to another physical server.
Such decisions made by self-healing technologies no doubt will be critical in the coming years as data centers and desktops become more complex. Still, judging by the number of companies embracing self-healing strategies, the technology appears positioned to have a major impact on computing, an impact that the technology's proponents hope will make downtime a thing of the past.