"While many people are focusing on glitzy aspects of technology (e.g. machines that think, etc.) we are mostly ignoring the known problems of managing and engineering complex and large software systems," Leveson wrote.
This is certainly the case at NASA and Lockheed Martin, the company that built the Mars Polar Lander, Leveson said in an interview. The reports released Tuesday by two investigation panels are evidence of the fact, she said. But Leveson added that Lockheed Martin shouldn’t be singled out for criticism when the same problem plagues the entire aerospace industry.
"What depressed me is that people are not working on these problems," Leveson said.
The most likely cause of the lander’s failure, investigators decided, was that a spurious sensor signal associated with the craft’s legs falsely indicated that the craft had touched down when in fact it was some 130-feet (40 meters) above the surface. This caused the descent engines to shut down prematurely and the lander to free fall out of the martian sky.
to a single bad line of software code. But that trouble spot is just a symptom of a much larger problem -- software systems are getting so complex, they are becoming unmanageable, Leveson said.
In many cases with spacecraft and aircraft flight systems, the interactions between various components have become so numerous and complicated that engineers simply cannot plan for, manage or even predict them all.
"We got away with this up to a certain point of complexity, but we keep adding complexity and we get to the point where we’re not going to be able to add any more without having these kinds of problems," she said.
It’s the curse of "feature creep" in software engineering, and it is extra dangerous because it is largely invisible.
"With mechanical systems, the design is right there on paper. And you can see it, and you can test it, and you can touch it," Leveson said. "With software it’s much harder to see what they've got. So they can't see that they've made this mess."
The answer is discipline in software design, Leveson said. Engineers need to understand that each new level of complexity shoved into software comes at a cost. She also suggests that more engineers would understand this if they were more familiar with the problems of software coding, and if software writers knew more about aerospace engineering.
Besides stripping down software code, NASA and others who rely on systems where a single failure can kill a mission need to be more rigorous with software testing, Leveson said.
It’s an old problem that she has seen at NASA before.
| Finding the bug |
Discovering the software bug that might have doomed the MPL was a lucky fluke. Read more about it
|
In the early 1990s, Leveson chaired a committee that looked into NASA’s management of the software that runs the space shuttle fleet.
Almost all the shuttle software errors that occurred since first launch in 1980 would have been found if NASA had done more "off-nominal" testing, Leveson said.
"Nominal" is NASA-speak for a mission that goes entirely according to plan and procedure. "Off-nominal" refers to what happens when the unexpected occurs.
NASA and others must begin doing more off-nominal testing and begin designing software that works even when things go wrong, Leveson said.
There are various types of analyses that engineers and mission managers can do to determine where the worst risks of failure may hide. This analysis begins by assuming the loss of the spacecraft and working backwards to pinpoint which software system failures could be fatal, then building safeties that protect against those.
It requires a different way of thinking about software design, Leveson said -- one that accepts that engineers may not be able to get everything to work perfectly, but makes sure that nothing awful happens.
For three decades now, since the tragic 1967 fire that killed three astronauts during a launch simulation on the Apollo launch pad, NASA has had a rigorous safety program that it applies to nearly all of its operations, Leveson said.
"[NASA] set up the safety program and they have applied these things to electromechanical systems. But nobody is doing a good job of applying them to software and to the newer technologies," she said.