Guest article by Icarus Interstellar's Donna A. Dulo, senior mathematician, computer scientist, and software/systems engineer for the US Department of Defense. Read more Icarus Interstellar articles on Discovery News.
As your spacecraft cruises expeditiously at warp speed on the edge of the galaxy, you detect a faint blip on your long range sensors. As you move closer the blip splits off into over a dozen distinct blips, each racing toward you in a tight formation. You sound the alarm, and as your crew takes their stations you realize the worst: you are about to confront a sizable armada of Borg cubes and spheres.
Fortunately, you spare the ship a major confrontation as you maneuver through a small, uncharted wormhole that you had recently detected in your pre-shift navigational planning, and the ship escapes unharmed. You emerge slightly off course but the ship is intact and your crew is safe.
As you begin to recalculate your new route, you detect another alarm. The systems module for ship life support has malfunctioned due to a software error, which emerged during the escape maneuvers. The software has corrupted the habitation systems controller, and you realize that the ship will no longer be able to produce breathable air within the next 24 hours. [Top 10 Star Trek Technologies Made Real]
The backup system has failed as well, since the backup hardware utilizes the same software routines. The rescue pods allow for 48 hours of breathable air, and the on-board mobile units have 8 hours of air per backpack.
You send your best computer scientists and software engineers into the engine room to diagnose the problem. They inform you that it will take them at least four days to isolate and fix the errors in the best case scenario, amidst over several hundred million lines of code that control the ship and life support systems.
Your situation is now dire. You order the shut off of non-critical systems and send your software team off with the utmost urgency. Then you wait, knowing that the lives of all crew members are now in the hands of your software development team.
Software in Starship Systems
The above scenario demonstrates the vital nature of software in a long term spacecraft travel. It begs the question of which enemy is worse: a flotilla of Star Trek baddies or a weakness the ship’s complex software system?
For those who are familiar with the highly intricate system of systems nature of software, the answer is clear; it's the software that poses the greatest risk.
Traveling at interstellar distances requires a self-sufficient, self-supporting ship and crew, requiring rapid in-house solutions to the gravest engineering situations. The inherent exponential complexity and ultimate fragility of software makes it one of the weakest links in long term survival aboard an interstellar spacecraft.
Imagine a fully operational spacecraft, with hundreds of millions of lines of code and tens or even hundreds of thousands of variables and states. Diagnosing a single error in a line of code is almost impossible in an emergency situation, even with the most advanced automated testing routines. The stress of the situation coupled with the inherent difficulties of both the mathematical logic and the sheer volume of code would strain even the best of engineering teams, who are now working without time on their side. The situation is ominous, but could it have been prevented?
Like the situation with the Borg, where you were thinking well in advance and had contingency plans and escape routes mapped out, planning for long term spacecraft software safety is possible. However, this planning must occur during the development of the ship, as well as during its interstellar operations. The key is a new engineering paradigm called "resilience," and it can quite readily be applied to software engineering and software program development.
Software Long Term Resilience
In a long-term space mission, the limits of the software will be challenged; however failure is truly not an option.
The software, as well as the crew members utilizing it, must be resilient to cope with all possible safety critical situations. The concept of resilience as a discipline in engineering emerged in the mid-2000's as a way to mitigate failure in complex systems in light of sound engineering efforts. Resilience engineering as a concept in software thereafter emerged as a paradigm for software safety that focuses on how people deal with the complexity of a software system to achieve operational success in even the most challenging of situations.
Resilience engineering focuses on the ability of a system to adapt to constantly changing situations and conditions so that a positive state of control over a system is maintained to avoid failure. Coupled with the system's ability to adapt is the ability of the human element within the system to adapt. Together, a human-machine coupled safety approach emerges, allowing the human element to gain knowledge and foresight into the systems operations and become a pro-active part of the system's safety operations.
There are two aspects of software resilience engineering: the resilience of the software through a sound safety-focused development process, and the actual real time operation of the software with positive human in the loop operations. The overall software operates under the concept that safety is a core value along with the continual anticipation of potential failure in the software.
This human centric safety focus helps change the risk equation of the system by providing measures to break the chain of cascading software failure causation while at the same time reducing the fragility of the system. The result is safer, more viable and predictable software performance with users fully involved in software processes and evolution.
Resilience engineering techniques continue to emerge, and focus on logical software redundancy, adaptive intervention techniques, predictive analysis, among many other sound engineering methodologies. Amidst the engineering structures are sound, well-grounded human resource and operational management protocols designed to focus on the ability of a crew to adapt to changing situations and mitigate even the most challenging software emergencies.
Through the coupling of resilient software engineering techniques and viable technical organizational leadership and team centric emergency software management, a complex system has the ability to survive through a catastrophic failure, or to prevent that failure proactively.
Applying Software Resilience
In our example, the backup life support system failed because it had exactly the same programming as the primary system, and thus in the same situation the backup failed in the same way. A more resilient system would use a different software program and set of algorithms in the backup system to do the same work, making the system more resilient. [Related: How to Make Interstellar Spaceflight a Reality]
A resilient system, too, would have its software more modularized and mathematically provable, providing greater and more viable paths of adaptability, refactoring, and repair. Reduced complexity and more standardized programmatic and algorithmic structures would provide additional safeguards to improve resilience.
Then the human element of a resilient system comes into play. Upon failure of the life support system, the crew on duty would immediately switch the system to the backup components, which house a different set of software routines including a completely different set of mathematical logic.
All crew members have been trained on both the hardware and the software of the ship, so the duty crew is well versed in all types of computational errors and how to handle them. Significant time has now been bought as the software systems engineering team gets to work on repairing the faulty logic of the primary set of software routines as the backup system functions flawlessly.
The repair task is easier as the software is more modularized, is easily hierarchically decomposed, and is thoroughly documented in design, architecture as well in its mathematically proven structures. The team is supplemented by a set of second tier software engineers, who have software engineering as their secondary crew function and are highly trained by the primary software team.
The scenario was well rehearsed in the past during training, and each team player knows their function: coder, verifier, mathematician, tester, and implementer. In a systematically orchestrated engineering management endeavor, a new set of logic is developed and coded for the primary system. Within two days it is verified and tested. It is subsequently implemented with total crew involvement, and the ship is back in full operational order.
By applying resilience into a spacecraft's software development and real time operation, the crew can enhance the survivability of the ship even while facing the most daunting software challenges. Through the development and application of the cutting edge theories and methodologies of software resilience, the ship will have the tools and the crew to safely negotiate the perils of deep space software operations.
Resilience techniques can also be applied to other forms of engineering as well as the ship's operations, creating a holistic safety culture that will enhance the overall ship survivability. Thus, a resilient spacecraft will be a sustainable long term spacecraft, destined to traverse the galaxy with endless possibilities for current and future generations. And if confronted, the Borg won't stand a chance.
Icarus Interstellar’s mission is to promote the development of Starship research, for both manned and unmanned vehicles. Software will be a huge part of these future systems and resilience research will help with the ultimate objectives, to first reach the stars and then to move among them, as an interstellar civilization.
This article was provided by Discovery News.