Systems Management – COMMENTARY

Written by David C. Lester, Editor-in-Chief
image description

            RAILWAY TRACK AND STRUCTURES, NOVEMBER 2023 ISSUE, FROM THE DOME COMMENTARY –The adoption of computers and the internet, as we all know, has brought about massive changes and efficiencies to the way we operate railroads and conduct many other aspects of life. The connectivity of systems has also made businesses more vulnerable to bad actors. The recent system outages at Norfolk Southern and Canadian National caused concern throughout the industry, but both roads said there was no evidence to indicate the outages were due to security breaches. Indeed, Canadian National reported that their system outage was due to an issue during an upgrade. Unfortunately, if there is a system outage that brings a railroad to its knees, there’s no hiding it. Even if the outage is brought about by a security breach, companies are not likely to advertise it. And a system outage can be caused by a myriad of issues other than a security breach.

            When a railroad experiences a system outage, though, the possibility of a security breach rushes to the forefront of the mind. Make no mistake, cybersecurity threats are out there and an entire industry has sprouted to help companies keep them from happening or to limit their damage if they do. Cyber attacks are among the greatest threats to our business operations, way of life, and ability to function as a society.

            However, when any company, including railroads, has a system outage, chances are very good that a security breach was not the problem but, rather, it was caused by some type of design or management issue that impacted the systems. As most readers know, systems are extremely complex, and rely on sophisticated hardware and software to get the job done. What’s more, they require careful maintenance and management. Think about your home computer for a moment. You periodically must upgrade software or hardware, and these upgrades usually present some issues that need troubleshooting and correction. The home computer is not like the old Bell telephones that you could abuse and throw rocks at and have it keep on going. Those old Bell telephones lasted for decades, while the standard for upgrading both the hardware and software in your home computer is every three to five years. And five years is pushing it.

            Given your experience with a home computer, you can imagine how complex a railroad’s systems are. And note that I use the term “systems,” because modern companies employ a “system of systems,” many computers and networks which do different things, but are usually connected to each other and most often connected to the internet. Firewalls and virtual private networks protect the system from some internet intrusion, but these protections can be breached. The internet connection is necessary because the systems at one company must communicate with those at other companies. But I digress.

            As with the home computer, each of the systems in a big company’s system must be managed and maintained properly. Both hardware and software. The tasks are numerous. Software has bugs that must be fixed. To fix these bugs, more software or a change to existing software must be made. Once that’s done, testing must be done to ensure that the fix for the one problem didn’t cause something else in the software to “break,” as well as to determine if the software fix actually resolved the problem it was supposed to. It’s generally true that multiple iterations of testing are required to see that a software problem is resolved and it the resolution didn’t cause another software process to fail. There are multiple types of testing, too, but reviewing those is beyond the scope of this column.

            Periodically, an entire software system must be upgraded to a new version to introduce new functionality, fix bugs, and even work properly on a given type of hardware. Sometimes, a software upgrade may require new or upgraded hardware because the old hardware may not have, among other things, the horsepower to handle processing the new software. The reverse is also true. It could be that hardware needs to be upgraded because it’s reached end of life, but the software being run on that hardware may not be compatible with the new hardware, so a software upgrade is required.

            The core message of this column is with all this activity going on, scrupulous testing (which falls under management and maintenance) and careful change control must be done to prevent problems. Scrupulous testing requires scrupulous test scripts (documentation of workflow steps to be completed by the system) that account for just about any situation you can foresee. You can’t catch all problems, but the goal is to catch as many as you can. And there is the issue of changes looking good in a test “environment” or “domain,” yet when the change is made in the live (production) system, something that you didn’t or could not catch in testing raises its head and causes problems. Of course, there are many software processes that, if broken, can create problems, but it usually must be a critical one to bring the entire system down. Unless it’s something silly like someone accidentally turning the system off, which I have seen happen (not at a railroad, though).

            So, while we must all be vigilant about cybersecurity, don’t assume that is the cause of a system outage when you hear about one. The demands on software and hardware systems, and those who maintain and support them are only going to increase with the advent of data analytics, cloud computing, and artificial intelligence.