Common Stability Misconceptions

Stability is a key factor for any grid or region and has been a point of contention around operators for a long time now. With software never being truly free of bugs, especially not the more complex it becomes stability can be overlooked. Thankfully the current aim toward updating some of the basis and utilizing more current technology has pushed a renewed sense for stability, compatibility and performance. That said, let’s actually take a look at the common stability misconceptions, data to back them up and disprove them.

Traffic equals instability

The idea that constant traffic causes eventual collection of more and more garbage that is not properly cleared has been around for a long time. For the most part the core concept itself is true, some accumulation of abandoned data can never be fully avoided, but clearing out and freeing up resources has massively improved with the change to a more recent version of the .NET framework(see picture). This major shift was accompanied by various fixes towards the consumption and, more importantly, the re-allocation of used resources. The return of which granted an overall reduction in resource usage that often sits in the 30% range. Such a change is definitely noticeable and even measurable. The bigger the resources usage was before these changes the bigger the gains are. Some of our customers are seeing reductions beyond 50% simply from upgrading to more recent versions. That said, there is still some work to be done and given the nature of the framework some resource leakage, as it is commonly called, still occurs. Levels are much lower compared to a few years ago, but edge-cases still exist and we still recommend refreshing areas with heavy traffic often, especially when the clientele has a tendency to, let’s put it mildly, act less gracefully in their self-accessorization(is that even a word?).

Error means crash

Humans are quire capable of handling errors in their own “programming” or execution of tasks, most programs however tend to struggle with that and need a helping hand from their programmers to make sure they don’t Windows98 on their users. Handling is often done by quite literally attempting to execute a task and waiting for a return. Should the return not occur the program simply continues throwing whatever broke right into the users face, essentially for them to fix, should they know how. The other common option is the typical “has stopped working” you are all used to, subsequent sending of bug reports, which are actually read believe it or not, or the “write to log and die” method of simply closing with the user being none the wiser as to what happened. The more complex a program becomes the more problematic it can be to employ those latter methods so proper error handling is vital. When so called “hard crashes” are reported it can generally be attributed to misuse these days. The time when a simple error in an item or script caused irrecoverable shutdowns of vital functions are almost gone and usually can be easily traced back to something that is easily fixed or, well, completely out of user control. Either way the crashes are becoming less and less frequent with each new piece that is reviewed and brought up to current programming specifications.

Restart every…

An actual configuration option that can be set, yet probably is a lot easier accomplished through external program control systems. Nonetheless refreshing, as we like to call it, does help to maintain a “refreshed” state that is not impacted by long runtimes. Then again, as you can see from this picture

runtimes that exceed day, weeks and months are no rarity anymore. Even areas with greater resource usage can share the same timeframes for uptime. We are all familiar with the method of fixing a program by simply restarting it, to reset everything to the start and clear any potential wrong data out, but that expects whatever is wrong to be able to reset itself and not reload the bad data. As such making sure such bad data cannot even enter, be that at runtime or after restarting, is a key part of creating long-term stability. The common method to achieving this is defining the types of data carefully and thus not giving the chance for incorrect or corrupt data to maintain in said types. Maintaining good type definitions throughout a program can be tough and sometimes you just want a catch-all for some random piece of data you don’t want to strictly define, but therein lies the issue that can lead to data corruption you cannot detect unless you wait until accumulation or constant overwriting causes an issue. As such doing long-term tests, that tax how long a program can clean up after itself, are important to verify the true stability of a program and its ability to maintain a clean set of working information.

I still crash though

And that cannot be fully avoided as mentioned above. Often the reason for this kind of comes close to the famous “an unfortunate turn of events” that come together to bring about a state that cannot be handled, because it was not even expected in the first place. With a client-server type application the transport of data between them being subject to various variables of doom the results can be unexpected. Even so, recovery from such states are possible and often not as far away. A good example is the main difference between the two common transport protocols used: UDP vs. TCP, you can look up yourself what those two like about each other and what they hate. The basic concept is that unlike UDP, TCP actually checks whether or not data has reached the other side. You can imagine this return may make some things slower when the difference of a tiny piece of information reaching the other side may not make a difference, so UDP is used to send data that can account for small missing pieces or simply to “fire and forget” because you, the user, are not going to notice the loss. However, in the implementations lies the caveats that can break the camels back. UDP in itself is just a sender and receiver of data, what you do with the data is up to you. You could, if you wanted to, send data and request the verification for it manually, all while still using UDP; that’s what, at least in some form, is used to send some of the most requested types of data we encounter. The system is more solid than it sounds, but can still sometimes miss the mark. Normally this is not much of a concern as it is simply resent, but during that period other tasks may be held and eventually you run into being disconnected due to timeout. This issue can be compounded by the amount of data that needs to be sent for certain things and so the more you use it the worse it can get. Thankfully these days methods exist to cache data, to resent only partial amounts when the wholly doesn’t reach the end and so stability can be increased. It doesn’t solve the elephants riddle, but it makes sure it can’t eat all the peanuts.

Who’s to blame?

It has been thrown to everyone under the sun, from users to developers to operators and your uncle Joe, but in the end it’s a community effort. Specifically in terms of educating toward better usage of resources, adequate methods of use and restraint. Nothing’s perfect, but if handled carefully and with a bit of understanding most of it will work just fine. It really is not the fault of people specifically, but what we all do to spread information about things to avoid, things to practice and simple guidelines to follow that ensure a good experience for all, including the poor programming code underneath it all. With enough effort and care the overall experience can be massively positive, but that requires everyone to work together and realize the limitations of software, hardware and people alike. The future will be a bright one, if that is on everyones’ mind.