Chasing Bugs

Luck is a foreign concept to software in general, but chance is not and has a tendency to beat the odds when or where you least expect it to. Outside of the ever-lasting and probably never ending saga of trying to find a random number generator that provides infinite entropy there is another ever-lasting chase going on, for bugfixes.

As usage of a piece of software increases the chances of bugs being found increases and so there is an incentive to get as many people testing a piece of software as possible. Even so, the most illusive of bugs may occur so infrequently that hundreds or even thousands may never see it or get the chance to find it repeatedly.

Over the past few years there have been many such instances with OpenSim, but none more interesting than the three we want to present here.

 

Stuck, but not really

Actually the most recent example, at least in direct regards to OpenSim, was an issue that manifested itself as a something quite familiar with those on poor connections; a lack of response from the region one is on. In difference to the usual connection loss, however, there was no logout and other functions continued to operate normally, just walking around was not possible. Well, that’s not actually entirely true either, as any script update would randomly snap you to the position you did actually walk to by pressing the keys, just there were no positional updates sent in the meantime to show this. This issue had many facets and was quite fascinating in how it manifested itself, in what was still functional and how to return to full functionality as well. For example, just having more people on the instance would instantly stop the issue from occurring, but as soon as you were alone and not moving you’d find yourself stuck once more.

Normally the process for getting bugs resolved is relatively simple. First, finding a way to reproduce the bug as reliably as possible so developers can attempt it for themselves and actively debug the code that is run. The second step is finding a solution to the issue that both removes the bug and does not inadvertently create another down the line. The problem with this issue was the fact it only showed up randomly and at a chance of less than 1% of all instances would ever see it occur. Combined with the fact, the only visible actions to the user was existing within the instance for a certain amount of time meant a reproduction setup for this issue was just a matter of chance.

Thankfully through the nature of OpenSim being open source there are other ways to figure out the broken code path. Granted not the prettiest debugging method, but adding console output to tasks running in the background and monitoring them for proper execution is a valid strategy. OpenSim runs a number of things while just visibly idling about and it so happened that one of these tasks ran just a tiny bit too quickly and cleared out the information about a present user before a new information could be added. This minor timing difference is normally not a big deal as there are ways to prevent them from accessing data you are in the process of modifying. This is done by creating a lock on the data and releasing it only after you are done with your modification. With multi-threaded apps this approach prevents other parts of the code from modifying the data in the meantime. This was all that was missing in this case, a few lines of code to stop two lines of code from intersecting, yet it took quite some time to find them.

For anyone curious to read the change, the commit has on the OpenSim git for this change is: c71181ff515b2d2e464dcf769f391cd5eccc4d0f together with the immediately following commit resolved the issue. A minor change that has massive implications in regards to what code is being run and the difference between normal operation and being stuck. Minor changes with a big and visible impact in the instances this bug reared its ugly head.

 

I am not seeing things

A certain level of self-doubt in regards to ones surroundings is useful in many ways, but trusting your senses is one of the first things we learn as human beings. The next issue is one that went from being dismissed as just random packet loss or glitches in code timing to a client-side issue going through the loops all the way to the fundamental client-side code.

It all started years back when this issue first showed up, likely not long after the problematic code was written into the client. Editing an object and changing just its scale without doing any other modification and then coming around back to it 20 minutes later finding all changes reverted. Now you’d easily blame some lost packets, not sending the new scale, or perhaps just a minor issue not properly decoding the information about the scale changes. No matter, changing the scale again and not giving it a second look, if at first you don’t succeed, keep trying, as they say.

Yet it kept happening and with ever greater frequency and magnitude, which eventually resulted in visible reversion right in the viewport. There is an issue and not just some random glitch, an actual problem somewhere in the code, but where? The hunt for the cause first starts where all of them start, going to the codepaths that handle this particular scale change and following it to see if there is anything not behaving as it should. A few attempts at that were made, but nothing was found. After all, most other changes to scale and other attributes of objects work fine and according to the data on the instance everything is processed and stored properly. Therein lies the clue however.

In difference to normal procedure of finding and eliminating bugs in this case the approach of verifying everything that was working, leaving what is broken to be the thing that could not be verified was taken. As some may know as the “Sherlock Holmes approach”. In this case it was determined that everything on the server side of things was working fine and everything was as it should be, but the client side was diverging visually.

One of the things most users learn over time is to regularly clear out the cache in the client to prevent it from getting too bloated or even damaged, so naturally that’s one of the first debugging steps one does when you want to test an issue on the client. Without data in the cache the client is forced to request all data from the server to display visually in the client. When one now, again, edits an object and sees the issue pop up again it becomes fairly clear where the issue must be. Sure enough disabling the relevant cache handling objects and their attributes caused the issue to disappear. We got a winner, the object cache is reverting to the cached object attributes and not refreshing it with the new attributes.

With the problem identified it was now just a matter of finding the actual code and getting it fixed. This turned out to be a much bigger thing than expected though. As the code handling the object attribute updates and cache management was located, the problems with it were immediately apparent, but not why seemingly they had not been visible to most users the various clients. It turns out OpenSim has a slight difference in what it sends for this protocol and so it was found the issue had been masked by a minor piece of data that is normally sent, but not in OpenSim. Subsequently the issue was escalated all the way to Linden Labs given the broken code was out of their feather. It turned out this minor protocol difference had been masking this issue for years and only when OpenSim did not bother sending what could be perceived as unnecessary packets was this issue apparent enough to be found. Mind you it was reproducible no matter the protocol used and so was confirmed and fixed in all clients.

The entire report chain once found to be client side can be seen here: https://jira.firestormviewer.org/browse/FIRE-31325 with links to the tracker of Linden Labs as well(requires login).

 

CTRL + C, CTRL + V

Or as some like to call it, modern journalism. Jokes aside, one of the biggest annoyances when writing complex pieces of code is repeating parts over and over again. Usually this can be avoided with functions and loops, but sometimes there is no way around it, especially if the code does the same thing, just uses slightly different variables.

This particular issue was not actually an issue that was visible or at least not immediately to most users without some serious digging and calculations. As part of our commitment to OpenSim we try to help out with development tasks most developers dread, from updating documentation to fixing typos, but most importantly, code review. In this capacity we set out to determine the causes for some of the compiler warnings, which used to be present when compiling OpenSim. Warnings are not errors, as such their presence alone does not instantly result in problems, but they can hint at runtime issues or other problems that might be encountered when actually using the software. It makes sense to pay attention to them and resolve them just as much as compiler errors and bugs.

In the process of resolving a compiler warning regarding an unused variable something else was found to be in error. As the code surrounding the unused variable was re-used elsewhere for a second time, both iterations were altered and the resulting patch file showed both instances and their apparent similarities, but crucially as well, a lack of difference in one part. The second iteration, using a different variable for its result was actually assigning it to the first variable. This only became apparent after reviewing the patch file and noticing the lack of parity between the changes of the iterations.

It’s not quite clear if this ever resulted in a visible bug, but it nonetheless constitutes a problem. It’s common for developers to struggle with this just as much as we all make typos in writing or misread numbers when doing mathematics, so this slipping by for quite some time is not out of the ordinary either. It helps to have a second pair of eyes looking over things, more so if they are either very critical of everything they read or just happen to have good pattern recognition.

For those interested the change in question is commit: 52947b628019736e6f4ad20d6e75ec5ba3c37e28 as was reported on mantis: http://opensimulator.org/mantis/view.php?id=7612

 

When these things happen they undoubtedly invoke a certain feeling of adventure and quite the dopamine rush when all said and fixed. For those with a passion for software development a lot of the fun is derived from finally seeing something work as intended or just getting a particularly annoying bug out of the way. We hope these technical insights into what sometimes goes on behind the scenes helps with understanding some of the complexity we deal with in regards to OpenSim. It may also foster some appreciation for the work the developers and maintainers of OpenSim provide for free; in their never-ending quest to squash bugs and refactor the codebase. It certainly isn’t something that is always as fun as these example may make it seem to be. Maybe it even provides a guideline for those looking to help with reporting bugs as to what the process is for resolving them and what is needed for that to begin with.