Chasing Bugs

Luck is a foreign concept to software in general, but chance is not and has a tendency to beat the odds when or where you least expect it to. Outside of the ever-lasting and probably never ending saga of trying to find a random number generator that provides infinite entropy there is another ever-lasting chase going on, for bugfixes.

As usage of a piece of software increases the chances of bugs being found increases and so there is an incentive to get as many people testing a piece of software as possible. Even so, the most illusive of bugs may occur so infrequently that hundreds or even thousands may never see it or get the chance to find it repeatedly.

Over the past few years there have been many such instances with OpenSim, but none more interesting than the three we want to present here.

 

Stuck, but not really

Actually the most recent example, at least in direct regards to OpenSim, was an issue that manifested itself as a something quite familiar with those on poor connections; a lack of response from the region one is on. In difference to the usual connection loss, however, there was no logout and other functions continued to operate normally, just walking around was not possible. Well, that’s not actually entirely true either, as any script update would randomly snap you to the position you did actually walk to by pressing the keys, just there were no positional updates sent in the meantime to show this. This issue had many facets and was quite fascinating in how it manifested itself, in what was still functional and how to return to full functionality as well. For example, just having more people on the instance would instantly stop the issue from occurring, but as soon as you were alone and not moving you’d find yourself stuck once more.

Normally the process for getting bugs resolved is relatively simple. First, finding a way to reproduce the bug as reliably as possible so developers can attempt it for themselves and actively debug the code that is run. The second step is finding a solution to the issue that both removes the bug and does not inadvertently create another down the line. The problem with this issue was the fact it only showed up randomly and at a chance of less than 1% of all instances would ever see it occur. Combined with the fact, the only visible actions to the user was existing within the instance for a certain amount of time meant a reproduction setup for this issue was just a matter of chance.

Thankfully through the nature of OpenSim being open source there are other ways to figure out the broken code path. Granted not the prettiest debugging method, but adding console output to tasks running in the background and monitoring them for proper execution is a valid strategy. OpenSim runs a number of things while just visibly idling about and it so happened that one of these tasks ran just a tiny bit too quickly and cleared out the information about a present user before a new information could be added. This minor timing difference is normally not a big deal as there are ways to prevent them from accessing data you are in the process of modifying. This is done by creating a lock on the data and releasing it only after you are done with your modification. With multi-threaded apps this approach prevents other parts of the code from modifying the data in the meantime. This was all that was missing in this case, a few lines of code to stop two lines of code from intersecting, yet it took quite some time to find them.

For anyone curious to read the change, the commit has on the OpenSim git for this change is: c71181ff515b2d2e464dcf769f391cd5eccc4d0f together with the immediately following commit resolved the issue. A minor change that has massive implications in regards to what code is being run and the difference between normal operation and being stuck. Minor changes with a big and visible impact in the instances this bug reared its ugly head.

 

I am not seeing things

A certain level of self-doubt in regards to ones surroundings is useful in many ways, but trusting your senses is one of the first things we learn as human beings. The next issue is one that went from being dismissed as just random packet loss or glitches in code timing to a client-side issue going through the loops all the way to the fundamental client-side code.

It all started years back when this issue first showed up, likely not long after the problematic code was written into the client. Editing an object and changing just its scale without doing any other modification and then coming around back to it 20 minutes later finding all changes reverted. Now you’d easily blame some lost packets, not sending the new scale, or perhaps just a minor issue not properly decoding the information about the scale changes. No matter, changing the scale again and not giving it a second look, if at first you don’t succeed, keep trying, as they say.

Yet it kept happening and with ever greater frequency and magnitude, which eventually resulted in visible reversion right in the viewport. There is an issue and not just some random glitch, an actual problem somewhere in the code, but where? The hunt for the cause first starts where all of them start, going to the codepaths that handle this particular scale change and following it to see if there is anything not behaving as it should. A few attempts at that were made, but nothing was found. After all, most other changes to scale and other attributes of objects work fine and according to the data on the instance everything is processed and stored properly. Therein lies the clue however.

In difference to normal procedure of finding and eliminating bugs in this case the approach of verifying everything that was working, leaving what is broken to be the thing that could not be verified was taken. As some may know as the “Sherlock Holmes approach”. In this case it was determined that everything on the server side of things was working fine and everything was as it should be, but the client side was diverging visually.

One of the things most users learn over time is to regularly clear out the cache in the client to prevent it from getting too bloated or even damaged, so naturally that’s one of the first debugging steps one does when you want to test an issue on the client. Without data in the cache the client is forced to request all data from the server to display visually in the client. When one now, again, edits an object and sees the issue pop up again it becomes fairly clear where the issue must be. Sure enough disabling the relevant cache handling objects and their attributes caused the issue to disappear. We got a winner, the object cache is reverting to the cached object attributes and not refreshing it with the new attributes.

With the problem identified it was now just a matter of finding the actual code and getting it fixed. This turned out to be a much bigger thing than expected though. As the code handling the object attribute updates and cache management was located, the problems with it were immediately apparent, but not why seemingly they had not been visible to most users the various clients. It turns out OpenSim has a slight difference in what it sends for this protocol and so it was found the issue had been masked by a minor piece of data that is normally sent, but not in OpenSim. Subsequently the issue was escalated all the way to Linden Labs given the broken code was out of their feather. It turned out this minor protocol difference had been masking this issue for years and only when OpenSim did not bother sending what could be perceived as unnecessary packets was this issue apparent enough to be found. Mind you it was reproducible no matter the protocol used and so was confirmed and fixed in all clients.

The entire report chain once found to be client side can be seen here: https://jira.firestormviewer.org/browse/FIRE-31325 with links to the tracker of Linden Labs as well(requires login).

 

CTRL + C, CTRL + V

Or as some like to call it, modern journalism. Jokes aside, one of the biggest annoyances when writing complex pieces of code is repeating parts over and over again. Usually this can be avoided with functions and loops, but sometimes there is no way around it, especially if the code does the same thing, just uses slightly different variables.

This particular issue was not actually an issue that was visible or at least not immediately to most users without some serious digging and calculations. As part of our commitment to OpenSim we try to help out with development tasks most developers dread, from updating documentation to fixing typos, but most importantly, code review. In this capacity we set out to determine the causes for some of the compiler warnings, which used to be present when compiling OpenSim. Warnings are not errors, as such their presence alone does not instantly result in problems, but they can hint at runtime issues or other problems that might be encountered when actually using the software. It makes sense to pay attention to them and resolve them just as much as compiler errors and bugs.

In the process of resolving a compiler warning regarding an unused variable something else was found to be in error. As the code surrounding the unused variable was re-used elsewhere for a second time, both iterations were altered and the resulting patch file showed both instances and their apparent similarities, but crucially as well, a lack of difference in one part. The second iteration, using a different variable for its result was actually assigning it to the first variable. This only became apparent after reviewing the patch file and noticing the lack of parity between the changes of the iterations.

It’s not quite clear if this ever resulted in a visible bug, but it nonetheless constitutes a problem. It’s common for developers to struggle with this just as much as we all make typos in writing or misread numbers when doing mathematics, so this slipping by for quite some time is not out of the ordinary either. It helps to have a second pair of eyes looking over things, more so if they are either very critical of everything they read or just happen to have good pattern recognition.

For those interested the change in question is commit: 52947b628019736e6f4ad20d6e75ec5ba3c37e28 as was reported on mantis: http://opensimulator.org/mantis/view.php?id=7612

 

When these things happen they undoubtedly invoke a certain feeling of adventure and quite the dopamine rush when all said and fixed. For those with a passion for software development a lot of the fun is derived from finally seeing something work as intended or just getting a particularly annoying bug out of the way. We hope these technical insights into what sometimes goes on behind the scenes helps with understanding some of the complexity we deal with in regards to OpenSim. It may also foster some appreciation for the work the developers and maintainers of OpenSim provide for free; in their never-ending quest to squash bugs and refactor the codebase. It certainly isn’t something that is always as fun as these example may make it seem to be. Maybe it even provides a guideline for those looking to help with reporting bugs as to what the process is for resolving them and what is needed for that to begin with.

What is the “metaverse”?

Before we get into the meat of that, let’s first welcome all the wannabe journalists and scam artists looking to capitalize on this whole craze Facebook has set loose on the world and perhaps remind them of the fact, that just because some big shouty lizard man says something most certainly doesn’t make it a credible source. Do your research and basic journalism before you become the mouthpiece for yet another Theranos.

 

Now to the fun part. In the realms of our virtual lifes the definition of metaverse has historically always been an umbrella for all technologies based around creating a virtual existence in a set world we may or may not be able to shape. This holds true for SecondLife as well as the grids present on the OpenSimulator platform and its derivatives. Strictly speaking then the word not only stands for the interconnection between grids, which was once even a thing between SecondLife and OpenSimulator if you can believe that, but also encompasses technologies not based on the same core design. IMVU springs to mind here, although the limitations and aim for it are somewhat different. Since then many other virtual living simulators have sprung up, some large, some small. We are inclined to still count them towards our collective metaverse as the pillars of what that stands for are often present:

 

Creation/Creativity

Expression in the form of creating anything, often starting with a blank canvas and an idea, is one of the core parts of these worlds. With other forms of media and entertainment being strictly defined and premade the freedom to create with almost no boundaries has always set the metaverse apart. It was the main selling point of SecondLife and continues to be a measure for virtual worlds out there as the specifics of it often define the ease-of-entry and expendability of whatever software is used.

Personal Expression

Whether it is through changing ones own representation within the metaverse or using the tools and creations to form miniature societies, when it comes to just being yourself any platform worth its weight aims to provide as much capabilities for this as possible. Projects, groups, clans or even cults can form from personal expression and what constraints are put on that or which tools are provided for it can be a direct measure of how much a virtual world is ultimately worth.

Community

Extending beyond just family and friends the people we meet within these virtual worlds form the basis of how welcome we feel within them. Some place a great emphasis on the formation of personal relationships such was the idea of IMVU initially, others create what can only be described as castes based on the contents of ones wallet. Regardless of that, most tend to hire community managers to handle whatever miniature society forms itself, either to contain or encourage the individuals therein.

 

Of course much like the word “literally” being forcefully changed through blatant misuse it remains to be seen what will become of the word metaverse given the technologies, companies and virtual worlds currently associating with it are themselves in constant flux. Perhaps Facebook in a shocking twist will actually define what their vision is at some point, becoming just another part of the metaverse or even shaping a change; In other news: pigs can fly. However, from our perspective the definition is fairly clear and already well established.

On the back of that we now get to the real fun part, probably the main drive behind this article and a source of personal and personnel fun times. Much like this article the mere inclusion of this word seems to result in a lot of noise from all manners of things. From business proposals to sale pitches and scams; The color in our email inbox is currently full rainbow happily tagging away at what we receive. As a form of entertainment I take some time out of my day to reply to them, knowing full well the nature of such emails. Since the metaverse is about sharing, let’s embrace that spirit and allow us to share some of the fun we have had with these emails.

 

#########################################################################################

This one looks innocent enough until you look at their website and wonder what exactly screams “modern” and “with it” about that. A desperate attempt of the 90s to stay relevant more like.

Not entirely true, but before it ends up costing phone charges. International still ain’t free round here.

Note: Taking care to spell the name of the company you contact correctly might be a good idea, proof reading 🙂

More than happily indulged that one, but after that pretty much radio silence, I don’t expect to hear back.

#########################################################################################

Note: Not pictured the small text footer claiming confidentiality etc. As if it worked that way.

A quick search about them later and the results are not exactly positive.

Spicy doesn’t begin to describe that, but nowadays a reality check once in a while might help. Santa isn’t real either, sorry. Predictably there has been no follow up.

#########################################################################################

 

We will spare you the obvious spam and scam attempts as they are never much fun and usually don’t even come from a mailserver one can reply to. These are just the highlights so far, we will update when more arrives that is worth sharing.

 

(Disclaimer: For those pictured here, remember this: The embarrassment may fade over time, but the bullshit remains, so think about what image you want to present to others. Don’t send emails you would not want to receive yourself.)

Drowning in Garbage

Continuing the series of technical deep-dives into the inner workings of OpenSimulator, this time the focus is on the things it doesn’t do. Much like the stereotype of the pubescent teenager that has trouble breathing over the mountain of gymsocks and hill of dirty shirts sorted loosely by distance they can be smelled from OpenSimulator has a tendency not to clean up after itself. In difference to your local municipality it isn’t a matter of dispatching some burly men and a big skip to collect all that and dump it onto a big pile or burn it. The solutions are more akin to surgically removing kidney stones or trying to teach a Dodo to fly, but let’s explore it anyways.

The main culprit of compulsive collecting whatever is thrown their direction in the case of OpenSimulator is the database holding data for various aspects that require central coordination. Each individual module dealing with one aspect and managing its data in often widely different arrangements. This can be expected given the history of the entire project and the many people that had often vastly diverging ideas on how best to handle things and what direction they expected the project would head to. Focusing only on the big offenders and the known points requiring either manual cleaning or at least a watchful eye to prevent issues in this article.

Information

Decentralization brings with it the need to request information directly from the endpoints they were received from. In other words, whenever there is data not directly stored locally on a specific grid, it has to be fetched from wherever it was originally created on. This goes for various things, such as information about who created an item, the associated profile data of the user or anything not yet transferred locally. As the Hypergrid is decentralized in nature, meaning there is no single authority that can provide information, each piece has to be requested from the origin. This is done via a http queue filled with the requests for certain data and their return inserted and locally cached where applicable. As things, like profiles, can change there is no localized place to save this data for any longer than most instances are running for, so each time the data is lost or deemed too old it has to be fetched again. A reasonable system to keep things up to date and make sure you are always getting the correct data, but it falls flat on its face when things don’t work.

Each piece of data that is requested from external sources creates and entry in the aforementioned queue and is checked, and waited for, every time. When the remote endpoint does not answer it takes time for the request to be deemed a failure and the next one is executed. When there is a lot of “dead” data this can fill up the queue and stop any more items from being executed completely halting any further requests. While these requests are open they are constantly attempting to fetch data, which reduces network performance and can lead to additional load on an instance. In extreme cases the amount of data requested can clog up the pipes to a point other critical operations requiring network are not executed, which has large detrimental effects on a given instance.

Unfortunately, as you may expect, the solution to this is either removing the item causing the requests to be scheduled or altering them to stop outgoing requests from being required. Usually this means adjusting things directly in the database to remove references to external endpoints entirely, which is problematic for many reasons. Most commonly creator information of items from remote sources are the culprit and removal of such is questionable.

Compounded by things such as changing addresses or locations of endpoints or slow/intermittent connections and it is not just items either. The problem of information requests goes as far as friends, groups, anything that requests data over the Hypergrid from other remote endpoints.

Groups

The group system or GroupsV2 has been a point of issues since its introduction. The data it keeps is spread across multiple tables in the database, which is both a blessing and a curse. When it comes to databases and tables generally you want to avoid sending too many queries for things and reference things in code rather than through combining tables. Equally smaller and simpler queries will run faster and can provide a overall faster fetching of required data. Which approach is favorable also depends on size of the data itself, table structure and even the performance of the machinery that runs the database itself. No wonder thus that database design and structuring is something some people focus their entire careers on. In the case of OpenSimulator the current system makes frequent and heavy calls to the database to refresh and provide data, which on the surface seems like a negative aspect. As more data comes in these calls do become less a strain compared to the amount of data they have to sift through before getting to relevant parts. This means that while the system can be slow it does not suffer that much from slowing down as more and more groups are created and people join them. It still is not an ideal approach and past a certain level it starts slowing down regardless. This is compounded with the dynamic nature of group data requiring frequent refreshes when making changes.

Assets

Arguably the biggest factor in collecting garbage like a hoarder with both a kleptomaniac disorder and aversion to garbage cans. The standard asset provider in OpenSimulator utilizing the database exclusively, saving all data directly to a table, is a ticking timebomb. As asset sizes increase with more and more mesh objects being imported and more and more complexity added to them you can quickly run into issues of assets now saving at all. This is because most databases impose limits on the maximum size of any given query or packet being sent to them. Once exceeded the information is simply rejected and the data is not saved at all. As more users join a grid running this system each of them requesting their data can easily exceed the maximum connections the database can both keep and even provide. This ultimately, as databases have to save their data somewhere as well, starts eating into the performance of the hardware itself. Running OpenSimulator on older drives still using spinning disks quickly hits a brick wall. As data access times increase for the files the database uses so will queries to the database itself, once timeouts start to be reached the entire system can quickly become unusable.

Thankfully there is a way around that. OpenSimulator implements an alternative asset system, FSAssets, which utilizes the database in a manner more in line with what they were designed for in the first place, less a mass storage device and more a reference engine for what’s where. As such this system only keeps information of which specific asset ID can be found at which location. This also allows for de-duplication, the process of checking if the actual data already exists somewhere and not saving yet another copy, but instead to simply reference the new ID to the already existing data. On the surface this sounds like an ideal solution, but even that has shortcomings. Much like the database itself the performance of this system is dependent on the speed the hardware can provide the information back to the requesting agent so storage systems need to be fast still and benefit from setups that reduce access timings. Thus combining multiple physical storage devices that either contain the same data or parts of data, each providing their access speed in combination thus increasing overall bandwidth is paramount to maintain performance as data grows.

And grow it will, very quickly in some cases, which brings other issues down the line. While databases are designed to retain their fast access to data even when reaching millions of entries the same cannot be said for files on disks. When performing backups or just moving data from one place to another as you upgrade your storage solution, moving millions of small files in various nested folders can be quite the task itself. Keeping track of changes to these is an even bigger task as most versioning systems will just throw in the towel when confronted with the scales of data OpenSimulator can produce in short amount of time. Designing backups and control of that data is thus just as important as selecting the right hardware and definitely should play into hardware selection as well.

Since we are talking about garbage, any asset system has the issue of being a rather simple fella to talk to. As the hoarder he is, he will keep all data you give him, regardless of whether anyone still even needs that data. This means over time it will contain data that is no longer present anywhere, be that on a region, a user inventory or really anywhere. Orphaned data is plenty and is almost impossible to find, given you need to cross-reference so many other places. One of the biggest culprit in this regard are notecards, specifically those created by scripts. As they create a new notecard whenever a change is made, yes that’s how that works, the old “version” is never discarded, it remains. This means over time, these orphaned notecards no longer belonging to anyone or any object tend to stack up in the database.

The only option to cull these is to check on all regions and inventories if a reference to them still exists. This can take quite some time and is only possible if you actually have access to all these databases in order to make sure you are not removing a notecard that is still in use. Of course with the connectors between the regions and the grid system it is within the realm of possibility to automate this process and do regular culling of dead notecards or other assets that exhibit similar behavior.

The main issue is crawling over what can quickly grow to millions of assets is both time and resource intensive and so is a significant change in the design of these connectors and the grid system itself. As often referred to said system as being a bit simple or borderline dumb as it mostly concerns itself with synchronizing data among the simulators and users. Adding routines and regular tasks would significantly increase the resource usage of the system.

Friends

We are all social beings and have a natural desire to flock together. This is generally not an issue within a single scope of the grid itself, but in the scope of the entire metaverse this can backfire somewhat. As social media has shown letting people know where you are and if you are available is something we tend to want to broadcast to those interested. OpenSimulator does the same, attempting to let all friends know if and where you are. These calls go out over the HyperGrid system to inform everyone on the friendslist. Problems arise when in the changing nature of the metaverse some of these remote locations are no longer available. Each request is scheduled internally and then run. If the call succeeds all is well, but each call that fails to reach its target will hold until a timeout is reached. The overall queue for these calls is restricted to prevent overloading of the network, but the calls still need to happen. As the queue fills up with requests to be done the simulator has to “remember” them until they can be executed. That task unfortunately takes quite a bit of resources as it waits and fills up with requests. It also will absolutely finish its queue even if the user causing the calls has already left. What is once in queue will remain.

Fortunately any failures to reach the other side are logged on the specific simulator they are run on. With access to the console or logs it is a simple task to then search the Friends table for the entries likely to fail and remove them. Though that should make sure to remove them on both sides as Friends are kept as a pairing so each friendship has an entry from the perspective of either participant. Unfortunately there exist still some levels of caching and mechanism to re-create friendships. It makes sense to also remove the concerning Calling Cards from the inventories, in order to prevent leaving dead references behind that could produce a renewed entry into the database.

Maptiles

In order to produce the world map for the viewer to show regions a tile is created for each region or even multiple tiles for larger regions. These tiles are sent to the grid system and are then used to generate the different scales of map zoom. The tiles are kept indefinitely and so are the generated levels of zoom. Here nothing is automatically culled, not even when the concerning regions are being properly shutdown, the maptile will remain until a new maptile is uploaded to that specific spot.

This leaves two methods. Either uploading an empty maptile to the specific spot you want to clear and causing the grid to redraw the zoom levels, which could be added to the shutdown routine of the simulator. Another option, which also has the added benefit of displaying renewed tiles for the regions, removing all the maptiles and asking all connected simulators to create new maptiles to be sent to the grid system. The latter method being quite the strain on the system to send and receive all that data, but also making sure map displays actual visuals of regions as they are.

What remains to be done beyond this though is clearing, you guessed it, the database. For legacy viewers and other applications the database gets sent asset entry requests for each maptile as a texture. These terrainimages as they are called will both consume entries in the database as, if a file-based asset system is used, a file within the asset system. Removing these, assuming no one is uploading similarly named textures can be a simple matter of going through the database and fetching the correct entries and corresponding assets.

Now obviously ideally the grid system would routinely do this or even issue a cleanup when changes to the region list are made, but that ends up in the same bucket of adding a substantial system to it, which will consume resources, issues calls over the network and create a lot more calls to the maptile system. Ultimately though, if left unchecked and since the new maptiles will always be slightly different, the amount of data in both database and assets can grow quite substantially.

Inventory

A less culprit of garbage, but nonetheless something to watch for. By default the deletion of items does not necessarily cause a deletion of the inventory item. Just stuffing things into trash also, obviously, does not delete the item either as it can still be restored. Suppose trash is more appropriately a recycle bin in operation. Regularly emptying of the folder and making sure to enable the ability to actually remove the references from the databases and assets makes sense. Automated systems for removing items left in trash for a certain time do not exist and might not be desired either, but please don’t use your trash folder as a backup, that’s what IARs are for.

Removing trash items based on their age in that folder and removing the corresponding assets through a routine does make the most sense here. Enabling the deletion option for items in the database and assets does bring with it the security risk of these deletion requests being issued externally if the correct request can be made to the system. Normal security practices fly in the face of the open nature of OpenSimulator and especially the HyperGrid system, so whatever can be done to secure the system against attack is worth considering, even if that means some garbage is collected. In the case of inventory the access to the database from the grid level is a given so creating routines for cleaning run in a secure manner are well within the realm of possibilities.

 

That’s it? Well not quite, there are quite a few other places garbage is collected that can eventually add up to causing issues, but taking care of those is quite a bit more difficult and only really becomes an issue at scales rarely seen in the metaverse. There is still room for improvement in OpenSimulator in regards to the mess it makes of its own data structures. We have begun working out some of these issues already and will continue to do so. In the meantime we hope the OpenSimulator project itself can decide on a course to take dealing with the issues mentioned here, but it is obviously not going to be easy deciding what or what not to do given the implications of building entire cleaning routines into a system that in some regards is nearing its design limits already. Thankfully in most cases a skilled grid operator can still manually create their own routines to counteract the problems arising from increase garbage collecting in their database and file structures.

Simplifying Payment Gateway Fees

Not that long ago we changed our fee structure to better reflect the actual fees we incur from each payment gateway. The aim was to do away with the flat fee applied regardless of the gateway, which dis-proportionally raised quite steep fees for specific gateways that actually had a reasonably small fee structure to begin with. With this shifting the most used gateway for payments toward those with lower fees attached showed that this move was also in the interest of customers.

Unfortunately the setup that allowed this to be done has had some issues over time and in some cases created a bit of a mess within the billing department. With our interest in maintaining security and keeping our systems up to date we are looking at a major upgrade for our billing system. This in turn breaks the existing ability to specify the fees for each gateway based on the incurred fees. We have been in contact with both the provider of the billing system and a third-party developer looking to restore the functionality to the system. With both sides blaming each other for being at fault for the system not working anymore we are faced with the difficult decision to roll back to a flat fee for all gateways.

We are aware this may cause some inconvenience and thus have elected to lower the overall fee below the mean average gateway fees we incur, thus shouldering the majority of the change. Some time in the following weeks the invoices created will show a flat 4% gateway fee on each invoice regardless of the selected payment gateway. We apologize for the inconvenience this may cause and will continue to pressure to get the original functionality restored.

Hosting Classic Addons Archive

We are big proponents of preserving software and content that would otherwise be forgotten. Especially when it comes to software that continues to have a use yet to be eclipsed by anything we feel it is of great importance to preserve it as testament to what is or once was possible. After all our all aim should be to grow and not stagnate or even revolve backwards to the lesser. Whenever projects aiming to achieve just that are running into trouble, especially regarding things we provide to our customers, we feel it is our duty to step in and provide assistance. It not only serves us, who potentially still use the software in question, but also anyone that might still do so including our customers.

As such we are happy to announce that we have provided a mirror for the Classic Addons Archive, a project that aims to preserve the grand library of Firefox addons that once existed prior to Mozilla suffering a case of what can only be described as laziness-induced rush of autophobia and removed support for a vast catalogue of addons on the basis of no longer wanting to maintain the interface for them. These addons included such marvelous wonders as TabMix+, DNS Flusher and Session Manager among many more that found widespread use for power-users and webdevelopers alike. Preserving their legacy and providing a way to access these addons to allow their use beyond their removal from the official addons page for Firefox is the aim of the Classic Addons Archive project.

Unfortunately due to the ongoing economical situation many free offerings that provided the project with options to provide these addons ceased and would require significant invest to retain. The writing was on the wall and so the archive has been unavailable lately. We reached out to the project lead JustOff to figure out how we could help the project that we internally use as well. After some email traffic we have now setup a mirror for the project to use, which will restore functionality to their addon and once again allow the download and installation for all legacy addons the project managed to archive.

“But if Firefox removed support who would even need them?”

Fortunately there are other like minded people out there willing to put in the work to not only preserve the addon capabilities, but also actively incorporate security and performance upgrades into a browser. The Waterfox project is just that, a Firefox fork with support for legacy addons and active development and the backing to continue supporting one of the most capable browsers ever built. Those feeling something missing in their life after Mozilla took and axe to Firefox can use Waterfox Classic and make use of the Classic Addon Archive to access the massive library of addons to extend their browser to new heights.

We are happy that we can provide assistance to the Classic Addons Archive, which brings the vital addon database to Waterfox, and thus preserve the legacy of so many addons and the hard work of their creators.

Regarding SEPA Direct Debit

Next to our existing payment gateways we always look for additional methods for payment with the intention to simplify payments as much as possible. With that in mind we recently enabled direct debit payments via Stripe, our payment provider for credit cards thus far, with the aim to add an even more direct payment gateway for SEPA payments. Klarna, previously known as SOFORT, offers such a gateway, which directly integrates with banks to initiate payments instantly. Their method is a lot more convenient for those already using online-banking regularly and the process is instant for us as well, resulting in a faster order deployment from our end.

Unfortunately during the onboarding process with Klarna we were presented with a number of requirements that are directly opposite of what we want to offer our customers. Additionally, in difference to all of our other payment processors, they would have required a minimum amount of monthly transactions to even offset their monthly fee imposed in addition to transaction fees themselves. We would have been required to change our invoice templates to place all available methods alongside each other, which would have caused a lot of clutter on them. Transaction fees, which are currently based upon the selected payment processor, would have to be carried entirely by us, which would have equated to up to 15% of the actual transaction amount depending on product, which would be the highest transaction fee imposed of any payment processor we work with.

Given these negative points and further requirements we have elected to not process further with the onboarding process with Klarna. If their requirements and fee structure changes in the future we may continue with it, but the likelihood of that is unknown to us. We will continue to look for ways to reduce the complexity of our payment gateways and provide additional ways to pay for services.

Applying The Wrong Concept The Right Way

In the ongoing, albeit somewhat irregular, theme of technical posts we once again want to bring you a deep dive into a topic that is not often touched upon. This time the focus is on applying a microservice or clustering concept to a piece of software that really does its best toddler temper tantrum impression of not wanting to do its homework.

Microservices

In the world of “web applications” the containerization and clustering of applications through various concepts, layers and confusing config files is a landscape full of wonder and pretty explosions. For most, let’s call them websites from now on, since that is what they are, these setups are not all that useful, since they mostly apply to projects of vast scale. Nonetheless some still fall into the trap of pretty buzzwords and promised gains. Supposedly that is easier than to blame oneself for the code not being optimized or the hardware being overloaded as is. Microservices in most cases describe the concept of splitting a large application into smaller parts, each handling a specific task given the input, producing output. Going along with then clustering these across vast networks to more closely position them near the user and scaling them as markets grow or shrink. For large platforms with a thousands of users this makes the most economical sense, since the solution in the past was to simply slap the entire app onto every growing hardware, which just did not scale performance and cost all that evenly. Thus the concept of distributing load and splitting things into the smallest parts to make them more efficient has helped the internet grow and certain companies and platforms making billions while slashing their IT budget.

Load Balancing

Not a new concept by any means, but an every more important one these days. A single point of ingress for data into an application serving a wide range of potential sources means potential bottlenecks on the horizon. Equally then producing the output from that generally results in a cascade of ever slower processing until you hit the inevitable timeouts. Balancing this load through means of microservices or caching mechanisms is common practice not just in the world of websites. Any type of application, down to the very browser you are reading this through subscribe to the concept of load balancing in one way or another. At the core of the solution is spreading the load across any sort of multiplication that does not rely on other parts to process the data. In most programming languages this is known as asynchronous processing and generally tags along the object-orientated programming style that allows it to work in the first place. As a concept thus the idea is to allow all parts of an application to run and finish on their own time without causing the whole thing to grind to a halt, even if that, in the name of keeping the end results in sync, sometimes cannot be avoided either.

Where does OpenSim come into this though?

This is where it gets really interesting, because OpenSim has been built from the ground up to split individual processing into own parts that can run on their own. These individual services are often asynchronous as well and can even be split and distributed. This design allows both for applying the concept of microservices along with the load balancing that brings to it. However, that is easier said than done. As it turns out the interconnection between the services for the point of once in a while making sure all that asynchronous data actually makes any sense at all is not a straight forward affair. More so since changes and new features demand direct connections to other services that absolutely cannot wait for anything else to go on.

In the past there were attempts to resolve this by simply creating another process of OpenSim running as a sort-of backup to receive the same data and run it independently. Should the return then arrive faster than the main process, then it would be used instead. This went along with splitting services out into their own instances as well, but the resulting complexity and requirement to test each new change to not severely break the chain of data processing meant this project never really went anywhere beyond a working prototype.

That’s not to say the attempt itself did not emphasize the need to maintain the service-based setup of OpenSim. Thankfully for the lesser complex part of providing the main services that even connect the assortment of simulators to a conclusive world this has been maintained. What is commonly referred to as Robust services generally still has the ability to be split and even run as copies of each other. This leaves the door open for both applying the concept of microservices and load balancing to it. Though as already mentioned, there are a few things that managed to become rather large pitfalls to anyone looking to attempt it.

Robust, a simpleton with an attitude

To begin let’s go over the goals and requirements.

  • Split as many services contained in Robust into their own instances
  • For services with a potential to overload from data ingress or processing spawn multiple instances and distribute the load between them
  • Setup connections to each instance in a manner that allows for effective load balancing and reduces the complexity of setup for simulators connecting to them

To achieve these goals we can use a few methods already available, some which require a bit of tinkering and some external systems that without nothing would work. Let’s go over each part.

Robust

With the aforementioned splitting in mind the basic configuration file for a single Robust instance already has a list of services it contains as well as their definitions further down below. All we thus have to do here is to select the services we want to run in each instance and make sure in the end we have instances for all of them. However, rather quickly this idea gets thrown out the window when looking at the actual service definitions. The problem sits in the connection services have with each other. While a lot of them point to them via either a local service definition or external connector, there still exist some that flat out assume a copy of the service is running in the same instance. So the difficulty is now up a notch trying to find the services that have to go together to share data.

Connectors

Most connected services refer to other services via the direct connection established over the addins present as part of the Robust system. We can see these as DLL files describing each service. However, in order to allow for multiple instances to communicate or indeed other parts of the entire software to communicate there exist Connectors. These are also DLLs, but their setup is somewhat different in that they provide a remote-bound connection to a service not defined by the addin, but a URL. This means we can change our service definitions to these Connectors to allow them to connect to a service running in a different instance.

Siblings

Splitting everything up into pieces is one part of resolving issues created from overloaded services, but eventually even that is no longer enough to handle the influx of data. As such applying the idea of load balancing by way of creating copies of an application becomes a requirement. Unfortunately this presents an issue when we want to make sure the individual copies are still able to share data with other parts. Whether this be in the form of connecting multiple services to a single dependency or the other way round. This is where we have to resort to external software to provide a way to group multiple instances of the same service under a common umbrella through which we can establish connections with it.

Includes

When attempting to setup a vast array of instances, each requiring their own little changes to configuration to interconnect properly we quickly run into less of an issue, but more a case of not getting brain freeze in the process. Writing a full configuration for each node requiring hundreds of lines each time to create all the necessary information for it to run is tedious and can easily produce mistakes. Thankfully this is something that has already annoyed at least one person before and to our advantage this person has done something about it. Configurations are capable of loading data from files and combine them into a fully qualified instance configuration. This means we can configure each service for connecting locally and remotely and simply mix and match the required parts via the architecture includes. We can now simply select what an instance is meant to run as local service and what it should connect to remotely.

Hacking

This is where it gets complex. In order to reduce load created from asking the same questions over and over again some services rely on caches. These will cache a request for certain data allowing it to be delivered without the need to retrieval from data storage. Unfortunately these caches are localized to the specific service, if we attempt to then multiply this service there is a chance for cached data corrupting actual data entered on a sibling. To combat this issue we have to go deep into OpenSim, find the caches and either remove them entirely or change their behavior to not be in use when multiple instances of a service are being run. In this case the better and more compatible option is to look for each part of the code that either requests or enters data into the caches and change these actions to be dependent on a flag set to either allow them or not, with the latter defaulting back to retrieving or storing data to the database directly as if the cache had no entry for it.

Long Term

As changes to main parts of OpenSim are still being made in order to update some of the ancient standards used when it was originally conceived along with new features requiring additional code the long term stability of this is still in question. Changes already made to some parts do already cause some instability and require long term testing as well as further changes to mitigate. As such this setup will likely require further “hacking” and even changes to the setup itself to account for changing service relations. As of yet it is unclear whether changes to the service interrelations to retain or even enhance the ability to split each service will be made, but we certainly hope so. Increasing data sizes and ever more growth will test the infrastructure and the more a setup can be spread and load distributed among the parts the more solid it will be in the future. As with everything it requires testing and more testing and ever more testing to identify issues, but as OpenSim is still in development that is frankly a given constant already.

The gritty bits

Having completed the crash course in Robust setup let’s create a hypothetical situation realistic enough to warrant creating a solution for.

Say we have to deal with over 10000 users logging in throughout the day, each having thousands of items in their inventory and being an overly active member of the community, chatting and roaming the world with vigor. How do we handle the influx of hundreds of requests per second?

Let’s go over each part.

1. Nginx

Nginx is a webserver with load balancing capabilities through the use of a proxy setup. This sounds complicated, but is actually relatively easy. What we need to do is setup a hostname for each individual type of service we want to run instances of. Then we pass requests from these hostnames onto a set of instances by passing the request over the ports used by those instances. This takes the form of server definitions with a proxy pass to the upstream ports used by the instance.

An example:

We can do this for all instances, multiple or singular, passing everything over a central port, thus making configuration of simulator connections relatively easy. Nginx handles routing the requests in a somewhat round robin style. This means it is not directly aware of the load placed on each copy, but we are changing the receiver for each request onto a different copy, which is likely enough. If necessary we can always add more copies.

2. Robust

In order to make it easier to run a large number of copies instead of multiplying the binary as a whole we simply treat it as template to spawn copies from. This requires providing each instance with the information of where the configuration should be loaded from. We do this by adding the inifile parameter to the execution command pointing it at a single file containing the aforementioned definitions and includes.

An example:

Configuring each service as normal making sure to use the Connectors for the remote counterparts. As mentioned above this structure looks confusing at first, but is actually a lot less work to do as we simply combine what we need rather than writing the config sections out in each file. Organizing the local connectors for services included in the specific robust instance we configure and the remote ones to connect to other robust instances in folders to make it easier to see what’s what.

3. Simulators

Connecting a simulator to this setup is remarkably easy given the complexity of what it is connected to. For the most part we can use the hostnames to connect the simulator services to their Robust providers. Only on select services, GridInfo in particular, a more direct connection is required. This also goes for external asset servers, which we hope will become more common as they are the second biggest bottlenecks in OpenSim.

A rough example of the setup:

Configuration depends on how you set things up and what type of service and instance split is done. As mentioned we don’t need to worry about setting up specific ports for each service as the individual parts are proxied through to their respective endpoints already, which also handles balancing load. The identification is no longer the port, but the hostname itself.

4. Runtime Environment

This section is somewhat optional, but may be of value in the future. A big issue with setting up so many individual services is handling them in case restarts and changes are required. As we are dealing with a program that runs independently we can simply push it to the background and nuke it whenever a restart is desired, but this might incur data loss. A better solution is providing a separate runtime environment for each instance. This can easily be accomplished under Windows by simply stuffing the window into a corner and forgetting about it, but as Windows is not a recommended platform to run services such as OpenSim, in Linux this is a bit more difficult. It is possible to simply send the process away as mentioned, but there is no way to interact or get it back other than sending data to it, which gives us no feedback. The better option is to use runtime environments, which are plentiful on Linux, such as docker or LXC for containers or more simply things like “screen”. The latter provides “windows” we can select at will to interact with each instance and both send commands, but also view the process working. Which one of these works best depends on familiarity and what level of separation you want for each service.

The Grand Solution

To test this setup we have created a testing environment running this setup:

  • As previously mentioned all routed through Nginx to reduce the complexity of connecting simulators.
  • Each instance having only minor changes to its configuration in the realm of setting the port to use.
  • External and internal routing for service connections is also done with the proxy to reduce complexity of service interconnection and take advantage of full load balancing of all requests.
  • Configurations based primarily on includes rather than full configuration files reducing complexity and clutter
  • Spawning instances of a template binary to reduce complexity of upgrades
  • Retaining simple configuration for simulators to services without the need to specify individual ports
  • Splitting services logically based around encountered load and retaining services that have no ability to remotely connect to integrated services
  • Minimal changes to OpenSim itself

This is obviously not the solution to all potential problems and there is no guarantee future changes won’t break a setup as complex as this. Certainly we hope for the opposite since there is only so much a single fully qualified instance of Robust can do on its own and hitting that limit is not a pretty sight.

Applying the concepts of microservice and load balancing in regards to OpenSim may seem wrong and there are certainly many obstacles in the way of doing so, but the core of it was always part of the idea behind the service-based setup of Robust or even OpenSim as a whole. Thus these concepts can work for it as well, despite the issues that exist due to inter-service dependencies and caching setups. It most certainly has ways to go to truly embrace them, but it is already possible to observe the positive aspects. Whether it is a setup as complex and distributed as shown here or simply splitting out one or two, the future undoubtedly lies in utilizing them.

Next to the brief mention of this capability on the official OpenSim wiki and some snippets of configuration options that can be found on the web this marks the first time it has been fully documented and tested. In the pursuit to fully dissect and test it we share the interest with a number of people, who have provided information, time and effort in testing as well. It shows the strong community spirit often found associated with opensource projects and we hope to propagate this to anyone reading this article. Instrumental in kicking this project off by providing the initial basis of configuration options and pointers to information we want to thank Gimisa Cerise who has been pulling apart the hidden and complex inner workings of OpenSim for a long time now. Equally we have to extend thanks to the OpenSim team for providing assistance in tracking down interconnected modules. The continuous effort put into the OpenSim project by everyone involved makes things like this possible in the first place; their ongoing support and work toward the project drives it forward and we are happy to be a part of it and contributing where possible.

We will certainly continue testing and pushing the boundaries of OpenSim to make sure it is prepared for the future and hope this insight into the capabilities it has will provide some positive impact on the metaverse as a whole.