Drowning in Garbage – Zetamex Network Blog

Continuing the series of technical deep-dives into the inner workings of OpenSimulator, this time the focus is on the things it doesn’t do. Much like the stereotype of the pubescent teenager that has trouble breathing over the mountain of gymsocks and hill of dirty shirts sorted loosely by distance they can be smelled from OpenSimulator has a tendency not to clean up after itself. In difference to your local municipality it isn’t a matter of dispatching some burly men and a big skip to collect all that and dump it onto a big pile or burn it. The solutions are more akin to surgically removing kidney stones or trying to teach a Dodo to fly, but let’s explore it anyways.

The main culprit of compulsive collecting whatever is thrown their direction in the case of OpenSimulator is the database holding data for various aspects that require central coordination. Each individual module dealing with one aspect and managing its data in often widely different arrangements. This can be expected given the history of the entire project and the many people that had often vastly diverging ideas on how best to handle things and what direction they expected the project would head to. Focusing only on the big offenders and the known points requiring either manual cleaning or at least a watchful eye to prevent issues in this article.

Information

Decentralization brings with it the need to request information directly from the endpoints they were received from. In other words, whenever there is data not directly stored locally on a specific grid, it has to be fetched from wherever it was originally created on. This goes for various things, such as information about who created an item, the associated profile data of the user or anything not yet transferred locally. As the Hypergrid is decentralized in nature, meaning there is no single authority that can provide information, each piece has to be requested from the origin. This is done via a http queue filled with the requests for certain data and their return inserted and locally cached where applicable. As things, like profiles, can change there is no localized place to save this data for any longer than most instances are running for, so each time the data is lost or deemed too old it has to be fetched again. A reasonable system to keep things up to date and make sure you are always getting the correct data, but it falls flat on its face when things don’t work.

Each piece of data that is requested from external sources creates and entry in the aforementioned queue and is checked, and waited for, every time. When the remote endpoint does not answer it takes time for the request to be deemed a failure and the next one is executed. When there is a lot of “dead” data this can fill up the queue and stop any more items from being executed completely halting any further requests. While these requests are open they are constantly attempting to fetch data, which reduces network performance and can lead to additional load on an instance. In extreme cases the amount of data requested can clog up the pipes to a point other critical operations requiring network are not executed, which has large detrimental effects on a given instance.

Unfortunately, as you may expect, the solution to this is either removing the item causing the requests to be scheduled or altering them to stop outgoing requests from being required. Usually this means adjusting things directly in the database to remove references to external endpoints entirely, which is problematic for many reasons. Most commonly creator information of items from remote sources are the culprit and removal of such is questionable.

Compounded by things such as changing addresses or locations of endpoints or slow/intermittent connections and it is not just items either. The problem of information requests goes as far as friends, groups, anything that requests data over the Hypergrid from other remote endpoints.

Groups

The group system or GroupsV2 has been a point of issues since its introduction. The data it keeps is spread across multiple tables in the database, which is both a blessing and a curse. When it comes to databases and tables generally you want to avoid sending too many queries for things and reference things in code rather than through combining tables. Equally smaller and simpler queries will run faster and can provide a overall faster fetching of required data. Which approach is favorable also depends on size of the data itself, table structure and even the performance of the machinery that runs the database itself. No wonder thus that database design and structuring is something some people focus their entire careers on. In the case of OpenSimulator the current system makes frequent and heavy calls to the database to refresh and provide data, which on the surface seems like a negative aspect. As more data comes in these calls do become less a strain compared to the amount of data they have to sift through before getting to relevant parts. This means that while the system can be slow it does not suffer that much from slowing down as more and more groups are created and people join them. It still is not an ideal approach and past a certain level it starts slowing down regardless. This is compounded with the dynamic nature of group data requiring frequent refreshes when making changes.

Assets

Arguably the biggest factor in collecting garbage like a hoarder with both a kleptomaniac disorder and aversion to garbage cans. The standard asset provider in OpenSimulator utilizing the database exclusively, saving all data directly to a table, is a ticking timebomb. As asset sizes increase with more and more mesh objects being imported and more and more complexity added to them you can quickly run into issues of assets now saving at all. This is because most databases impose limits on the maximum size of any given query or packet being sent to them. Once exceeded the information is simply rejected and the data is not saved at all. As more users join a grid running this system each of them requesting their data can easily exceed the maximum connections the database can both keep and even provide. This ultimately, as databases have to save their data somewhere as well, starts eating into the performance of the hardware itself. Running OpenSimulator on older drives still using spinning disks quickly hits a brick wall. As data access times increase for the files the database uses so will queries to the database itself, once timeouts start to be reached the entire system can quickly become unusable.

Thankfully there is a way around that. OpenSimulator implements an alternative asset system, FSAssets, which utilizes the database in a manner more in line with what they were designed for in the first place, less a mass storage device and more a reference engine for what’s where. As such this system only keeps information of which specific asset ID can be found at which location. This also allows for de-duplication, the process of checking if the actual data already exists somewhere and not saving yet another copy, but instead to simply reference the new ID to the already existing data. On the surface this sounds like an ideal solution, but even that has shortcomings. Much like the database itself the performance of this system is dependent on the speed the hardware can provide the information back to the requesting agent so storage systems need to be fast still and benefit from setups that reduce access timings. Thus combining multiple physical storage devices that either contain the same data or parts of data, each providing their access speed in combination thus increasing overall bandwidth is paramount to maintain performance as data grows.

And grow it will, very quickly in some cases, which brings other issues down the line. While databases are designed to retain their fast access to data even when reaching millions of entries the same cannot be said for files on disks. When performing backups or just moving data from one place to another as you upgrade your storage solution, moving millions of small files in various nested folders can be quite the task itself. Keeping track of changes to these is an even bigger task as most versioning systems will just throw in the towel when confronted with the scales of data OpenSimulator can produce in short amount of time. Designing backups and control of that data is thus just as important as selecting the right hardware and definitely should play into hardware selection as well.

Since we are talking about garbage, any asset system has the issue of being a rather simple fella to talk to. As the hoarder he is, he will keep all data you give him, regardless of whether anyone still even needs that data. This means over time it will contain data that is no longer present anywhere, be that on a region, a user inventory or really anywhere. Orphaned data is plenty and is almost impossible to find, given you need to cross-reference so many other places. One of the biggest culprit in this regard are notecards, specifically those created by scripts. As they create a new notecard whenever a change is made, yes that’s how that works, the old “version” is never discarded, it remains. This means over time, these orphaned notecards no longer belonging to anyone or any object tend to stack up in the database.

The only option to cull these is to check on all regions and inventories if a reference to them still exists. This can take quite some time and is only possible if you actually have access to all these databases in order to make sure you are not removing a notecard that is still in use. Of course with the connectors between the regions and the grid system it is within the realm of possibility to automate this process and do regular culling of dead notecards or other assets that exhibit similar behavior.

The main issue is crawling over what can quickly grow to millions of assets is both time and resource intensive and so is a significant change in the design of these connectors and the grid system itself. As often referred to said system as being a bit simple or borderline dumb as it mostly concerns itself with synchronizing data among the simulators and users. Adding routines and regular tasks would significantly increase the resource usage of the system.

Friends

We are all social beings and have a natural desire to flock together. This is generally not an issue within a single scope of the grid itself, but in the scope of the entire metaverse this can backfire somewhat. As social media has shown letting people know where you are and if you are available is something we tend to want to broadcast to those interested. OpenSimulator does the same, attempting to let all friends know if and where you are. These calls go out over the HyperGrid system to inform everyone on the friendslist. Problems arise when in the changing nature of the metaverse some of these remote locations are no longer available. Each request is scheduled internally and then run. If the call succeeds all is well, but each call that fails to reach its target will hold until a timeout is reached. The overall queue for these calls is restricted to prevent overloading of the network, but the calls still need to happen. As the queue fills up with requests to be done the simulator has to “remember” them until they can be executed. That task unfortunately takes quite a bit of resources as it waits and fills up with requests. It also will absolutely finish its queue even if the user causing the calls has already left. What is once in queue will remain.

Fortunately any failures to reach the other side are logged on the specific simulator they are run on. With access to the console or logs it is a simple task to then search the Friends table for the entries likely to fail and remove them. Though that should make sure to remove them on both sides as Friends are kept as a pairing so each friendship has an entry from the perspective of either participant. Unfortunately there exist still some levels of caching and mechanism to re-create friendships. It makes sense to also remove the concerning Calling Cards from the inventories, in order to prevent leaving dead references behind that could produce a renewed entry into the database.

Maptiles

In order to produce the world map for the viewer to show regions a tile is created for each region or even multiple tiles for larger regions. These tiles are sent to the grid system and are then used to generate the different scales of map zoom. The tiles are kept indefinitely and so are the generated levels of zoom. Here nothing is automatically culled, not even when the concerning regions are being properly shutdown, the maptile will remain until a new maptile is uploaded to that specific spot.

This leaves two methods. Either uploading an empty maptile to the specific spot you want to clear and causing the grid to redraw the zoom levels, which could be added to the shutdown routine of the simulator. Another option, which also has the added benefit of displaying renewed tiles for the regions, removing all the maptiles and asking all connected simulators to create new maptiles to be sent to the grid system. The latter method being quite the strain on the system to send and receive all that data, but also making sure map displays actual visuals of regions as they are.

What remains to be done beyond this though is clearing, you guessed it, the database. For legacy viewers and other applications the database gets sent asset entry requests for each maptile as a texture. These terrainimages as they are called will both consume entries in the database as, if a file-based asset system is used, a file within the asset system. Removing these, assuming no one is uploading similarly named textures can be a simple matter of going through the database and fetching the correct entries and corresponding assets.

Now obviously ideally the grid system would routinely do this or even issue a cleanup when changes to the region list are made, but that ends up in the same bucket of adding a substantial system to it, which will consume resources, issues calls over the network and create a lot more calls to the maptile system. Ultimately though, if left unchecked and since the new maptiles will always be slightly different, the amount of data in both database and assets can grow quite substantially.

Inventory

A less culprit of garbage, but nonetheless something to watch for. By default the deletion of items does not necessarily cause a deletion of the inventory item. Just stuffing things into trash also, obviously, does not delete the item either as it can still be restored. Suppose trash is more appropriately a recycle bin in operation. Regularly emptying of the folder and making sure to enable the ability to actually remove the references from the databases and assets makes sense. Automated systems for removing items left in trash for a certain time do not exist and might not be desired either, but please don’t use your trash folder as a backup, that’s what IARs are for.

Removing trash items based on their age in that folder and removing the corresponding assets through a routine does make the most sense here. Enabling the deletion option for items in the database and assets does bring with it the security risk of these deletion requests being issued externally if the correct request can be made to the system. Normal security practices fly in the face of the open nature of OpenSimulator and especially the HyperGrid system, so whatever can be done to secure the system against attack is worth considering, even if that means some garbage is collected. In the case of inventory the access to the database from the grid level is a given so creating routines for cleaning run in a secure manner are well within the realm of possibilities.

That’s it? Well not quite, there are quite a few other places garbage is collected that can eventually add up to causing issues, but taking care of those is quite a bit more difficult and only really becomes an issue at scales rarely seen in the metaverse. There is still room for improvement in OpenSimulator in regards to the mess it makes of its own data structures. We have begun working out some of these issues already and will continue to do so. In the meantime we hope the OpenSimulator project itself can decide on a course to take dealing with the issues mentioned here, but it is obviously not going to be easy deciding what or what not to do given the implications of building entire cleaning routines into a system that in some regards is nearing its design limits already. Thankfully in most cases a skilled grid operator can still manually create their own routines to counteract the problems arising from increase garbage collecting in their database and file structures.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30