Data is beautiful – pt. 1

A fairly common expression not just among the stat majors and math nerds on the internet. On its own such a statement always requires a qualifier though, as with anything that describes the world around us, what data we see doesn’t necessarily provide any further information about what we can observe with our own eyes. It’s the things it cannot show and thus require interpretation that is the most interesting part about statistics. Philosophies aside, statistics are important for various reasons, if only to have some form of overview of what you are dealing with.

With OpenSimulator there exists a lot of information inside the underlying database structure, but a lot of that is either hidden in other information or requires correlation to multiple points of data to return any useful information. Plain statistics and system information used to gauge the health of the installation and the scale at which it operates at are somewhat difficult to grasp, if the concepts of how the database operates and the data within correlates to each other are not well understood. Unfortunately not much of this is documented either, but how about we fix that part?

 

Statistical data

 

Starting with the most common data points of interest to grid owners, the usage statistics of the platform they run. Users, their origin and connections as well as the information that is extrapolated from them.

A big metric for many is the activity on their grid, which is often seen as the most important indicator of success or validity of their egomania. Unfortunately the information available to accurately determine this figure is somewhat limited due to the lack of information being stored within the database. We can more accurately determine the number of active users on a grid through the information the users provide beyond just their name, but this data is only available through the logs. Without much trouble changes can be made to store this into the database and thus provide additional means to increase the accuracy of this particular statistic, but we won’t go that far in this article.

Another, seemingly, useful metric to determine usage is the scale of the installation, which usually is measured in the amount of space it makes available to its users. Unfortunately the information, again, lacks depth to gauge if any of that space is actually usable in the sense that it is open to the public, but this is neither a clear case anyways, nor is there a direct way to determine this without significant changes to information kept in the database.

External connections, made both inbound and outbound, in the form of any information shared within the metaverse is another very useful metric to get an idea of the level of integration a community might have to the wider metaverse. Factors in this are a good indicator on whether there is an active exchange happening between communities and how much your own community interacts with others.

Among all of this information sits a myriad of other statistical data. Some of which is useful for pursuing certain goals or simply to track and watch evolve over time. The big question, that is often somewhat vaguely answered and poorly documented, then is how to extract all this information. There are a multitude of ways to accomplish this, but in order to both look at the information at a glance and more easily retain it over time as historical record and trend information the best way remains to directly query the database for it.

With a single, albeit rather complex, query most of the relevant information can be extracted. Through the use of correlation and filtering it can be enhanced to be more accurate and provide data points not directly tracked. While simpler ways exist to fetch this data, they, unfortunately, also gloss over the options to remove erroneous or duplicate data, so the only way to actually achieve some form of useful and accurate data is to take into account all available data points and correlate them accordingly.

To start, let’s look at the data points we can actually retrieve from the database and what they stand for:

  • online_users – As the name suggest, the currently online users that are actually inworld somewhere, which is determined by being on a valid region
  • active_local_users – The amount of users that have logged into the grid at some point in the last 30 days, which can be qualified by removing duplicates
  • active_visitors – Those that have visited in the last 30 days, which unfortunately cannot easily be deduplicated
  • traffic – A combination of all the locally active and external visitors from above, which is what most refer to as activity
  • total_visitors – How many unique users have visited from external sources, again hard to deduplicate
  • residents – Counting how many have registered an avatar that is not an alternate of an existing avatar.
  • recent_registrations – The count of how many registrations happened in the last 30 days
  • region_count – Just the total number of rows seen in the database
  • region_equivalent – Determining the size of var regions to how many normal region spots they occupy
  • land_area_sqkm – And then taking that to work out how much space that is in square kilometers
  • friendships – We can work out who is friends with who and count only those true friends
  • hg_friendships – Similarly for external users, with the caveat that this is sometimes a bit bugged
  • partnerships – The ones really close to each other
  • groups – Counting all the local and externally known groups
  • known_grids – Extrapolated from the visitors we can determine how many unique other grids have visited

Now to determine each of these data points we can run specific queries to the database to return them.

online_users
SELECT COUNT(*) AS online_users FROM Presence WHERE RegionID != '00000000-0000-0000-0000-000000000000'
Selecting the count from the Presence table, but excluding any entry that shows the RegionID as a null key, because this usually indicates a user has crashed due to a failed teleport or otherwise. Incidentally the “you are already logged” message is caused partly by this information as well, so we should exclude it and ideally even remove any such entries in the table if they remain there for more than a few minutes.

active_local_users
SELECT COUNT(DISTINCT(UserAccounts.Email)) AS active_local_users FROM GridUser INNER JOIN UserAccounts ON GridUser.UserID=UserAccounts.PrincipalID WHERE GridUser.Login > (UNIX_TIMESTAMP() - (30*86400)) AND (GridUser.Logout - GridUser.Login) > 30
In order to count just the unique active users we can match their GridUser entries, which contains the information about their last login time, to the UserAccounts table, which contains the email address associated with the specific avatar. While this isn’t a complete indication of a duplicate, it does provide at least some hints. We can further remove any entries that show the time actually spent inworld to be less than 30 seconds, arguably an amount of time that allows for very little actual activity. Through matching the UUID across the two tables we actively remove the external visitors as they are denoted with their full UUI instead, so no matches and joins are performed for them.

active_visitors
SELECT COUNT(*) AS active_visitors FROM GridUser WHERE UserID LIKE '%http%' AND Logout > (UNIX_TIMESTAMP() - (30*86400))
Then in order to fetch the external visitors for the last 30 days we can filter by looking for entries that contain a UUI and check the timings for those. Using the time of log out to make sure we only look at those that successfully reached our end and logged out properly. Login time may be set even if a visitor never actually reaches our end, so it is not a reliable specifier for the metric we are aiming to extract.

traffic
(active_local_users + active_visitors) AS traffic
Simply combining the locally active users and the visitors then we can determine the overall traffic or active user count.

total_visitors
SELECT COUNT(*) AS total_visitors FROM GridUser WHERE Login > 0 AND UserID LIKE '%http%'
In difference to the activity counts simply filtering by UUI and a login time greater than zero we can see the overall number of avatars that have visited since the creation of the table. Unfortunately without storing more data it is very difficult to deduplicate these entries.

residents
SELECT COUNT(DISTINCT(Email)) AS residents FROM UserAccounts WHERE UserLevel > 0
Similarly to the active users we can remove some duplicates via the email address and obviously we need not count users that have been banned.

recent_registrations
SELECT COUNT(DISTINCT(Email)) AS recent_registrations FROM UserAccounts WHERE Created > (UNIX_TIMESTAMP() - (30*86400))
Removing the duplicates through the email address again and utilizing the timestamp of the account creation to determine the amount of newly registered users in the last 30 days.

region_count
SELECT COUNT(*) AS region_count FROM regions
A simple count of the regions listed in the table.

region_equivalent
SELECT SUM(POWER(sizeX/256, 2)) AS region_equivalent FROM regions
In order to get the actual scale of the world we can look at the reported size of a region and divide it by the standard size. Then square it to get the amount, which we only have to add up.

land_area_sqkm
SELECT (SUM(POWER(sizeX, 2)))/1000000 AS land_area_sqkm FROM regions
Using a similar method we can fetch the total size in square meters occupied by a region, which we then only have to convert to square kilometers by dividing the sum by 1.000.000

friendships
SELECT CAST(COUNT(*) / 2 AS INT) AS friendships FROM Friends Friendship JOIN Friends Friended ON Friendship.PrincipalID = Friended.Friend AND Friendship.Friend = Friended.PrincipalID
In order to count a friendship only as one friendship between two users we obviously have to divide the whole number of them found in the Friends table. This often results in an uneven number though, which can usually be resolved with two methods shown here. Casting the result as an integer forces the number to be a whole number without a comma. The other method is joining the Friends table back into itself in order to check if a friendship actually exists from both perspectives, meaning both parties have accepted the friendship and the corresponding mirrored entries exist in the table, which removes friend requests that have been ignored.

hg_friendships
SELECT CAST(COUNT(*) / 2 AS INT) AS hg_friendships FROM Friends Friendship JOIN Friends Friended ON Friendship.PrincipalID = Friended.Friend AND Friendship.Friend = Friended.PrincipalID WHERE Friendship.PrincipalID LIKE '%http%' OR Friended.PrincipalID LIKE '%http%'
The same can be applied specifically to friendships that connect users across grids. Though it should be noted that in some cases these entries are not properly set on both ends and thus this number is likely not as accurate as it may seem. Much like the GridUser table we can specifically check for the UUI to exclude local users from this number.

partnerships
SELECT COUNT(*) AS partnerships FROM userprofile WHERE useruuid IN (SELECT profilePartner FROM userprofile) AND profilePartner!='00000000-0000-0000-0000-000000000000'
These are similar to friendships in that they can technically exist only on one side, so checking whether both sides have mirrored entries makes sense. OpenSimulator does not internally handle partnering.

groups
SELECT COUNT(*) AS groups FROM os_groups_groups WHERE Location=''
Looking only at the local groups through the filter of excluding any entries that contain a location pointer to the origin grid.

known_grids
SELECT COUNT(DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(UserID, ';', 2), ';', -1)) AS known_grids FROM GridUser WHERE UserID LIKE '%http%'
AND UserID NOT RLIKE 'http\:\/\/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'

We can count the number of unique grids that have been utilized by external users to visit our local installation by checking the GridUser table for all the UUI’s contained within. This should be filtered to remove entries that simply include the remote grid address as a direct IP address as the validity, accessibility or existence of installations that use them is questionable at best. This can be accomplished most easily with regex, which, despite being rather slow to execute, is still easiest. Further filtering for accessibility of existence cannot be done with SQL directly and we also cannot determine if an entry is simply a renamed existing one either.

Finally we can combine all these individual queries to provide a single row of statistics, which can be used as a single query to write the results into another table for historical information; or simply to output everything in one go and show the results. To do this each query needs to be turned into a subquery and given a temporary object to write the data to.

SELECT
online_users, active_local_users, active_visitors, (active_local_users + active_visitors) AS traffic, total_visitors, residents, recent_registrations, region_count, region_equivalent, land_area_sqkm, friendships, hg_friendships, partnerships, groups, known_grids
FROM
(SELECT COUNT(*) AS online_users FROM Presence WHERE RegionID != '00000000-0000-0000-0000-000000000000') AS stats1,
(SELECT COUNT(DISTINCT(UserAccounts.Email)) AS active_local_users FROM GridUser INNER JOIN UserAccounts ON GridUser.UserID=UserAccounts.PrincipalID WHERE GridUser.Login > (UNIX_TIMESTAMP() - (30*86400)) AND (GridUser.Logout - GridUser.Login) > 30) AS stats2,
(SELECT COUNT(*) AS active_visitors FROM GridUser WHERE UserID LIKE '%http%' AND Logout > (UNIX_TIMESTAMP() - (30*86400))) AS stats3,
(SELECT COUNT(*) AS total_visitors FROM GridUser WHERE Login > 0 AND UserID LIKE '%http%') AS stats4,
(SELECT COUNT(DISTINCT(Email)) AS residents FROM UserAccounts WHERE UserLevel > 0) AS stats5,
(SELECT COUNT(DISTINCT(Email)) AS recent_registrations FROM UserAccounts WHERE Created > (UNIX_TIMESTAMP() - (30*86400))) AS stats6,
(SELECT COUNT(*) AS region_count FROM regions) AS stats7,
(SELECT SUM(POWER(sizeX/256, 2)) AS region_equivalent FROM regions) AS stats8,
(SELECT (SUM(POWER(sizeX, 2)))/1000000 AS land_area_sqkm FROM regions) AS stats9,
(SELECT CAST(COUNT(*) / 2 AS INT) AS friendships FROM Friends Friendship JOIN Friends Friended ON Friendship.PrincipalID = Friended.Friend AND Friendship.Friend = Friended.PrincipalID) AS stats10,
(SELECT CAST(COUNT(*) / 2 AS INT) AS hg_friendships FROM Friends Friendship JOIN Friends Friended ON Friendship.PrincipalID = Friended.Friend AND Friendship.Friend = Friended.PrincipalID WHERE Friendship.PrincipalID LIKE '%http%' OR Friended.PrincipalID LIKE '%http%') AS stats11,
(SELECT COUNT(*) AS partnerships FROM userprofile WHERE useruuid IN (SELECT profilePartner FROM userprofile) AND profilePartner!='00000000-0000-0000-0000-000000000000') AS stats12,
(SELECT COUNT(*) AS groups FROM os_groups_groups WHERE Location='') AS stats13,
(SELECT COUNT(DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(UserID, ';', 2), ';', -1)) AS known_grids FROM GridUser WHERE UserID LIKE '%http%'
AND UserID NOT RLIKE 'http\:\/\/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}') AS stats14

Given the inherent complexity and size of this data, beyond just for record keeping and trend analysis, storing it elsewhere in regular intervals means the information need not be queried directly from the various sources. Caching is your friend in this case, but should already be at least partially present on the subqueries that have rarely changing data. Though this does depend on the specific software in use.

It is fairly common to provide this information publicly in order to provide transparency, but undoubtedly also to show growth and ever bigger numbers. What measure of success that ultimately provides is left to the interpretation of each individual reader as none of these metrics can quantify the human condition that makes up the vast majority of the experience we all have with a platform such as OpenSimulator. No one wants to be reduced to numbers, but they are certainly fascinating to look at.

Leave a Reply

Your email address will not be published. Required fields are marked *