I'm making a post to provide some more visibility into the recent server stability issues.
Why do we need a new server?
We are upgrading from the old server (ithor, 2015) to a new server (bespin, 2024) primarily to remain on a supported php version and corresponding software stack.
The old server runs PHP7 on apache via nginx, on linux. The old server is hosted directly on AWS EC2.
The new server runs PHP8 on apache via nginx, on docker, on linux. Moving to using docker in the stack allows us to match development processes with production deployment. The new server is hosted on AWS Lightsail. AWS Lightsail is more cost effective for us.
The overall compute performance of the two servers is comparable and does not appear to be an issue, with the new server being significantly faster (when excluding database performance).
Note that currently both www and www2 are hosted on the same physical server, bespin. The difference is only what code is executed.
Database Basics
Both servers connect to the same database, hosted on MariaDB on RDS. The migration to RDS was completed in April 2021.
The database is about 60GB in size, and comprises of about 25 GB of active game data (entities, characters, actions) and 35 GB of archival data (events, transactions). The largest game tables are about 12m rows in size, while the largest archival tables are about 50m rows in size, with a typical retention of about 90 days.
Aside: What is a Vicious Cycle?
In these explanations I used the term vicious cycle to mean any situation where performance degredation causes even more performance degredation. This usually occurs when some task is executed slowly while holding some important resources. Other tasks wait on the resources and also become slow, resulting in everything being slow.
We have some measures in place to try to mitigate vicious cycles;
- Lock Timeouts occur to preemptively kill queries that are unlikely to be successful.
- Lock Timeouts also occur to preemptively kill queries that can cause a vicious resource starvation cycle.
- Deadlocks occur to kill queries that cannot ever be successful due to resource starvation.
Most time that you encounter an issue above, it's because the server has already ran out of resources due to a bug, and its attempting to recover itself without entering a vicious cycle.
Infrastructure Issue #1: Database Latency
The database latency to ithor was approximately 0.4ms, but the latency to bespin is much higher, 1ms, RTT. This has a significant impact to many heavy pages since pages typically do many thousands of tiny queries in succession.
This manifests in large pages (such as the scanner page) feeling slow to load, typically an increase from sub 1 second to 2-3 seconds rendering time of which 98% of the time is spent in RTT latency.
We suspect the increased latency is due to VPC transit between lightsail and RDS. Unfortunately this is not something we can fix by simply upgrading the server size.
Planned Permanent Fix: We will migrate off RDS and find somewhere lower latency to host the database server.
We will either host the database server within lightsail database offering, or host the database locally on the same physical server as the webserver.
Infrastructure Issue #2: Database Sizing
Since moving to RDS, the database class was running as an m5.xlarge, however it should have been a smaller class, m5.large. We resized the database class to be m5.large at the start of the new server migrations. The smaller database class has performed more poorly than expected, most likely exaserbated due to self inflicted database issues listed below.
Planned Permanent Fix: Appropriate re-resizing of the server when we migrate the database.
Specifically, our database needs a larger buffer cache to hold more of the game state in memory.
SWC Issue #3: Archival Table Bloat
Archival Tables (like events and transactions) have maintenance tasks to trim them to their desired retention length. Typically these maintenance jobs are run every day during daily jobs.
One large table trimming job was disabled during some maintenance work and was never re-enabled causing the table itself to bloat to over 60GB. This severe bloat likely detrimentally affected the performance of the database server.
This issue was identified and fixed by truncating the table (on D197).
Planned Permanent Fix: Partitioning for all Archival tables over a certain size.
The archival tables are now large enough that using database partitioning is the only viable permanent solution.
SWC Issue #4: Ineffective Notifier/Events Queries
To generate the flashing notifiers at the top of each page, the events tables are queried using a permission mask, to only find relevant events.
However, there is an issue when a faction member has no privileges in a large faction: the query to find visible events can end up scanning the entire large events table. This slow query can result in a vicious cycle since notifiers are generated by every user on many pages and the resources required to generate notifiers can block other unrelated users from generating notifiers. Notifiers are already heavily cached.
A Workaround was implemented to limit the events scan to 1 Week duration.
Planned Permanent Fix: Partitioning the events tables will reduce the performance impact of regenerating the notifiers.
This will reduce the chance that an events lookup query can use too many resources.
Planned Permanent Fix: Decouple notifiers from the database, to avoid notifier failure from cascading between players.
This will reduce the chance that when one player fails to generate notifiers, it affects other unrelated players.
SWC Issue #5: Character Creation Table Scans
When a new character was created or certain operations on handles occurred, there were several queries executed that lockup the main player accounts table. Anything that locks a full table has the potential to cause a vicious cycle, specifically since access to the player accounts table is required for generating most pages.
A workaround has been implemented to disable these queries.
Planned Permanent Fix: Player accounts table modification will be decoupled from main page rendering.
SWC Issue #6: PHP Upgrade Backwards Incompatabilities
Due to upgrading between PHP 7 and 8, there are a large number of code incompatabilities that may show up as page crashes or errors. Most visibile of these was issues in vendor code. As of Y25D197 all known compatability issues are fixed or suppressed.
Most PHP8 issues that arose were able to be mitigated by using www2.
SWC Issue #7: Clicking Thundering Herd Issues
A number of pages and buttons (and recent events) allow for a multiplication of required processing power from repeated clicking or refreshing. The most recent server outage today was caused by this.
A few examples:
- Travel buttons can be clicked multiple times causing the same processing to occur multiple times.
- Inventory and WS endpoints can load in parallel.
- Ajax/Javascript functionality can often load multiple times in parallel.
Planned Permanent Fix: UI will be improved to reduce the chance of multiple in flight requests being sent.
Currently it is very easy to unintentionally send multiple requests when using the UI, causing server load.
Herdfest Issue #8: Slow Scanners
During herdfest, a large amount of character squads within the same room highlighted issues with the scanner pages. Specifically, rendering some scanner page details required a large amount of server resources.
The scanner was modified to fall back to a simpler version that hid details until manually loaded. This version of the scanner will show up in any room above a certain size.
The quest team moved quickly to redesign the second part of herdfest to reduce the amount of concurrent players per room and this significantly improved server performance.
Possible Future Improvement: Making scanner details loading more streamlined, such as automatically loading them when interacted with.
Herdfest Issue #9: Slow Combat Resolution
During herdfest, a large amount of concurrent combat caused timer resolution to slow. We run combat synchronously since the outcome of combat can make large changes (such as destruction of entities). Combat resolution took approximately 2.2 seconds, however the large amount of players concurrently fighting meant that the tick became saturated at around 3-400 players repeatedly attacking.
We found several potential improvements, but none were deployed during herdfest due to the overall risk of the changes. These improvements have since been deployed with combat resolution currently taking approximately 1.0 seconds.
Conclusion
Thanks for your patience while we resolve these issues.