Blogs
Update On Server Bugs Affecting Recent Events

Update On Server Bugs Affecting Recent Events

Robloxxbetter
| 0

Hey chess community. We wanted to share an update on two recent server bugs that impacted both the Bullet Chess Championship and the Clash of Claims.

In the Clash of Claims, there were two separate issues: the first was an issue where GM Kramnik’s freshly unboxed computer was not finished setting up, resulting in an inaccurate local system clock. This led to clock errors in two games - both were discarded from the event. Our team has since added code to address situations when local system clocks are inaccurate.

The other issue was with certain moves not being correctly delivered to opponents. It happened twice, but it only really impacted one game where Kramnik didn’t receive the move 26.Qa5. Kramnik should be awarded the win in that game. There was a second game where the bug happened, but the system recovered and only game observers - not players - had a move delay. This second instance was only noticed afterward when we reviewed the logs.

When this bug appeared in the Clash of Claims, we immediately started investigating. At first we thought that the Live Chess Server was overwhelmed by the high volume of observers in the games and a record number of people trying to challenge the players while they were playing. That turned out to be correlated, not causal. The actual root cause was a bug in the underlying game connection code. Specifically, when certain Android clients were resetting their HTTP/2 connection, there was a chance that the move publication action would fail on the client-server level. If just one of those clients watching the game disconnected, the move-sending channel could stop sending the moves to everyone - players and observers alike. This is why games with many viewers were far more likely to encounter the bug. Looking at the data from the last two weeks, we see that 95% of users impacted by this issue were Android 5 and Android 6 users, but the remaining 5% were top players being observed by users with these older Android devices.

  • The code that handles this real-time connection is part of the open-source “Highly Scalable Clustered Web Messaging” system called CometD which Chess.com uses in our live chess server. We updated our server with the latest patch on June 13 and have not seen any of these issues since that time. (See chart below.)  

Unfortunately, before we were able to update to the latest version of CometD three games in the Bullet Chess Championship were also affected, including Naroditsky vs Nakamura, Firouzja vs Nihal, and Firouzja vs Sevian. In the first two games the system recovered with no impact. In the third game, Sam Sevian won on time because Alireza didn’t receive the move. We decided to discard the game, and Sam proceeded to lose the match. This was understandably very frustrating for the players, especially for Sam, who showed great sportsmanship.

Much to our dismay, this bug has been around for several years, but flew under the radar because most of the top games on our platform are either watched via chess.com/events (which circumvents the entire issue by working as a buffer for client connections) or observed through a newer server that doesn’t rely on CometD. Unfortunately, it took a very high-stakes event to give us the data and investigation to uncover the root cause. 

We are sorry that these bugs impacted these events and that they also have impacted players from time to time for so long. It was unfortunate that these events were affected, but we are glad to have fixed the root of the issues. We strive for the smoothest and best possible connections for all players - professional and casual alike which is why we will continue to invest in adding more telemetry, logging, and monitoring for all of our events and players.