Jan 12, 2021 Perpetual swap system failure report

Published on Jan 15, 2021Updated on Apr 12, 20243 min read

1. System downtime description:
At 7:19 am UTC and 9:36 am UTC on Jan. 12, OKX perpetual swap trading service had been suspended twice due to system downtime.
After troubleshooting, we discovered the cause for the downtime:

For the first downtime that happened at 7:19 am UTC, OKX found that there was a system upgrade for perpetual trading at 6:30 am UTC. The TBT channel for perpetual swaps failed to pass the depth data due to a configuration error and triggered an emergency mechanism, resulting in the service suspension of perpetual trading.

For the second downtime that happened at 9:36 am UTC, OKX detected that after the WebSocket push system upgrade, an abnormality occurred in the common components between the push system and the perpetual trading system, which interrupted the procession of perpetual transactions and led to the service suspension of perpetual trading.

Summary for the first downtime on Jan. 12:
As of 6:30 am UTC, the perpetual swap system upgrade was initiated as planned.
As of 6:41 am UTC, the perpetual swap system upgrade had been completed.
As of 6:42 am UTC, OKX detected the configuration error that caused the TBT channel to fail to pass the trading depth data out and made an urgent repair.
As of 7:19 am UTC, OKX suspended the perpetual trading service and started system maintenance.
As of 7:39 am UTC, the system maintenance had been completed, and the perpetual trading service was resumed.

Summary for the second downtime on Jan. 12:
As of 9:00 am UTC, the WebSocket system upgrade was initiated as planned.
As of 9:32 am UTC, the WebSocket system upgrade had been completed.
As of 9:33 am UTC, OKX detected an abnormality in the perpetual trading system and made an urgent repair.
As of 9:36 am UTC, OKX had suspended the perpetual trading service and started system maintenance.
As of 10:10 am UTC, the system maintenance had been completed, and the perpetual trading service was resumed.

2. What work do we do to ensure the stability of the OKX platform?

OKX provides 24/7 trading services and has been dedicated to making its trading system ultra-stable and smooth. However, given the complexity and unexpected abnormalities of a trading system with high performance, we cannot guarantee that the system will work perfectly at all times. However, we have been working hard to improve system stability and minimize the probability of downtime from all aspects, including:

1). We strengthen engineering quality assurance and optimize the test system. The code for new functions can be launched only after it runs stably for a period of time in demo trading.
2). We upgrade architecture. The high availability of multiple servers in various regions is being realized, with less downtime caused by hardware and software problems.
3). Hot upgrades will be realized in a stateless way, which reduces the impact of the upgrade on user transactions.

3. How to get updates from OKX?

(1) Once we detect failures, we will immediately publish failure notifications on the Status page.
(2)If there is any system upgrade scheduled, we will publish a notification on the Status page and notify users via market and community channels (API user community + regular user community). Meanwhile, API users can be notified of the updates by subscribing to System/Status channel.