The stability of online services based on cloud platforms has emerged as a major issue due to the massive failure of the AWS Seoul region in November last year. Traditionally, high-availability (HA) systems have been applied to redundant systems within a region to prevent failures in some systems. However it is impossible to cope with the problems that arise in a whole region with HA systems. Therefore, there is a need for multi-region DR system.
This article introduces the application of the multi-regional DR system to PallyCon cloud service to automatically address large-scale failures of cloud platform and minimize damage.
Introducing PallyCon DR System
PallyCon DR system uses AWS Seoul region as the main system in normal condition. When it detects a failure of the main system through the health check function of Seoul region, it automatically switches the service to the backup system in Tokyo region.
|Cycle||30 seconds (minimum 10 seconds possible)|
|Method||Check whether the database connection state of the region is normal through a specific API such as DRM license request URL|
|Failover condition||If a service failure is continuously detected for 3 minutes, it will be switched to Tokyo region. Then, if the disruption of Seoul region is recovered and the normal state of service is continuously detected for 3 minutes, it returns to Seoul region again.|
DR Server Architecture and Restrictions
The database used by PallyCon service is replicated in real-time with a cross-region replica. When the service is running in Tokyo region due to a fault, it is possible to inquire existing information and issue licenses in a 'Read Only' state. This backup system minimizes the impact of the regional failure on PallyCon's customers.
However, it is not possible to write new data such as content packaging info during the failure, because processing multi-master in the inter-regional database is not supported.
The backup system in the Tokyo region basically runs one instance of each major servers, but it can be expanded automatically by auto-scaling depending on the traffic.
Automatic configuration and deployment of DR infrastructure
PallyCon DR system uses AWS CloudFormation or Terraform to quickly and accurately handle repetitive DR infra configurations and deployments.
It enables asynchronous copying of changed resources such as AMIs, and handles automatic configuration and deployment tasks such as reorganizing the launch configuration for AMI changes and deploying auto-scaling.
The use of cloud platforms such as AWS for various online services is becoming commonplace, and transition from on-premise solutions to cloud-based solutions-as-a-service (SaaS) is accelerating. In addition to your own service, it is also needed to make sure that the SaaS solution used by the service is also covered by a disaster recovery system.
PallyCon multi-region DR system can recover PallyCon service in about three minutes even in the event of a major failure across a region of AWS cloud platform. This minimizes disruption to your business continuity or finances.