Earlier this year, Target had a major database outage as a result of an error made during regular retail information system (RIS) maintenance. During the outage, Target was unable to process consumer credit card transactions for over an hour, resulting in a frustrating experience for shoppers. In 2019, when high availability is the norm, why do retailers still experience outages? And what can retailers like Target do to avoid costly downtime in advance of the holiday season?
Understanding Resilient Systems And High Availability
Any downtime, even one second will damage the brand reputation and trust with a consumer and lead to lost transactions — impacting revenue. This is especially important as Amazon continues to dominate the retail industry. The way to compete with Amazon is by providing consumers with frictionless, fast and efficient service, especially during the holiday season.
Historically, highly available systems used to be systems with redundant, hot-swappable power supplies, disk drives and even CPUs. However, today there are better approaches to high availability than making a single machine highly available. Now retailers have access to services that are highly available by using large clusters of machines, where any node in the cluster can fail without impacting the consumer experience.
What this means is that if there is a disaster that affects one node it doesn’t cause the retailer's entire infrastructure to cease processing transactions. Today's resilient, distributed retail systems that are built to failover in cloud-native ways are no longer prone to outages like this.
It’s 2019, Why Do RIS Outages Still Happen?
Good question. So why do retailers still have outages? They happen for many reasons, including network configuration errors created by humans, denial of service attacks, natural disasters, etc. While none of these causes are going away any time soon, some RIS outages caused by humans can be averted and data outages due to disasters can be made history.
One approach retailers can use to prevent costly outages is to make retail information systems resilient. This means moving legacy activities like point-of-sale systems, inventory and shopper behavior, etc., to the cloud and selecting a system that:
- Eliminates maintenance windows: Maintenance windows are designed to minimize user impact, but any planned downtime is an unacceptable consumer experience. Look for an RIS that keeps point-of-sale systems up and eliminates all mandatory maintenance downtime.
- Automate resilience testing: Simulating disasters (and recovering from them) shouldn’t be a manual process. Retailers that build this into their everyday processes, like Netflix, have a working model that makes emergency protocols not an edge case, but the norm.
- Adopt a self-healing database: Choosing a database that self-organizes, self-heals and automatically rebalances is a crucial component of resilience. By making component failure an expected event that a system can handle gracefully, retailers will be prepared when a data center goes down for whatever reason.
- Select an RIS with built-in redundancy to ensure data replicates: To survive disasters, retailers need to replicate the data stored in their RIS and have redundancy built into the system to eliminate all points of failure. Prepare for disaster by having data transactionally replicated to multiple data centers as part of normal operation.
While it might seem daunting to migrate RIS data to the cloud, it is an effective approach to keep processing credit card transactions during the busy holiday season without experiencing a costly outage. In order to compete, retailers need to consider adopting a high availability platform that supports their RIS system and helps them move merchandise.
Peter Mattis is the co-founder and CTO of Cockroach Labs where he works on a bit of everything, from low-level optimization of code to refining the overall design. He has worked on distributed systems for most of his career, designing and implementing the original Gmail backend search and storage system at Google and designing and implementing Colossus, the successor to Google's original distributed file system. In his university days, he was one of the original authors of the GIMP and is still amazed when people tell him they use it frequently.