It’s summer, and that means ecommerce businesses are once again gearing up for the peak selling periods of back-to-school and the holidays. As in years past, they are making sure their infrastructure is ready to cater to temporary spikes in demand by adding the proper computing resources.
But there’s another key principle that needs focusing on, which is quickly gaining in significance — and that is observability, or an organization’s ability to analyze, manage and make sense of the huge volumes of data (logs, metrics and traces) being generated in order to anticipate and uncover behavioral issues across a distributed system. With masses of data flowing into an organization — especially during peak sales periods — observability becomes critical for making informed decisions.
To date, most ecommerce businesses have adhered to the traditional centralized observability approach, where teams must collect, ingest and index data before asking questions upon it. Such an approach may have worked as recently as a few years ago, but it no longer suffices for a mission-critical ecommerce operation.
First, as architectures have grown increasingly distributed (through containers, the cloud, microservices and more), they are generating exponentially larger data volumes — way too much for humans to make sense of on their own, or even for machines to analyze without performance issues. Second, in ecommerce, minutes of downtime can equate to millions of lost dollars — Amazon loses an estimated $220,000 per minute of downtime.
But even if your business doesn’t run the risk of losing as much as this, outages result in a lot of collateral damage including brand, customer satisfaction and more. Outages and other performance degradations must be avoided, and ideally proactively sniffed out to the fullest extent possible.
In ecommerce — with thousands or millions of dollars on the line — getting real data in as close to real time as possible is vitally important. Think of situations like “flash sales,” where sales may be active for not even a full hour. In such a context, the time needed to centralize data puts an organization way behind “the eight ball” when it comes to shortening mean-time-to-repair (MTTR), which essentially needs to be zero.
Another challenge is the volume-based limitations created by many of these centralized data repositories. We often see this challenge push organizations toward one of two extremes. They may end up spending an inordinate amount on capacity that gets consumed by high-volume, low-value data. On the other extreme are companies that drop or filter out logs — making sometimes arbitrary decisions on what datasets to analyze, and which to neglect — to such an extreme that they create blind spots.
The answer lies in striking a critical balance — being able to observe and fully leverage all of an organization’s datasets, albeit in a highly efficient manner from the vantage points of both cost and manpower. As you consider your observability approach with peak sales periods approaching, here are some questions you should keep in mind:
- How much observability is enough? Ideally an organization should be able to observe all its data, and a good rule of thumb is, the more stringent the SLA, the more observability that’s required. Most organizations just can’t afford to not have “eyes in all places.” At the end of the day, having an eye on all of one’s data is the only way to achieve a complete picture of application and service behavior.
- How can we find the signals in our data while shedding the noise? You want to observe all of your data, but that doesn’t eradicate the fact that a large portion of this data will be system ‘noise’ that doesn’t require remediation or action. One approach that delivers the best of both worlds involves analyzing all data at its source, as it’s being created, and then dynamically ingesting it — automatically identifying and relegating valuable data (such as anomalous datasets or data tied to a production deployment) as hot and searchable, and making the rest available in less-expensive cold storage.
- How can observability best be implemented to support the quickest possible debugging and troubleshooting processes? This answer also lies in analyzing data at its source, as it’s being created. Machine learning techniques can be deployed in conjunction to surface anomalies that would otherwise be unpredictable — and doing so without running queries and in near-real time, capturing the exact raw data organizations need to debug and highlighting the affected components. In contrast to the approach of centralizing data — which often entails more manual toil to sift through — decentralization makes it much easier to identify not just the existence of a ‘needle,’ but where exactly that needle is in the ‘haystack.’
Each year, it’s impossible to predict which ecommerce businesses might experience an unwelcome hiccup during a peak sales period. But observability strategies are playing an increasingly critical role, and the good news is that companies should no longer feel forced to compromise between inspecting all of their data and achieving razor-thin mean-time-to-detect (MTTD) and MTTR. With novel approaches organizations can have it all — maximum visibility into potential growing hotspots while keeping costs and manual toil in check — ultimately enabling the delivery of consistently superior digital experiences that delight customers.
Ozan Unlu is the CEO and Founder of Edge Delta.