There is a very bad joke that someone once told me about an elderly man with a very long, nearly unpronounceable name. All of his life he was forced to spell his name, to tell people how to pronounce his name, and to discuss the origins of his name. One day, he decided to have his name legally changed to the single word “Odd.” His friends and family insisted that he rethink his decision, but he was adamant. No one would ever have trouble spelling his name or pronouncing his name, even if he did have to explain it. Sadly, several days after he changed his name, he passed away. To honor his wishes, his family had his gravestone engraved with the single word “Odd.” To this day, as people walk past that one gravestone, many stop and say “that’s odd!” (insert groans and boos here). The story has a point, though. Things that are odd are often not so easily marked. In fact, in a world of Big Data, unprecedented world events, and previously inconceivable technologies touching nearly all parts of life, how do we find what is truly “unusual”? The term “anomaly” in data science is often used to describe data that is divergent from what might be expected. It turns out, finding anomalies in an environment replete with unprecedented change is no small task.
What’s Odd: Defining anomalies in data
The most basic form of anomaly detection is attribute-based. We define what we are looking for in terms of attributes, and then look for anything that departs significantly from that definition. For example, we might have a manufacturing process that looks for defective parts on a production line by finding anything that is outside of specification limits. Such a definition is useful, but it turns out we can do better.
In our imaginary production environment, suppose we are making round wire of a certain length. If the wire is too oblate (not circular), or too short or too long by a certain percentage variation from the specifications, we will use the anomaly to reject the wire. The problem with this approach is that we find the defective wire after we have made it. At that point, it’s too late to intervene. It turns out that if we look further down the process, we can detect the wire as it’s beginning to stretch or skew, but before it has become outside specification limits, and then intervene to avoid making defective wire in the first place. Looking for different anomalies may be more useful. Second order measurements (rate of change, characteristics of the source of the data) can be very powerful in detecting an environment wherein we are likely to find anomalies.
Modern methods of anomaly detection are far more complex. They can involve thousands of attributes and look at data in many dimensions. Anomalies can relate to divergence from an objective function (e.g. regressive methods) or be based on learning and heuristics (e.g. non-regressive, recursive, or cognitive approaches).
There are many situations in modern business where anomaly detection in data is critical. Among them are applications to find opportunity faster than the competition. Putting aside equity and commodity markets, drug research, and other fields where such techniques have been de-rigueur for quite some time, there are exciting new greenfields for finding anomalies. For example, anomaly detection is currently being used in social listening to help companies understand subtle shifts in brand perception, effectiveness of marketing efforts, and competitive focus.
On the risk side, anomaly detection is being used in many exciting ways. Traditional applications include forensics, credit management, and law enforcement. Recently, there has been tremendous focus on using anomaly detection to discover new and emerging cyber threats and for counter-terrorism.
Anomaly detection informs how we find new risk and new opportunity. It is essential that we continue to improve our skills in order to keep pace with ever increasing amounts of data and ever more disruptive technological evolution. The most important advice is to continue to re-evaluate what anomalies are being sought and how environments are changing, as well as to continuously challenge that we are using the best methods.
Even Anomalies Have Anomalies: Challenges to anomaly detection
There are many commercially available and open source tools that can be helpful in anomaly detection. Before diving into your data, however, it is important to consider some aspects of the environment that may render the process less effective.
Consider our simple wire manufacturing example again. Imagine that we are looking at the process data and we find that, although we are making good product, the data shows a trend to the lower end of the specification limit at some times, while a trend to the upper end of the specification limit at other times. Experts will quickly consider two root causes for such a situation. First, there may be a multimodal effect, essentially two or more distributions showing up in the data. This sort of thing could happen, for example, if the process were running in two shifts, and one operator used the machinery differently than another operator. A second consideration would immediately occur to an analyst in such a situation. The data may contain an unmeasured parameter (for example temperature in the environment) that, when measured and considered along with the original data, explains the process variation. There are many other explanations, but these two examples illustrate that the process of anomaly detection and intervention is not a simple one. Expert advice is often required.
Many attributes of data variation can undermine anomaly detection. Among them are the “V’s” of Big Data (the volume, velocity of change, variety, differing intrinsic value, and inconsistent veracity). All of these attributes of data can confound efforts to understand anomalies. For example, consider volume. If we are looking for a rarely-occurring anomaly in a vast and increasing amount of data (a good example is searching the cosmos for signs of electromagnetic signals that might indicate intelligent life), then recognizing the anomaly is not the problem. Finding the anomaly in a sea of otherwise “noisy” data is the challenge in and unto itself. Instead of finding a needle in a haystack, such problems are often referred to as like finding a particular needle in a stack of needles.
Another confounding aspect of anomaly detection is change. Clearly, we live in a time of unprecedented change in technology (and the data it produces) and in the world itself (and the data it produces). Many of the tricks of the past, most notably regressive methods and exhaustive analysis of entire data sets, are simply insufficient or wholly inappropriate in the face of massive and disruptive change. It is also critical to note that sometimes change in data is endemic, and sometimes the change is induced by attempts at anomaly detection itself. So-called “observer effects” in data are particularly challenging in terms of detecting anomalies. An example would be detecting malefactors, who change their behavior when they realize that they may be detected.
Anomaly detection is not a push-the-button exercise. It is crucial to consider environmental and behavioral aspects and to have expert involvement to build appropriately robust processes.
Anomalies Yet to Come: The future of anomaly detection
So far, we have looked at how we come to understand and define anomalies, and some of ways the process can become more complex or confounded. The future, it seems, has some additional surprises for us to consider in the form of emerging technologies and behaviors that will produce new types of anomalies tomorrow.
One such example is quantum computing. In traditional digital computing, information is stored in bits, which can be either zero or one. In quantum computing, information is stored in qubits, which can essentially be anything between zero and one (including complex numbers). In the quantum computers of today, qubits are funny things. Quantum algorithms are not useful for all types of problems, but a certain set of problems are very promising. How would we understand an anomaly in data that itself can only be understood in complex, often probabilistic modalities? How do we see something unusual in a set of data that is… all unusual? This explanation is intentionally over-simplified, but even at that it seems clear that quantum anomaly detection will require some new vocabulary and process that is not done cooking in modern computer science. We will need to think differently, to behave differently, as we consider where to focus on the data that is produced by the quantum-enabled environments of the future.
Another more discrete example of future anomalies comes from the Internet of Things (IoT). Today, IoT devices largely talk back to applications that either control them or receive data. Anomaly detection is much as described above. In the future, as IoT becomes more autonomous, more self-discovering, and more performant, anomaly detection may only be possible by observing a rapidly changing system. Much like understanding illness in a living organism, anomaly detection in future IoT may require understanding of signs and symptoms, and clinical interventions to understand the best treatments.
New technologies and behaviors are rapidly changing the field of advanced anomaly detection. The methods and understanding must continue to advance in ways that we do not yet fully comprehend. This field is a new frontier for Computer and Data Science.
Anomalies are a critical part of the future of understanding our mounting data and our rapidly changing environment. As we continue to produce data at a rate that is arguably unmeasurable, we must think about how we know what is different, unusual, or suspect. Our future depends on our ability to learn new ways of understanding the data we create and how we use it.