Data Essentialism

Inactivate, Archive or Delete … How Much Data to Keep?

About two decades ago, data management was considered strictly a technology function due to the cost of data storage, memory, and support. Unkempt data meant exponentially higher costs and investments in these areas than it does today. The precipitous drop of storage and memory costs has allowed more data (and versions of the same data) to be swimming around data environments much longer than a vigilant technology team would have allowed in prior years. Though our ability to store large amounts of data for trending, machine learning, and more is generally considered a good thing, without a data strategy on what to keep active versus nonactive has become a bigger hurdle than the operational cost of storing data. What has actually happened, in my view, is that we’ve opened a Pandora’s box of potential confusion for users due to the compounding prevalence of duplicate, dated, and (gasp!) unimportant data.

The challenge we are now facing is determining what data to keep, who is affected, why we need to keep data, and how to go about deciding. Although I am a self-proclaimed data hoarder, I have had to control this urge to keep data so I can strategically support my stakeholders with “just the right amount” of data. Certain policies allow us to keep legacy data, but beyond this, we are entering the journey of data essentialism. The notion of essential property is thoroughly correlated to the idea of necessity.

The challenge we are now facing is determining what data to keep, who is affected, why we need to keep data, and how to go about deciding.

Factors of Data Essentialism

ROT (Redundant, Outdated, or Trivial) Data –TechTarget defines ROT as data “that an organization continues to retain even though the information that is documented has no business or legal value.” Identifying ROT data is one of the first steps we can take in deciding which data need to go. Surfacing and managing ROT data is an ongoing process. An effective master data program will not only help minimize the presence of redundant, outdated, or trivial data in your ecosystem but will also provide a systematic and rule-based articulation to keep such data out.

Strategies such as master data enrichment, refresh, and recertification help you keep data fresh and identify those that need to be put out of circulation. For instance, if you have a regular cadence (pro tip: at least once a quarter) of master data enrichment and data refresh, your master data program will be able to pinpoint, from an outsider’s perspective, the bulk of your data that is still relevant. Also, the concept of a unique entity identifier, like the D-U-N-S® Number, will scale the identification of potential redundancy.

Transaction Data Relevance – In my opinion, the most important reason to scale and maintain master data is to articulate, link, and aggregate not only our defined customers and prospects, but our interactions with them together with their interactions with us. These are what we call “transaction data.” These could include billing, booking, engagements, etc. Traditionally, the more transaction data available, the richer your analysis will be – that is, in a perfect world. However, there are aspects of transaction data that you need to take into consideration, such as their quality and application to your purpose. Incomplete and low-quality transaction data could misrepresent outcomes and lead to inaccurate decisions. Work with business stakeholders to identify pertinent transaction data. In fact, it would be prudent to have a strategy for exposing that transaction data to users in order to comply with policies as well as for effective reporting. The data you keep should have a well-defined purpose for your given use cases. Failure to accomplish that creates noise, confusion, and potential compliance and security vulnerabilities.

Use your available master data as guidance in helping business users manage transaction data. Do the aggregations and attributions make sense? Do they provide value? Are there transaction data that are orphans? Addressing these questions with business users can help define reporting relevance. Expose data strategically and only keep data that has value.

Data-Driven Decisions – This is a good place to start with providing a strategy on what data to expose and keep. If the datasets you have do not help the business make sound and needed decisions, it might be wise to limit their exposure to users. For instance, if the business requires having data only as far back as 12 years for reporting, anything beyond this might be considered immaterial. This cut-off point can provide the demarcation line in defining time-relevant data for decision-making.

Data Regulations Compliance – Understand the policies around data access, operations, and storage. At the very least, ensure that your data management strategy meets external regulations and communicates the outcomes with your users. Failure to comply would make your organization vulnerable to sanctions from governing bodies and also to would-be security threats.

Inactivate – Archive – Delete

The topics above show that proving the value of data is the most significant part of data essentialism. If data do not provide value or purpose, they will need to be reassessed if they will continue to be part of the ecosystem. With process and technology, removing them from your data ecosystem doesn’t have to be a binary action (i.e., to delete or not delete). We now have options, as data needs and landscape may change. Here’s a brief explanation of each:

Inactivate – This is also known as “logical delete” or “soft delete.” Using a binary indicator field (i.e., Active_Status = “Y” or “N”) the data management team can control the exposure of certain data to its users. This becomes useful for data quality initiatives that need to be removed from use but may need to be kept for compliance or operations purposes. The advantages of this process are the ease of reactivation and immediate impact once the indicator is switched.

Archive – This process has two pertinent steps. The first is to back up the data into a separate location –in most cases, outside the main environment – where they can be systematically accessed, potentially for a recovery process. The second is the physical deletion from the active environment. This process is beneficial when there is a requirement to remove data for purposes of data quality, compliance, and efficiency. By doing this, your active data environment will cease to have the data, but they can be accessed later, allowing flexibility in case data are needed again.

Delete – This is the physical removal of data without keeping a copy or having a backup. This becomes the best option for dealing with data quality initiatives or compliance when data will need to be permanently removed. For instance, in GDPR, once a person/organization requests absolute deletion of their information from your data ecosystem, you are required to physically remove their data.

Value and Compliance

Retaining data boils down to value and policy compliance. The cost of keeping data is much more reasonable compared to 20 years ago. Knowing which data are needed to make decisions will help you decide which data you need to keep, however understanding why you keep your data will provide you with efficiencies in data access and reduce vulnerabilities from potential data breaches or regulatory requirements. Be guided by these principles and your data quality goals to help you determine how much data to keep.