Data Talks, Episode 6: Dealing With Duplicates

Episode Six: Dealing With Duplicates

Host: George L'Heureux, Principal Consultant, Data Strategy

Guest: Donald Folk, Principal Data Advisor

In this episode of Data Talks, we discuss different ways of addressing the very real problem of duplicate data. 

Just like incomplete and inaccurate data, duplicate data can have negative implications for sales, marketing, and finance teams –actually for anyone within the organization who uses data. Duplicates can increase costs of marketing programs, create sales conflicts, and cause inefficiencies throughout the organization. 

There are many causes of duplicate data including human error, subjectivity in entry, and collecting data from multiple sources. Sometimes, as our expert shares, it’s even the result of a well-intentioned internal program. But reasons and repercussions aside, there are things that we can do, not only to fix duplicates once they occur, but to prevent them from happening in the first place and how the D-U-N-S Number plays an important role as the unique identifier to consolidate those duplicates.

 

Read full transcript

Episode 6: Dealing With Duplicates

George L'Heureux:
Hello everyone. This is Data Talks presented by Dun & Bradstreet. I'm your host, George L'Heureux. I'm a principal consultant for data strategy in the advisory services team at Dun & Bradstreet. In advisory services, our team is dedicated to helping our clients maximize the value of their relationship with D&B through expert advice and consultation. And on Data Talks, I chat every episode with one of the expert advisors at D&B about a topic that can help consumers of our data and our services get more value. Today's guest expert is Don Folk. Don is a data strategy consultant at D&B and Don, how long have you been with Dun & Bradstreet?

Don Folk:
I am surprised to say this, but it's 23 years this month.

George L'Heureux:
And tell me a little bit about what you do in your current role as a data strategy consultant.

Don Folk:
So my current role is really around making sure the customer understands the value of the D&B data process flows, everything like that. I'm specifically an expert around matching, identification of what we're going to talk about today is duplicate, but also in regards to using the D&B data assets to the fullest abilities and capabilities that we can.

George L'Heureux:
And tell me a little bit about how you got to this point in your career. What was the path that made you interested in this and got you into this role?

Don Folk:
Sure. So over 23 years, as you'd expect, there was definitely a lot of positions that I shifted through. Starting through our delivery organizations, I understood how clients really asked for our data and used our data. And then from the delivery organizations, which I would actually create the deliverables that went out to our clients, I shifted over into our content organization and really understood what it meant to be a vendor for D&B, really for us D&B purchasing the data, their data, whatever it may be, but then I would actually help ingest that vendor's data into the workflow. So I have a pretty holistic view of data in general. I understand from the customer side, but also from the vendors side.

George L'Heureux:
And I think that that puts you in a really good spot to talk about the topic that you and I had agreed on here. It was really important for people to hear about, which is the idea of duplicates. And one of the things that our team deals with, with just about every client, is how many duplicates we're seeing in their data. Why is that even a thing? Why do we care?

Don Folk:
Yeah. I mean, realistically D&B even has duplicates in the database. We have best-in-class processes that we use to identify and resolve those duplicates, but every database in the known world has duplicates. It's just a by-product of really just collecting data from multiple sources.

George L'Heureux:
And so they're there but I imagine the reason that we talk about them is that they can cause problems, right?

Don Folk:
Very true.

George L'Heureux:
What kind of problems do we think about when we're talking about duplicate records and the impacts that they can have?

Don Folk:
Yeah. If I could sum it down to one word, it's really confidence. If the duplicates are in the data it creates this lack of confidence from your sales organization, being the client of D&B from accounts receivables, accounts payable, supplier management, all of those activities have a different structure and complication of duplicates. But again, it's really just that they're there. We know they're there and it just creates that level of lack of confidence whenever one of your salespeople find it in your repository.

George L'Heureux:
And if you've got more than one of something in your data store, in your database, the chance exists for me to go grab one and you to grab another and not realize that they may have two completely different views of the same customer. That's one of the challenges you're talking about there.

Don Folk:
Yeah, absolutely. And if we look at it from a marketing perspective, sometimes that isn't always the worst case, but if we tie it to accounts payable, accounts receivable, things where there's important decisions that are tracked at each of those independent levels and you potentially could see that some of the accounts owing or receivable from dollar amounts could be tied to both of those accounts. So definitely having those disparate duplicate views is a significant problem.

George L'Heureux:
Right. I mean, at that point you're not just talking about an extra record here or there, but you're talking about things that could roll up and eventually impact financial filings.

Don Folk:
Absolutely. And those are the concerns that you need to be mindful of as a data supplier, data collector, data aggregator, for sure.

George L'Heureux:
Okay. So then what do we do? How do we address the problem of duplicate data?

Don Folk:
Well, there's definitely a multi facet approach. First, you have to understand your use case. If it's strictly marketing, your exposure level is lessened. But if it's, like I said, from an accounts payable, accounts receivable, some supplier stuff, you have to be more aware of it. The key to this is really getting that D-U-N-S Number, our unique identifier. You have to get that D-U-N-S Number on as many records as you humanly can, possibly can within your own repository. That's the key. That's your first step in identifying the duplication.

George L'Heureux:
So how does that help? We get the D-U-N-S Number on all these records. What do we do next? How does that D-U-N-S Number help us?

Don Folk:
Yeah, so that D-U-N-S Number is the unique key that permits us to say, "This specific business entity looks and feels like this record within the D&B file." If you have multiple records or entries in your repository with the same D-U-N-S Number, that's the definition of a duplicate. Now there's reasons that can occur. And I'm sure we'll talk about that in a few minutes, but definitely that unique D-U-N-S Number will be that identifier that you can consolidate and collapse on to identify those duplicates.

George L'Heureux:
So we know that that's going to help for a large majority records that customers have in their databases. Anything that aligns with the D-U-N-S Number they're going to be able to see whether or not there's D-U-N-S Number overlap in that set. There are some records that for various reasons aren't going to get D-U-N-S Number. How do we help with those? What can be done with that set of records?

Don Folk:
Yeah. So I think the first part is understanding why they aren't actually becoming or able to be or have a D-U-N-S Number assigned to it. Is it lacking information? Or is it because the information that's supplied is in a structure that's confusing the match engine tool to a magnitude that we can't help support that D-U-N-S Number assignment process. So one is let's do a review of your data to try and figure out exactly why we can't get a D-U-N-S Number. And then secondarily, if we come up with reasons and there are valid reasons why, but we don't have D-U-N-S Number assignments, that's whenever we have to start thinking a little bit differently about how to identify duplicates within that universe.

George L'Heureux:
And you and I have talked before about how the presence of a D-U-N-S Number actually has a little bit of a multiplier effect. Not only are we able to get the value out of the D-U-N-S Number itself, but we've seen over time with clients that those records that don't have D-U-N-S Number actually have a higher incidence rate of having duplicates in the dataset.

Don Folk:
Yeah. Without a doubt. And there's many reasons behind that, but the primary kind of theme to this is that if there's missing information that prohibits us from assigning a D-U-N-S Number, that probably means it makes it more complex for us to identify if we're looking into the right business. And we could generate duplicate entries in the database to try and mitigate some of that. So it's definitely the lack of information to be able to link to a D-U-N-S Number makes it more prone to a duplicate within the repository. For sure.

George L'Heureux:
I've always found that really interesting. But let's say that we get down to that set of duplicates that we know are there now, whether it's using a D-U-N-S Number or other methods for things that maybe couldn't be D-U-N-S Number. Once you've identified all those dupes, how do you go about resolving them?

Don Folk:
And quite honestly that's the hard part of the whole equation. Identification through D-U-N-S or through other means is definitely probably the easiest aspect of this. The resolution is whenever it is very client specific. The resolution process, because we have to be mindful that all of the information that's being carried along with that specific entity within your own repository needs be consolidated. We need to actually collect that information of accounts payable, accounts receivable, all of the supplier based information. All of that information has to then be collected and consolidated into one single view of a single client of yours.

George L'Heureux:
So you talked about it depending on the client, how are some ways that clients might need to, let's call it personalize that resolution process beyond just aggregating the data into a single record?

Don Folk:
Yeah. Like I said, it's definitely, you have to be mindful of what specific use case you're going after, but again, it potentially could be a manual review. It could be outsourcing of that consolidation once you have identified the duplicate. Creating that survivor view or kind of cherished view is something that you have to be mindful of. We can certainly help you with that, in that defining what that resolution process would look like. But again, it's definitely something you need to be aware of, it's probably the most complex component of this topic.

George L'Heureux:
So with the potential downsides of having duplicates in your data and obviously the related benefits that are associated with taking care of them, identifying and resolving them, why does it even remain a problem? Why aren't more companies, why isn't everyone staying on top of issues like this?

Don Folk:
Really, it's really around the complexity of this. A great example that I worked with a client on was there was an initiative that their sales team put into place that the individual sales team members would get an additional bonus if they brought on new supplier clients. And the outcome of that was that the sales teams created new business records for previous suppliers of this source. So the client intentionally tried to make a initiative to grow sales, but indirectly created a duplicate problem because the sales teams just put the same records in twice and it kind of looked like a new record, but it was an old business and that created that duplicate effect.

George L'Heureux:
So let's talk about standards, guidelines, best practices, what are some best practices that companies can use to perform duplicate resolution and really know when they've made a difference or when the juice is no longer worth the squeeze?

Don Folk:
Yeah. So first is, if you have the D-U-N-S Number and it's assigned definitely do some analysis to figure out exactly what your rates would be in the D-U-N-S Number universe. Knowing that that's probably the most pristine records within your database, if your percentages are over a certain threshold, then you should be concerned and certainly as a first step, look into your D-U-N-S Number universe to see what your duplicate rates are.

George L'Heureux:
Do you have a feel for what a percentage above which people should really be concerned is, or is that another one of the things that really depends on a client and their particular use case?

Don Folk:
It definitely depends on client use case specific, excuse me, specific stuff, but general a general rule of thumb is 5%. If you're exceeding 5% from a duplication rate, you really kind of got a problem that you need to address sooner rather than later. And that's a general statement. I mean, best-in-class is generally around one to three, is really my industry feel for a standard, but anything above five is definitely something that you need to address sooner rather than later.

George L'Heureux:
Okay. So Don, as we wrap up, someone might be out there listening or watching and they're hearing this and wondering whether or not their database has this issue, how prevalent it is and what it might be impacting them with, what do they do as a first step?

Don Folk:
So like I said, definitely understand what that duplication rate is by just looking at your own database, if there is a D-U-N-S Number assigned. If there isn't, I think what I would recommend is really reaching out to the consultant team. I think that we can certainly help you with some best demonstrated practices and best means of identifying those duplicates. But again, I think it's really just looking into your own database where there is a D-U-N-S Number to start to figure out exactly what that rate would look like for your specific use case.

George L'Heureux:
Well, hey, Don, I really appreciate you taking time to sit and chat with me about this topic and sharing your expertise over your many years of work here at D&B with everyone who's watching or listening, helping them understand the importance of identifying and resolving duplicates.

Don Folk:
Well, thanks for having me. I really enjoyed talking about this, it's passion.

George L'Heureux:
Our guest expert today again has been Don Folk, a data strategy consultant here at Dun & Bradstreet. And this is Data Talks. We hope that you've enjoyed today's discussion and if you have, we encourage you to please share it with a colleague or a friend, let them know about the show. And if you'd like more information about things that we've discussed on today's episode, please visit www.dnb.com or talk to your company's D&B specialist today. I'm George L'Heureux, thanks for joining us. Until next time.