Many challenges in NLP involve natural language understanding and this is what D&B’s data scientists are asking themselves and other experts in the field. Can the computational linguistics concerned with the interactions between computers and us humans’ natural languages, be utilised to improve data and specifically, the good old Standard Industry Classification, (SIC) code?
Firstly a little bit of history about the SIC. When a business registers at a registry, they are asked to provide an Industry Classification Code, this is the ‘as registered’ SIC, and the important point to note is that it is self-assigned, using a pre-defined list of categories and their resultant codes. So, what is the problem? The problem is that the categories that exist are often not terribly useful when it comes to using them to really understand your customer base. For example, a common code is ‘Miscellaneous Business services elsewhere unclassified’. Not very useful.
Here at D&B, we really feel that data science can help improve this situation and ultimately the data we provide to our customers. Therefore, we have embarked on some small tests using semantic search, (which seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear), to generate more relevant results. Then we have used ‘Neural Modelling’ taking basic business information, SIC descriptions and the web to help us improve granularity.
To date, our tests have been relatively small, but the results are really encouraging. If we can get this to work in a larger scale trial with the quality of output then the opportunities will be huge. Next up is that pesky robot to do the cleaning. However, I am reliably informed that this is still a way off, as the contextual piece is rather more difficult to master.