Step-by-Step Guide to automate Data Cleansing using Text Mining

Have you ever considered what the quality of your data is? You should, as data quality is a critical part of any business operation. When working with customers, we often encounter unstructured and chaotic data, including data with no headers, empty fields, and duplicate records. How do you go about structuring and cleaning your data? By data cleansing. Data cleansing allows companies to keep their data up-to-date and to uncover gaps in their day-to-day processes. You could, of course, go through this process manually, which is tedious and time consuming. But what if there is a way to automate this process? In this blog we will look into automated validation of company’s name and address information by applying text mining.

Automated Data Cleansing

To validate if a physical address exists and whether the information is complete, Places API is a great option. If the address is nonexistent or incomplete, you can automatically update the information in your database. Sounds scary? Well, it is not. Let me explain. A single combination of a name and address from your database is queried using Places API. The search returns the best existing match based on the input. For example, when inputting: “McCoy”, “Torenallee Eindhoven”, the search results are “McCoy & Partners”, “Torenallee 45, 5617 BA Eindhoven”. Next, assign a score to the data quality and create an update rule to automate the cleansing process.

Below is a detailed guide on how to use text mining to automate data cleaning.

The best tools for automated data cleansing

Which tools can be used for automated data cleansing? Programming languages such as Python or R are the most used ones amongst data scientists, however RapidMiner is a great option if you have no coding experience. Generally, many great tools are available and new ones are emerging every day. There are many best-tool-guides and blogs online, for this reason we will not discuss tools further in this blog.

Making your computer work with textual data

To make sure that the computer can interpret and work with textual data, you need to pre-process it. Which pre-processing steps are required highly depend on the data and the goal you wish to achieve. When implementing automated validation on the name and address, the data will consist of a couple of word junctions. Here the order of words is very important, and misspellings might be done on purpose e.g., “Perficient”, “Technologent”, “7Eleven”. Taking these characteristics into consideration, we need to avoid pre-processing techniques such as word replacement (a technique to correct misspelled words), stemming and lemmatization(a technique to strip the word to stem and group it together with other similar meaning words). Common text mining practice dictates to filter out commonly used words such as ‘the’, ‘a’, ‘is’, etc. since these words do not add valuable information and would completely dominate the analysis. Another widespread practice is to filter out punctuation and special characters out of the text, to declutter the data. Converting your text to lower case decreases the risk of interpreting capitalized versions words as new words.

Numerically representing words

However, arguably the most important and difficult part of the text mining is numerically representing the words. Numerically representing unstructured textual data allows to make it mathematically computable. There are numerous techniques to achieve this, from the most basic ones like one-hot encoding to neural-network-driven ones such as Neural Probabilistic Models.

For an automated check on a company's name and address, a tokenization method was used. Tokenization is the process of cutting data into meaningful parts, called tokens. Tokens are then embedded into a vector space, or in other words, a vector is numerical representation of a word. Typically, words with similar meanings will have vector representations that are close together in the embedding space.

Stay in control using a similarity Score

At this stage we have pre-processed our text and gave it numerical representation. Now we can check how similar both names and addresses are, the name and address from the database and the name and address returned from Places API. Since our textual information is now a vector, this would be done by calculating the distance between two vectors. The most common distance metrics like Euclidean Distance, Hamming Distance and Manhattan Distance require both vectors to be of equal lengths. Unfortunately this is not the case. Cosine similarity is a measure of similarity between two sequences of numbers irrespective of their length. The bigger the cosine similarity score, the more similar the information is. Using a similarity score sets up a rule or a threshold. For example, if the similarity score is higher than 80%, automatically allow the update. Is the score lower? Ask a person to review the change. This way you update thousands of records in a couple of minutes while staying fully in control.

Ask us about future data cleansing initiatives

With this blog, you will be able to update your customer database. Read more about text mining in this blog. Text mining simply shines through business science Or watch the McCoy TV episode on cleansing your data using data science.

Are you a Friend of McCoy?

As an innovation partner, we want to continue inspiring you. That's why we gladly share our most relevant content, events, webinars, and other valuable updates with you.