Integral Solutions - IT solutions for companies
Integral Solutions - IT solutions for companies

See what's new in Integral Solutions

28.01.2018
Machine Learning is not possible without data, which contains the information we need, allows us to do the job

January 28.01.2018, XNUMX | Piotr Krzeszewski – Data Scientist

Importance of Data Quality in the world of Machine Learning

Machine Learning is not possible without data that contains the information we need, allows us to ask questions and enable us to find valuable answers that will turn into high business value insights. What characteristics should data have in order to be able to use them effectively in the machine learning process? Below I will analyze some of the characteristics that characterize good data. In each case, I will try to justify and illustrate the importance of this feature with examples. As it turns out, even seemingly insignificant oversights can have a significant impact on the outcome of the entire project. Problems with data have a particularly big impact on the work of the Data Science team. Finally, I will present methods that will allow you to automate data handling and minimize the costs resulting from bad data.

Characteristics of good data

 

Completeness

The data we use should be complete on at least three main levels:

  • Variable
    Are all the variables that may affect our question included in the training set? Let's imagine a situation where a large bank would like to find customers ready to take out a loan. Such research can be used in marketing. Unfortunately, someone decided that the date does not matter, and the time that has elapsed since the last loan is important. In this situation, the Data Scientist would have a difficult task, because it would be difficult to observe e.g. a group of customers who borrow for the holidays. The use of a complete set of data would allow the monetization of another valuable information.

 

  • Range
    Does our dataset cover all ranges of specific variables? Let's assume that we would like to prepare a model predicting the level of antenna utilization for a mobile network operator. If we had data only from autumn and winter, it will be difficult for us to correctly determine the predictions of stations that are located in seaside towns and places where famous summer festivals take place.

 

  • Record
    Does each case contain all the data that is available? Let our goal now be to personalize a marketing campaign for customers of a large chain of clothing stores. If the prepared set lacked gender information for a large part of the clients, we expect that the prepared model would be noticeably worse than the model prepared on the complete data set.

 

Uniqueness

Duplicate data is usually a serious problem. They can appear in two places:

  • Duplicate traits
    For many algorithms used in Data Science processes, duplicates are a problem. Firstly, they increase the running time of the algorithm, because the same data must be processed more than once. In some cases, this is also a challenge, as the algorithm may have trouble "deciding" which feature is more important. This leads to poorer and unstable (less reproducible) results. We would consider the presentation of the same feature in several units as duplicates (e.g. customer height in meters and centimeters or net and gross bill value).

 

  • Duplicate records
    Duplicates of individual observations are a less common problem. If duplicates make up a noticeable part of the data, they can skew the results. Let's imagine a situation in which we are looking for information about young bank customers who open deposits. There is a client living in the countryside who has set up several dozen deposits. If we carry out the analysis on a set with multiple duplicates of this customer's data, we may obtain a result that is not reflected in reality, suggesting that customers from the countryside open many times more deposits than their peers from the city.

 

Topicality

Another class of problems can be caused by using outdated data. In what situations can this have a negative impact on the effectiveness of our model?

Let's take, for example, a chain of stores offering loyalty cards to their customers in 2006-2009. Management would like to anticipate the demand for some luxury products. With data from 2006 to mid-2008, it would be very difficult to predict sales in 2009, when there was a significant reduction in consumption caused by the economic crisis. This situation also highlights the fact that the Data Scientist needs to know other factors that are not visible in the dataset that will affect the analysis being performed.

You should also ensure that the data set includes the most recent data we can obtain. It would be unacceptable for a Data Scientist to prepare a model to select the best marketing campaign for each client and not to be provided with data on the results of recent campaigns carried out for these clients.

Two conclusions can be drawn from the above examples:

  • it may be necessary to regularly refresh the prepared model to include current data. The frequency of this operation should be selected depending on the problem being modeled. Production in a factory may vary on a weekly basis, and for agricultural plants a semi-annual or quarterly cycle may be sufficient,
  • Data Scientist cannot use data that it does not know about or has no access to. Therefore, the organization must prepare mechanisms that will ensure the proper dissemination of knowledge about existing data sets. You can try to implement this by appointing a person responsible for such tasks (e.g. a dedicated data administrator), but for larger organizations it will be necessary to find automated solutions.

Accuracy

For many people who live in the Data Science world, this is the most important feature of data. The accuracy of the data is necessary to obtain a valid result. Unfortunately, errors can occur in many places and for various reasons. Some examples:

  • incorrect data entered into the system (by users or employees),
  • incorrect ways of transferring data between systems,
  • incorrect loading of data in analysis tools,
  • incorrect data transfer to machine learning algorithms.

The problem with the accuracy of data can occur at any stage of data handling and processing. Some of the errors will be difficult to detect and it will be practically impossible to correct them (especially when data is loaded from the system user). Data handling is susceptible to human error, so it is worth automating all processes related to their delivery and processing.

In addition to this, there are other aspects to keeping data accurate. Are the columns well described? Will the Data Scientist be able to determine the meaning of each variable? Let's take a named column for example quarter_1_sales_sum. Although it seems to be a good description of the data it contains, there are still legitimate questions:

  • What year does this data refer to?
  • Is it net or gross sales?
  • Does this take into account contracts concluded during this period, or payments booked?
  • How are returns of goods handled in this variable?
  • Does this variable have the same meaning for individual customers and for companies (especially in terms of paying VAT)?

Of course, not all of these questions will be relevant in every situation, and additional information about the dataset will dispel some doubts. However, such inaccuracies can be a problem, leading to errors and delays. It often happens that a Data Scientist needs to find people responsible for data in order to understand their meaning.

It is also worth mentioning the usual inaccuracy of measurements. If we conduct a material analysis or some of the metrics are not precisely measured, how is the error for these variables shaped. Is it constant over time? Is the inaccuracy information accurate and provided with the data?

Cohesion

An often overlooked (and usually assumed to be true) property of data is consistency. As with the previous issues, breaking this feature can lead to problems in various situations. When preparing for a Data Science analysis, we should find answers to the following questions:

  • Are we able to link data from different systems to one object (e.g. a specific customer, a unique product)?
  • Do the systems we use use the same data formats (e.g. telephone number, address format)?
  • Do all datasets used have the same level of precision (e.g. a location is down to a country in one system and down to a city borough in another)?
  • Does the data use the same units? It is relatively easy to tell the difference between kilograms and tons, but a mismatch between kilometers and miles can go unnoticed and lead to errors that are difficult to correct in the future.
  • Are time data expressed in the same way? Do they use the same time zone?

Problems with inconsistent data have been known for a long time. There is an example of a NASA project that ended up losing a probe because imperial units were mixed up with the SI system.

Importance

The last feature we want to mention in the context of good data is its importance. As in the previous cases, it can be considered on different levels.

First, I would like to focus on an aspect that is not directly related to data, but crucial for the profitability of the entire process. Before starting data analysis, you should consider whether the question we are asking the Data Science team is really important in a business sense. Are there no issues that can generate greater savings or profits? It should be remembered that the Data Science team often has extensive experience related to both the technical and business part gained while working in various projects. It is certainly worth seeking the opinion of the Data Science team, which can indicate issues where further action could prove to be the most profitable.

Validity should also be considered in the context of data preparation and use. Has the Data Scientist received all the data it can use in the analysis? It should be remembered that it is Data Scientists who know best which data is worth using, so they should make decisions in this matter. This can be illustrated on the example of a bank that would like to select a group of customers worth calling with a new offer. In this case, there are many different types of data to consider – personal details, credit history, response to previous contact attempts. However, interestingly, in one of the similar projects it turned out that user behavior on the website is very important. Omitting an important set of data can doom the project to failure from the very beginning.

Impact of data quality on the Data Science team

I started this post by saying that machine learning is not possible without data. However, I would say it is possible without good data. A good Data Scientist should be able to handle inaccurate, poorly described data that is inconsistent and contains duplicates. The analysis can also be carried out using incomplete or outdated data (this will affect the result obtained).

However, all this costs time. It is often said that data processing is about 80% of the entire Data Science process. If a Data Scientist has to spend extra hours to detect duplicates, check the correctness of the units used, or days to precisely determine the meaning of individual variables, the cost of the entire process increases significantly. What's worse, the first effects that will help to direct further work will appear much later.

Providing bad data for analysis also has negative effects beyond the project. Data Scientist will quickly lose trust in data if it turns out that he has to be responsible for quality himself. It will somehow be forced to check the quality of all data, even if some of them are of better quality. Secondly, it can affect the morale of the Data Science team. If a significant amount of time is devoted to processing a certain set of data during the project and it is necessary to repeat it during the next analysis, the task becomes tiring and demotivating.

Addressing data quality issues

So what should we do to ensure that our data is of good quality?

Two items are needed. First, there must be designated people in the organization who are responsible for the quality of this data. Secondly, it is necessary to support oneself with appropriate tools to ensure correct data handling.

In any organization that cares about data, there are several solutions to consider:

  • data directory. A tool that contains information about the data sets owned by the organization. Thanks to this, the administrator or Data Scientist will be able to check what datasets are available and what is contained in them. The data catalog also helps to secure data and to ensure that responsibility for individual collections is clearly defined.
  • handling tools master data (key data). Each organization has some data that is most important for a given organization (such as a list of customers, products or facilities). Typically, this data is used in multiple systems. You cannot afford to have inconsistencies between the data - that there are "several versions of the truth" in the data.
  • tools for Data Quality (data quality). They allow you to control data quality, find problems and set your own rules to ensure that all data is of the right quality.
    a data processing engine that will support automatic operations and ensure that data is up-to-date.

 

 

READ MORE OUR BLOG