Rapid technological development in almost every organization has resulted in a significant increase in the amount of data they use. Back in the early 1990s, Teradata boasted that they created the first Walmart system with a capacity of 1 TB (1000 GB). Today, databases and data warehouses of popular websites such as YouTube are much larger than that.
Sometimes the quality of data also hinders the entire process and we need the support that we find in computers – data mining uses algorithms to analyze unstructured data. In the article, I describe the most important of these. Find out more about the concept of data mining and what techniques are used to find relevant data in large sets of data from multiple sources.
What is data mining?
Data mining is the process of discovering rules, relationships, and patterns in collected information to obtain knowledge. It is a technological combination of traditional methods of data analysis (i.e. relatively well-known statistics) with contemporary algorithms and Artificial Intelligence solutions and ways to process large volumes of data using one or more computing units.
Data mining process
“Every dataset, every database, every spreadsheet has a story to tell”Stuart Frankel, CEO of Narrative Science
Data mining is one of the inseparable elements of KDD (Knowledge Discovery in Databases), i.e. discovering the knowledge gathered in databases. This process includes:
- Determining the purpose of the analysis – understanding the problem, familiarizing oneself with the data, and business needs.
- Data integration – combining information from different sources, sometimes with a different structure and different data models.
- Preprocessing of data – getting rid of human errors, typos, and empty values. Arranging data types for individual information. Searching for and getting rid of duplicates.
- Data transformation – it is the next stage of processing that focuses on the requirements of further exploration. It involves distinguishing potentially useful columns, and parts of data, as per the predetermined purpose. It is about simplifying data as much as possible.
- Selecting exploration methods and choosing the right algorithm – this point is described further in the article.
- Data mining – according to the definition above, data mining makes it possible to search for rules, dependencies and patterns.
- Interpretation and visualization – understanding the results obtained and making them understandable for business; creating tables, writing down conclusions, documenting the process, and justifying the means used.
Types of data mining techniques and methods
Currently, in the paradigm, there are two main groups of exploration methods to choose from for the purposes of our analysis:
- Predictive methods
- Descriptive methods
Data mining techniques
Within each of them, 3 data mining techniques are categorized – the most popular approaches to exploration. Of course, this area is constantly developing, and there are more algorithms as well as approaches, so in this article, we will focus on the key ones. Below I will try to explain each group and method and give examples for a better understanding.
1. Predictive methods – focus on the attempt to predict the result based on the values of other input data. The results of these methods are called target or dependent values, while the attributes used to obtain them are independent or explanatory values.
The methods used by data miners include:
- Classification technique – operates on the basis of algorithms focused, as the name suggests, on the classification of data objects. It is used when our dependent value is discrete (categorized). It was used, for example, in the diagnosis of diseases in patients on the basis of previous disease classifications.
Examples of algorithms: the naive Bayes classifier, logistic regression, K-nearest neighbors, decision trees, support vector machine.
- Prediction technique – predicts the most likely values for the data received. Models created in this technique can be imagined as continuous functions, adjusted in terms of the input information received. It could be used, for example, to conduct market research on the earnings of workers in given sectors in which, on the basis of education, years of experience, origin, and other demographic conditions, the average wage could be estimated depending on the abovementioned factors.
Examples of algorithms: linear regression, ridge regression, polynomial approximation
- Time-Series Analysis – a technique that gives results based on the analysis of data that changes over time. If the step (change of time relative to data) is irregular, then this technique is called fuzzy time series analysis.
Examples of algorithms: autoregressive integrated moving average (ARIMA), moving average, and exponential smoothing.
2. Descriptive methods – they try to draw patterns (correlations, data trends, clusters, anomalies, etc.) from the received input values, which can describe the relationships between the received data. These methods are designed to characterize, in a general sense, the properties of the input data (discover patterns and relationships, properly group the data, and detect characteristic anomalies); however, in order to draw specific conclusions, additional work needs to be done to prepare data and visualize it properly. Techniques include:
- Discovering Associations – the technique of discovering patterns that describe strongly related features between items in a set of data. An example could be finding groups of genes that have similar properties, or analyzing a customer’s basket to plan the distribution of products (e.g., so that a customer who is buying bread passes by the butter shelf on his way to the cash register).
Examples of algorithms: Apriori, ECLAT
- Clustering technique – creates a finite number of collections, categories that are created on the basis of data and similar features. The number of such categories is due to the similarity of the data. It can be used, for example, in team sports to check for similarities between players from a given team, so you can have a basis for creating new tactics before the next game.
Examples of algorithms: K-means clustering, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
- Detection of changes and deviations – a technique that looks for fragments of a data set that differ significantly from the rest. Such fragments are defined as anomalies or outliers. These techniques are characterized by a high detection rate and a low false rate. They are used, for example, in AML (Anti Money Laundering) and for monitoring changes in the ecosystem.
Examples of algorithms: the K-NN algorithm, Bayesian networks, and hidden Markov models (HMM).
Which model to choose and how to effectively explore the data?
The answer to that question is, of course, “it depends”.
Data mining is only part of the whole KDD chain, and we can see from the previous part of the article how extensive the topic is. Everything that happened with the data along the way will have an impact on our decisions. However, there are certain patterns that can be followed. If we know that our results are to be discrete, we will use classifiers; if they are to be numerical results, different ones – we will use regression prediction techniques. However, if we have no expectations or they are general in relation to our data and we want to learn something from them, we will use descriptive techniques.
Also read: Data storytelling in Microsoft Power BI
Application of a data mining algorithm
We have determined what method/technique to use – now we need to answer the question of which algorithm to apply to create a data model. Well, there’s no definite answer. Personally, I have come across an approach in which we choose the algorithms that we comprehend – and which we know will “understand” our data the best – and then compare them. However, if there is no chance to try out several solutions, the production conditions do not allow it or there are other limitations, the experience of the developer or the team that implements these solutions will decide.
Why use data mining?
Data mining is an extensive area that fits into the currently important trend related to Big Data, Data Science, and building a data-driven organization. In fact, a separate article on each of the above-mentioned algorithms could be created. Companies that want to be data-driven collect data and invest in data analytics tools such as Microsoft Power BI, Tableau, or Qlik Sense, which are used to visualize the conclusions received by the methods which I have described in a form that is comprehensible to everyone. Such solutions help to find trends and hidden relationships between data from different data sources, extract the potential and turn raw data into useful information allowing us to make accurate decisions. Many companies use them not only to analyze historical data but also to make forecasts, and thereby, for example, increase sales.