Definition, Function, Process and Stages of Data Mining
Definition, Function, Process and Stages of Data Mining
Data Mining is a process that uses statistical techniques, mathematics, artificial intelligence, machine learning to extract and identify useful information and related knowledge from various large databases (Turban et al. 2005). There are several other terms that have the same meaning as data mining, namely Knowledge discovery in databases (KDD), knowledge extraction, data analysis / data / pattern analysis, business intelligence and archaeological and data data dredging (Larose, 2005)
Data mining capabilities to find valuable business information from a very large database, can be analogous to mining precious metals from their source land, this technology is used to:
- Prediction of trends and business characteristics, where data mining automates the process of finding predictive information in large databases.
- The discovery of previously unknown patterns, where data mining sweeps the database, then identifies previously hidden patterns in one sweep.
- Data mining is useful for making critical decisions, especially in strategy.
Here are some data mining definitions from several sources (Larose, 2005):
- Data mining is the process of finding something meaningful from a new correlation, existing patterns and trends by sorting through large data stored in the repository, using pattern recognition technology and mathematical and statistical techniques.
- Data mining is the analysis of observational databases to find unexpected relationships and to summarize data with new methods or methods that are understandable and useful to data owners.
- Data mining is an interdisciplinary field of science that brings together learning techniques from machines (machine learning), pattern recognition, statistics, databases, and visualization to overcome the problem of extracting information from large databases.
- Data mining is defined as a process of extracting useful and potential information from a set of data contained implicitly in a database.
Data Mining function
Data mining has important functions to help get useful information and increase knowledge for users. Basically, data mining has four basic functions, namely:
- Prediction function. The process of finding patterns from data using several variables to predict other variables of unknown type or value.
- Function Description (description). The process of finding an important characteristic of data in a database.
- Classification function. Classification is a process to find a model or function to describe the class or concept of a data. The process used to describe important data and can predict data trends in the future.
- Association functions (association). This process is used to find a relationship that is contained in the attribute value of a data set.
Data Mining Process
The processes commonly carried out by data mining include: description, prediction, estimation, classification, clustering and association. In detail the data mining process is explained as follows (Larose, 2005):
Description aims to identify patterns that appear repeatedly on a data and change the pattern into rules and criteria that can be easily understood by experts in the application domain. The rules produced must be easy to understand in order to effectively increase the level of knowledge in the system. Descriptive tasks are data mining tasks that are often needed in postprocessing techniques to validate and explain the results of the data mining process. Postprocessing is a process used to ensure only valid and useful results that can be used by interested parties.
Predictions are similar to classifications, but data are classified based on behavior or values predicted in the future. Examples of predictive tasks, for example, are to predict a reduction in the number of customers in the near future and stock price predictions in the next three months.
Estimates are almost the same as predictions, except the target variable is estimated more in the numerical direction than in the direction of the category. The model is built using a complete record that provides the value of the target variable as a predictive value. Furthermore, in the next review estimation n the value of the target variable is based on the value of the prediction variable. For example, an estimate of systolic blood pressure in hospital patients is based on patient age, gender, weight, and blood sodium level. The relationship between systolic blood pressure and predictive variable values in the learning process will produce an estimation model.
Classification is the process of finding a model or function that describes and distinguishes data into classes. Classification involves the process of examining the characteristics of an object and inserting an object into one of the classes that has been previously defined.
Clustering is the grouping of data without being based on a particular data class into the same object class. A cluster is a collection of records that have similarities with each other and have an incompatibility with records in other clusters. The aim is to produce groupings of objects that are similar to each other in groups. The greater the similarity of objects in a cluster and the greater the difference in each cluster, the better the quality of cluster analysis.
The task of associations in data mining is to find attributes that appear at a time. In the business world it is more commonly called shopping basket analysis (market basket analisys). The task of the association seeks to uncover rules for measuring the relationship between two or more attributes.
Tahapan Data Mining
The stages carried out in the data mining process starts from the selection of data from the source data to the target data, the preprocessing stage to improve data quality, transformation, data mining and the stages of interpretation and evaluation that produce output in the form of new knowledge that is expected to contribute better. The details are explained as follows (Fayyad, 1996):
Tahapan Data Mining
1. Data selection
Selection of data from a set of operational data needs to be carried out before the information excavation stage in KDD starts. The selection data used for the data mining process is stored in a file, separate from the operational database.
2. Pre-processing / cleaning
Before the data mining process can be implemented, it is necessary to do a cleaning process on the data that is the focus of KDD. The cleaning process includes, among other things, removing duplicate data, checking inconsistent data, and correcting errors in data.
Coding is a transformation process on selected data, so that the data is suitable for the data mining process. The coding process in KDD is a creative process and is very dependent on the type or pattern of information to be searched in the database.
4. Data mining
Data mining is the process of finding patterns or interesting information in selected data using a particular technique or method. Techniques, methods, or algorithms in data mining vary greatly. The choice of the right method or algorithm depends on the overall purpose and process of the KDD.
5. Interpretation / evaluation
The pattern of information generated from the data mining process needs to be displayed in a form that is easily understood by interested parties. This stage is part of the KDD process called interpretation. This stage includes examining whether the pattern or information found is contrary to the facts or hypotheses that existed before.
Turban, E, 2005, Decision Support Systems and Intelligent Systems Indonesian Edition Volume 1. Andi: Yogyakarta.
Larose, Daniel T. 2005. Discovering Knowledge in Data: An Introduction to Data Mining. John Willey & Sons, Inc.
ayyad, Usama. 1996. Advances in Knowledge Discovery and Data Mining. MIT Press.