Contents:
This allows analysts to create and perform their own data mining experiments using their knowledge of the methodologies and techniques provided. The book emphasizes the selection of appropriate methodologies and data analysis software, as well as parameter tuning.
These critically important, qualitative decisions can only be made with the deeper understanding of parameter meaning and its role in the technique that is offered here. This volume is primarily intended as a data-mining textbook for computer science, computer engineering, and computer information systems majors at the graduate level. Senior students at the undergraduate level and with the appropriate background can also successfully comprehend all topics presented here. Researchers, students as well as industry professionals can find the reasons, means and practice to make use of essential data mining methodologies to help their interests.
Request permission to reuse content from this title. Please read our Privacy Policy. Print this page Share. Description Now updated—the systematic introductory guide to modern analysis of large data sets As data sets continue to grow in size and complexity, there has been an inevitable move towards indirect, automatic, and intelligent data analysis in which the analyst works via more complex and sophisticated software tools. Why a Data-Mining Project Fails 17 1. ChiMerge Technique 77 3. Nearest Neighbor Classifi er 4. Generating a Decision Tree 6.
Generating Decision Rules 6. Data-Mining Applications B. Kantardzic has won awards for several of his papers, has been published in numerous referred journals, and has been an invited presenter at various conferences. He has also been a contributor to numerous books. However, in many domains the underlying first principles are unknown, or the systems under study are too complex to be math- ematically formalized. With the growing use of computers, there is a great amount of data being generated by such systems.
Thus there is currently a paradigm shift from classical modeling and analyses based on first principles to developing models and the corresponding analyses directly from data. We have gradually grown accustomed to the fact that there are tremendous volumes of data filling our computers, networks, and lives. Government agencies, scientific institutions, and businesses have all dedicated enormous resources to collecting and storing data.
In reality, only a small amount of these data will ever be used because, in many cases, the volumes are simply too large to manage, or the data structures them- selves are too complicated to be analyzed effectively. How could this happen? The primary reason is that the original effort to create a data set is often focused on issues such as storage efficiency; it does not include a plan for how the data will eventually be used and analyzed. The need to understand large, complex, information-rich data sets is common to virtually all fields of business, science, and engineering.
In the business world, corpo- rate and customer data are becoming recognized as a strategic asset. The entire process of applying a computer-based methodology, including new techniques, for discovering knowledge from data is called data mining. Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods.
Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers. In practice, the two primary goals of data mining tend to be prediction and descrip- tion. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans.
Therefore, it is possible to put data-mining activities into one of two categories: On the descriptive end of the spectrum, the goal is to gain an understanding of the analyzed system by uncovering patterns and relationships in large data sets. The relative importance of prediction and description for particular data-mining applications can vary considerably.
The goals of prediction and description are achieved by using data-mining techniques, explained later in this book, for the following primary data-mining tasks: Discovery of a predictive learning function that classifies a data item into one of several predefined classes. Discovery of a predictive learning function that maps a data item to a real-value prediction variable.
A common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data. An additional descriptive task that involves methods for finding a compact description for a set or subset of data. Finding a local model that describes significant depen- dencies between variables or between the values of a feature in a data set or in a part of a data set.
Change and Deviation Detection. Discovering the most significant changes in the data set. The more formal approach, with graphical interpretation of data-mining tasks for complex and large data sets and illustrative examples, is given in Chapter 4. Current introductory classifications and definitions are given here only to give the reader a feeling of the wide spectrum of problems and tasks that may be solved using data- mining technology.
The success of a data-mining engagement depends largely on the amount of energy, knowledge, and creativity that the designer puts into it. In essence, data mining is like solving a puzzle. The individual pieces of the puzzle are not complex structures in and of themselves. Taken as a collective whole, however, they can constitute very elaborate systems. As you try to unravel these systems, you will probably get frustrated, start forcing parts together, and generally become annoyed at the entire process, but once you know how to work with the pieces, you realize that it was not really that hard in the first place.
The same analogy can be applied to data mining. In the beginning, the designers of the data-mining process probably did not know much about the data sources; if they did, they would most likely not be interested in performing data mining. Individually, the data seem simple, complete, and explainable. But collectively, they take on a whole new appearance that is intimidating and difficult to comprehend, like the puzzle. Therefore, being an analyst and designer in a data-mining process requires, besides thorough professional knowledge, creative thinking and a willingness to see problems in a different light.
Data mining is one of the fastest growing fields in the computer industry. Once a small interest area within computer science and statistics, it has quickly expanded into a field of its own. Since data mining is a natural activity to be performed on large data sets, one of the largest target markets is the entire data-warehousing, data-mart, and decision-support community, encompassing professionals from such industries as retail, manufacturing, telecommunications, health care, insurance, and transportation.
In the business com- munity, data mining can be used to discover new purchasing trends, plan investment strategies, and detect unauthorized expenditures in the accounting system. It can improve marketing campaigns and the outcomes can be used to provide customers with more focused support and attention. Data-mining techniques can be applied to problems of business process reengineering, in which the goal is to understand interactions and relationships among business practices and organizations.
Many law enforcement and special investigative units, whose mission is to identify fraudulent activities and discover crime trends, have also used data mining successfully. For example, these methodologies can aid analysts in the identification of critical behavior patterns, in the communication interactions of narcotics organizations, the monetary transactions of money laundering and insider trading operations, the move- ments of serial killers, and the targeting of smugglers at border crossings. Data-mining techniques have also been employed by people in the intelligence community who maintain many large data sources as a part of the activities relating to matters of national security.
Appendix B of the book gives a brief overview of the typical commercial applications of data-mining technology today. Despite a considerable level of overhype and strategic misuse, data mining has not only persevered but matured and adapted for practical use in the business world. Is data mining a form of statistics enriched with learning theory or is it a revo- lutionary new concept?
In our view, most data-mining problems and corresponding solutions have roots in classical data analysis. Data mining has its origins in various disciplines, of which the two most important are statistics and machine learning. Statistics has its roots in mathematics; therefore, there has been an emphasis on math- ematical rigor, a desire to establish that something is sensible on theoretical grounds before testing it in practice.
In contrast, the machine-learning community has its origins very much in computer practice. This has led to a practical orientation, a willingness to test something out to see how well it performs, without waiting for a formal proof of effectiveness. If the place given to mathematics and formalizations is one of the major differences between statistical and machine-learning approaches to data mining, another is the rela- tive emphasis they give to models and algorithms. Modern statistics is almost entirely driven by the notion of a model.
This is a postulated structure, or an approximation to a structure, which could have led to the data. Basic modeling principles in data mining also have roots in control theory, which is primarily applied to engineering systems and industrial processes. The problem of determining a mathematical model for an unknown system also referred to as the target system by observing its input—output data pairs is generally referred to as system identification.
System identification generally involves two top-down steps: In this step, we need to apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is to be conducted. In general, system identification is not a one-pass process: Both structure and parameter identification need to be done repeatedly until a satisfactory model is found. This iterative process is represented graphically in Figure 1. Typical steps in every iteration are as follows: Conduct validation tests to see if the model identified responds correctly to an unseen data set often referred to as test, validating or checking data set.
Terminate the process once the results of the validation test are satisfactory. Block diagram for parameter identification. While we know a great deal about the structures of most engineering systems and industrial processes, in a vast majority of target systems where we apply data-mining techniques, these structures are totally unknown, or they are so complex that it is impos- sible to obtain an adequate mathematical model.
Therefore, new techniques were developed for parameter identification and they are today a part of the spectra of data- mining techniques. In pattern recognition it refers to the vector of measurements characterizing a particular object, which is a point in a multidimensional data space. In data mining, a pattern is simply a local model.
In this book we refer to n-dimensional vectors of data as samples. Data mining is a process of discovering various models, summaries, and derived values from a given collection of data. Even in some professional environ- ments there is a belief that data mining simply consists of picking and applying a computer-based tool to match the presented problem and automatically obtaining a solution. This is a misconception based on an artificial idealization of the world. There are several reasons why this is incorrect. One reason is that data mining is not simply a collection of isolated tools, each completely different from the other and waiting to be matched to the problem.
A second reason lies in the notion of matching a problem to a technique. Only very rarely is a research question stated sufficiently precisely that a single and simple application of the method will suffice. In fact, what happens in practice is that data mining becomes an iterative process. One studies the data, examines it using some analytic technique, decides to look at it another way, perhaps modifying it, and then goes back to the beginning and applies another data-analysis tool, reaching either better or different results.
This can go around many times; each technique is used to probe slightly different aspects of data—to ask a slightly different question of the data. What is essentially being described here is a voyage of discovery that makes modern data mining exciting. Still, data mining is not a random application of statistical and machine-learning methods and tools. It is important to realize that the problem of discovering or estimating dependencies from data or discovering totally new data is only one part of the general experimental procedure used by scientists, engineers, and others who apply standard steps to draw conclusions from the data.
The general experimental procedure adapted to data-mining problems involves the following steps: State the problem and formulate the hypothesis. Most data-based modeling studies are performed in a particular application domain. Hence, domain-specific knowledge and experience are usually neces- sary in order to come up with a meaningful problem statement. Unfortunately, many application studies tend to focus on the data-mining technique at the expense of a clear problem statement.
In this step, a modeler usually specifies a set of variables for the unknown dependency and, if possible, a general form of this dependency as an initial hypothesis. There may be several hypotheses formulated for a single problem at this stage. The first step requires the com- bined expertise of an application domain and a data-mining model. In practice, it usually means a close interaction between the data-mining expert and the application expert.
In successful data-mining applications, this cooperation does not stop in the initial phase; it continues during the entire data-mining process. This step is concerned with how the data are generated and collected. In general, there are two distinct possibilities. The first is when the data-generation process is under the control of an expert modeler: The second possibility is when the expert cannot influence the data- generation process: An observa- tional setting, namely, random data generation, is assumed in most data-mining applications.
Typically, the sampling distribution is completely unknown after data are collected, or it is partially and implicitly given in the data-collection procedure. It is very important, however, to understand how data collection affects its theoretical distribution, since such a priori knowledge can be very useful for modeling and, later, for the final interpretation of results.
Also, it is important to make sure that the data used for estimating a model and the data used later for testing and applying a model come from the same unknown sampling distribution. If this is not the case, the estimated model cannot be successfully used in a final application of the results. Data preprocessing usually includes at least two common tasks: Such non- representative samples can seriously affect the model produced later. There are two strategies for dealing with outliers: Therefore, it is recommended to scale them, and bring both features to the same weight for further analysis.
Also, application-specific encoding methods usually achieve dimensionality reduction by providing a smaller number of informative features for subsequent data modeling. These two classes of preprocessing tasks are only illustrative examples of a large spectrum of preprocessing activities in a data-mining process. Data-preprocessing steps should not be considered as completely inde- pendent from other data-mining phases.
In every iteration of the data-mining process, all activities, together, could define new and improved data sets for subsequent iterations. Generally, a good preprocessing method provides an optimal representation for a data-mining technique by incorporating a priori knowledge in the form of application-specific scaling and encoding. More about these techniques and the preprocessing phase in general will be given in Chapters 2 and 3, where we have functionally divided preprocessing and its corresponding techniques into two subphases: The selection and implementation of the appropriate data-mining technique is the main task in this phase.
This process is not straightforward; usually, in practice, the implementation is based on several models, and selecting the best one is an additional task.
The basic principles of learning and discovery from data are given in Chapter 4 of this book. Later, Chapters 5 through 13 explain and analyze specific techniques that are applied to perform a successful learning process from data and to develop an appropriate model. Interpret the model and draw conclusions.
In most cases, data-mining models should help in decision making. Note that the goals of accuracy of the model and accuracy of its interpretation are somewhat contradictory. Usually, simple models are more interpretable, but they are also less accurate. Modern data-mining methods are expected to yield highly accu- rate results using high-dimensional models. A user does not want hundreds of pages of numeri- cal results. He does not understand them; he cannot summarize, interpret, and use them for successful decision making. Even though the focus of this book is on steps 3 and 4 in the data-mining process, we have to understand that they are just two steps in a more complex process.
All phases, separately, and the entire data-mining process, as a whole, are highly iterative, as shown in Figure 1. A good understanding of the whole process is important for any successful application. No matter how powerful the data-mining method used in step 4 is, the resulting model will not be valid if the data are not collected and preprocessed correctly, or if the problem for- mulation is not meaningful.
Our ability to analyze and understand massive data sets, as we call large data, is far behind our ability to gather and store the data. Recent advances in comput- ing, communications, and digital storage technologies, together with the development of high-throughput data-acquisition technologies, have made it possible to gather and store incredible volumes of data. Large databases of digital information are ubiquitous. Growth of Internet hosts. Scientific instruments can easily generate terabytes of data in a short period of time and store them in the computer.
One example is the hundreds of terabytes of DNA, protein-sequence, and gene-expression data that biological science researchers have gathered at steadily increasing rates. The information age, with the expansion of the Internet, has caused an exponential growth in information sources and also in information-storage units. An illustrative example is given in Figure 1. It is estimated that the digital universe consumed approximately exabytes in , and it is projected to be 10 times that size by Inexpensive digital and video cameras have made available huge archives of images and videos.
The prevalence of Radio Frequency ID RFID tags or transponders due to their low cost and small size has resulted in the deployment of millions of sensors that transmit data regularly. E-mails, blogs, transaction data, and billions of Web pages create terabytes of new data every day. There is a rapidly widening gap between data-collection and data-organization capabilities and the ability to analyze the data. Current hardware and database technol- ogy allows efficient, inexpensive, and reliable data storage and access.
However, whether the context is business, medicine, science, or government, the data sets them- selves, in their raw form, are of little direct value. What is of value is the knowledge that can be inferred from the data and put to use.
This knowledge can be used to introduce new, targeted marketing campaigns with a predictable financial return, as opposed to unfocused campaigns. The root of the problem is that the data size and dimensionality are too large for manual analysis and interpretation, or even for some semiautomatic computer-based analyses. A scientist or a business manager can work effectively with a few hundred or thousand records.
As you try to unravel these systems, you will probably get frustrated, start forcing parts together, and generally become annoyed at the entire process, but once you know how to work with the pieces, you realize that it was not really that hard in the first place. There may be several hypotheses formulated for a single problem at this stage. The model of a data-mining process should help to plan, work through, and reduce the cost of any given project by detailing procedures to be performed in each of the phases. Is data mining a form of statistics enriched with learning theory or is it a revo- lutionary new concept? Outlier analysis is a set of important techniques for prepro- cessing of messy data and is also explained in this chapter.
Effectively mining millions of data points, each described with tens or hundreds of characteristics, is another matter. What are the solutions? Yes, but how long can you keep up when the limits are very close?
Maybe, if you can afford it. But then you are not competitive in the market. The only real solution will be to replace classical data analysis and interpretation methodologies both manual and computer- based with a new data-mining technology. In theory, most data-mining methods should be happy with large data sets. Large data sets have the potential to yield more valuable information. If data mining is a search through a space of possibilities, then large data sets suggest many more possi- bilities to enumerate and evaluate.
The potential for increased enumeration and search is counterbalanced by practical limitations. Besides the computational complexity of the data-mining algorithms that work with large data sets, a more exhaustive search may also increase the risk of finding some low-probability solutions that evaluate well for the given data set, but may not meet future expectations. To prepare adequate data- mining methods, we have to analyze the basic types and characteristics of data sets.
The first step in this analysis is systematization of data with respect to their computer representation and use.
Data that are usually the source for a data-mining process can be classified into structured data, semi-structured data, and unstructured data. Tabular representation of a data set. Examples of semi-structured data are electronic images of business documents, medical reports, executive summaries, and repair manuals. The majority of Web docu- ments also fall into this category.
An example of unstructured data is a video recorded by a surveillance camera in a department store. Such visual and, in general, multimedia recordings of events or processes of interest are currently gaining widespread popularity because of reduced hardware costs. This form of data generally requires extensive processing to extract and structure the information contained in it.
Structured data are often referred to as traditional data, while semi-structured and unstructured data are lumped together as nontraditional data also called multimedia data.
Most of the current data-mining methods and commercial tools are applied to traditional data. However, the development of data-mining tools for nontraditional data, as well as interfaces for its transformation into structured formats, is progressing at a rapid rate. The standard model of structured data for data mining is a collection of cases.
Potential measurements called features are specified, and these features are uniformly measured over many cases. Usually the representation of structured data for data- mining problems is in a tabular form, or in the form of a single relation term used in relational databases , where columns are features of objects stored in a table and rows are values of these features for specific entities. A simplified graphical representation of a data set and its characteristics is given in Figure 1.
In the data-mining literature, we usually use the terms samples or cases for rows. Many different types of features attributes or variables —that is, fields—in structured data records are common in data mining.
Not all of the data-mining methods are equally good at dealing with different types of features. There are several ways of characterizing features. One way of looking at a feature— or in a formalization process the more often used term is variable—is to see whether it is an independent variable or a dependent variable, that is, whether or not it is a variable whose values depend upon values of other variables represented in a data set.
This is a model-based approach to classifying variables. All dependent variables are accepted as outputs from the system for which we are establishing a model, and inde- pendent variables are inputs to the system, as represented in Figure 1. There are some additional variables that influence system behavior, but the cor- responding values are not available in a data set during a modeling process.
The reasons are different: A real system, besides input independent variables X and output dependent variables Y, often has unobserved inputs Z.
These are usually called unobserved variables, and they are the main cause of ambiguities and estimations in a model. Large data sets, including those with mixed data types, are a typical initial environment for application of data-mining techniques. When a large amount of data is stored in a computer, one cannot rush into data-mining techniques, because the important problem of data quality has to be resolved first. Also, it is obvious that a manual quality analysis is not possible at that stage.
Therefore, it is necessary to prepare a data-quality analysis in the earliest phases of the data-mining process; usually it is a task to be undertaken in the data-preprocessing phase. The quality of data could limit the ability of end users to make informed deci- sions. It has a profound effect on the image of the system and determines the corre- sponding model that is implicitly described. Using the available data-mining techniques, it will be difficult to undertake major qualitative changes in an organization based on poor-quality data; also, to make sound new discoveries from poor-quality scientific data will be almost impossible.
There are a number of indicators of data quality that have to be taken care of in the preprocessing phase of a data-mining process: The data should be accurate. The analyst has to check that the name is spelled correctly, the code is in a given range, the value is complete, and so on. The data should be stored according to data type. The analyst must ensure that the numerical value is not presented in character form, that integers are not in the form of real numbers, and so on. The data should have integrity.
Updates should not be lost because of conflicts among different users; robust backup and recovery procedures should be imple- mented if they are not already part of the Data Base Management System DBMS. The data should be consistent. The form and the content should be the same after integration of large data sets from different sources. The data should not be redundant. In practice, redundant data should be mini- mized, and reasoned duplication should be controlled, or duplicated records should be eliminated. The data should be timely.
The time component of data should be recognized explicitly from the data or implicitly from the manner of its organization. The data should be well understood. Naming standards are a necessary but not the only condition for data to be well understood.