Purpose of this paper

Published: November 30, 2015 Words: 2595

Introduction

The purpose of this paper is to present the concept of knowledge creation by outlining its definition and use, while also considering the state-of-the-art in KDD, it is not intended to provide an in-depth introduction to each one of them; rather, it is intended to acquaint the reader with the concept of KDD approaches and potential uses. The paper begins with a brief coverage of the concept of knowledge creation and followed by Knowledge discovery.

Knowledge creation can be defined as the formationof newideasthrough interactions between explicit andtacit knowledgeinindividualhuman minds. This is as defined by Ikujiro Nonaka, and it consists ofsocialization(tacitto tacit), externalization (tacit to explicit),combination (explicit to explicit), andinternalization(explicit to tacit). On the other hand, the phrase knowledge discovery in databases was coined at the first KDD workshop in 1989 (Piatetsky-Shapiro, 1991) to emphasize that knowledge is the end product of a data-driven discovery. It has been popularized in the Artificial Intelligence (AI) and machine-learning fields. In some, they view KDD as the overall process of discovering useful knowledge from data. While, data mining refers to a particular step in this process. In the KDD process, additional steps such as data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of the results of mining, is essential to ensure that useful knowledge is derived from the data.

Currently, in databases the amount of data being collected far exceeds one's ability to reduce and analyze data without the use of automated analysis techniques. It is noted that many scientific and transactional business databases grow at a phenomenal rate. For example, a single system like the astronomical survey application SCICAT, is expected to exceed 3 terabytes of data at completion (Fayyad et al, 1996). Therefore, it is possible that KDD is the field that is evolving to provide automated analysis solutions.

Having considered knowledge creation, one now looks at Knowledge discovery. For this, it can be defined as ``the non-trivial extraction of implicit, unknown, and potentially useful information from data'' (Frawley et al, 1991). While (Fayyad et al, 1996), a clearer distinction between data mining and knowledge discovery is drawn. Under their conventions, the knowledge discovery process takes into account the raw results fromdata mining(the process of extracting trends or patterns from data) and carefully and accurately transforms them into useful and understandable information. Typically, this information it obtained is not retrievable by standard techniques but is uncovered through the use of AI techniques. It can be seen that KDD is a growing field and there are many knowledge discovery methodologies in use and under development. Some of these are generic, while others are domain-specific.

Knowledge Discovery in Databases or KDD

What does knowledge discovery mean in the context of information systems? Traditionally, the emphasis has been on an individual's role in gathering information and creating new knowledge. Therefore, anyone who does literature-based research can appreciate the aphorism of Georg Christoph Lichtenberg, an 18th century German physicist better known for his wit: “Lesen heisst borgen, daraus erfinden abtragen” [To read means to borrow; to create out of one's readings is paying off one's debts]. While information retrieval systems may be used to locate relevant documents, most researchers accomplish the subsequent knowledge discovery process on their own. In the field of AI, where discussion of knowledge discovery dates to the 1980s, offers a different vision for the future. Edward Feigenbaum ("Toward the Library of the Future," Long Range Planning, 22(1):122, 1989) contrasts the libraries of today ("warehouses of passive objects" where "books and journals sit on shelves waiting for us to use our intelligence to find them, to interpret them, and cause them finally to divulge their stored knowledge") with a library of the future where "books" would interact and collaborate with the user. This knowledge system would involve an intelligent computer agent interacting with one or more people. About ten years later, the work on knowledge discovery, capture and creation seeks to find new ways to foster such human-computer and human-human collaboration.

Data as raw material dataare the smallest units of measure. The word ‘data' is technically the plural ofdatumbut often used as a singular. They are the components of information. Data may be the 1's and 0's of computer memory, names and addresses in a demographic file, or the raw facts and figures before interpretation. They are stored in data bases and data processing is the electronic manipulation of data. While data mining or knowledge discovery in databases (KDD) involves manipulation of data from structured databases. A variety of methods are used to evaluate data for relevant relationships that could yield new knowledge. The intent is to find valid, novel, potentially useful and ultimately understandable patterns in data. There the goals of data mining can include prediction and description. The latter makes use of existing variables in the database in order to predict unknown or future values of interest. While description focuses on finding patterns in the data for subsequent presentation for user interpretation. Thus we find that the work draws on machine learning, pattern recognition, statistics and visualization techniques.

The concept of knowledge discovery can be better understood if one looks at the study introduced by University of Illinois professor Linda Smith. By Knowledge Discovery, Capture and Creation, she discusses what knowledge discovery means in the on text of information systems, and outlines the range of techniques being used to support knowledge discovery. She manages to introduce two different categorizations of knowledge that can serve as a framework for considering what types of knowledge are found through knowledge discovery techniques and what is omitted. In her work, she discusses text as raw material and data as raw material, and asserts that knowledge management in business settings is concerned with knowledge capture, finding ways to make tacit knowledge explicit, or creating expert directories to foster knowledge sharing through human-human collaboration

Considering KDD, although there are many approaches to it, six common and essential elements qualify each as a knowledge discovery technique. The following are basic features that all KDD techniques share (adapted from Fayyad et al, 1996 and Frawley, 1991) and use:

For this method, large amounts of data are required in order to provide sufficient information to derive additional knowledge. Since large amounts of data are required, processing efficiency is essential to obtain good results. Utmost amongst them is accuracy. This is essential to assure that discovered knowledge is valid. Apart from that, the results should be presented in a manner that is understandable by humans in general. One of the major premises of KDD is that the knowledge is discovered using intelligent learning techniques that sift through the data in an automated process. For this technique to be considered useful in terms of knowledge discovery, the discovered knowledge must therefore have potential value to the user.

KDD provides the capability to discover new and meaningful information by using existing data. It exceeds the human capacity to analyze large data sets. The amount of data that requires processing and analysis in a large database exceeds human capabilities, and the difficulty of accurately transforming raw data into knowledge surpasses the limits of traditional databases. Therefore, the full utilization of stored data depends on the use of knowledge discovery techniques.

The usefulness of future applications of KDD is far-reaching. KDD may be used as a means of information retrieval, in the same manner that intelligent agents perform information retrieval on the web. It is envisaged that new patterns or trends in data may be discovered using these techniques. KDD may also be used as a basis for the intelligent interfaces of tomorrow, by adding a knowledge discovery component to a database engine or by integrating KDD with spreadsheets and visualizations.

The KDD process involves using the database along with any required selection, pre-processing, sub sampling, and transformations of it; by applying data-mining methods (algorithms) to enumerate patterns from it; and evaluating the products of data mining to identify the subset of the enumerated patterns deemed knowledge. The data-mining component of the KDD process is concerned with the algorithmic means by which patterns are extracted and enumerated from data. The overall KDD process (figure 1) includes the evaluation and possible interpretation of the mined patterns to determine which patterns can be considered new knowledge.

KDD Techniques

Learning algorithms are an integral part of KDD. These techniques may either be supervised or unsupervised. In general, it is found that supervised learning techniques enjoy a better success rate as defined in terms of usefulness of discovered knowledge. According to (Brachman & Anand, 1996), learning algorithms are complex and generally considered the hardest part of any KDD technique.

Machine discovery is considered one of the earliest fields that have contributed to KDD (Fayyad et al, 1996). While machine discovery relies solely on an autonomous approach to information discovery, KDD typically combines automated approaches with human interaction to assure accurate, useful, and understandable results.

There are many different approaches that are classified as KDD techniques. Among them are the quantitative approaches, such as the probabilistic and statistical approaches and also approaches that utilize visualization techniques. There are classification approaches such as Bayesian classification, inductive logic, data cleaning/pattern discovery, and decision tree analysis. Other known approaches include deviation and trend analysis, genetic algorithms, neural networks, and hybrid approaches that combine two or more techniques.

Because of the ways that these techniques can be used and combined, there is some disagreement on how these techniques should be categorized. For example, the Bayesian approach may be logically grouped with probabilistic approaches, classification approaches, or visualization approaches. For the sake of organization, each approach described here is included in the group that it seemed to fit best. However, this selection is not intended to imply a strict categorization.

Probabilistic Approach

These techniques utilize graphical representation models to compare different knowledge representations and are based on probabilities and data independencies. In this approach, they are useful for applications involving uncertainty and applications structured such that a probability may be assigned to each ``outcome'' or bit of discovered knowledge. However, probabilistic techniques may be used in diagnostic systems and in planning and control systems (Buntine, 1996). It is important to note that automated probabilistic tools are available both commercially and in the public domain.

Statistical Approach

The statistical approach uses rule discovery and is based on data relationships. An ``inductive learning algorithm can automatically select useful join paths and attributes to construct rules from a database with many relations'' (Hsu & Knoblock, 1996). In this type of approach, induction is used to generalize patterns in the data and to construct rules from the noted patterns As an example, online analytical processing (OLAP) of a statistically-oriented approach is used. It is known that automated statistical tools are available both commercially and in the public domain.

An example of a statistical application is determining that all transactions in a sales database that start with a specified transaction code are cash sales. The system would note that of all the transactions in the database only 60% are cash sales. Therefore, the system may accurately conclude that 40% are collectibles.

Classification Approach

Classification is probably the oldest and most widely-used of all the KDD approaches (Quinlan, 1993). This approach groups data according to similarities or classes. There are many types of classification techniques and numerous automated tools available.

TheBayesian Approachto KDD ``is a graphical model that uses directed arcs exclusively to form an [sic] directed acyclic graph'' (Buntine, 1996). Although the Bayesian approach uses probabilities and a graphical means of representation, it is also considered a type of classification.

Bayesian networks are typically used when the uncertainty associated with an outcome can be expressed in terms of a probability. This approach relies on encoded domain knowledge and has been used for diagnostic systems. Other pattern recognition applications, including the Hidden Markov Model, can be modelled using a Bayesian approach (Buntine, 1996). Automated tools are available both commercially and in the public domain.

Pattern Discovery and Data Cleaningis another type of classification that systematically reduces a large database to a few pertinent and informative records (Guyon, 1996). If redundant and uninteresting data is eliminated, the task of discovering patterns in the data is simplified. This approach works on the premise of the old adage, ``less is more''. The pattern discovery and data cleaning techniques are useful for reducing enormous volumes of application data, such as those encountered when analyzing automated sensor recordings. Once the sensor readings are reduced to a manageable size using a data cleaning technique, the patterns in the data may be more easily recognized. Automated tools using these techniques are available both commercially and in the public domain.

TheDecision Tree Approachuses production rules, builds a directed a cyclical graph based on data premises, and classifies data according to its attributes. This method requires that data classes are discrete and predefined (Quinlan, 1993). According to (Fayyad et al, 1996), the primary use of this approach is for predictive models that may be appropriate for either classification or regression techniques. Tools for decision tree analysis are available commercially and in the public domain.

Deviation and Trend Analysis

Pattern detection by filtering important trends is the basis for this KDD approach. Deviation and trend analysis techniques are normally applied to temporal databases. A good application for this type of KDD is the analysis of traffic on large telecommunications networks.

The US telecommunication company, AT&T uses such a system to locate and identify circuits that exhibit deviation (faulty behaviour) (Sasisekharan et al, 1996). The sheer volume of data requiring analysis makes an automated technique imperative. Trend-type analysis might also prove useful for astronomical and oceanographic data, as they are time-based and voluminous. It is also known that public domain tools are available for this approach.

Hybrid Approach

A hybrid approach to KDD combines more than one approach and is also called a multi-paradigmatic approach. In practice, its implementation may be more difficult; hybrid tools are able to combine the strengths of various approaches. Some of the commonly used methods combine visualization techniques, induction, neural networks, and rule-based systems to achieve the desired knowledge discovery. Deductive databases and genetic algorithms have also been used in hybrid approaches. There are hybrid tools available commercially and in the public domain.

Other Approaches

Neural networks may also be used as a method of knowledge discovery. For this, neural networks are particularly useful for pattern recognition, and are sometimes grouped with the classification approaches. In this respect, there are tools available in the public domain and commercially. Genetic algorithms, also used for classification, are similar to neural networks although they are typically considered more powerful. There are tools for the genetic approach available commercially.

Conclusions

KDD is a rapidly expanding field with promise for great applicability. Knowledge discovery purports to be the new database technology for the coming years. The need for automated discovery tools had caused an explosion in the number and type of tools available commercially and in the public domain. Thesoftwareweb site (Piatetsky-Shapiro, n.d.) is updated frequently and is intended to be an exhaustive listing of currently available KDD tools.

It is anticipated that commercial database systems of the future will include KDD capabilities in the form of intelligent database interfaces. Some types of information retrieval may benefit from the use of KDD techniques. Due to the potential applicability of knowledge discovery in so many diverse areas there are growing research opportunities in this field. Many of these opportunities are discussed in (Piatetsky-Shapiro, n.d.), a newsletter which has regular contributions from many of the best-known authors of KDD literature.

References