Data Mining And Data Warehousing Information Technology Essay

Published: November 30, 2015 Words: 2773

The concept of data warehousing dates back to the late 1980 .when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow - mainly, the high costs associated with it. In the absence of datawarehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently.Each environment served different users but often required much of the same data. The process of gathering, cleaning and integrating data from various sources, usually long existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from the operational systems that were logically related to prior gathered data.

Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture; rather multiple architectures exist to support various environments and situations. The worthiness of the architecture can be judged in how the conceptualization aids in the building, maintenance, and usage of the data warehouse.

One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers:

Operational database layer:

The source data for the data warehouse - An organization's Enterprise Resource Planning systems fall into this layer.

Data access layer:

The interface between the operational and informational access layer - Tools to extract, transform, load data into the warehouse fall into this layer.

Metadata layer:

The data directory - This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.

Informational access layer:

The data accessed for reporting and analyzing and the tools for reporting and analyzing data - Business intelligence tools fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer.

MULTIDIMENSIONAL DATA MODEL:

1. Conceptual modeling of data warehouses

2(a).Star schema

2(b).Snowflake schema

2. OLAP operation

1. STAR SCHEMA:

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.

STAR MODEL:

Dimension tables have a simple primary key, while fact tables have a compound primary key consisting of the aggregate of relevant dimension keys.

It is common for dimension tables to consolidate redundant data and be in second normal form, while fact tables are usually in third normal form because all data depend on either one dimension or all of them, not on combinations of a few dimensions.

The star schema is a way to implement multi-dimensional database (MDDB).

Another reason for using a star schema is its simplicity from the users' point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema.

EXAMPLE:

SNOWFLAKE SCHEMA:

A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. Closely related to the star schema, the snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the snowflake schema, however, dimensions are normalized into multiple related tables whereas the star schema's dimensions are demoralized with each dimension being represented by a single table.

When the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and where child tables have multiple parent tables ("forks in the road"), a complex snowflake shape starts to emerge. The "snow flaking" effect only affects the dimension tables and not the fact tables.

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schemas are not normalized much, and are frequently designed at a level of form.

SNOWFLAKE MODEL:

The decision whether to employ a star schema or a snowflake schema should consider the relative strengths of the database platform in question and the query tool to be employed. Star schema should be favored with query tools that largely expose users to the underlying table structures and in environments where most queries are simpler in nature. Snowflake schema are often better with more sophisticated query tools that isolate users from the raw table structures and for environments having numerous queries with complex criteria.

EXAMPLE:

Benefits :

If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snow flaking may be appropriate.

A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalized. A snowflake schema will therefore be easier to implement.

A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organization of the database.

Some users may wish to submit queries to the database which, using conventional multid reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snow flaking would typically be required to permit simple query tools to form such a query, especially if provision for these forms of query weren't anticipated when the data warehouse was first designed.

3. OLAP OPERATION:

Online analytical processing, or OLAP is an approach to quickly answer multi-dimensional analytical queries.[1] OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining. The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).

Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases and hierarchical databases that are faster than relational databases.

1. ROLL UP

2. DRILL DOWN

3. SLICE AND DICE

4. PIVOT

1. ROLL UP:

A roll-up involves computing all of the data relationships for one or more dimensions. To do this, a computational relationship or formula might be defined.

2. DRILL DOWN/UP:

Drilling down or up is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down).

3. SLICE AND DICE:

Slice: A slice is a subset of a multi-dimensional array corresponding to a single value for one or more members of the dimensions not in the subset.

Dice: The dice operation is a slice on more than two dimensions of a data cube (or more than two consecutive slices).

4. PIVOT:

To change the dimensional orientation of a report or page display.

OLAP TYPES:

Multidimensional OLAP

Relational OLAP

Hybrid OLAP

Multidimensional OLAP:

MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in optimized multi-dimensional array storage, rather than in a relational database. Therefore it requires the pre-computation and storage of information in the cube - the operation known as processing.

2. Relational OLAP:

ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information. Depends on a specialized schema design.

3. Hybrid OLAP:

There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except that a database will divide data between relational and specialized storage. For example, for some vendors, a HOLAP database will use relational tables to hold the larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data.

ONLINE TRANSACTION PROCESSING. (OLTP)

Online transaction processing, or OLTP, refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions. OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automatic teller machine (ATM) for a bank is an example of a commercial transaction processing application.

The technology is used in a number of industries, including banking, airlines, mail-order, supermarkets, and manufacturing. Applications include electronic banking, order processing, employee time clock systems, e-commerce, and e-Trading. The most widely used OLTP system is probably IBM's CICS.

Benefits:

Online Transaction Processing has two key benefits: simplicity and efficiency. Reduced paper trails and the faster, more accurate forecasts for revenues and expenses are both examples of how OLTP makes things simpler for businesses.

Disadvantages:

As with any information processing system, security and reliability are considerations. Online transaction systems are generally more susceptible to direct attack and abuse than their offline counterparts. When organizations choose to rely on OLTP, like any other technology, operations can be severely impacted by reliability problems.

DATA CLEANING:

Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the data.

1. Missing values

2. Noisy data

3. Inconsistent data

DATA TRANSFORMATION:

Data transformation, the data are transformed or consolidated into forms appropriate for mining

It can solve the following:

1. Smoothing

2. Aggregation

3. Normalisation

4. Generalisation

5. Attribute construction

DATA REDUCTION:

Data reduction is used to reduce representation of the data set that is much smaller in volume.

Strategies:

1. Data cube Aggregation

2. Attribute subset Selection

3. Dimensionality reduction

4. Numerosity reduction

DATA AGGREGATION:

Data aggregation is any process in which information is expressed in a summary form for purposes such as reporting or analysis. Ineffective data aggregation is currently a major component that limits query performance. And, with up to 90 percent of all reports containing aggregate information, it becomes clear why proactively implementing an aggregation solution can generate significant performance benefits, opening up the opportunity for companies to enhance their organizations' analysis and reporting capabilities.

Seven Criteria to Selecting an Effective Aggregation Solution:

1. Enterprise-class solution.

2. Flexible architecture

3. Performance.

4. Scalability

5. Fast implementation

6. Efficient use of hardware and software resources

7. Price/performance.

DATA MINING:

Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

While data mining can be used to uncover patterns in data samples, it is important to be aware that the use of non-representative samples of data may produce results that are not indicative of the domain. Similarly, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined". There is a tendency for insufficiently knowledgeable "consumers" of the results to attribute "magical abilities" to data mining, treating the technique as a sort of all-seeing crystal ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Hence, an important part of the process is the verification and validation of patterns on other samples of data.

The term data mining has also been used in a related but negative sense, to mean the deliberate searching for apparent but not necessarily representative patterns in large amounts of data. To avoid confusion with the other sense, the terms data dredging and data snooping are often used. Note, however, that dredging and snooping can be (and sometimes are) used as exploratory tools when developing and clarifying hypotheses.

While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:

Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.

Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

LEVELS OF DATA MINING:

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.

Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on statistical signi ficance.

Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

APPLICATION OF DATA MINING:

1. Telecommunication

2. Retail industries

3. Financial data analysis

4. Surveillance / Mass surveillance

5. National Security Agency

6. Quantitative structure-activity relationship

7. Customer analytics

8. Police-enforced ANPR in the UK

9. Stellar wind (code name)

10. Banking

CONCLUSION:

According to this paper we can use our computer as a mass storage system,

By using this concept we can run banking sector in smooth way.

the big merit is "it accept the all kind of data by monitoring the situation"

"This concept is conquered the normal file demerits"

"Security is good by using this data warehouse"

So all the people are prefer data ware house instead of file storage