Study About Computer Ethics And The Internet Computer Science Essay

Published: November 9, 2015 Words: 4059

When most people think of ethics or morality, it refers to well-defined standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues. Ethical standards also include those that enjoin virtues of honesty, compassion, and loyalty. Then there are standards relating to rights, such as the right to life, the right to freedom from injury, and the right to privacy. Computer ethics is a set of moral principles that regulate the use of computers. Some common issues of computer ethics include intellectual property rights such as copyrighted electronic content, privacy concerns, and how computers affect society. As technology advances, computers continue to have a greater impact on society. Therefore, computer ethics promotes the discussion of how much influence computers should have in areas such as artificial intelligence and human communication. As the world of computers evolves, computer ethics continues to create ethical standards that address new issues raised by new technologies.

The world-wide-web in its current form, linking documents with hyperlinks, in which heterogeneous data is brought together from distributed sources, has led to a number of concerns about issues related to privacy, copyright, and intellectual property. The benefits of such a Web are plenty but threats to personal information in such as social networking sites also abound. Privacy is essential for the proper functioning of a liberal, democratic society.

Another area where computer ethics play an important part is the prolific data generated from various sources. Data mining technology has emerged as a means for identifying patterns and trends from such large quantities of data. With increasing usage of data mining in the public and private sectors, privacy assumes paramount importance. It requires significant research on how to extract valuable knowledge in data and at the same time, prevent private or sensitive information in data mining process from leaking. We provide an overview of privacy preserving association rule mining, which is one of the most popular pattern discovery methods in the new and rapidly emerging research area of privacy preserving data mining. Various proposals and algorithms have been designed for it in recent years.

Analysis of sequential patterns is currently one of the most active areas of research in data mining. Sequential pattern mining is commonly defined as finding the complete set of frequent subsequences in a set of sequences. For example, let us consider the sales database of a bookstore. The discovered sequential pattern could be "70% of people who bought Harry Porter also bought the Lord of Rings at a later time". This information can be useful for shelf placement, promotions, etc. Sequential pattern mining finds its applications in areas such as business, e-commerce for analyzing the click patterns in a website, increasing sales and promotions, targeted marketing, etc.

Also, we have focused on many new techniques proposed for privacy-preserving of sequential patterns. Accordingly, W. Ouyang et al. has proposed a randomized and data perturbation approach, which is simple and easy to be implemented, and has a rather good precision of support reconstruction. The latter approach does not decrease the support of true frequent sequences and can easily be combined with existing sequential pattern mining algorithms. The proposed work by Mhatre, A. et al. demonstrates the effect of adding noisy data on sequential pattern mining algorithm over progressive database which is scalable from a single node system to a multi-party scenario. We summarize various proposals and algorithms designed in the research area of privacy preserving data mining and survey current existing techniques, and analyze their advantages and disadvantages.

I Introduction

Data mining technology has emerged as a means for identifying patterns and trends from such large quantities of data. For instance, shopping centers conclude that male customers who buy diaper usually shop beers by analyzing consuming lists. This forms the relation between diaper and beer through rearranging these goods. In addition, there is also the relation between milk and bread. This improvement of goods arrangement after analysis not only makes customers convenient but increases expenditure. Credit card centers of banks find out behavior features and consuming models of high-quality clients from lots of trade data so as to seek potential clients, stimulate clients' consumption and create more opportunities of overlapping sell.

However, data mining also brings some problems. For example, credit card centers may intentionally or unconsciously make sensitive information of clients leak while mining relating information of clients. With the Internet popularity, because more and more information can be obtained in electronic form, that people have their own privacy confidential is becoming increasingly urgent. According to statistics, even if privacy protection measures, about one-fifth of Internet users don't like to provide their own information to the Web site, and more than the half investigators only in good privacy-preserving measures are willing to provide their own information to the Web site. Among the potential consumers shopping in internet browse, there are almost half who gave up the hope for internet shopping because of worrying about no protection of their privacy. Therefore, how to ensure personal privacy in data mining has become a need to be addressed. It requires significant research on how to extract valuable knowledge in data and at the same time, prevent private or sensitive information in data mining process from leaking. Thus techniques of data mining without leaking the private information are needed. Research on privacy preserving data mining is developed for this purpose. Correspondingly the privacy preserving data mining and knowledge discovery should be developed aimed at these problems.

The association rule mining problem was first proposed by Agrawal et al.[1]. In order to make a publicly available system secure, we must ensure not only that private sensitive data have been trimmed out, but also to make sure that certain inference channels have been blocked as well. Under privacy constraints, the association rule mining problem was extensive researched. Many effective methods for privacy preserving association rule mining have been proposed [2-13]. But most of those methods may result in information loss and side-effects in some extent, such as non-sensitive rules falsely hidden and spurious rules falsely generated, may be produced in the sensitive rule hiding process. That is, an essential problem under the context is trade-off between the data utility and the disclosure risk.

Sequential pattern mining is commonly defined as finding the complete set of frequent subsequences in a set of sequences [14]. Sequential pattern mining provides a means for discovering meaningful sequential patterns among a large quantity of data. For example, let us consider the sales database of a bookstore. The discovered sequential pattern could be "70% of people who bought Harry Porter also bought the Lord of Rings at a later time". The bookstore can use this information for shelf placement, promotions, etc.

In this paper is organized as follows, Section II gives the classification frame of privacy-preserving algorithm in data mining, we provide an overview of privacy preserving association rule mining, which is one of the most popular pattern discovery methods in the new and rapidly emerging research area of privacy preserving data mining in section III and in section IV, we summarize various proposals and algorithms designed in the research area of privacy preserving sequential pattern mining.

II Data Mining

The data mining target is to find knowledge, and knowledge is presented through certain patterns. Association rule is the most frequently used method in data mining, which finds out the association between data and different objects by discovering the potential dependence among data. Classification and clustering are to sort out things by characterizing the common significance of different things. The algorithm realization of privacy-preserving technology is usually realized through the combination of data mining algorithm and data processing technology. Y. Shen et al. [15] describes the privacy-preserving classification mainly has the following two ways: (1) privacy preserving technology of centralized data and distributed data depending on data distribution. The latter one can be further classified into privacy preserving technology of horizontal partition and vertical partition. Distributed privacy preserving data mining algorithm is realized through the employment of SMC (Secure multi-party computation) as discussed by Y. Shen et al. in [16]. (2) According to the data mining algorithm classification, privacy-preserving technology can be classified.

The two basic methods [17] associated with association rule privacy-preserving technology has been put forward by Y. Shen et al. [15]. The first one is to prevent from producing association rules by hiding frequent itemsets; the second one is to avoid producing important rules by making the belief degree of important rules achieve the lowest belief appointed by users. Stanley R. M. Oleveira and Osmar R. Zaiane [18] even proposed one heuristic privacy-preserving method which realizes the protection of sensitive rules through one kind of single scanning algorithm. This algorithm mainly takes the method of removing part of information to realize data clearing, and then to hide sensitive rules, which won't append any noise to raw database. The association rule mining technology disadvantages of centralized database mainly have the several following points: network traffic is considered a little, mining efficiency is low and the degree of spatial complexity is high. Data perturbation way is very efficient used in data mining alone in centralized environment, but it will produce some problems in distributed environment, Jaideep Vaidya[13] proposed association rule privacypreserving algorithm based on data vertical distributing, which gains support counting of itemsets by securely computing scalar product delegating sub-itemsets.

The most typical classification data mining are classification methods based on distance, classification methods based on decision tree, Bayesian classification and so on. R. Agrawal and R. Srikant [10] proposed one algorithm of preserving data privacy which first adds noise to raw data, this randomized management won't influence data distribution, and then based on reconstruction technology concludes distributed information similar with raw data and meanwhile constructs decision tree. Du and Zhan[19] proposed privacy-preserving K-nearest classification algorithm relying on vertical distribution. While decision tree classification method under the condition of data distribution is to construct decision tree by transferring middle computation result, and data mining algorithms of decision tree classification are ID3 algorithm, c4.5 algorithm , c5.0 algorithm and so on. Lindell and Pinkas [20] first proposed privacy-preserving ID3 classification tree distributed algorithm which adopts computing tool of security x involving the participation of semi-honest third party. In addition, based on Bayesian classification algorithm, M. Kantarcioglu and Clifton [12] established a Naive Bayesian Classification model of horizontal partition to realize privacy preservation through the secure sum method.

Privacy-preserving clustering mining relying on data perturbation is to make real sensitive data unknown by transforming data, and then to process clustering analysis. However, privacy-preserving clustering based on SMC is to make one party who participates only under the condition of owning its own personal information become fully aware of the whole clustering information, mainly through constructing secure multiparty protocol. Oliveiral and Zaiane [21] proposed rotational based transformation (RBT) method to transform data, which realizes the isometry transformation of points in multi-dimension and achieves an excellent privacy preserving result. Merngn and Ghosh [22] proposed a privacy-preserving method solving distributed clustering analysis, in which the raw or disturbed data shared by every site will form suitable parameter in each local site, transmit parameter to the central site and accomplish high quality distributed clustering through suitable samples.

III Association Rules

Let I = { I1,I2 …,Im } be a set of items [23]. Let D be a database of transactions where each transaction T is a set of items such that TI.ï€ Each transaction is associated to an identifier, call TID. A transaction T is said to contain A if and only if AT. An association rule is an implication of the form A B, where AI, B I, and A∩B =Ñ„. The rule A B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain AB. The rule AB has confidence c in the transaction set D. That is,

Sup (A B) = P (AUB) = (1)

Conf (A B) = P (B | A) = (2)

Where |A| is named as the support count of the set of items A in the set of transactions D, as denoted by sup_ count( A) . A occurs in a transaction T, if and only if AT. Rules that satisfy both a minimum support threshold ( min_ sup ) and a minimum confidence threshold ( min_conf ) are called strong. A set of items referred to as an itemset. An itemset that contains k items is a k -itemset. Itemsets that satisfy min_ sup is named as frequent itemsets. All strong association rules result from frequent itemsets.

According to privacy protection technologies discussed in [24], at present, privacy preserving association rule mining algorithms commonly can be divided into three categories [25]: Heuristic-Based Techniques, Reconstruction-Based Association Rule and Cryptography-Based Techniques.

Heuristic-based techniques are to modify data for the selected data sets and take into account the effectiveness of data security and privacy. The methods of Heuristic-based modification include perturbation, which is accomplished by the alteration of an attribute value by a new value (i.e., changing a 1-value to a 0-value, or adding noise), and blocking, which is the replacement of an existing attribute value with a "?". There is a basic principle of choosing the transaction or the item of itemset to be modified that we should reduce the influence of the original database as far as possible. Dasseni et al. [4] extends the sanitization of sensitive large itemsets to the sanitization of sensitive rules. Oliveira and Zaiane also in [5] aims at balancing between privacy and disclosure of information by trying to minimize the impact on sanitized transactions or else to minimize the accidentally hidden and ghost rules. Wang et al. propose a matrix based sanitization approach to hide the sensitive patterns in [26]. It is the first paper to involve the consideration of avoiding the Forward-Inference Attacks [27], which can also be avoided in the sanitized database generated by our sanitization process. Oliveira et al. propose a novel method to modify databases for hiding sensitive patterns in [8]. Multiplying the original database by a sanitization matrix yields a sanitized database with private content. The method can avoid the question of the Forward-Inference Attacks. This paper in [28] describes a technique that uses a queue and a random number generator to generate the items so that each item has an approximately equal frequency of being added to transactions. In replacement-Based Techniques, Saygin et al. [29] discusses specific examples with the use of an uncertain symbol used in association rule mining, in which case the support and confidence interval are used to support and confidence interval to replace. Agrawal et al. improve on the distribution reconstruction technique presented in [30] by using the Expectation Maximization (EM) method and they propose novel metrics for the quantication and measurement of privacy preserving data mining algorithms. The paper in [31] presents a new generalization framework on the concept of personalized anonymity in order to perform the minimum generalization for satisfying everybody's requirements, the core of personalized anonymity is the concept of personalized anonymity. It provides privacy protection of different size for the records of data table. The paper in [32] proposes a personalized anonymity model on the base of (α,k)-anonymization model in order to resolve the problem of privacy self management and proposes corresponding anonymity method by using local recoding and sensitive attribute generalization.

A number of recently proposed techniques address the issue of privacy preservation by perturbing the data and reconstructing the distributions at an aggregate level in order to perform the association rules mining. Agrawal et al. in [10] first proposed the method of distribution reconstruction on numeric data which is disturbed by Bayesian algorithm in 2000. Then, Dakshi and Charu in [33] improve the work over the Bayesian-based reconstruction procedure by using an Expectation Maximization (EM) algorithm for distribution reconstruction. The work presented in [34] deals with binary and categorical data in the context of association rule mining. Both papers consider randomization techniques that offer privacy while they maintain high utility for the data set. This approach for randomizing transactions would be to generalize Warner's "randomized response" method. Before sending a transaction to the server, the client takes each item and with probability p replaces it by a new item not originally present in this transaction. This process is called uniform randomization. The algorithm is applied in categorical data and the key is to mining the frequent itemsets. Shariq J. et al. in [35] present a scheme called MASK, which attempts to simultaneously provide a high degree of privacy to the user and retain a high degree of accuracy in the mining results. Its scheme is based on a simple probabilistic distortion of binary data, employing random numbers generated from a pre-defined distribution function. And These works by Agarwal et al. [34] based on the "select-a-size" and "cut-and paste" random transform operation to hide the original data set method, and then convert the transformed data into project itemsets support counting, in order to identify frequent itemsets. Taking into account the more time and space in the process of reconstructing the distributions, Shariq J. later optimizes the MASK algorithm.

Many Cryptography-based approaches have been proposed in the context of privacy preserving data mining algorithms. Cryptography-based approaches like Secure Multi-party Computation (SMC) are secure at the end of the computations. No party knows anything except its own input and the results. SMC method is a typical technique. The [36] presents four secure multiparty computations based on the methods that can support privacy preserving data mining. The described methods include the secure sum, the secure set union, the secure size of set intersection, and the scalar product. Theory for performing linear regression on vertically partitioned data has also been developed. Sanil et al. [37, 38] describe two different perspectives. The paper in [37] relies on quadratic optimization to solve for coefficients. The paper in [38] uses a form of secure matrix multiplication to calculate off-diagonal blocks of the full-data covariance matrix. Another way for computing the support count utilizes the secure size of set intersection method described in [36]. If the transactions are vertically partitioned across the sites, this problem can be solved by generating and computing a set of independent linear equations [13]. The work in [39] develops a log-linear model approach for strictly vertically partitioned databases and a more general secure logistic regression for problems involving partially overlapping data bases with measurement error. Kantarcioglu and Clifton in [40] use a secure multi-party computation to model the horizontal partitioning of transactions across sites, and present algorithms that incorporate cryptographic techniques to minimize the shared information without incurring much overhead in the mining process. The paper in [41] proposes an efficient distributed algorithm FDM (Fast Distributed Mining of association rules) for mining association rules.

IV Sequential Pattern Mining

In the sequential pattern mining, we are given a database of customer transactions. Each transaction consists of the following fields: customer-ID, transaction-time, and the items purchased in the transaction. No customer has more than one transaction with the same transaction-time. We do not consider quantities of items bought in a transaction: each item is a binary variable representing whether an item was bought or not. An itemset is a non-empty set of items. A sequence is an ordered list of itemsets according to time. The support for a sequence is defined as the fraction of total customers who support this sequence. The problem of mining sequential patterns is to find the sequences with maximal length (e.g., maximal sequence) among all sequences that have a certain user-specified minimum support. Each such maximal sequence represents a sequential pattern.

A pattern-set is a non-empty set of patterns. A sequence is an ordered list of pattern-set. Without loss of generality, we assume that the set of patterns is mapped to a set of contiguous integers. We denote a pattern-set a as (a1a2…an), where aj is a pattern. We denote a sequence S by < s1s2…sn>, where sj is a pattern-set. A sequence < a1a2…an > is contained in another sequence < b1b2…bm > if there exist integers i1 < i2 < … < in such that a1 bi1, a2 bi2… an bin. For example, the sequence < (3) (4 5) (8) > is contained in < (7) (3 8) (9) (4 5 6) (8) >, since (3) (3 8), (4 5) (4 5 6) and (8) (8). However, the sequence < (3) (5) > is not contained in < (3 5) > (and vice versa). The former represents patterns 3 and 5 occurred one after the other, while the latter represents pattern 3 and 5 occurred together.

W. Ouyang et al. has proposed a randomized [43] and data perturbation [42] approach, which is simple and easy to be implemented, and has a rather good precision of support reconstruction. The latter approach does not decrease the support of true frequent sequences and can easily be combined with existing sequential pattern mining algorithms. The proposed work by Mhatre, A. et al. [44] demonstrates the effect of adding noisy data on sequential pattern mining algorithm over progressive database which is scalable from a single node system to a multi-party scenario.

V Conclusion

Ethics refers to well-defined standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues. Ethical standards also include those that enjoin virtues of honesty, compassion, and loyalty. Then there are standards relating to rights, such as the right to life, the right to freedom from injury, and the right to privacy. Computer ethics is a set of moral principles that regulate the use of computers. Some common issues of computer ethics include intellectual property rights such as copyrighted electronic content, privacy concerns, and how computers affect society. As technology advances, computers continue to have a greater impact on society. As the world of computers evolves, computer ethics continues to create ethical standards that address new issues raised by new technologies. The benefits of world-wide-web are plenty but threats to personal information in such as social networking sites also abound. Another area where computer ethics play an important part is the prolific data generated from various sources. Data mining technology has emerged as a means for identifying patterns and trends from such large quantities of data. With increasing usage of data mining in the public and private sectors, privacy assumes paramount importance. It requires significant research on how to extract valuable knowledge in data and at the same time, prevent private or sensitive information in data mining process from leaking. We provide an overview of privacy preserving association rule mining, which is one of the most popular pattern discovery methods in the new and rapidly emerging research area of privacy preserving data mining. Various proposals and algorithms have been designed for it in recent years.

Analysis of sequential patterns is currently one of the most active areas of research in data mining. Sequential pattern mining is commonly defined as finding the complete set of frequent subsequences in a set of sequences. For example, let us consider the sales database of a bookstore. The discovered sequential pattern could be "70% of people who bought Harry Porter also bought the Lord of Rings at a later time". This information can be useful for shelf placement, promotions, etc. Sequential pattern mining finds its applications in areas such as business, e-commerce for analyzing the click patterns in a website, increasing sales and promotions, targeted marketing, etc.

Also, we have focused on of the most active areas of research in data mining - sequential pattern mining. Accordingly, W. Ouyang et al. has proposed a randomized and data perturbation approach, which is simple and easy to be implemented, and has a rather good precision of support reconstruction. The latter approach does not decrease the support of true frequent sequences and can easily be combined with existing sequential pattern mining algorithms. The proposed work by Mhatre, A. et al. demonstrates the effect of adding noisy data on sequential pattern mining algorithm over progressive database which is scalable from a single node system to a multi-party scenario. We summarize various proposals and algorithms designed in the research area of privacy preserving data mining and survey current existing techniques, and analyze their advantages and disadvantages.