Online social networks have witnessed an enormous growth in terms of user over past few years. However, it also becomes center of attraction for spammers. In order to trace spammers and who is related to whom on a large scale is a complex problem. Since spammers communicate covertly so by analyzing simple graph of social network, they cannot be identified. In order to find the circle of people involved in the malicious messaging, we will associate people on the basis of their spatio-temporal co-occurrence i.e. people frequently communicating with each other. In this paper, we will associate people on the basis of their spatio-temporal co-occurrence and find the users involved in the malicious communication.
Early social networking websites primarily focused on bringing group of people together in order to interact through chat rooms and share information through personal homepage. However, in 2002 these social networking sites emerged as the most popular sites, by allowing users to publicize and share content, social network sites may become at risk to different types of malicious action such as message spamming. These spammers are encouraged to spam in order to broadcast pornography or for promotion of certain content.
These social network websites have become the center of internet users' attention, by creating a versatile, common platform and allowing people to connect with other people- who share their common interests; activities, political views, language, cultural and religious values or nationality. Apart from internet user's, these websites also able to get researchers' attention who try to understand the reasons behind users' engagement with them and how to extract useful information from them.
The reason behind this introduction is to provide enough background that forces us to do our research on these social network websites.
1.1 Social Network:
A social network is group of social entities i.e. comprising of individuals or organizations called "nodes or actor", which are linked by certain relationship such as friendship, kinship, dislike, sexual relationships, or trading relationship called "tie, edge or link". So a social network sites are basically sites that allow to publish them self and allow to form relation with other people, not only that it also allow individuals to eloquent and make visible their social network, which make it different from other computer mediated communications.
Social network analysis has many useful insights as one can identify important nodes, nodes with many connections or those with greatest effect on the network [Kempe, 03] [Richardson, 02] or to find the subset of network that show interesting patterns [Agrawal, 03] [Mukherjee, 04]. Moreover, it also help us in identifying nodes i.e. who is core of network and who is on the periphery or where are the clusters and who is in it.
Knowledge of social network is very useful in different aspects. For example, viral marketing take advantage of relationship between existing and potential customers to boost sales of products and services [Kempe, 03] [Richardson, 02] or one can identify group of people involved in malicious communication. Moreover, members of social network can also take benefits to get to know others, in order to expand their social circle [Mukherjee, 04].
Despite of many uses, analyzing social network to find required pattern is difficult job. In this paper, we are computationally mining social network for finding group of users' involved in the malicious activities from spatio-temporal data i.e. data is associated with location and time. Moreover, we will be focusing on the spatio-temporal co-occurrence to establish ties between users in order to detect the group of users involved in malicious activities.
1.2 Our contribution
[state your contribution in 10 lines and
give a good architecture diagram here]
1.3 Paper organization
[lits the organization of your paper like lestion one introduces paper, section 2 covers previous work …]
2. Previous works
Existing spam/malicious filter currently are mostly static in nature, such as black and white listing [Resig, 04], or rule-based filters. One of the techniques to detect malicious content is Bayesian classifier, a statistical approach that mainly focuses on textual features. An example is given in [Androutsopoulos, 00], which describe how it can be applicable to filter malicious content. The fundamental idea is to compute probability for data being malicious based on words it contains. Another method that is related to technique used in this paper is proposed by Gee[Gee, 03], in which latent semantic analysis (LSI) is used to filter malicious content. However in Gee's [Gee, 03] paper, he used textual feature of email.
We first discuss the conventional LSI used in search engine [Lauw, 09], in this section.
2.1 Latent Semantic Indexing:
Latent semantic indexing (LSI) or Latent semantic analysis as discussed in [Cristiani,02] is a statistical technique that calculates statistical correlation between all terms and documents in a corpus, in an attempt to overcome the problems in conventional matching.
In general, it involves creating a weighted term-document matrix, upon which we perform Singular Value Decomposition and using this matrix to find concepts contained in the text.
Support vector machine (SVM) is created using as a term-document matrix. Singular value decomposition (SVD) is applied to estimate the term usage across all documents in the corpus, deriving in the process conceptual indices that approximate the underlying word usage structure. Then, to avoid the noisy effects due to excessive variability in the vocabulary usage, the SVD-derived matrices are reduced to an arbitrary k dimensions ([Cristiani,02], [Landauer, 97] , [Landauer,98]). [Deerwester, 90] provides an excellent background on kernel machines and how they are not susceptible to local minima like other methods.
The end result is a condensed vector for each term and document that is a linear combination of data from every other matrix cell ([Landauer, 97]). Retrieval and searching is performed using the database of singular values and vectors obtained from the reduced SVD matrices. Studies have shown that these vectors are robust, effective indicators of meaning and enjoy a higher recall than searching only with individual terms ([Gee, 01], [Landauer, 97], [Landauer,98]).
One issue with LSI is that it does not support the ad-hoc addition of new documents once the semantic set has been generated. Any update to any cell value will change the coefficient in every other word vector, as SVD uses all linear relations in its assigned dimensionality to induce vectors that will predict every text samples in which the word occurs ([Landauer, 97]).
To compare two vectors, the dot product is used to generate a cosine of the angle between the two vectors. [Cristiani,02] provides an exact summary of how this is generated. A cosine of 1 signifies that the two vectors (be they term or document) are considered to be exactly similar (which is different from identical; a cosine of -1 would signify that they are theoretically completely dissimilar. New test documents not previously included in the semantic set can be used for comparison as well, by combining the vectors of their composite terms.
2.2 Mining Social Network:
Apart from co-occurrence, there are also some other criteria, on which ties between actors can be formed i.e. similarity, communication etc and using this information, we can analyze network.
Similarity is basically degree of likeness and symmetry between two or more concepts or objects i.e. sharing same properties or features. Generally, friends tends to be a like [Carley, 91], from this we conclude that people who share common features tends to be associated to one another in some way. For example, websites with similar text and links represents group of related individuals [Adamic, 03].
Communication, the exchange of information or resources is normally observed between related people. This means, communication can infer association between people. For example emails [Schwartz, 93], instant messaging [Resig, 04] can be used to trace association, since such communications are directed i.e. between sender and receiver.
Co-occurrence presupposes that if different entities co-occur more frequently then alone then they are associated. With regards to space and time, co-occurrence can be define in different ways i.e. basic co-occurrence (when no time and space is considered), temporal co-occurrence (when only time is considered), spatial co-occurrence (when only space is considered) and spatio-temporal co-occurrence (when both space and time are considered). However, current work mainly focus on time series approach i.e. data associated with each node is considered as the collection of time series and using this time series, we calculate the distance between time series with the help of Euclidean [Wang, 03] or LCSS [Vlachos, 02] distance function. And if the distance is less then certain threshold, the time series are considered similar and resulting nodes are considered to be co-occurring.
3. Mining Social Networks from Spatio-Temporal Events
Event is basically an instance of co-occurring actors (nodes) in social network. In order to identify the spatio-temporal behaviors of actors, we focus on the common behavior of social network i.e. actors (nodes) group together, when they interact. By this, we mean that social event will result in spatio-temporal co-occurrence.
Spatio-temporal event is basically spatio-temporal co-occurrences that results from social relations and communications. More formally, spatio-temporal event can define as following.
Definition: for event to be spatio-temporal event it must satisfy following condition:
Actors involved in event must have same location.
The difference in time of participation of actors in particular event must be less then certain threshold (ï¤t).
No's of actors in event must be at least two.
The first two conditions specify the constraints for spatio-temporal co-occurrence. The value of both can be as restraining or as lenient, depending upon how you associate people. However, it is neither possible nor practical to have exact co-occurrence. Furthermore, by allowing this tolerance to be diverse, an appropriate tolerance can be selected depending upon need.
The third condition is a general condition, which is true for all events i.e. for event to occur; at least two actors must interact.
Once we create spatio-temporal events, we give them weight between 0 and 1. With the help of these weights, we will be able to predict association between actors. In order to more precise, we set threshold and events whose value is greater than threshold can only be selected. Same algorithmic approach is followed as that of [Lauw, 09]
After joining actors on the basis of spatio-temporal co-occurrences, we will find actors involved in malicious activities. In order to do so, we will be using latent semantic indexing (LSI).
4. Latent Semantic Indexing (LSI)
In order apply latent semantic indexing for malicious content filtering, the posts i.e. text message (in our case) is considered as document and each post is considered as vector in space.
In order to create vector space model, we represent each post according to its feature value. This led us to an important question i.e. which features of posts are significant for classification into malicious text or not.
In our case, we use text based feature set. It basically consists of words extracted from posts. Document frequency thresholding is used for reducing dimensionality.
4.1 Dimensionality Reduction:
When tokenizing text, one frequently faces very high dimensional data, which is difficult to handle. Therefore, document frequency thresholding plays import role in reducing dimensions such that terms having very high or low frequencies are disqualified. Moreover, term that occur in all document give no important information. An overview of different term selection technique is given in [Yang, 97].
4.1.2 Document Frequency Thresholding:
The basic idea is that, the frequent items play no significant role in distinguishing between malicious and non-malicious text. The biggest advantage of document frequency thresholding is that it require no class information and therefore is used in clustering, where either only one class information about data is available or no class information is available.
Document frequency thresholding proceeds as following:
First the upper boundary is set, usually to 0.5 so that all terms that occur in more than half document can be omitted.
Lower boundary is dynamically set according to the preferred number of features.
Once dimension are reduced using document frequency thresholding, we can also apply Information Gain technique to increase performance.
The basic idea of Information Gain is to identify how well each feature separates the given data. Information Gain along with other techniques for feature selection is given in[Yang, 97].
4.2 Training and Classification:
The training consists of indexing the malicious and non malicious text along with the computation of singular value decomposition and classification consists of query indexing and retrieval of closest message from the training set. Every new message is classified based on the feature set used and comparing it with the query vector, if the closest message from training set is malicious then the new message is malicious else it is non malicious. The distance is measured on the basis of angle between the query and training vector[Langville, 05].
5. Experimental Evaluations
For experiment we use data from Facebook. The dataset consist information about user's personal information and posts.
We first create spatio-temporal database from this data i.e. store only information of user regarding its location and time of interaction. Once we have spatio-temporal database, we start associating people on the basis of spatio-temporal co-occurrences for different values of threshold (ï¤t). Graph 1 shows when the value of threshold (ï¤t) is very small (e.g. 60 sec).
Figure number: title
The resulted graph shows a large deviation from the original graph because it considers those interactions which occur accidently. These wrong associations are showed with read lines Graph 2 shows when the value of threshold (ï¤t) is very large (e.g. 28800 sec or 8 hrs).
Figure number: title
The resulted graph shows also deviation from original graph because it misses interaction between friends who communicate for short time. So we have to select an appropriate value so that it neither selects accidental values nor misses important data, then we will be able to have most appropriate graph that resemble the correct association as shown in graph 3.
Figure number: title
Once we set the appropriate value, we increase our dataset in order to better find the circle of people involved in malicious communication. After associating people on the basis of spatio-temporal co-occurrences, we find malicious text send by the particular individual. For this, we apply latent semantic indexing and after finding the user involved in malicious activities as shown in graph below with big circle, we are able to find other users who are also involved in these activities.
Figure number: title
6. Conclusion
In this paper we proposed spatio-temporal mining on social network to identify circle of users' involved in malicious circle with the help of latent semantic analysis. There are many opportunities for future works. Our current approach could be fine-tuned by investigating other factors that may boost the quality of identifying malicious circle of people. This approach can also be used by different organizations to improve their performance. [extend the conclusion and future work part to at least 10 more lines]
References [Must add some more like 5-6 references of 2008-2009]
[Adamic, 03] L. A. Adamic and E. Adar, Friends and neighbors on the web, Social Networks, 25 (2003), pp. 211-230.
[Agrawal, 03] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu, Mining newsgroups using networks arising from social behavior, WWW, (2003), pp. 688-703.
[Androutsopoulos, 00]Androutsopoulos, J. Koutsias, K. Chandrinos, and C. D. Spyropoulos, An experimental comparison of naive Bayesian and keyword based anti-spam filtering with personal e-mail messages, in SIGIR '00: proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval,2000, pp. 160-167.
[Carley, 91]K. Carley, A theory of group stability, American Sociological Review, 56 (1991), pp. 331-354.
[Cristiani,02] N. Cristiani and B. Scholkopf, "Support Vector Machines and Kernel Methods: The New Generation of Learning Machines". AI Magazine. Fall 2002 (Vol 23, no. 3), pp. 31-41, 2002.
[Deerwester, 90] S. Deerwester, S.T. Dumais, , G.W. Furnas, T.K. Landauer, and R. Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41:391-407, 1990.
[Gee, 03] K. Gee, Using latent semantic indexing to filter spam, in ACM Symposium on Applied Computing,
Data Mining Track, 2003, pp. 460-464.
[Gee, 01]Gee, K. Text Classification Using Latent Semantic Indexing. Master's thesis. The University of Texas at Arlington, 2001.
[Kempe, 03] D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the spread of influence through a social network, KDD,(2003), pp. 137-146.
[Landauer, 97]T.K. Landauer. and S.T. Dumais, A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge.Psychological Review, 104, 211-240, 1997.
[Landauer,98] T.K. Landauer, P.W. Foltz, D. Laham . An Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284, 1998.
[Langville, 05]A. N. Langville, The linear algebra behind search engines, in Journal of Online Mathematics and its Applications (JOMA), 2005, Online Module,2005.
[Lauw, 09]H. W. Lauw, E. P. Lim, T. T. Tan and H. H. Pang, Mining Social Network from Spatio-Temporal Events, 2009, pp. 88-89.
[Mukherjee, 04] M. Mukherjee and L. B. Holder, Graph-based data mining on social networks, in Workshop on Link Analysis and Group Detection (in conj. with KDD), (2004).
[Mukherjee, 04] D. M. Boyd, Friendster and publicly articulated social networking, in Conf. on Human Factors and Computing Systems, (2004), pp. 1279-1282.
[Resig, 04] J. Resig, S. Dawara, C. M. Homan, and A. Teredesai, Extracting social networks from instant messaging populations, in Workshop on Link Analysis and Group Detection (in conj. with KDD), (2004).
[Richardson, 02] M. Richardson and P. Domingo, Mining knowledge-sharing sites for viral marketing, KDD, (2002), pp. 61-70.
[Schwartz, 93] M. F. Schwartz and D. C. M. Wood, Discovering shared interests using graph analysis, CACM, 36 1993),pp. 78-89.
[Vlachos, 02] M. Vlachos, G. Kollios, and D. Gunopulos, Discovering similar multidimensional trajectories, ICDE, (2002),
pp. 673-684.
[Wang, 03] Y. Wang, E. Lim, and S. Hwang, On mining group patterns of mobile users, DEXA, (2003), pp. 287-296.
[Yang, 97]Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization, in Proceedings of ICML-97, 14th International Conference on Machine Learning, D. H. Fisher, ed., Nashville, US, 1997, Morgan Kaufmann Publishers, San Francisco, US, pp. 412-420.