Tajinder Singh1, Madhu Kumari2, Daya Sagar Gupta3* and Nikolai Siniak4
1 Department of CSE, SLIET Longowal, India
2 Department of CSE, Chandigarh University, India
3 Department of CSE, RGIPT Amethi, India
4 Private Institute of Management and Business, Belarus
Abstract
Despite of the behemoth utilization of social media platforms for various aspects, which provides opportunities to analyze and study the social behavior of users, text mining’s role has been not explored fully. For this, text mining is the way to discover interesting patterns in data. The motive of text mining is to utilize discovered patterns to elucidate contemporary behavior or to predict future outcomes. Multiple disciplines participate in crawling text to discover required textual patterns such as mathematical modeling, computer science, data mining and warehousing to name a few. For this purpose, embeddings are also playing a key role and under the umbrella of machine learning, IoT (Internet of things) are coping up flawlessly at an individual level to predict the behavior in terms of security privacy, analysis, and prediction. Through this chapter, explaining the role of such strategies in social media text analysis for finding knowledgeable patterns. To illustrate and deliberate the areas of social media which are reachable on an amazing variety in the field of text mining using IoT-enabled services in terms of machine learning are also described. Outcomes can provide as a baseline for future of IoT research based on machine learning in emerging applications.
Keywords: Text mining, social media, embeddings, clustering, pre-processing, Internet of Things (IoT)
In these days’ social media is acting as a double edged sword to share information, at its low cost with easy usage and explosive growth. Explosion of live multimedia streams text magnetize online social clients to contribute in real time phenomenon to become an element of gigantic multitude as they want to participate by publishing information in terms of text, video, image or sharing URL. Though, the immense quantity of text available on social media beside with the redundant information demands a momentous venture of collected social text stream. Removing of unwanted information and keywords from the collected text is very common practice. Usually social users do not write a words with correct spellings due to which numerous kind of ambiguities like lexical, semantic and syntactic occurs. Various machine learning approaches are available which helps to remove unwanted information from social text and identify the key theme in extracted social patterns. Therefore online social media and text mining is expanding with a huge degree. In current era a new story of text mining has begun as vast amount of text is posted on a various kind of social media such as Twitter, Facebook, forums, and blogs etc. Numerous types of knowledgeable patterns can be analyzed as it is a knowledge intensive manner. The text is collected over a certain period of time and various analysis tools are used to seek appropriate information from various social media resources through the classification and examination of mes-merizing patterns. Unstructured formats are available as the collected text usually comes in unstructured format and the lack of knowledge to deal with such strange patterns of text forced the researchers to perform scrubbing and normalizing data.
A dynamic and attractive source can change the whole story as it can be a big resource to generate innovative pattern of information. The dynamic nature also helps the users to get updates on time whenever required which not lonely propose a baseline for the social users but also for those who are interested to participate and keep track on daily update with time. With such kind of explosion in social media provides a full freedom to participate and to be a part of current trend. Critical analysis at regular intervals is also to be maintained to keep track on social user’s activities for a variety of perspectives’ plausibility. Numerous methods and training ways are addressed and discussed which are communally used to share and with new users to analyze reappearance of text [1].
These all mechanism diffuses information in vast extent and helps the social media for its rapid growth. Surprising growth and evolution of World Wide Web, online users are also increasing day by day. This exponential growth is never ended and such users participate in various activities of social media and generate a huge corpus of text. Multiple schemes are available which can combine and assemble collected text for analysis such as Leipzig Corpora Collection [2] and many more. The collected text demands for pre-processing [3] from which various features can be extracted and these features can be used for further research areas such as opinion/sentiment mining [4, 5], analysis of polarity disambiguation in sentiment analysis [6], detection, classification and analysis of social media events [7–9], burst analysis and trend forecasting [10], subject/topic tracking [11], recommendation method [12], Machine leaning applications in Internet of Things (IoT) and many more. Bag of Word (BOW) is a very realistic and simple approach which is used to represent text. Semantic information is also carried forward with the text representation so that the correct analysis can be performed. Doing such analysis helps the users to understand the nature of text and also provide a freedom where the large corpus of text can be converted into numeric data. Multiple influential mining mechanisms are available which can perform this task very well to extract required information [13, 14]. Few features which can be available in social text/documents can be described as:
Thus, from the above features it is observed that the role of text pre- processing is quite important and it is quite challenging and interesting also to extract the actual meaning of the collected text streams. Therefore, in the next section the role of pre-processing is explained which is proving an information in various domains.
Non-standard and out of vocabulary words have their own role in social media and their negative impact on decision making is quite challenging. We have seen and observed from the above discussion how the pre-processing tasks are demanding for emerging methodology. In general form, an ISBN is usually worn to symbolize “International Standard Book Number” [18] and this abbreviation can be used by someone for other purpose also. Such kind of abbreviations, which are changing their sense from one tier to another, make pre-processing task more difficult. Instead of social media, social apps such as mobile phone messaging also contains non-standard or out of vocabulary words. Such words are very common and also handy e.g. hruhow are you, ttyl/talk to you later, cusee you [19], and many more. Therefore, in every social media, such words are increasing in rapid way and this is a very common practice in these days. Typos such as hiy/hey, bey/ bye/c u soon/see you soon and repetition of words like byeeeeee/bye, helllooooo/hello, and many more are the part of social media informal abbreviations. Influence of such words in social media is very critical and needs to be addressed.
The performance and accuracy is degraded by non- standard and out of vocabulary words. To increase the performance and accuracy parsing is a best solution which helps to take care about out of vocabulary words [22]. We have seen that many researchers have used parsing on forums [24] and text diversity [25]. The authors have claimed that parsing gives best result but if errors occurs or performance degrades then the major reason behind this are unwanted population, false tokenization and out of vocabulary or non-standard words. Thus, we can say that wrong POS tagging generates error in parsing [19]. Such kind of issues is also available in machine translation in which machine interpretation can be wrong due to negative influence of our of vocabulary words. Let us consider an example of non-standard vocabulary which is pre-processed to gain the actual form of words in a sentence.
Before pre-processing: Finally he got a chance 2 wrk 2gther on a project.
After pre-processing: Finally, he got a chance to work together on a project.
Role of capital words in name entity recognition (NER) cannot be ignored which is also a part of non-standard word. Out is observed that with the increase of slangs, non-standard and out of vocabulary words on social media, segments of non-standards words are also growing. Sometimes slangs don’t have global impact but they contain local impact and cannot be ignored in text mining. In text mining these all act as a part of noise and such influence of words degrades the quality and performance of the text mining task. In Twitter we know that #tag, @tag are also play a significant role which are very important to consider and in [26] authors are working on domain adoption problem which is based on tagging mechanism. So we can say that tagging and its influence on the contextual information is important to consider which leads towards a successful text mining. For this purpose in section 11.3 modern text pre-processing approaches are discussed with their application domains.
In text mining research, pre-processing is required and it converts non-standard and out of vocabulary words into actual word. Sometime during pre-processing contextual information lost therefore, it is also very important to understand the contextual information in a suitable form. We have seen that in social media text analytics, contextual information is not considered usually and it leads towards wrong decision. In current era, sentiment analysis, polarity disambiguation in sentiment analysis, event detection, and classification are very common areas of research and everyday new event happened on social media. It is necessary to find the contextual information to count the exact meaning of a particular keyword [19]. In human analysis, it is easy to understand and recognize the contextual information but in machine translation it becomes difficult task. Efforts are required to understand and deal with this situation. From the previous study we can say that its very complex task and such keywords demands additional information also which is usually ignored.
Previous studies related to this text mining problem are designed in which context insensitive lexical approach is commonly used. The main motive of this approach is to reduce or remove the repetitive characters to gain the exact word. It works very well for minimizing the repeated words but it is difficult to analyze the actual sense and exact meaning of the word including the contextual information. Let us consider an example to understand this scenario. “Police will charge rupees 5 thousand fine for not tagging seat ballet in a car” whereas in second example: “I am absolutely fyn, What about you?”.
Now, in these examples, fine have two aspects in which it is representing money and felling/emotions of human concerned. From this example we can say that in re-processing it is necessary to consider the contextual as well as ambiguity in social media text. Many approaches which are handling ambiguity and also taking care of contextual information are explained further. As we know that the machine translation based and spell checking based pre-processing tasks are very communal used from the last many years. Including these two we have also explained the recent approaches which are used in pre-processing and contextual polarity disambiguation.
Similarly, a machine learning approach based on character translation is commonly used for text pre-processing [31]. Character based machine translation can handle the sparse easily. We have observed that in social media text mining, due to sparse nature of words and non-standard words availability, it is very difficult to find a particular keyword from the collected data. Therefore, character level machine translation approach is used for informal text nor-malization. It is also found that when we are applying this approach database should be updated time to time when the new words are coming from various users with new abbreviations and types. For example if in the database “whr” is defined as where then it will be recognized otherwise it will be fail to understand the word due to nonexistence of information at character level translation.
Usually the dictionaries are designed from large list of unigrams such as Google N-gram text corpus. Likelihood approach is used and depends on the user that how to consider the prior information and then what is the probability to match the words together. Distributional patters of words are also based on probability of matching words. In this approach a query based on misspelled words is matched with the dictionary and then the various patterns of matching are designed. The probability of finding correct pattern is computed and in social media, several non-standard words totally different from their standard word forms, which is away and outside the range of a spell check due to which a standard distribution of matching pattern is to be applied. As we have discussed above the repetition of the character can be corrected and similarly several words are existing in social media text which cannot be detected by spell checkers and not available in dictionary based approach (e.g. f9, b4, str8 etc.).
Various other approaches are also based on same sequences but in [36] a supervised machine learning approach is used. They used the unsupervised approach for text pre-processing and the proposed approach applied on short text messages such as tweets and SMS to find the non-standard words. The performance was computed and analyzed with the standard data sets and in the same way in [37] a special source channel based text processing approach is designed. Four factors are included in this approach such as contextual information, orthographic factor, a phonetic aspect, a contextual aspect, and short form growth. The study proved that the sources which create error to twitter data can be minimized and improved. The authors have practically performed and convert the various abbreviations into actual words such cu/see you”, which is included frequently as one of their model.
Unsupervised machine learning methodology is used by authors in [40] to pre-process the text. An approach based on random walk is used on noisy unlabeled text. Contextual information and similarity is also considered from the collected sequence of text.
In this approach a graph called bipartite graph is generated which is represented as G(W, C, E), in which ‘W’, contains all nodes representing pre-processed keywords and noisy out of vocabulary keywords, ‘C’, stands for contextual information and ‘E’, is representing the edges. The weight is updated on the basis of the word frequency. When a same word will occur again and again the analysis stage then the weight will be updated by one every time.
Another way of text pre-processing is parsing strategy. Parsers are widely used to find the meaning of non-standard words and it is used by various researchers whose motive is to find the exact meaning of non-standard words in the collected corpus. Therefore, a parsing based approach is applied in [41] whose main motive is to analyze non-standard words and to convert them into standard words. Performance of the parser is directly associated with the normalization performance and for this purpose evaluation based approaches can be used to measure the performance of various approaches w.r.t. to existing approaches. Numerous parameters are available to compute the performance of pre-processing approaches such as F-measure, precision and recall is widely used and acceptable. A modern sequential text pre-processing approach is given in [42] in which the main focus is drawn on English lexicons. The various strategies are applied to analyze the non-standard words and in first phase these non-standard words are converted into standard words. After analysis phase, domain information of the text is analyzed in next phase which enables further processing of text.
In the modern approaches of text pre-processing an integrated, unsupervised statistical methodology is also very popular and used by various researchers to normalize the text. In [43] same approach is used which is keeping track on converting out of vocabulary words into standard regular words. Random features are selected from the collected text and log- linear method is applied to extract the exact meaning of the words. In case of real time text, usually we deals with text streams and in twitter case, a collected tweet is represented as S={s1,s2,……..sn} when we are dealing with social text streams. The main motive of generating such string is to identify the exact meaning of keywords which is represented as tweet = {t1,t2,…….tn}. Similar kind of approach is used by various social media text analysis approaches to analyze the meaning of words. Finding the real time events based on various time strategy is also a part of social text stream which is dynamic in nature and changing with the change of time. Table 11.1 is showing the summarized version of various text pre-processing techniques.
Table 11.1 Summary of text pre-processing approaches.
Reference and year | Major contribution | Methodology | Scoring |
---|---|---|---|
[28, 47, 55] |
| Open-source toolkit and SMT decoder. | Machine Learning Approach (Supervised) |
[29, 17] |
| NLP (Natural Language Processing) | Machine Learning Approach (Supervised) |
[30, 41, 58] |
| LI-rule, LS-rule, SMT. | Numerical Statistical based machine translation (SMT) |
[10, 19, 33, 40, 49] |
|
Language model (5-gram) Preprocessing based on dictionary Generation | Machine Learning Approach (Unsupervised) |
[43, 61] |
| log-linear model and UNLOL | Machine Learning Approach (Unsupervised) |
[44, 52] |
| Language models | Machine Learning Approach (Unsupervised) |
Temporal and continuous extraction of social text in social text stream is a power of every kind of textual features. For extracting social text from twitter, API is designed which helps to extract social text pattern based on user’s query. In most of the cases, the process of extraction starts at ‘T’ which can be represented as closed time interval of time, [time1, time2] and it is defined as T = (t1,t2,….ti,… tr….., tn). A steam of social text is a sequence ‘S’ which is collected during various interval, S = {s1,s2,…. si,…..,sr…..sn}. From these collected intervals there may be several element of S such that si, holding a tweets ti. This part of tweet can be connected or linked with other entities in a social network. To analyze the actual connection and linking of tweets, various clustering mechanisms are used. Clustering is an easy way by which a stream of text can be categorized into various similar classes [44]. If in social text stream, S = (s1, s2, ….si, ….sr, ….sn), is distributed among clusters (C′) which is defined as C′ = (C1, C2, ……. Cj) where, C1, C2, ……. Cj ∈ S, such that S = ∪ (c1, c2, …… cj). Each of the object si, belongs to the cluster cj where, cj ⊂ C′.
A similarity based function can also be used which helps to assign a cluster to every incoming text stream on the basis of similarity index. Therefore, if the new keyword is coming then a new cluster is to be created otherwise existing cluster is to be assigned to the incoming textual data which increase its popularity and make it happening.
In social media event is related to anything happening surrounding. If the event will be popular then it will be very active on every kind of social media whereas other events will be ignored. If we consider a simple example as if there will be marriage function of a celebrity then it will act as a big event whereas marriage of a normal citizen will be a very common. Therefore, it is important to know the global context of an event whereas the local events does not contain any equivalent recognition in social media history as global events achieved [45]. From this discussion a social happening can be categorized into two ways:
Social text stream: It can be represented as ‘s’ which helps to extract the tweets in ‘T’ in a specific time interval such as, [time1, time2]. The extracted tweets contain chain of text which can be a combination of tweets which is represented as, T = (t1,t2,….ti,… tr….., tn). In other words we can say that social text streams are the combination of continuous and discrete time internal in which various keywords may be linked with each other directly or indirectly. Information of sender and receiver is to be recorded and this scenario may vary from network to network.
From the social text stream it is observed that when the text is extracted, it will be full of noise such as unwanted symbols, abbreviations, short forms of specific words and may be some additional information in terms of tagging and emoticons. Due to this reason a pre-processing approach is required for cleaning the social text stream. We have studied various machine learning based methods which can clean the text in an efficient way. Various methods and techniques are present which classify and detect events from and those methods can be applied to perform pre-processing.
After pre-processing the processed form of text is stored on the basis of their similarity. For this purpose, clustering is used to organize similar text in a single cluster and accordingly others will be managed. Usually in the existing study we have seen that a dynamic clustering is used [46]. In various applications such as topic detection, event classification, event detection, trend analysis and many more are dependent on clustering [47] [48]. In previous study, many authors are applied clustering approach on various applications of text mining and similarly in [48] and [49] clustering is applied on event detection and tracking. The single pass clustering is implemented to analyze the text and to understand the evolving nature of the social events adaptive filtering mechanism is used.
Clustering based on supervised and unsupervised type is also preferred for content analysis. In [50] the authors used the machine learning mechanism whereas in [51] clustering based on machine learning approaches to identify events and their evolving nature is studied. Basically in social media if we want to extract some patterns from the social text stream then usually three steps are performed which are as follows:
In [52] authors developed an approach for event burst detection based on clustering. A spatial approach is used which helps to analyze the numerous feature of collected text whereas, in [53] a feature based pivot approach based on clustering is studied based on a time window. The time window helps to analyze various kinds of social patterns over time which changes with the change of time window [54]. In [55] a spatial clustering mechanism is applied on flicker text and to extract event from images a k-mean clustering with the combination of DWT is used in [56]. LDA based methods are also very popular in social text analysis and after pre-processing the LDA based methods can also be used. Various author used Bayesian filters for the analysis of social text in extraction of various features such as location and time [57, 58]. It is observed that such applications help to study the unplanned events very well such as earthquake, tsunami, etc. [59, 60]. A language based model called Latent Dirichlet Allocation Category Language Model is designed for topic modeling and also helps to catego-rize text into different classes.
So, we can say that it is very clear from the discussion that the pre- processing a backbone for each and every application of text mining. Without performing the task of pre-processing further analysis is not possible and it is better to understand the role of pre-processing so that an efficient and superior approach can be followed. Here now we are going to give some issues which are analyzed from the discussion and existing in the current study.
Therefore, in this chapter, we explained a role of text pre-processing task in various application domains of text mining by considering various social text streams. For this purpose, the embeddings are also playing a key role in social media text analysis and they cannot be ignored. Effective vector representations for very short fragments of text in weighted form of word, embedding can be used. In various perspectives embeddings in social media can be used in which weighted embeddings are used widely and effectively. In case of embeddings it will very easy to apply clustering mechanism which helps to combine similar keywords in a group. Thus, we can say that embeddings are faster and efficient to use and for this reason we are explaining the usage of embedding in text mining domain in the next section.
To generate and represent a social text stream or corpus into vector form embeddings are used. Figure 11.1 is describing the general architecture of embedding.
Usually, it is found that there are three main types of classes on the basis of which embeddings can be classified based on the topic. Figure 11.2 is defining a way by which text can be retrieved and ranked with the help of an embeddings.
Latent semantic indexing (LDA): To identify and analyze hidden concepts and plan in the collected text LDA is widely used. In this method each keyword is represented in the form of vector. Semantic relationships between various documents and terms are analyzed to identify the hidden relationship within the social text. To produce all knowledgeable patterns, pre-processing is also required at initial stage for cleaning the text.
Figure 11.3 is a depiction of LSI (Latent Semantic Indexing). In this representation SVD (Singular Value Decomposition) is computed in in first stage. In Figure 1.3, ‘D’ is a representation of ‘M’ in ‘r’ dimensions, where, ‘T’ is a matric for transforming new documents. ∑, a diagonal matrix provides comparative importance of dimensions [61].
The overall scenario of the model is represented as:
Here, f(q, d) is standing for score between a query q and an individual text d. Weight matric which is to be learnt is . For training purpose, normalized text stream/corpus is considered. As per the query, ‘q’ related ‘’ and unrelated ‘’ text is analyzed. We would like to choose ‘S’ such as and it can merge to regain information as given below:
Recurrent neural network: The popularity of acceptance of RRNs’ is gaining huge interest in the field of text mining. The application of RNN in language modeling is never ended due to their neurons. These neurons can employ to access its internal its internal memory which helps to maintain information associated to preceding state. Due to this property of RNN, contextual information related to every keyword can be preserved. RNNs are recognized as recurrent as they perform the related function for each module of a cycle through the output being inclined on the preceding computations. The distributed hidden feature of RNN helps to store huge information connected or linked with past in smooth manner. On the other hand, the non-linear dynamics feature allows restoring veiled state in multifaceted means. In Figure 11.4, folded RRN is shown and Figure 11.5 describes a complete network.
In the above Figure 1.4, ‘xt’ is the input, ‘st’ is describing the hidden state at time ‘t’ and ‘ot’ is the output state. In mathematical form, RNN can be described as:
As we know that in text mining, for the experiment purpose, we can use existing text corpus or we can extract text from the social media called text stream. Current research is based on real time social text stream which helps to analyze social informative patterns for multiple purposes. Therefore, now in this section we present a twitter text stream. Due to its availability publically accessibility, we used this text stream for the experiment purpose also.
Table 11.2 Summary of collected statistics.
Statistics | Total values |
---|---|
Total feature entities | 3453214 |
Total types of feature’s categories | 67540 |
Connected features | 34398 |
Categories for features | 5430 |
Average token features | 564 |
Average predicates associated with features | 432 |
Average total literals associated with features | 289 |
We performed a series of experiments to answer understand the role of embeddings and its role in social text stream. For this purpose we tried our best to configure model by implementing it on various social text streams. We implemented and analyzed the result for the Skip-Gram-based and word2vec optimization step.
In this chapter, the standard insights extraction for social text stream is explained for event evaluation. We have also explained how well it applies to a grammatically complex and local slangs of languages. Various processes and embedding mechanism are studied and identified as belonging to the standard structure of social media analysis such as text preprocessing, embeddings, clustering and extraction of event related keywords from collect text. From the above discussing it is observed that embeddings are very useful in social text analysis and from the experimental studies we can say that various features associated with collected text can be distinguished from disjunctively written languages. It is also observed that if we use embeddings in Twitter social text stream they have an optimistic impact on the accuracy. In the future, we are interested in extending the embedding models to ana-lyze the impact using large volume of data sets and social streams for pre-processing and keyword extraction.