Abstract: In this paper we proposed a survey in sentiment, polarity and function analysis of citations. This is an interesting area that has had an increased development in recent years but still has plenty of room for growth and further research. The amount of scientific information in the Web makes it necessary innovate the analysis of the influence of the work of peers and leaders in the scientific community. We present an overview of general concepts, review contributions to the solution of related problems such as context identification, function and polarity classification, in order to identify some trends and suggest possible future research directions.
The number of publications in science grows exponentially each passing year. To understand the evolution of several topics, researchers and scientists require locating and accessing available contributions from among large amounts of available electronic material that can only be navigated through citations.
A citation is a text that references previous work with different purposes, such as comparing, contrasting, criticizing, agreeing, or acknowledging different sources. Citations in scientific texts are usually numerous, connect pieces of research and relate authors across time and among communities.
Citation analysis is a way of evaluating the impact of an author, a published work or a scientific media. [35] established that there are two types of research in the field of citation analysis of research papers: citation count to evaluate the impact [7, 12, 14, 22, 41] and citation content analysis [17]. The advantages of citation count are the simplicity and the experience accumulated in scientometric applications, but many authors have pointed out its weakness. One of the limitations is that the count does not difference between the weights of high and low impact citing papers. PageRank [27] partially solved this problem with a rating algorithm. [33] proposed co-citation analysis to supplement the qualitative method with a similarity measure between works A and B, counting the number of documents that cite them.
Myriam Hernández A.: Escuela Politécnica Nacional, Facultad de Ingeniería de Sistemas, Quito, Ecuador, e-mail: [email protected]
José M. Gómez: Universidad de Alicante, Dpto de Lenguajes y Sistemas Informáticos, Alicante, España, e-mail: [email protected]
This research work has been partially funded by the Spanish Government and the European Commission through the project, ATTOS (TIN2012-38536-C03-03), LEGOLANG (TIN2012-31224), SAM (FP7-611312) and FIRST (FP7-287607).
Recently, this type researchers’ impact measure has been widely criticized. Bibliometric studies [29] show that incomplete, erroneous or controversial papers are most cited. This can generate perverse incentives for new researchers who may be tempted to publish although its investigation is wrong or not yet complete because this way they will receive higher number of citations [23]. In fact, it also affects the quality of very prestigious journals such as Nature, Science or Cell because they know that accepting controversial articles is very profitable to increase citation numbers. Reviews, such as those conducted by the recently awarded Nobel Prize [30] emphasize this fact. Moreover, as claimed by [32] it is more influential the quantity of articles than their quality or than the relationship between papers with a higher number of citations and the number of citations that, in turn, they receive [40]. In this context, also could be recalled the Brazilian journals case occurred lately, that used self-references to skew the JCR index [38].
Another limitation of this method is that a citation is interpreted as an author being influenced by the work of another, without specifying the type of influence [43] which can be misleading concerning the true impact of a citation [8, 23, 26, 29, 31, 42]. To better understand the influence of a scientific work it is advisable to broaden the range of indicators to take into account factors like the author’s disposition towards the reference, because, for instance, a criticized quoted work cannot have the same weight than other that is used as starting point of a research.
These problems are added to the growing importance of impact indexes for the researchers’ career. Pressure to publish seems to be the cause of increased fraud in scientific literature [11]. For these reasons, it is becoming more important to correct these problems and look for more complete metrics to evaluate researchers’ relevance taking into account many other “quality” factors, one of them being the intention of the researcher when citing the work of others.
Automatic analysis of subjective criteria present in a text is known as Sentiment Analysis (SA). It is a current research topic in the area of natural language processing in the field of opinion mining and its scope includes monitoring emotions in fields as diverse as marketing, political science and economics. It is proposed that SA be applied in the study of bibliographic citations, as part of citation content analysis, to detect the intention and disposition of the citing author to the cited work, and to give additional information to complement the calculation of the estimated impact of a publication to enhance its bibliometric analysis [2, 43].
Citation content analysis examines the qualitative/subjective features of bibliographic references considering the words surrounding them as a window context whose size could fluctuate from a sentence to several paragraphs. Qualitative analysis includes syntactic and semantic language relationships through speech and natural language processing and the explicit and implicit linguistic choices in the text to infer citation function and feelings of the author regarding the cited work [43]. According [36] there is a strong connection between citation function and sentiment classification.
A combination of a quantitative and qualitative/subjective analysis would give a more complete perspective of the impact of publications in the scientific community [2]. Some methods for subjective citation analysis have been proposed by different authors, but they call for more work to achieve better results in detection, extraction and handling of citations content and to characterize in a more accurate way the profile of scientists and the criticism or acceptance of their work. [6] states that when a paper is criticized, it should have a lower or a negative weight in the calculation of bibliometric measures.
In recent years, there is an explosive development in the field of sentiment analysis [21] applied to several types of text in social media, newspapers, etc. to detect sentiments from words, phrases, sentences and documents. Although having results about the polarity or the function of a citation would have important applications in bibliometric [2] and citation-based summarization [28], relative less work has been done in this field. Some general approaches of the general sentiment analysis could be applied for citation analysis but it has special features that must be considered. According [5], a citation sentiment recognizer differs from general sentiment analysis because of the singular characteristics of citations.
In this paper, we evaluate the development of subjective, sentiment and polarity citation analysis during recent years in four lines of research that are closely related: context identification, detection of implicit citations, polarity classification and purpose (function) classification. We consider proposed approaches and trace possible future work in these areas.
[36] stated that the application of subjectivity, sentiment or polarity analysis is a more complex problem when applied to citation than in other domains. Citation sentiment recognizer differs from a general sentiment analyzer because of the singular characteristics of scientific text that make more difficult to detect the author intention and the motivation behind the quote. Some of these differences are:
Furthermore, in scientific literature the citing author disposition towards a cited work is often expressed in indirect ways and often with academic expressions that are particular for each knowledge field. Efforts have being made to connect sentiments to technical terms and to develop general sentiment lexicons for Science [34].
Different works which have being made to identify optimal size of context windows in order to detect the sentences referring to the citation. Most citation classification approaches are not of general application but are heavily oriented for specific science domains [10]. Almost all the research in this field corresponds to the Computer Science and Language Technologies fields and some to Life Science and Biomedical Topics. The more frequently used technique is machine learning algorithms with supervised approaches. [19] implemented a supervised tool with an algorithm using co-reference chains for extracting citation with a Super Vector Machine (SVM) classifier handling citation strings in the MUC-722 dataset. [35] used supervised training methods with n-grams (unigrams and bigrams) and other features as proper nouns, previous and next sentence, position, and orthographic characteristics. Two algorithms were used for the classifier: Maximum Entropy (ME), and SVM. As a result they made subjectivity analysis and classified text as citing or non-citing. They confirmed through experimentation that the use of context compound of sentences around the citation leads to a better identification of citation sentences, they used the ACL Anthology Corpus23.
[5] worked in a supervised approach with context windows of different lengths. They developed a classifier using citations as features sets with SVM and n-grams of length 1 to 3 as well as dependency triplets as features. They obtained a citation sentiment classification using an annotated context window of 4 sentences. Their method detected implicit and explicit citations in that context window taking into account the section of the article where citations appear (abstract, introduction, results, etc.) They specified four categories of citation sentiment: polarity, manner, certainty and knowledge type.
Conditional Random Fields (CRF) is a technique used to tag, segment and extract information from documents and have been used for some authors to identify context. [3] implemented a CRFs-based citation context extraction. They developed a context extraction tool: CitContExt. This software detects six no-context type sentences: background, issues, gaps, description, current work outcome and future work; and seven context type sentences: cited work identifies gaps, cited work overcomes gaps, uses outputs from cited works, results with cited work, compare works of cited work, shortcomings in cited work, issue related cited work.
[2] addressed the problem of identifying fragments of a citing sentence that are related to a target reference, they called it reference scope. They approach context identification as a different problem because they considered a citing sentence that cites multiple papers. They used and compared three methods: word classification SVM, sequence labeling CRF and segment classification. Segment classification was executed splitting the sentence into segments by punctuation and coordination conjunctions and was the algorithm that achieved best performance. They used a corpus formed by 19,000 NLP papers from ACL Anthology Network (ANN) corpus.
[28] experimented with probabilistic inference to extract citation context identifying even non explicit citing sentences for citation-based summarization. They modeled sentences as a CRF to detect the context of data patterns, employed a Belief Propagation mechanism to detect likely context sentences and used SVM to classify sentences as being inside or outside context. Best results were obtained for a window that has each sentence connected to 4 sentences on each side.
[34] implemented a context extraction method with a graphical interface in the form of global and within specialty maps that show links labeled by sentiment for citation and co-citation contexts graphically as global and within-speciality maps to be analyzed in terms of sentiment and content. The corpora were processed using bag of sample cue words that denote sentiment or function and count their frequency of appearance. They used a subset of 20 papers of the citation summary data of the AAN [28].
The fields of citation purpose classification (function) and citation polarity classification are interesting research topics that are being explored in recent years. An introduction to an automated solution to this problem was first developed for [36] applied in 116 articles randomly chosen from annotated corpus from Computation and Language E-Print Archive24. They used a supervised technique to automatically annotate citation text using machine learning to classify 12 citation functions aggregated in four categories. This work did not include a posterior classification of unlabeled text.
[25] used a supervised probabilistic model of classification based on a dependency tree with CRFs to distinguish positive and negative polarity to a dataset formed by four corpora for sentiment detection in Japanese.
[10] used a semi-supervised automatic data annotation with an ensemble-style self-training algorithm. Classification was aspect-based with NaÃŕve Bayes as main basic technique. Citation was classified as: background, fundamental idea, technical basis and performance comparison applied over randomly selected papers from ACL ARC25.
[4] implemented an aspect-based algorithm in which each citation had a feature set inside a SVM. Corpus is processed using WEKA26 resulting in polarity citation classification.
[18] applied a faceted algorithm using Moravcsik and Murugesan annotation scheme and the Stanford ME classifier. In this approach we have a four aspects scheme and context lengths of 1, 2 and 3 sentences. Conceptual vs. operational facet asks: “Is this an idea or a tool?; organic vs. perfunctory facet distinguishes those citations that form the foundations from the citing work from more superficial citations; evolutionary vs. juxtapositional facet detects the relationship between the citing and cited papers: based on vs. alternative work; confirmative vs. negational facet, captures the completeness and correctness of the cited work according to citing work.
[24] implemented a hybrid algorithm that used discourse as a tree-model and analyzed POS as regular expressions to obtain citation relations of contrast and corroboration.
[9] obtained XML from PDF formats using PDFX and then extract citations by means of XSLT. They retrieved citation functions using citation context with CiTalO27 tool and two ontologies: CiTO2Wordnet and CiTOfunctions. Following are some of the positive, neutral and negative functions they used: agrees with, cites, cites as authority, cites a data source, cites as evidence, cites as metadata document, cites as potential solution, cites as recommended reading, cites as related, confirms, corrects, critiques, derides.
[20] used a ME-based system to train from annotated data an automatically classify functions in three categories linked to sentiment polarity: Positive (agreement or compatibility), Neutral (background), and Negative (shows weakness of cited work).
[16] implemented citation extraction algorithm through pattern matching. They detected polarity (positive, negative and neutral) with CiTalO, a software they developed using a combination of techniques of ontology learning from natural language, sentiment analysis, word sense disambiguation (with SVM), and ontology mapping. Corpus used was 18 papers published in the seventh volume of Balisage Proceedings (2011)28, about Web markup technologies.
[37] made unsupervised boot-strapping algorithm for identifying and categorizing in two categories of concepts: Techniques and Applications.
[13] established a faceted classification of citation links in networks using SVM with linear kernels. They used three mutually exclusive categories: functional, perfunctory, and a class for ambiguous cases.
[1] used a trained classification model and SVM classifier with Linear Kernel applied to the AAN dataset.
We have seen that the most used classification tool is machine learning followed by dictionary based-approaches. Hybrid methods, such a combination of machine learning and rule-based algorithms or dictionary methods with NLP are not very implemented despite their good results in general sentiment analysis, probably because of their complexity.
In citation analysis, sentiments are represented in a binary way, as polarity, and not in discrete scales. Functions are categorized in diverse forms that can always be mapped to their polarity.
In citation context identification, performance results for precision vary from approximately 77% to 83%. In citation purpose and polarity analysis results for precision have a greatly variation with a wide range that goes from 26% to 91%. Nevertheless, the values obtained in the different methods cannot be directly compared because the execution frames and evaluation methodologies diverge substantially among the studies.
In sentiment, polarity and function citation analysis there are open problems and interesting areas for future research such as:
In this paper we have proposed a survey in sentiment, polarity and function analysis of citations.
Although work in this specific area has increased in recent years, there are still open problems that have not been solved and they need to be investigated. There are not enough open corpus that can be worked in shared form by researchers, there is not a common work frame to facilitate achieving results that are comparable with each other in order to reach conclusions about the efficiency of different techniques. In this field it is necessary to develop conditions that allow and motivate collaborative work.