Myriam Hernández A. and José M. Gómez

Sentiment, Polarity and Function Analysis in Bibliometrics: A Review

Abstract: In this paper we proposed a survey in sentiment, polarity and function analysis of citations. This is an interesting area that has had an increased development in recent years but still has plenty of room for growth and further research. The amount of scientific information in the Web makes it necessary innovate the analysis of the influence of the work of peers and leaders in the scientific community. We present an overview of general concepts, review contributions to the solution of related problems such as context identification, function and polarity classification, in order to identify some trends and suggest possible future research directions.

1 Introduction

The number of publications in science grows exponentially each passing year. To understand the evolution of several topics, researchers and scientists require locating and accessing available contributions from among large amounts of available electronic material that can only be navigated through citations.

A citation is a text that references previous work with different purposes, such as comparing, contrasting, criticizing, agreeing, or acknowledging different sources. Citations in scientific texts are usually numerous, connect pieces of research and relate authors across time and among communities.

Citation analysis is a way of evaluating the impact of an author, a published work or a scientific media. [35] established that there are two types of research in the field of citation analysis of research papers: citation count to evaluate the impact [7, 12, 14, 22, 41] and citation content analysis [17]. The advantages of citation count are the simplicity and the experience accumulated in scientometric applications, but many authors have pointed out its weakness. One of the limitations is that the count does not difference between the weights of high and low impact citing papers. PageRank [27] partially solved this problem with a rating algorithm. [33] proposed co-citation analysis to supplement the qualitative method with a similarity measure between works A and B, counting the number of documents that cite them.

Myriam Hernández A.: Escuela Politécnica Nacional, Facultad de Ingeniería de Sistemas, Quito, Ecuador, e-mail: [email protected]

José M. Gómez: Universidad de Alicante, Dpto de Lenguajes y Sistemas Informáticos, Alicante, España, e-mail: [email protected]

This research work has been partially funded by the Spanish Government and the European Commission through the project, ATTOS (TIN2012-38536-C03-03), LEGOLANG (TIN2012-31224), SAM (FP7-611312) and FIRST (FP7-287607).

Recently, this type researchers’ impact measure has been widely criticized. Bibliometric studies [29] show that incomplete, erroneous or controversial papers are most cited. This can generate perverse incentives for new researchers who may be tempted to publish although its investigation is wrong or not yet complete because this way they will receive higher number of citations [23]. In fact, it also affects the quality of very prestigious journals such as Nature, Science or Cell because they know that accepting controversial articles is very profitable to increase citation numbers. Reviews, such as those conducted by the recently awarded Nobel Prize [30] emphasize this fact. Moreover, as claimed by [32] it is more influential the quantity of articles than their quality or than the relationship between papers with a higher number of citations and the number of citations that, in turn, they receive [40]. In this context, also could be recalled the Brazilian journals case occurred lately, that used self-references to skew the JCR index [38].

Another limitation of this method is that a citation is interpreted as an author being influenced by the work of another, without specifying the type of influence [43] which can be misleading concerning the true impact of a citation [8, 23, 26, 29, 31, 42]. To better understand the influence of a scientific work it is advisable to broaden the range of indicators to take into account factors like the author’s disposition towards the reference, because, for instance, a criticized quoted work cannot have the same weight than other that is used as starting point of a research.

These problems are added to the growing importance of impact indexes for the researchers’ career. Pressure to publish seems to be the cause of increased fraud in scientific literature [11]. For these reasons, it is becoming more important to correct these problems and look for more complete metrics to evaluate researchers’ relevance taking into account many other “quality” factors, one of them being the intention of the researcher when citing the work of others.

Automatic analysis of subjective criteria present in a text is known as Sentiment Analysis (SA). It is a current research topic in the area of natural language processing in the field of opinion mining and its scope includes monitoring emotions in fields as diverse as marketing, political science and economics. It is proposed that SA be applied in the study of bibliographic citations, as part of citation content analysis, to detect the intention and disposition of the citing author to the cited work, and to give additional information to complement the calculation of the estimated impact of a publication to enhance its bibliometric analysis [2, 43].

Citation content analysis examines the qualitative/subjective features of bibliographic references considering the words surrounding them as a window context whose size could fluctuate from a sentence to several paragraphs. Qualitative analysis includes syntactic and semantic language relationships through speech and natural language processing and the explicit and implicit linguistic choices in the text to infer citation function and feelings of the author regarding the cited work [43]. According [36] there is a strong connection between citation function and sentiment classification.

A combination of a quantitative and qualitative/subjective analysis would give a more complete perspective of the impact of publications in the scientific community [2]. Some methods for subjective citation analysis have been proposed by different authors, but they call for more work to achieve better results in detection, extraction and handling of citations content and to characterize in a more accurate way the profile of scientists and the criticism or acceptance of their work. [6] states that when a paper is criticized, it should have a lower or a negative weight in the calculation of bibliometric measures.

In recent years, there is an explosive development in the field of sentiment analysis [21] applied to several types of text in social media, newspapers, etc. to detect sentiments from words, phrases, sentences and documents. Although having results about the polarity or the function of a citation would have important applications in bibliometric [2] and citation-based summarization [28], relative less work has been done in this field. Some general approaches of the general sentiment analysis could be applied for citation analysis but it has special features that must be considered. According [5], a citation sentiment recognizer differs from general sentiment analysis because of the singular characteristics of citations.

In this paper, we evaluate the development of subjective, sentiment and polarity citation analysis during recent years in four lines of research that are closely related: context identification, detection of implicit citations, polarity classification and purpose (function) classification. We consider proposed approaches and trace possible future work in these areas.

2 Citation Sentiment Analysis

[36] stated that the application of subjectivity, sentiment or polarity analysis is a more complex problem when applied to citation than in other domains. Citation sentiment recognizer differs from a general sentiment analyzer because of the singular characteristics of scientific text that make more difficult to detect the author intention and the motivation behind the quote. Some of these differences are:

  • – Authors do not always state the citation purpose explicitly. Sentiment in citations is often hidden to avoid explicit criticism especially when it cannot be quantitatively justified [15].
  • – Most citation sentences does not express sentiments, most of them are neutral and objective because they are just descriptive or factual.
  • – Negative polarity is often expressed as the result of a comparison with the author’s own work and it is presented in the paper’s evaluation sections.
  • – There is a specific sentiment related science lexicon which is primarily compound of technical terms that diverge among scientific fields [39].
  • – The citation context fluctuates from one clause to several paragraphs. To delimitate the scope of a citation is a problem not yet satisfactorily solved and involves lexical, semantic and discourse considerations. [5] stated that most of the work in subjective citation analysis covers only the citation sentence and stressed the necessity to cover a wider citation context to detect polarity or function of a citation in a more accurate way. All the more, [35, 36] suggested that all related mentions of the paper are not concentrated in the surrounding area of the citation and expressed the necessity of developing natural language processing methods capable of finding and linking together citations with sentences qualifying them that are not around the citation physical location.

Furthermore, in scientific literature the citing author disposition towards a cited work is often expressed in indirect ways and often with academic expressions that are particular for each knowledge field. Efforts have being made to connect sentiments to technical terms and to develop general sentiment lexicons for Science [34].

3 Citation Context Identification and Detection of Implicit Citations

Different works which have being made to identify optimal size of context windows in order to detect the sentences referring to the citation. Most citation classification approaches are not of general application but are heavily oriented for specific science domains [10]. Almost all the research in this field corresponds to the Computer Science and Language Technologies fields and some to Life Science and Biomedical Topics. The more frequently used technique is machine learning algorithms with supervised approaches. [19] implemented a supervised tool with an algorithm using co-reference chains for extracting citation with a Super Vector Machine (SVM) classifier handling citation strings in the MUC-722 dataset. [35] used supervised training methods with n-grams (unigrams and bigrams) and other features as proper nouns, previous and next sentence, position, and orthographic characteristics. Two algorithms were used for the classifier: Maximum Entropy (ME), and SVM. As a result they made subjectivity analysis and classified text as citing or non-citing. They confirmed through experimentation that the use of context compound of sentences around the citation leads to a better identification of citation sentences, they used the ACL Anthology Corpus23.

[5] worked in a supervised approach with context windows of different lengths. They developed a classifier using citations as features sets with SVM and n-grams of length 1 to 3 as well as dependency triplets as features. They obtained a citation sentiment classification using an annotated context window of 4 sentences. Their method detected implicit and explicit citations in that context window taking into account the section of the article where citations appear (abstract, introduction, results, etc.) They specified four categories of citation sentiment: polarity, manner, certainty and knowledge type.

Conditional Random Fields (CRF) is a technique used to tag, segment and extract information from documents and have been used for some authors to identify context. [3] implemented a CRFs-based citation context extraction. They developed a context extraction tool: CitContExt. This software detects six no-context type sentences: background, issues, gaps, description, current work outcome and future work; and seven context type sentences: cited work identifies gaps, cited work overcomes gaps, uses outputs from cited works, results with cited work, compare works of cited work, shortcomings in cited work, issue related cited work.

[2] addressed the problem of identifying fragments of a citing sentence that are related to a target reference, they called it reference scope. They approach context identification as a different problem because they considered a citing sentence that cites multiple papers. They used and compared three methods: word classification SVM, sequence labeling CRF and segment classification. Segment classification was executed splitting the sentence into segments by punctuation and coordination conjunctions and was the algorithm that achieved best performance. They used a corpus formed by 19,000 NLP papers from ACL Anthology Network (ANN) corpus.

[28] experimented with probabilistic inference to extract citation context identifying even non explicit citing sentences for citation-based summarization. They modeled sentences as a CRF to detect the context of data patterns, employed a Belief Propagation mechanism to detect likely context sentences and used SVM to classify sentences as being inside or outside context. Best results were obtained for a window that has each sentence connected to 4 sentences on each side.

[34] implemented a context extraction method with a graphical interface in the form of global and within specialty maps that show links labeled by sentiment for citation and co-citation contexts graphically as global and within-speciality maps to be analyzed in terms of sentiment and content. The corpora were processed using bag of sample cue words that denote sentiment or function and count their frequency of appearance. They used a subset of 20 papers of the citation summary data of the AAN [28].

4 Citation Purpose Classification and Citation Polarity Classification

The fields of citation purpose classification (function) and citation polarity classification are interesting research topics that are being explored in recent years. An introduction to an automated solution to this problem was first developed for [36] applied in 116 articles randomly chosen from annotated corpus from Computation and Language E-Print Archive24. They used a supervised technique to automatically annotate citation text using machine learning to classify 12 citation functions aggregated in four categories. This work did not include a posterior classification of unlabeled text.

[25] used a supervised probabilistic model of classification based on a dependency tree with CRFs to distinguish positive and negative polarity to a dataset formed by four corpora for sentiment detection in Japanese.

[10] used a semi-supervised automatic data annotation with an ensemble-style self-training algorithm. Classification was aspect-based with NaÃŕve Bayes as main basic technique. Citation was classified as: background, fundamental idea, technical basis and performance comparison applied over randomly selected papers from ACL ARC25.

[4] implemented an aspect-based algorithm in which each citation had a feature set inside a SVM. Corpus is processed using WEKA26 resulting in polarity citation classification.

[18] applied a faceted algorithm using Moravcsik and Murugesan annotation scheme and the Stanford ME classifier. In this approach we have a four aspects scheme and context lengths of 1, 2 and 3 sentences. Conceptual vs. operational facet asks: “Is this an idea or a tool?; organic vs. perfunctory facet distinguishes those citations that form the foundations from the citing work from more superficial citations; evolutionary vs. juxtapositional facet detects the relationship between the citing and cited papers: based on vs. alternative work; confirmative vs. negational facet, captures the completeness and correctness of the cited work according to citing work.

[24] implemented a hybrid algorithm that used discourse as a tree-model and analyzed POS as regular expressions to obtain citation relations of contrast and corroboration.

[9] obtained XML from PDF formats using PDFX and then extract citations by means of XSLT. They retrieved citation functions using citation context with CiTalO27 tool and two ontologies: CiTO2Wordnet and CiTOfunctions. Following are some of the positive, neutral and negative functions they used: agrees with, cites, cites as authority, cites a data source, cites as evidence, cites as metadata document, cites as potential solution, cites as recommended reading, cites as related, confirms, corrects, critiques, derides.

[20] used a ME-based system to train from annotated data an automatically classify functions in three categories linked to sentiment polarity: Positive (agreement or compatibility), Neutral (background), and Negative (shows weakness of cited work).

[16] implemented citation extraction algorithm through pattern matching. They detected polarity (positive, negative and neutral) with CiTalO, a software they developed using a combination of techniques of ontology learning from natural language, sentiment analysis, word sense disambiguation (with SVM), and ontology mapping. Corpus used was 18 papers published in the seventh volume of Balisage Proceedings (2011)28, about Web markup technologies.

[37] made unsupervised boot-strapping algorithm for identifying and categorizing in two categories of concepts: Techniques and Applications.

[13] established a faceted classification of citation links in networks using SVM with linear kernels. They used three mutually exclusive categories: functional, perfunctory, and a class for ambiguous cases.

[1] used a trained classification model and SVM classifier with Linear Kernel applied to the AAN dataset.

5 Discussion

We have seen that the most used classification tool is machine learning followed by dictionary based-approaches. Hybrid methods, such a combination of machine learning and rule-based algorithms or dictionary methods with NLP are not very implemented despite their good results in general sentiment analysis, probably because of their complexity.

In citation analysis, sentiments are represented in a binary way, as polarity, and not in discrete scales. Functions are categorized in diverse forms that can always be mapped to their polarity.

In citation context identification, performance results for precision vary from approximately 77% to 83%. In citation purpose and polarity analysis results for precision have a greatly variation with a wide range that goes from 26% to 91%. Nevertheless, the values obtained in the different methods cannot be directly compared because the execution frames and evaluation methodologies diverge substantially among the studies.

In sentiment, polarity and function citation analysis there are open problems and interesting areas for future research such as:

  • – Detection of size of context windows.
  • – Detection of all references to cited work including those outside window context using NLP techniques and discourse analysis.
  • – Detection of non-explicit citing sentences for sentiment citation analysis.
  • – Development and application of domain independent techniques.
  • – Development and application of hybrid algorithms to obtain better performance.
  • – Development of a unique framework to execute experimental comparison among different techniques: algorithms, datasets, feature selection.
  • – Facet-analysis to classify sentiment, function or polarity of separate aspects and functions of the same work.

6 Conclusions

In this paper we have proposed a survey in sentiment, polarity and function analysis of citations.

Although work in this specific area has increased in recent years, there are still open problems that have not been solved and they need to be investigated. There are not enough open corpus that can be worked in shared form by researchers, there is not a common work frame to facilitate achieving results that are comparable with each other in order to reach conclusions about the efficiency of different techniques. In this field it is necessary to develop conditions that allow and motivate collaborative work.

Bibliography

  • [1] Amjad Abu-jbara, Jefferson Ezra, and Dragomir Radev. Purpose and polarity of citation: Towards nlp-based bibliometrics. Proceedings of NAACL-HLT, pages 596–606, 2013.
  • [2] Amjad Abu-Jbara and Dragomir Radev. Reference scope identification in citing sentences. In 12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 80–90, Stroudsburg, PA, USA, June 2012. Association for Computational Linguistics.
  • [3] M. A. Angrosh, Stephen Cranefield, and Nigel Stanger. Conditional random field based sentence context identification: enhancing citation services for the research community. In Proceedings of the First Australasian Web Conference, volume 144, pages 59–68, Adelaide, Australia, January 2013. Australian Computer Society, Inc.
  • [4] Awais Athar. Sentiment analysis of citations using sentence structure-based features. In 11 Proceedings of the ACL 2011 Student Session, pages 81–87, Stroudsburg, PA, USA, June 2011. Association for Computational Linguistics.
  • [5] Awais Athar and Simone Teufel. Context-enhanced citation sentiment detection. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 597–601, Montreal, Canada, June 2012. Association for Computational Linguistics.
  • [6] Susan Bonzi. Characteristics of a Literature as Predictors of Relatedness Between Cited and Citing Works. Journal of the American Society for Information Science, 33(4):208–216, September 1982.
  • [7] Christine L Borgman and Jonathan Furner. Scholarly communication and bibliometrics. Annual Review of Information Science and Technology, 36(1):2–72, February 2005.
  • [8] Björn Brembs and Marcus Munafò. Deep Impact: Unintended consequences of journal rank. January 2013.
  • [9] Paolo Ciancarini, Angelo Di Iorio, Andrea Giovanni Nuzzolese, Silvio Peroni, and Fabio Vitali. Semantic Annotation of Scholarly Documents and Citations. In Matteo Baldoni, Cristina Baroglio, Guido Boella, and Roberto Micalizio, editors, AI*IA, volume 8249 of Lecture Notes in Computer Science, pages 336–347. Springer, 2013.
  • [10] Cailing Dong and Ulrich Schäfer. Ensemble-style Self-training on Citation Classification. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 623–631, Chiang Mai, 2011. Asian Federation of Natural Language Processing.
  • [11] Ferric C Fang, R Grant Steen, and Arturo Casadevall. Misconduct accounts for the majority of retracted scientific publications. Proceedings of the National Academy of Sciences of the United States of America, 109(42):17028–33, October 2012.
  • [12] E Garfield. Citation Analysis as a Tool in Journal Evaluation: Journals can be ranked by frequency and impact of citations for science policy studies. Science, 178(4060):471–479, November 1972.
  • [13] Eric Martin Han Xu. Using Heterogeneous Features for Scientific Citation Classification. In Proceedings of the 13th conference of the Pacific Association for Computational Linguistics . Pacific Association for Computational Linguistics, 2013.
  • [14] J E Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46):16569–72, November 2005.
  • [15] K. Hyland. Writing Without Conviction? Hedging in Science Research Articles. Applied Linguistics, 17(4):433–454, December 1996.
  • [16] Angelo Di Iorio, Andrea Giovanni Nuzzolese, and Silvio Peroni. Towards the Automatic Identification of the Nature of Citations. In SePublica, pages 63–74, 2013.
  • [17] C. Lee Giles Min-yen Kan Isaac G. Councill. ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Sixth International Language Resources and Evaluation , pages 661–667, Marrakech, Morocco, 2008. European Language Resources Association.
  • [18] Charles Jochim and Hinrich Schütze. Towards a Generic and Flexible Citation Classifier Based on a Faceted Classification Scheme. In Procedings of COLING’12, pages 1343–1358, 2012.
  • [19] Dain Kaplan, Ryu Iida, and Takenobu Tokunaga. Automatic extraction of citation contexts for research paper summarization: a coreference-chain based approach. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pages 88–95, Suntec, Singapore, August 2009. Association for Computational Linguistics.
  • [20] Xiang Li, Yifan He, Adam Meyers, and Ralph Grishman. Towards Fine-grained Citation Function Classification. In RANLP, pages 402–407, 2013.
  • [21] B Liu and L Zhang. A survey of opinion mining and sentiment analysis. Mining Text Data, pages 415–463, 2012.
  • [22] T Luukkonen, O Persson, and G Sivertsen. Understanding Patterns of International Scientific Collaboration. Science, Technology & Human Values, 17(1):101–126, January 1992.
  • [23] Eve Marder, Helmut Kettenmann, and Sten Grillner. Impacting our young. Proceedings of the National Academy of Sciences of the United States of America, 107(50):21233, December 2010.
  • [24] Adam Meyers. Contrasting and Corroborating Citations in Journal Articles. Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, pages 460–466, 2013.
  • [25] Tetsuji Nakagawa, Kentaro Inui, and Sadao Kurohashi. Dependency tree-based sentiment classification using CRFs with hidden variables. In Proceeding HLT ’10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 786–794, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
  • [26] Joshua M Nicholson and John P A Ioannidis. Research grants: Conform and be funded. Nature, 492(7427):34–6, December 2012.
  • [27] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web., November 1999.
  • [28] Dragomir R. Radev, Pradeep Muthukrishnan, and Vahed Qazvinian. The ACL Anthology Network corpus. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pages 54–61, Suntec, Singapore, August 2009. Association for Computational Linguistics.
  • [29] Filippo Radicchi. In science “there is no bad publicity”: papers criticized in comments have high scientific impact. Scientific reports, 2:815, January 2012.
  • [30] Ian Sample. Nobel winner declares boycott of top science journals, September 2013.
  • [31] Michael Schreiber. A case study of the arbitrariness of the h-index and the highly-cited-publications indicator. Journal of Informetrics, 7(2):379–387, April 2013.
  • [32] Donald Siegel and Philippe Baveye. Battling the paper glut. Science (New York, N.Y.), 329(5998):1466, September 2010.
  • [33] Henry Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265–269, July 1973.
  • [34] Henry Small. Interpreting maps of science using citation context sentiments: a preliminary investigation. Scientometrics, 87(2):373–388, February 2011.
  • [35] Kazunari Sugiyama, Tarun Kumar, Min-Yen Kan, and Ramesh C Tripathi. Identifying citing sentences in research papers using supervised learning. In 2010 International Conference on Information Retrieval & Knowledge Management (CAMP), pages 67–72. IEEE, March 2010.
  • [36] Simone Teufel, Advaith Siddharthan, and Dan Tidhar. Automatic classification of citation function. pages 103–110, July 2006.
  • [37] Chen-Tse Tsai, Gourab Kundu, and Dan Roth. Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management– CIKM ’13, pages 1733–1738, New York, New York, USA, October 2013. ACM Press.
  • [38] Richard Van Noorden. Brazilian citation scheme outed. Nature, 500(7464):510–1, August 2013.
  • [39] Mateja Verlic, Gregor Stiglic, Simon Kocbek, and Peter Kokol. Sentiment in Science – A Case Study of CBMS Contributions in Years 2003 to 2007. In 2008 21st IEEE International Symposium on Computer-Based Medical Systems, pages 138–143. IEEE, June 2008.
  • [40] Gregory D Webster, Peter K Jonason, and Tatiana O Schember. Hot topics and popular papers in evolutionary psychology: Analyses of title words and citation counts in Evolution and Human Behavior, 1979–2008. Evolutionary Psychology, 7(3):348–362, 2009.
  • [41] Howard D White and Katherine W McCain. Visualizing a discipline: An author co-citation analysis of information science, 1972âĂŞ1995. Journal of the American Society for Information Science, 49(4):327–355, 1998.
  • [42] Neal S Young, John P A Ioannidis, and Omar Al-Ubaydli. Why current publication practices may distort science. PLoS medicine, 5(10):e201, October 2008.
  • [43] Guo Zhang, Ying Ding, and Staša Milojević. Citation content analysis (cca): A framework for syntactic and semantic analysis of citation content. November 2012.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset