Natural language processing is no free lunch

S. Wagner    University of Stuttgart, Stuttgart, Germany

Abstract

Today’s operating systems, with their personal assistants Siri or Cortana, show the impressive progress natural language processing (NLP) has made. They make it seem like all technical and methodological challenges of NLP have been solved. As many artefacts in software engineering are full of natural language, the applications are endless. As it turns out, however, using NLP is no free lunch. We offer a brief check on how and how not to apply NLP in software analytics in this chapter.

Keywords

Natural language processing (NLP); Part-of-speech tagging; Topic modeling; Stemming; Level of abstraction; Clones

Today’s operating systems, with their personal assistants Siri or Cortana, show the impressive progress natural language processing (NLP) has made. They make it seem like all technical and methodological challenges of NLP have been solved. As many artefacts in software engineering are full of natural language, the applications are endless. As it turns out, however, using NLP is no free lunch. We offer a brief check on how and how not to apply NLP in software analytics in this chapter.

We recently applied NLP on the documentation of software systems. Our starting point was that the likes of JavaDoc and Doxygen allowed software developers to document source code in a structured and versatile way. This led to a considerable amount of documentation, especially of interfaces in many programs today. The documentation focuses, however, on the level of functions/methods and classes. Documentation on the component level is often missing.

So why can’t we use the lower-level documentation to generate the component documentation? One problem is that most of this documentation is in natural language (apart from some annotations for authors or return values). How can we decide what from the class and method comments is important for describing the component they are a part of?

I decided to team up with a NLP expert and colleague, Sebastian Padó, and to explore this issue in a master thesis [1]. While we could apply topic modeling to the comments on the class level and generate meaningful topics, the practical usefulness of the results is unclear. For example, for the Java package java.io it showed us the following topic to be most probable: “buffer, stream, byte, read, method, field, data, write, output, class, serial, input, written…”. It gives an idea what the package is about. Yet, it cannot replace a description created by a human.

Let us discuss what it takes to create such results and how we could improve them.

Natural Language Data in Software Projects

Most data we deal with in software projects is somehow textual. The central artefact in a software project is the source code, which is textual, in a formal language, and how to analyse it has been thoroughly studied. Beyond the source code, however, we find a lot of natural language data in today’s software projects. For example, we have textual documentation for the user, or of the software architecture. We also find textual data in commit messages and issue descriptions. Even in the source code, as we saw before, we have natural language in the comments. We need to make more use of this rich source of information to support developers and guide projects. Yet, how can we deal with data that is as fuzzy as natural language?

Natural Language Processing

NLP has made huge progress over the last decades. NLP researchers have developed a wide range of algorithms and tools to deal with large text corpora and give various insights into the meaning of the natural language texts. For example, part-of-speech tagging gives you the grammatical use of each word (verb, noun, or determiner) and topic modeling extracts the most probable topics for documents in a text corpus. The access to the research results in NLP is easy as there are several great text books (eg, [2]) and open-source libraries (eg, from the Stanford NLP group: http://nlp.stanford.edu/software/) available. They provide us with the means to analyse the natural language data from our software projects.

How to Apply NLP to Software Projects

But how are we going to apply that in a software project? The first thing to do is to understand the used algorithms and tune them so they fit to the problem. Binkley et al. [3], for example, provide insights and an interesting discussion on how to tune parameters of the topic modeling algorithm Latent Dirichlet allocation (LDA) in the context of software artefacts. Yet, as the natural language data, as well as the goals of its analysis, are diverse and specific to your context, I will focus on four further good practices to follow in general.

Do Stemming First

The first step in any analysis is extracting and cleaning the data. This is also the case with natural language data. Often, we need to extract it from Microsoft Word documents or PDFs. As soon as we have the plain text data, it is usually a good idea to use a stemmer. Stemming is the process of removing morphological and inflexional endings from words. This allows an easier analysis as all words from the same word stem can be treated equally. For example, in most cases I don’t care for the difference between “read” and “reading.” In the topic modeling example, it should lead to the same topic.

In my experience, a simple stemming algorithm such as the one by Porter [4], as implemented in the library from the Stanford NLP group mentioned herein, is sufficient for texts in English. In some applications and for other languages, stemming might not be enough. For those cases, you should look into lemmatization, which employs dictionary and morphological analysis to return the base form of a word. More details can be found, for example, in Manning et al. [2].

Check the Level of Abstraction

As natural language texts do not follow formal grammar and do not have formal semantics, the level of abstraction can be diverse in text documents, even if we look at the same type of artefact. This became very apparent when we applied clone detection to natural language requirements specifications [5]. The intention was to find parts of the specifications that were created by copy&paste. We found that the level of cloning was extremely diverse from several specifications with almost none up to specifications with more than 50% clones. The main reason was the level of detail in which the specification described the system. The specifications with high cloning described very concrete details such as messages on a network. The specifications with low cloning described the functionality in an abstract way. It is probably not a good idea to further analyse these textual data in the same way. So be sure to know what you are looking for. It might be helpful to cluster artefacts even if they are of the same type.

For example, we might be interested in whether the topics that are used in our requirements specifications differ. This could give us hints about whether we usually specify similar aspects. Yet, as we learned herein, the levels of abstraction are quite different in typical requirements specifications. An analysis of the topics of all of those specifications might not be very insightful. Yet, using the degree of cloning to cluster them with a clustering algorithm such as k-means could help us in grouping them in useful levels of abstraction. Inside of each of these clusters, the topics are probably more homogeneous. For example, specifications with a low level of abstraction might include topics such as protocols, messages, and hardware platforms, while high-level specifications might talk more about use cases, features, and interactions.

Besides using a clustering algorithm, it can also be a good idea to apply manual analysis here (see also Section “Don’t Discard Manual Analysis of Textual Data”, which follows). An experienced requirements engineer is probably able to quickly cluster the specifications along their level of abstraction. While this takes more effort, it would be a more direct analysis of abstraction than the degree of cloning and could provide further insights as well.

Don’t Expect Magic

Even though the methods, algorithms, and tools in NLP are of impressive quality today, don’t expect that everything can be used out of the box and provide perfect results. It is still necessary to try alternative algorithms and tune them during the analysis. More importantly, however, the algorithms can only give you results as good as the text that already contains them. We found in the master thesis [1] that we can generate topics with useful terms, but rely strongly on the quality of the JavaDoc comments. The analysis of the official Java library worked well, as those classes are well documented. But the analysis of other open source code with fewer comments also showed less usable topics. Being able to provide such additional use of comments might encourage better commenting and create a self-enforcing loop. Yet, often we find missing, outdated, or sparse documentation in practice. Hence, don't expect too much up front, but also, don’t be discouraged. Continued analysis could pay off in the longer term.

Don’t Discard Manual Analysis of Textual Data

So what can we do if the automatic NLP analysis does not give us fully satisfactory results? I believe it is often helpful to keep the human in the loop. This could mean a fully qualitative analysis of the natural language data [6] to create a full human interpretation. Great insights can come from such an analysis. Humans can make connections based on their own knowledge and experience not available to computers. Furthermore, they are able to formulate the results in a way that is easily accessible to other humans. Yet, such an analysis is very elaborate. A middle ground could be to use manual feedback early and often [7]. We applied topic modeling to user stories to classify them. Initially, our automatic analysis had rather poor precision, but two feedback rounds with practitioners increased it to 90%. Hence, I believe such a combination of automatic analysis with manual feedback could be beneficial in many contexts.

So what does a systematic manual analysis look like? Generally, we apply well-proven techniques of qualitative analysis from areas such as sociology. In its most basic form, this means a simple coding of the textual data. You can attach a code or tag to a piece of the complete text to analyse. This can be single word, a sentence, or whole paragraphs. For example, we have analyzed a part of a survey on problems in requirements engineering with such methods [6]. Practitioners gave free-text answers on what problems they experience in requirements engineering, how they manifest in their development process, and what they consider causes for these problems.

First, we added codes to pieces of text describing each of these aspects (problem, cause, and effect) with very little abstraction. With several reviews and revisions, we grouped the low-level codes to more abstract ones to come up with more general problems, causes, and effects. For example, we coded the answer “The communication to customer is done not by technicians, but by lawyers.” as the problem Weak Communication. This brought us to a deep understanding of the answers an automated technique would hardly be able to achieve. Yet, the manual analysis must be accompanied by continuous discussion and reviews by other people to avoid too much subjectivity. Furthermore, one has to weigh the high degree of insight against the large amount of effort necessary.

Summary

Natural language data is omnipresent in software projects and contains rich information. NLP provides us with interesting tools to analyse this data fully automatically. Yet, as always, there is no free lunch. The textual data must be extracted and cleaned, potentially clustered according to the level of abstraction, and the analysis often has to be complemented with human analysis and feedback to be practically useful. Nevertheless, I am certain we will see much more interesting research in this area in the future.

References

[1] Strobel PH. Automatische Zusammenfassung von Quellcode-Kommentaren. MSc thesis. University of Stuttgart; 2015.

[2] Manning C., Raghavan P., Schütze H. Introduction to information retrieval. New York: Cambridge University Press; 2008.

[3] Binkley D., Heinz D., Lawrie D.J., Overfelt J. Understanding LDA in source code analysis. In: Proc. international conference on program comprehension (ICPC 2014); ACM; 2014:26–36.

[4] Porter M.F. An algorithm for suffix stripping. Program. 1980;14(3):130–137.

[5] Juergens E., Deissenboeck F., Feilkas M., Hummel B., Schätz B., Wagner S., et al. Can clone detection support quality assessment of requirements specifications? In: Proc. 32nd international conference on software engineering (ICSE’10); ACM; 2010.

[6] Wagner S., Méndez Fernández D. Analysing text in software projects. In: Bird C., Menzies T., Zimmermann T., eds. Art and science of analysing software data. Waltham: Morgan Kaufmann; 2015.

[7] Vetrò A., Ognawala S., Méndez Fernández D., Wagner S. Fast feedback cycles in empirical software engineering research. In: Proc. 37th international conference on software engineering (ICSE’15); IEEE; 2015.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset