Open from the beginning

G. Gousios    Radboud University Nijmegen, Nijmegen, The Netherlands

Abstract

In research, we are obsessed with open access. We take extra steps to make our papers available to the public, we spend extra time for producing preprints, technical reports, and blog posts to make our research accessible and we lobby noncollaborating publishers to play along. We are not so zealous with the artifacts that comprise our research; source code, data, and documentation are treated as second class citizens that nobody (i) publishes and (ii) wants to have a look at.

Keywords

Open access; Alitheia core; GHTorrent; Openness

The problem in this business isn’t to keep people from stealing your ideas; it is making them steal your ideas.

Howard H. Aiken

In research, we are obsessed with open access. We take extra steps to make our papers available to the public, we spend extra time for producing preprints, technical reports, and blog posts to make our research accessible and we lobby noncollaborating publishers to play along. We are not so zealous with the artifacts that comprise our research; source code, data, and documentation are treated as second class citizens that nobody either publishes or wants to have a look at.

I believe this is a grave mistake and leads to missed opportunities in increasing the impact of our research.

But why is open access to all research artifacts important for software data scientists? Let me share with you two stories.

Alitheia Core

My involvement with empirical software engineering started in 2005. My PhD supervisor wanted to create a plot of the Maintainability Index metric for the whole lifetime of the FreeBSD project to include in his book. The initial version was a hacky shell and Perl script solution, similar to what most repository miners come up with at the time. After creating the plot, he thought that it would be interesting if we had a tool to analyze the lifetime of projects based on repository mining and combine metrics at will. This gave birth to the Alitheia Core project, where a joint group of about 15 engineers and researchers set to write a software analysis platform that would allow anyone to submit a project repository for analysis.

What we came up with was a rather sophisticated repository mining tool, Alitheia Core, that was comprised of analysis plug-ins and offered a wealth of services, such as parsers, automatic parallelization, and even cluster operation. Alitheia Core was built in Java, using the latest and greatest technologies of the time, eg, object-relational mapping for database access and REST for its web APIs. It also featured no less than two web interfaces and an Eclipse plug-in. When it was announced, in mid-2008, it was probably the most technologically advanced repository mining tool. Along with Alitheia Core, we also delivered a curated dataset of about 750 OSS repositories, including some of the biggest available at the time. After the end of the project, we offered all source code and datasets to the software analytics community.

In numbers, the project looked like a resounding success; around 20 papers were published, four PhD students have written a dissertation based on it and more importantly, we could do studies with one or two orders of magnitude more data from the average study at the time. Unfortunately, though, outside of the project consortium, Alitheia Core had limited impact. From what we know, only one external user cared to install it and only two researchers managed to produce a publication using it. By any account, Alitheia Core was impressive technology, but not a successful project.

GHTorrent

Fast forward to mid-2011; GitHub’s popularity had begun to skyrocket. They had just made available version 2 of their API and I thought that this was my chance to finally teach myself some Ruby and distributed systems programming. I went ahead and wrote scripts that monitored GitHub’s event timeline and parsed them into two databases, MySQL and MongoDB. The whole process of monitoring the event timeline and parsing was decoupled through a queue server, and thus we could have multiple monitors and parsers working on a cluster. While all these might sound interesting from an architectural viewpoint, the initial implementation was rather uninteresting, technology-wise; just a couple of scripts that in a loop would poll a queue and then update two databases by recursively retrieving information from the web. The scripts where released as OSS software on GitHub in November 2011, while data dumps of both databases where offered through BitTorrent. This marked the birth of the GHTorrent project.

A paper describing the approach and initial implementation was sent to the 2012 Mining Software Repositories conference, but failed to impress the reviewers much: “The work is messy and often confuses the reader in terms of what have they done and how they have done [sic]” one reviewer wrote. “The experiment seems foley, results somewhat lacking [sic]” another reviewer added. The paper included some slightly embarrassing plots as well, eg, featuring holes in the data collection process, due to “a bug in the event mirroring script, which manifested in both event retrieval nodes.” Nevertheless, the paper was accepted.

Shortly after the conference, something incredible happened; I witnessed a client connecting to our Bittorrent server. Not only did it connect, it also downloaded the full data dump of the MySQL dataset! This marked the first external user of GHTorrent, merely days after the end of the conference where we presented it. The paper that resulted from this download was published in early 2013, before even my second GHTorrent paper. This motivated me to take GHTorrent more seriously; I worked together with initial users to fix any issues they had and I was prompt in answering their questions. On the technical side, we (students, the community, and me) implemented services to access an almost live version of both databases, a dataset slicer and various interactive analyses and visualizations.

Since its public availability, GHTorrent grew a lot: As of this writing (January 2016), it hosts more than 9.5 TB of data. Vasilescu’s paper marked the beginning of high speed (by academic standards) uptake of GHTorrent: more than 60 papers were written using it (one third of all GitHub studies, according to one paper), while at least 160 users have registered to the online query facilities. GitHub themselves proposed GHTorrent as a potential dataset for their third annual data challenge, while Microsoft funded it to run on Azure.

Why the Difference?

The relative indifference that the research community reserved for Alitheia Core is in stark contrast with the fast uptake of GHTorrent. But why? Why did the community embrace a hacky, obviously incomplete and, in some cases, downright erroneous dataset while it almost ignored a polished, complete, and fully documented research platform? Let us consider some potential reasons:

 GitHub is hot as a research target: This is obviously true. But when Alitheia Core was done, SourceForge was also very hot as a research dataset, as evidenced by the number of projects that targeted it. Moreover, the unaffiliated Github Archive project offers a more easily accessible version of a subset of GHTorrent, so researchers could have just used this and ignored GHTorrent altogether.

 The “not invented here” syndrome: Researchers (and, to a lesser extent, practitioners) are very reluctant to use each other’s code. This is not entirely unfounded: code, especially research code, is very prone to have issues that may be hard to debug. Researchers also know that dealing with inconsistencies in data is even worse; still, they flocked to GHTorrent.

 Ad hoc solutions work best: Data analytics is a trial and error approach; researchers need to iterate fast and an interactive environment coupled with a database makes them more productive than a rigid, all-encompassing platform.

Despite the fact that there is some truth in the preceding list, I believe that the main reason is openness.

Alitheia Core was developed using an opaque process that the project members certainly enjoyed, but was obviously not in touch with what the users wanted. On the other hand, GHTorrent grew with its users. As any entrepreneur or innovation expert can confess, it is very hard for an innovative product or service to be adopted by a community; adoption is much easier if it grows organically with it. Moreover, it is extremely difficult to dazzle users with feature lists (except perhaps if you are Apple): users, especially tech-savvy ones, put high value on construction transparency and compatibility with their work habits.

Be Open or Be Irrelevant

To me, the difference is a clear win of the open source process and its application on research. Open access to all research artifacts from the very beginning can only be a good thing: collaboration attraction, spreading of research results, replication, and advancement of science in general are all things for which open access is a prerequisite. There is not much at risk, either: the only thing we risk with open access is that our research will be ignored; then, this already may mean something about the relevance of our research.

If I learned something from this experience, it would be the following three things:

 Offer a minimum viable product: Offer the least possible piece of functionality that makes sense and let people create by building on top of it. This can be data plus some code, or just data. Make sure that you make it easy for others to build on what you offer: make sure that your tools are easy to install and that your data is well documented. Be frank in the documentation; accurately state the limitations and the gotchas of your code/data.

 Infrastructures are overrated: The effort required to learn how infrastructure code works should not be overlooked. The invested effort must return some gains. Big effort must be followed by big gains in return, and there is always the risk of deprecation; this is why only very few infrastructure tools survive the test of openness. The Unix experience should be our guide; make tools and services that do one thing well, accepted text and return text.

 Open now trumps open when it is done: Finally and most importantly, no one is going to wait for you to perfect what you are doing. Opening your research up early on is not a sign of sloppiness; it is a sign of trusting in your work and acknowledging the fact that done is better than perfect. Don’t be afraid that someone will steal your idea, for if someone invests time in stealing it, it is a great idea and you have a head start. Open access is an absolute must for the adoption and wide spreading of research results. This should happen as early as possible.

To cut the long story short: make your research open from the beginning!

References

[1] Gousios G., Spinellis D. Conducting quantitative software engineering studies with Alitheia Core. Empir Softw Eng. 2014;19(4):885–925.

[2] Gousios G. The GHTorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories; 2013:233–236.

[3] Vasilescu B., Filkov V., Serebrenik A. Stack overflow and GitHub: associations between software development and crowdsourced knowledge. In: 2013 ASE/IEEE international conference on social computing, social computing; IEEE; 2013:188–195.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset