Case Study 4

Developing a portal framework for humanities scholars

Joan A. Smith

Abstract.

The study focuses on the decision-making points through the various stages of a project, all the time emphasising the importance and necessity of long-term sustainability.

Keywords

community

development tools

Emory University Library

humanities

portal

Southcomb

sustainability

Introduction

This case study concerns a three-year grant-funded investigation1 conducted by Emory University Library to explore inter-institutional scholarly portal services supporting research in the humanities. Officially titled A Cyberinfrastructure for Scholars, a key goal of the research project was the development of a suite of software tools to create and maintain humanities-oriented search portals.2 A portal is a user-configurable software interface that is typically used to integrate several tools together into one dashboard-like screen. An example of a popular portal is iGoogle,3 which has numerous widgets (small applications) that can be custom-arranged into a screen layout that makes sense for the individual user (see Figure 4.1). Another important feature of a portal is that other widgets can be built by community members at large. As such, it acts as a framework for the community to use and extend, evolving to meet the community’s changing needs.

image

Figure CS4.1 An iGoogle portal page configured with a selection of widgets

The project team created a new humanities portal called ‘SouthComb’, designed to provide a comprehensive and faceted search across scholarly information sources. The tools that were developed in support of the SouthComb portal enabled harvesting, automatic classification and meta-searching for information originating across multiple resources including the World Wide Web, OAI-compliant Open Archives, various library catalogues and other digital information collections. Even though the source code for the portal was released as an open source project under Google Code,4 the project did not attract sufficient community support to become self-sustaining and the principal website5 has been in a static state since 2009. From an economic perspective, the institution felt there was insufficient return on investment (ROI) to maintain the software in-house, and website usage was too low to justify committing manpower to ongoing content development.

Project objectives (mission)

The project’s primary mission was to improve the research process for humanities scholars by providing better technology tools and implementing a compelling, working example. Three objectives were established to achieve the mission:

1. Build a sustainable combined search portal service. This online community portal for the interdisciplinary field of Southern Studies includes a combined resource search system and various participatory services, and has been proposed as a low-cost subscription-based service. The key features demanded by scholars have been embedded in the open source software developed during the project, which was open-sourced via Google Code’s (free) project hosting service. With a rich feature set in high demand by customers, the product could be sustainable through a modest mixed-revenue stream.

2. Improve networked access to humanities collections of the US South. The use of technology in humanities research is inconsistent, even at institutions like Emory University which has a well-established reputation in this area. In part the problem has been accessibility of materials via the usual discovery channels (web and catalogue searches). A key initiative of the portal project is improving the available tools and techniques for exposing, organising and discovering humanities collections. As a result, these features had the highest priority when planning the portal’s development.

3. Explore sustainable models for the advancement of scholarly Cyberinfrastructure. Sustainability is a recognised issue for academic research projects, particularly those in the humanities. Our intent was to creatively address the sustainability challenges of our own production system, thereby advancing digital library project sustainability more broadly.

Users can add, delete and arrange the widgets in a way that makes personal, visual sense. The goal of the Cyberinfrastructure project was to create a portal for humanities scholars that would aggregate research resources (widgets) from across a broad spectrum of information sources.

As part of the project’s mission, the Principal Investigators also sought answers to a number of questions common to research libraries today:

image Are there useful categories of services and functions that can serve to organise thinking about such projects during initial planning efforts?

image Can we identify effective strategies for institutionalising technology-based research services for long-term operations?

image Can we define a generalisable process for inter-institutional development of a complex engineering project?

These questions are especially applicable to humanities researchers, where there have so far been few tools to leverage technology’s capabilities. Answers to these questions could help establish principles of practice applicable to the rapidly evolving digital humanities research setting.

Building the portal (experiences to date) Technology and engineering considerations

The technical work of our project went through several different implementation phases, beginning with software evaluation, moving to prototype implementations and finally transitioning to a production phase during which we launched several staged releases of the SouthComb service. The technical requirements were established during the first year of the project and included the following core goals:

1. Be sustainable

2. Be easily manageable

3. Be easily reusable

4. Harvest OAI

5. Harvest web materials

6. Automatically classify records

7. Conduct meta-searching

8. Provide organised and uniform access to records

Goals 4 to 7 are geared toward the user community, intending to provide a ‘value-added’ component that would ideally drive up the user base, amortising the cost of overall development among a large number of users. Success in this dimension depends upon correctly targeting user needs and building a strong community. In contrast, the first three goals relate directly to the economic feasibility of any project, and can be significantly impacted by engineering decisions. For example, obscure development tools or environments can require a longer familiarisation timeline for new members coming into the development team. On the other hand, if the existing engineering staff come from non-traditional backgrounds, choosing enterprise-class technologies can extend development time beyond the funding lifecycle and greatly complicate long-term maintenance.

The portal team evaluated a variety of development environments and software tools (including crawlers and classifiers) that might contribute to achieving the goals. To improve overall productivity, we explored different development processes and methodologies, notably Waterfall and Agile, revising our original concept of a distributed engineering team in favour of a single in-house team to minimise communication complexity. We also sought to re-use software from our prior projects, but these proved unsatisfactory in part because that software had not been designed for a production-level environment.

Numerous engineering challenges arose throughout the course of product development. We prototyped the portal with popular products like jBoss, Jetspeed and uPortal, but found the implementations bulky and requiring areas of expertise not well aligned with that of our engineering staff. As a result, we went through several engineering restarts. Another consideration was that we wanted to use open source products, particularly those that did not have a licence fee. Eventually, it became clear that we needed to build the environment using established tools (Lucene, SOLR and others) rather than inventing our own, and to develop the portal using lightweight frameworks like Ruby on Rails which supports rapid application development. We also switched database engines, migrating from MySQL to PostgreSQL in order to improve XML integration. The final beta product was composed of five core elements:

image a front-end for the user application built with Rails;

image an administrative front-end to activate harvests, which was also built with Rails;

image the Java-based harvester to crawl and harvest content;

image a PostgreSQL database to store the content;

image the Lucene/Solr tools to perform searches and content indexing.

Project staffing and management

The project experienced many changes in staffing during the three-plus years of its existence, including a change in Principal Investigator during the final months. There were numerous technical staff changes as well. For example, a sequence of ten engineers worked on the code for periods ranging from five to 18 months, averaging less than one year per programmer. Every funded project role experienced at least one turnover during the performance period so that, by the end of the grant period, none of the original (first-year) project members remained. Given the typical approach to academic research projects, that is the lion’s share of the work is accomplished by students, this is not unusual. Nonetheless, costs and sustainability are greatly impacted by so much turnover, due in part to the time required for new team members to become productive on the project and to the loss of ‘project memory’ with each team member’s departure.

These were not the only major changes in project participants. We initially conducted user testing with Emory faculty, staff and students, all of whom volunteered their time. Numerous issues with the portal led us to shift user testing from this more random group to a set of targeted users, many from within our own library community, who could give more specific feedback on the portal. Community contributors carved time from their regular duties, which is yet another uncounted financial investment in the portal. The team retained an advisory board whose time was also unpaid, and held several conferences where scholars from other institutions provided helpful insights and suggestions for the project. All of this participatory effort represents a significant, unpriced investment in the project that does not appear on the balance sheet.

Deploying the portal

As each beta version of the product was deployed, performance issues and software problems were uncovered that needed to be addressed. Because portal components were built on different languages (notably Java and Rails) and depended on a variety of open source packages (Lucene, SOLR, Heritrix, Apache Web Server and others), deploying each release was a complex task of dependency analysis and version-conflict resolution. The lack of dedicated performance experts together with project code complexity made it harder to evaluate the source of performance bottlenecks exposed during beta user tests. In contrast, commercial groups typically invest a considerable proportion of the overall development cycle into deployment planning before the first release is in beta. Server capacity, performance tuning, security considerations and quality assurance are all exercised by performance engineering experts to determine the optimal configuration of the deployed product’s environment, particularly if the company will host the software for its users (as we were doing for the SouthComb portal).

Continuous feature and user interface (UI) changes added complications, in part because with each UI change a new software development tool was added and/or the overall development environment was fundamentally modified. This practice adds to the cost of software development because the engineers need time to adapt to the changes and to determine whether there are cascading effects, that is if other parts of the software will be impacted by the change, or if the deployment server hardware will need upgrading.

This last aspect, the target deployment server(s), has economic implications that may be overlooked in the academic library. Server costs are typically expressed as one-time costs (X dollars to buy server brand ABC) with other considerations embedded in routine institution operating costs. For example, network bandwidth and power use add a predictable amount to the lifecycle cost of a server and are typically aggregated at the departmental budget level into a network provider monthly fee and the building’s total electricity bill. But the real economic impact of each server is in the ongoing system administration time that it will require and consequences arising from such administration. For each deployed machine, system engineers need to apply operating system and software product patches, perform backup operations, manage user access and accounts, and – perhaps most importantly – monitor overall system security through the machine’s logs, patterns of access and other security vulnerabilities. In many cases, addressing a security issue can create a problem with underlying software. One example seen in our institution’s experience was the transition from PHP’s6 version 4.x series to the 5.x series where code written for PHP 4.x (which had a critical security issue) failed when the server was upgraded to the PHP 5.x version. Engineering costs accrue in the process of either reworking the broken code, rewriting the product from scratch or archiving (abandoning) the product altogether, but these are seldom accounted for in the budget.

Lessons learned

The project goals (as listed earlier) were at least superficially achieved with the deployment of the SouthComb web portal in 2007–8. We built a feature-rich, combined-search portal service, improved network access to various humanities collections of the southern United States and examined issues pertaining to sustaining a digital library portal. In retrospect, however, we did not fully meet the technology goals we set for the project, which contributed to a short ‘shelf-life’ for the portal. That is, the portal has not been maintained beyond its initial production version nor has a user community emerged to demand a new release. We learned several lessons from this experience that helped with subsequent projects:

image True cost estimation. According to our financial report to the grant agency, we were less than 1 per cent over our project budget at the conclusion of the award period. A lot of programming time was lost in the search for the optimal development environment, which equates to an unplanned increase in product development cost. We rewrote key portions of the code each time we switched our development tool set, and in some cases had to resolve bugs occurring from incompatible tool set versions.

image Process management. We erred in not matching our engineering process to our development process, spending about a third of the performance period using the Waterfall approach. Although we did change to an Agile methodology, it would have helped us to discuss and evaluate software process methodologies before we began the coding effort. We might have made more forward progress on the portal and had fewer code rewrites, thus getting a better return on our engineering investment.

image Usability. Feedback from target users is essential to creating a viable product. We had input from a wide range of users and user types, but maintaining a small core group of test users would have been more helpful. Another approach that would have helped is that of ‘wireframing’ the product. This gives the test user a sense of product flow before the actual software is written and can reduce total programming time by providing an approved ‘spec’ for the programmer to follow.

image Deployability. We integrated a wide variety of tools into the portal, including Lucene, SOLR, Heritrix, a PostgreSQL database server and XML-based components. Such a complex deployment environment introduces potential versioning conflicts that can delay or even prevent releasing the product to the public. It also complicates the process of identifying sources of security breaches, adding to the maintenance cost over time. Finally, system administrators need to be very familiar with each of the installed components and understand how to tune the server for optimum performance of those tools without compromising other services. The time involved in the server administration functions should be included when planning the product’s deployment.

image Sustainability. There are many factors involved in sustaining a digital library. From a simple budget perspective, the costs of keeping a server (electricity and bandwidth) are quickly calculated and the basic system administration cost is also predictable. What is not readily estimated is the maintenance factor. How much expertise is needed to keep the product running when underlying systems (OS, web server, compiler/interpreter and so on) are upgraded? Higher expertise typically costs more in annual salary and is also harder to replace in the case of staff turnover. Can the digital library coexist with other services, or does it require dedicated resources? Resource-friendly, low-demand digital libraries equate to lower costs. How large is the user base and/or how unique is the digital library? Ultimately, sustaining the digital library may be more a question of academic obligation than plain cost assessment but if the cost is high then other services offered by the institution may find themselves no longer sustainable in order to offset the higher cost of the mission-critical product.

Recommendations (key messages for other practitioners)

The following recommendations are geared toward organisations that are building new products intended for wide use in a production environment.

image Know before you go. Every software development language has both strengths and weaknesses. Recognise the pros and cons of each language and/or framework under consideration, then make a choice and stick with it through at least a full version 1.0 release. Where possible, use tools that have broad community acceptance so that:

(a) you can leverage the web community for technical support; and

(b) you will have a larger pool of programmer candidates from which to choose, should you need to augment the team.

image Keep IT simple. The target deployment environment (which includes web and database servers, compilers/interpreters and the system’s operating system) should be as close to your institution’s existing IT services as possible. This will simplify maintenance, make deployments easier to debug and improve long-term sustainability of the project.

image Plan early, plan often. Newer software process models like the Spiral and Agile methodologies are plan-oriented, calling for frequent team meetings where current status is reviewed, the software task list is re-prioritised, assigned and time estimated, and delivery dates are adjusted based on current progress. These planning-intensive methodologies can improve software development because (a) they give all team members frequent feedback on progress; (b) changes to the software design can occur incrementally; and (c) costs can be contained by modifying or eliminating planned features in a timely fashion.

image Identify the minimum viable product (MVP). The key benefit of custom software development (whatever features you want) is also a disadvantage, since it is difficult to predict user interest in each of the special features. Time and money might be spent on a feature that is rarely used rather than improving the highest-demand features. The MVP strategy deploys a fully operational core product together with features that are non-working links (typically during beta tests). Logs track user clicks on the non-operational items (which can have a ‘coming soon’ message, for example) and those that are frequently clicked can be implemented for the version 1.0 release. This approach can help resolve disagreements over which features will be important to users and allow the team to focus energy on those in highest demand.

image Budget carefully. Developing digital library software is expensive. Students can play a key role (thus reducing cost in some areas), but technical leadership and project management should be filled by the institution’s permanent staff. If the project is likely to be widely used, sufficient in-house system administration time needs to be budgeted into the project ahead of the planned deployment date. Some time should also be allocated for ongoing, post-deployment product support.

image Commit to a core team. Successful digital library software development requires engineering expertise and a consistent vision. Every time the lead programmer or project manager changes, the vision for the product will change. This inevitably adds to the cost of the product since the new leader needs time to become familiar with the new role, even if already familiar with the product. Wherever possible, assign temporary members (student programmers, for example) to low-complexity tasks that have short ramp-up time in order to maximise their productivity.

image Know your organisations limitations. Software requires long-term care, even if no further development is planned beyond the first production release. If your organisation experiences frequent changes in IT staffing, or if the IT staff are already overwhelmed with maintenance of the current in-house systems, the project is likely to have a short shelf-life. IT staff will have increasing difficulty keeping the product operational with each update to core system environments.

Conclusions

All software development projects go through the common phases of conceptualisation, implementation design, programming and deployment. In cases where the final project will have a very limited audience – whether because of limited public access or lack of broad appeal – it can often be fully realised with only a few in-house contributors or with rotating members such as students from the university’s Computer Science Department. On the other hand, creating a robust, production-quality software framework is a complex engineering project which requires the full range of software engineering staff, including software quality assurance and security engineers, technical documenters and performance/capacity tuning engineers – in addition to the usual programmers, user interface designers, project managers and subject matter experts. Whether or not it is practical for the institution to devote this much manpower to a single project depends on its local mission and role in the larger academic world. Instead of calculating value solely on the size of the user community, a project’s ROI can be calculated using critical academic impact factors such as collection uniqueness or institutional mission. From a practical perspective, project teams should plan to use their standard development tools for the majority of the software tasks, limiting experimental development to a manageable proportion so that coding delays or functionality limitations do not add to development costs or impact product deployment.


1.Funded by the Andrew W. Mellon Foundation from 2005 to 2009.

2.The original project proposal, implementation and development plan were created and managed by Katherine Skinner (now of Educopia) and Martin Halbert (now at the University of North Texas). The economic assessment and recommendations regarding the project’s future were made by this author as Principal Investigator during the final term of the project.

3.http://www.google.com/ig

4.Source code for the project is available at: http://code.google.com/p/ southcomb.

5.http://southcomb.org

6.A scripting language. See: http://www.php.net/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset