Chapter 18
Evaluating the Performance of Public Programs

Kathryn E. Newcomer

Since the beginning of the twenty-first century, the demand for systematic data about the efforts and results of public and nonprofit programs has increased due to the convergence of actions by many institutions with stakes in the delivery of public services. In countries across the world, governments at all levels, foundations, funders of international development endeavors (e.g., World Bank and the US Agency for International Development), think tanks, and academics with interests in improving the delivery of services have all called for more rigorous evaluation studies and data to support evidence-based policy and evidence-based management (Kettl, 2005; Pollitt & Bouckaert, 2000; Moynihan, 2008; Hatry, 2008; Newcomer, 2008, 1997). This chapter describes the current context affecting the demand and supply of evaluation in the public sector and offers guidance for managers on how to meet the evaluation challenge.

Both the level and sophistication of the dialogue about the merit of measuring and reporting about the operations and results of public programs and services have increased over the past two decades (Martin & Kettner, 2009; Packard, 2010). Many resources have been put into evaluation, even while critics question whether performance data or evaluation studies are even used to improve services, inform budgeting, or facilitate learning within service providers (Alexander, Brudney, & Yang, 2010; Radin, 2006 2009; Ebrahim, 2005; Pollitt & Bouckaert, 2000).

While measuring programmatic performance in public agencies is virtually ubiquitous in public agencies and nongovernmental organizations (NGOs) across the world, decisions regarding what to measure and how to measure have been affected by a number of societal trends as well as seminal events that have shaped public deliberations to privilege some types of measures over others. Public program performance is itself an amorphous concept open to a multitude of operational definitions and may be measured and interpreted in a variety of ways by different stakeholders (Moynihan, 2008; Newcomer, 1997). In addition, selection of measures and studies tends to occur in a politically charged arena. The risks of having evaluations or performance data used to embarrass governments, criticize programs or policies, reduce funding, or force change are likely to weigh heavily on those charged with making the measurement decisions (Moynihan, 2008; Newcomer, 1997).

Stakeholders have increasingly called for public agencies not merely to measure workloads and accomplishments under their control, typically referred to as outputs, but to measure the outcomes, or impact, of the government efforts. Proponents of “reinventing government” through adopting performance measurement, along with other new public management market-oriented tools (Pollitt & Bouckaert, 2000; Kettl, 2005), and of evidence-based policy and evidence-based management increasingly have called for the use of outcome data to assess the effectiveness of programs and policies (Olsson, 2007; Perrin, 2006; Nutley, Walter, & Davies, 2007).

Public programs are created in a highly visible and politicized environment. Evaluation is an attempt to dispassionately measure their effects. However, a dispassionate measure is difficult to achieve because of the complexity of the effort and the many and diverse interests involved. Reasoned selection of evaluation objectives, evaluators, and evaluation tools can increase the likelihood that the fruits of evaluation efforts benefit public programs.

Program administrators and overseers need to contemplate what they wish to achieve through the evaluation efforts they undertake or sponsor. Several strategic issues should be considered to enhance the objectivity and usefulness of program evaluation:

  • Who can and should define what effective program performance looks like?
  • How might mission-oriented outcomes or impacts be measured?
  • What information should be collected to measure program performance?
  • What will valid and reliable evaluative information cost (in terms of both economic and political costs)?
  • How can evaluative information best be used to improve programs?

These issues are being discussed at all levels of government (and in nonprofit organizations) as more and more people inside and outside government are demanding evidence that government programs work.

The nature of the strategy employed to judge government performance reflects the current political and economic climate, as well as the values brought to the task by both the requesters and the providers of evaluation information. Differences in the location and intentions of the officials requesting evaluations, the questions raised, the location and training of the evaluators, and the resources available lead to the use of different evaluation strategies and tools.

This chapter provides an overview of evaluation practice at all levels of government. It describes the context for evaluation, examines current evaluation practice, and offers advice on how evaluation may be used to improve government performance.

Context for Evaluation of Government Performance

Some remarkable changes in thinking about the way the performance of public and nonprofit agencies should be measured and evaluated have occurred in the past two decades. The most significant changes in the backdrop for evaluation practice have been executive and congressional initiatives at the federal level of government that require programmatic data and evidence; local governmental efforts to collect data to show what citizens are getting for their money; the rise of demands for evidence-based policy from critics outside government along with a fixation on random control trials as the best way to produce such evidence; and increasing calls for more convincing evidence of the effectiveness of international development efforts.

The notion of assessing the effectiveness of programs to inform budgeting was introduced at the federal level in the executive branch in the 1960s with the Planning, Programming and Budgeting System in the Department of Defense (Schick, 1966). Effectiveness, not outcomes, was the operative term as federal budget offices examined programs for the next several decades.

Local governments, especially budget offices, moved to measure their efforts and accomplishments with guidance from the Urban Institute and the International City Management Association in the 1970s. Results and outcomes became the terms reflecting efforts undertaken in local governments in the 1980s as they issued reports (and later constructed websites) showing what citizens were getting for their tax monies.

Congress first called for the provision of nonfinancial program performance and results data in agency financial statements in the Chief Financial Officers Act of 1990. These reporting requirements were then expanded in the Government Management Reform Act of 1994. Perhaps the most important legislative initiative was the Government Performance and Results Act of 1993 (GPRA), which required all federal agencies to have strategic plans, performance goals, and performance reporting and to use evaluation. The inclusion of the term results in the title of the law reflected the public dialogue inspired by the best-selling 1992 book, Reinventing Government, by David Osborne and Ted Gaebler, as well as many other advocates of new public management reforms that included calls for managing by results (i.e., outcomes). Since the enactment of GPRA, dozens of federal laws have been passed that require performance measures in specific policy arenas, and GPRA reporting requirements were strengthened with the GPRA Modernization Act of 2010 (US Government Accountability Office, 2008a 2008b).

In the early 2000s, evaluation and measurement practice was affected by demands for evidence-based policy in the public sector. It is hard to trace the origins of the adjective evidence-based, but a confluence of influential events at the turn of the century heralded increased public enchantment with the term. The establishment of the Campbell Collaboration in 2000, the Coalition for Evidence-Based Policy in 2001, and the What Works Clearinghouse at the US Department of Education in 2002 were some of the more publicized commitments made by social scientists and government to advance the use of systematic collection and analysis of research to inform decision making in the public sector (US Government Accountability Office, 2009).

The assumptions underlying the promotion of evidence-based decision making in government and in the nonprofit sector are that the more rigorous the social science research design is, the more credible are the evaluation findings, and systematic reviews of rigorous evaluation studies of the same intervention can produce especially credible findings and models of demonstrated evidence-based interventions for dissemination.

Widely recognized criteria by which the rigor of evaluation studies can be rated are accepted, but not all criteria are viewed as equal by evaluation professionals. Most advocates of systematic reviews tend to believe that true experiments are far better than any other design. The term for experimental designs that has become fashionable in the twenty-first century is random control trials (RCTs) since that is the term used in medical research for tests of the efficacy of new drugs. Many of those supporting evidence-based policy view RCTs as the gold standard for designing any research or evaluation study.

During the Obama administration, the terms results and evidence have been used when referring to the ongoing performance data being collected and reviewed in quarterly reviews that mimic the CompStat model popularized in police departments' use of data-driven reviews (Hatry & Davies, 2011; US Government Accountability Office, 2013). Obama's Office of Management and Budget (OMB) also has promoted the use of data analytics (US OMB, 2011), that is, sophisticated analyses of performance data and administrative data to inform decision making, reflecting the widespread popularity of the use of data analytics in Michael Lewis's 2003 best-selling book Moneyball.

Since 2010, OMB also has been advocating more rigorous evaluation work to supply strong evidence on the extent to which specific programs work (i.e., produce results). It has presented a tiers-of-evidence framework (Preliminary/Exploratory, Moderate/Suggestive, and Strong Causal) that communicates which evaluation research designs are deemed more likely to produce valid data.

The Obama OMB has publicly voiced support for rigorous program evaluation more prominently than previous administrations. A series of office memoranda from OMB between 2009 and 2013 signaled that performance measurement and evaluation were to be used to produce evidence on what works (e.g., US OMB, 2013). OMB established a cross-agency federal work group to develop common evidence standards; signaled to agencies that it is more likely to fund evidence-based programs; established chief evaluation officers at the US Department of Labor and the Centers for Disease Control; focused on improving access to data and linking of data across program and agencies; called for more collaborative evaluations both across agencies and service providers in different sectors; and offered training on evaluation expectations to agency staff starting in fall 2013. Program evaluation seems to have been given a facelift by the federal government.

Internationally, governments and other funders have been looking for several decades for evidence of what development strategies work and have paid for external evaluators to conduct impact studies to collect data to shed light on solutions for complex social problems. Calls for evidence of development results have led to more investment in evaluation and monitoring (Savedoff, Levine, & Birdsall, 2006). One leader in developing an evidence base is the International Initiative for Impact Evaluation (3ie), which was established in 2008 to fund impact evaluations in development work, reflecting the worldwide enchantment with the assessment of impact to ultimately inform decision making by both development funders and implementers. 3ie, a US nonprofit organization, is but one of many other organizations seeking answers about what works in fostering development (Lipskey & Noonan, 2009).

The search for rigorous evidence of international development effectiveness has led to many calls for RCTs in that arena as well. A somewhat dual-pronged movement has emerged: increased transparency on where and how funds are allocated, with a push to make public contracts, and tracking of funds from the funder to the community, and a push for “bang for your buck” with demonstrating results notably through impact evaluation and systematic reviews using experimental design. The two movements are somewhat distinct and are requiring input monitoring, output auditing, and impact assessment.

While there is disagreement within the evaluation profession regarding the sanctity of RCTs and the relative weight to be given to different evaluation designs, the acceptance of the value of the “evidence-based” label is widespread. Significant implications of the prevalence of the public acceptance of the goal of rigorous evidence for evaluation practice include higher demands placed on those reporting outcomes or evaluation findings to demonstrate the quality of the evidence they produce; lack of a clear, shared understanding about when evidence is good enough; and, given the homage paid to RCTs, more uncertainty among both evaluators and audiences about how to produce high-level evidence in fieldwork where random assignment is not an option. It is harder to produce compelling “evidence” about public sector performance than it has ever been before.

Over the past two decades, initiatives have been undertaken by budgeting professionals, police chiefs (e.g., CompStat), public health professionals (e.g., Healthy People 2020), the United Way of America, foundations, accrediting organizations, intermediaries (e.g., Charity Navigator and America Achieves), and social scientists intent on using evidence to improve public policies that have affected the public appetite and preferences for what information about government and nonprofits should collected. Policymakers and practitioners have been calling for evidence-based practice in health care delivery since well before the turn of the twenty-first century. The evidence-based medicine component of the Affordable Care Act (ACA) highlighted this demand (http://www.ncbi.nlm.nih.gov/pubmed/21860057). The ACA is well funded and promotes the use of RCTs in identifying effective medical practices.

It is highly unlikely that demand for high-quality evidence about policy and program results will diminish any time soon, though the willingness to fund the costs of collecting the data needed is less certain. Politicians and the general citizenry want to know what works, even while they are not as committed to funding the efforts needed to answer that question.

Evaluation Practice in the Twentieth-First Century

Program evaluation refers to the application of analytical skills to measure and judge the operations and results of public programs, policies, and organizations. Program evaluators employ systematic data collection, analysis, and judgment to address a multitude of questions about programs and policies. Evaluation practice encompasses routine measurement of program efforts and outcomes (i.e., performance measurement and monitoring), as well as one-shot studies across the life cycle of programs.

Evaluators draw from models and tools in the social sciences, such as economics and psychology, to analyze programs and policies, typically with a goal of improving them. They can advise public managers on what to measure, how to analyze and interpret diverse kinds of evidence, and how to use that information to inform learning and decision making in the context of uncertainty and bounded rationality.

Evaluation includes many kinds of approaches, including quick-turnaround, one-shot studies and surveys, as well as the collection and analysis of data on a routine basis—annually, quarterly, or more frequently—to try to grapple with questions of how programs and policies are working. Different sorts of questions are raised across a program's life cycle. For example, most evaluators would agree that evaluation ideally should be planned and designed before a program goes into effect. The reality is that such preplanning is rare. Typically programs and policies have been implemented well before evaluations or performance measures are requested by a principal, funder, or other stakeholder.

Perhaps the most important evaluation skill is framing the most relevant questions to ask about programs and policies. Michael Quinn Patton (2011), among the best-known evaluation theorists, has said that the gold standard in evaluation is selecting the appropriate evaluation method (or methods) to answer particular questions and serve intended uses. Evaluation practice not only entails selecting tools; it also includes skills in valuing, matching methods to evaluative questions, engaging stakeholders, and investigating program and policy contexts.

Program managers may use information collected as part of evaluation efforts to improve programs, and external overseers may use them to inform or justify funding decisions. Many different criteria may be used to assess the performance of programs; allocation-oriented inquiries, for example, may focus on service delivery, cost-effectiveness, or program impact.

Program evaluation tends to be retrospective and involves collecting and analyzing information on past program performance. Managers and overseers use the information to make judgments on the value or worth of programs to inform future resource allocation (a summative approach), improve program operations (a formative approach), or both of these purposes. The evaluation strategy employed in a specific instance should reflect who wants to know what about a program, as well as who collects the information. The political agenda of requesters and the current economic stresses influencing program funding also affect the strategy used and the specific questions raised.

Evaluation expertise typically is employed to address at least one of three broad objectives: (1) describe program activities or problem addressed, (2) probe program implementation and program targeting, or (3) measure program impact and identify side effects of program. Table 18.1 displays the sorts of questions that are addressed when collecting data to fulfill each of these three objectives, as well as evaluation designs used to address the questions. In practice, one evaluation effort may attempt to meet two or more of these distinct objectives.

Table 18.1 Matching Designs to the Evaluation Questions

Evaluation Objective Illustrative Questions Possible Design
Describe program activities or problem addressed
  • What activities are conducted? By whom?
  • How extensive and costly are the programs?
  • Whom do the programs serve?
  • How do activities vary across program components, providers, or subgroups of clients?
  • Has the program been implemented sufficiently to be evaluated?
  • Performance measurement
  • Exploratory evaluations
  • Evaluability assessments
Probe program implementation and program targeting
  • To what extent has the program been implemented as designed?
  • With replication of previously successful interventions, how closely is the intervention being implemented with fidelity to the original design?
  • What feasibility or management challenges hinder successful implementation?
  • To what extent have program activities, services, or products focused on appropriate (mandated) issues or problems?
  • To what extent have programs, services, or products reached appropriate populations or organizations?
  • To what extent is the program implemented in compliance with the law and regulations?
  • To what extent do current targeting practices leave significant needs (problems) not addressed or clients not reached?
  • Multiple case studies
  • Implementation or process evaluations
  • Performance audits
  • Compliance audits
Evaluate program impact and identify side effects of program
  • Has the program produced results consistent with its purpose (mission)?
  • How have effects varied across program components, approaches, providers, and client subgroups?
  • Which components or providers have consistently failed to show on impact?
  • To what extent have program activities had important positive or negative side effects for program participants or citizens outside the program?
  • Is the program strategy more (or less) effective in relation to its costs?
  • Is the program more cost-effective than other programs serving the same purpose?
  • What are the average effects across different implementations of the intervention?
  • Experimental designs (RCTs)
  • Difference-in-difference designs
  • Propensity score matching (PSM)
  • Statistical adjustments with regression estimates of effects
  • Multiple time series designs
  • Regression discontinuity designs
  • Cost-effectiveness studies
  • Benefit-cost analysis
  • Systematic reviews
  • Meta-analyses

Framing Evaluation Questions

In terms of framing evaluation questions or identifying what to measure, developing the theory of change or logic underlying a program or project typically is a sound starting place (Frechtling, 2007; Wholey, Hatry, & Newcomer, 2010). Modeling programmatic elements, that is, the causal mechanisms that are expected to produce desired results and the contextual factors that may mediate or moderate program processes, is helpful not only for designing program evaluation studies but in determining what to measure on a routine basis, and then in analyzing and interpreting what the performance data mean. Differentiating among analytical categories of program operations and results is not always clear to all stakeholders, who may have different perspectives on what constitutes a program output (what the program actually did) versus a program outcome (what happened as a result of the program actions), but engaging stakeholders in conversations on how programs are intended to operate and identify factors outside the agency's control that may hinder achievement of results is extremely helpful.

A program's stakeholders are frequently interested in learning about how a program is being implemented. With any intergovernmental program, for example, many interesting things may happen as federal, state, and local collaborators attempt to get a program in place and operating as intended (see Barnow, 2000, about evaluating job training programs). Key stakeholders may need to know whether implementation is adhering to process rules and regulations or whether the program's design is a good fit for the environment in which it has been placed. Many important evaluation questions that are raised are “how” questions. How well is an agency meeting particular regulations in terms of equity or fairness and targeting? Evaluators or program staff may ask more questions of the people who are affected by a program about their experience in order to assess their satisfaction, whether their needs are being met, and the viability of a program's design. Typically a mixture of qualitative data collection methods, such as interviewing and on site observations, is needed to answer the “how” questions. Current practice is to use a combination of qualitative and quantitative data collection methods in evaluation work (Greene, 2007).

When considering whether and how to replicate or scale up programs previously found in certain circumstances to be successful, questions to address include these:

  • What were the circumstances when the model was implemented; that is, how successful was it for who, where, and when?
  • Are these circumstances sufficiently similar to other populations, locations, and times to support replication or scaling up?
  • What has occurred in replicating the model that was not anticipated?
  • What are the implications of those consequences for adherence to high-fidelity replication or facilitating appropriate adaptation?
  • What has been learned about implementing the model and adapting it in a different setting?

An organization also may need to ask what happened: What are the program outcomes in both the short run and the longer run? Why didn't a program achieve its intended goals or outcomes? What may be some of the program's unintended outcomes? And ultimately, what was the program's impact? When the term impact is used in evaluation, it involves thinking about the counterfactual. That is, what would have happened if the program had not been in place? Another way to express this question is: What is the net effect of the program? Since factors apart from the program may be affecting the intended outcomes, evaluators may need to control for the influence of other relevant factors.

Depending in part on the scope of a program, the use of experimental designs (RCTs) in impact evaluation may be expensive and less likely to be used due to the resources and time required or ethical concerns. A variety of research designs that attempt to rule out other explanations for the observed impact of a program by employing comparison groups have shown promise in evaluation efforts such as difference-in-difference designs, propensity score matching, multiple-time-series designs, and regression discontinuity designs (Henry, 2010).

Since impact evaluations are costly and require a longer time period than most other evaluation approaches, conducting a meta-evaluation or a meta-analysis of previous program evaluations is a less costly way to assess program impacts. A meta-evaluation, or systematic review, involves gathering all known evaluations of a program, systematically analyzing the strength of the methodology employed in each, and comparing the findings. Meta-evaluations allow systematic identification of the effects that methodological decisions (e.g., sample size, time period) have on conclusions in impact evaluations. In a federal system of government, different jurisdictions serve as natural laboratories for innovative programs and policies adopted at different times, so it can be useful to scan experiences across jurisdictions to summarize the effects of interventions or policies. And if a large number of evaluations have used the same outcome measures and the same means of measuring program components, then outcomes can be averaged in a meta-analysis across the studies and a summary assessment of impacts made(Lipsey & Wilson, 2000).

Table 18.2 offers a set of evaluation design principles that convey the importance of designing the evaluation to address the questions, as well as the need for highly valid and reliable data and well-supported inferences to provide credible and compelling data and findings.

Table 18.2 Basic Principles of Evaluation Design

  • Framing the most appropriate questions to address in any evaluation work is the key.
  • Clear and answerable evaluation questions should drive the design decisions.
  • Design decisions should be made to provide appropriate data and comparisons to address the evaluation questions.
  • Decisions about both measurement and design should be made to bolster the methodological integrity of the results.
  • During design deliberations, careful consideration should be given to strengthening inferences about both findings and the likely success of recommendations.
  • All decisions made during evaluation design and reporting should be characterized by strong methodological integrity.

The Complex Evaluation Ecosystem

Evaluation operates in a complex political context. It is instigated by politicians and managers for a variety of reasons and is undertaken by a variety of actors. To clarify how the motivations underlying evaluation and the providers used may affect programs and resources, evaluation practice is divided here in accordance with three underlying political and managerial motivations: problem-based investigations, performance assessment, and impact evaluation. The implications of these three different motivations for the way evaluation is carried out, likely program staff responses, and the resources consumed are displayed in table 18.3.

Table 18.3 Implications of Different Political Motivations for Evaluation

Political or Managerial Motivation Focus Evaluation Provider Reaction of Program Staff Data Sources Required Time Period Required for Evaluation Effort Resources Required for Evaluation Effort Evaluation Tools Typically Employed
Problem-based investigations (What went wrong?) Inputs (e.g., financial transactions) or program processes Inspector general offices and GAO in the federal government; state and local auditors; contractors Defensive to hesitant Typically limited to agency records and staff Short (six weeks to six months) Potentially few for evaluators; opportunity costs for program staff Financial, performance, or compliance audits
Performance assessment
(How is it working?)
Program operations, level of efforts (i.e., outputs), and short-term outcomes Agency evaluation or budget staffs; contractors; and state and local budget staff or auditors Hesitant to compliant Typically limited to agency records and staff Short to medium (typically a few months or longer) but depends on questions Depends on questions; opportunity costs for program staff Ongoing performance measurement systems; process or implementation evaluations; performance audits
Impact evaluations (Did it work?) Outcomes and net effects of programs Contractors; agency evaluation staffs; foundations; and state and local auditors Defensive to compliant Agency records and staff and many external sources such as beneficiaries and other service delivery partners Long (at least one year or longer) Extensive; opportunity costs for program staff Experimental designs; comparison group designs; time series designs; cost-effectiveness studies; benefit-cost analyses; comparative case studies; meta-evaluations; meta-analyses

Problem-Based Investigation

Program leaders or overseers who feel that there are problems due to fraud, inadequate management, or poor program design may turn to program evaluation (or auditing) as a means of identifying the roots of the problem and perhaps providing possible solutions. Auditors such as those within the US Government Accountability Office (GAO) or inspector general (IG) offices at the federal level or their counterpart offices at the state and local levels are typically tasked with these inquiries. They tend to view their job as one of probing alleged improprieties or lack of compliance with procedural requirements. Since program staff are an essential source of information for evaluators, their perceptions represent another difficult-to-quantify aspect of evaluation that can potentially skew the results. When auditors are brought in, program staff are likely to feel that an assumption exists on the part of some public officials or managers that a problem, probably financial in nature, exists in their program. The GAO and IG auditors, and their counterparts at the state and local level, are therefore not generally greeted with enthusiasm by program staff. In fact, the assumption that these auditors are more concerned with uncovering problems than with improving programs is quite commonly held.

Performance Assessment

Leaders and overseers responding to demands for evidence that government works, as well as program or budget staff interested in obtaining data on accomplishments that extend beyond financial data, are likely to view the evaluation task as a means of monitoring, and perhaps improving, program operations. The GPRA, along with many state and local initiatives, presses program staff and evaluators within public agencies to collect data on performance on an ongoing basis.

The resources consumed by programs and the number of goods and services provided are the easiest program facets to track because agency records typically contain these data. Outputs are typically measurable, but their measurement may entail costs not covered in program budgets. Although outputs may be easy to track with computerized information processing systems, the computerized systems may not have been designed with adequate flexibility to capture elements or comparisons that program staff or oversight groups identify as the most crucial outputs to track to ascertain how well a program is working. Assessing how well large, intergovernmental programs are being implemented, especially if they are spread across states and entail more resources, are typically contracted out to organizations such as the Urban Institute and Mathematica. Program staff do not generally greet performance assessment with enthusiasm, but they are not as intimidated as with problem-based investigations.

Impact Evaluation

Political or career leaders who need data to justify expensive or politically vulnerable programs may well ask for evidence on program impact. Impact evaluations measure the extent to which a program causes change in the desired direction in a target population. The key challenge in successfully implementing an impact evaluation is to ensure that the impact attributed to the program is due to the program and not other factors. These approaches tend to be expensive and difficult to implement, and the possibility that the findings may not demonstrate that the program works may be intimidating to staff, who are asked to devote even more of their own time to the endeavor.

Certain technical skills are necessary for measuring program impact; thus, trained evaluation staff as well as outside evaluation contractors are likely to be called on for these projects. Such projects typically involve systematic data collection in the field from program recipients, and the data analysis will likely involve sophisticated statistics. Virtually all impact evaluations for governments at all levels are contracted out to organizations such as the Urban Institute and Mathematica.

When questions about the cost-effectiveness of programs are asked in connection with reauthorization or appropriations processes, program staff may well be defensive. In this context, there are probably few questions that will not cause program staff and other supportive groups, such as program beneficiaries, to be concerned about the fate of their program.

Cost-effectiveness studies and benefit-cost analyses have become popular among politicians who are asking for evidence that helps to compare the costs of achieving desired outcomes through different programs. (For an example, see an overview of state use of benefit-cost analyses in Pew-MacArthur, 2013.) Cost-effectiveness studies relate the costs of a given program activity to measures of specific (intended or unintended) program outcomes. The program's objectives need to be clear enough so that the costs of achieving specific objectives can be identified. For example, if the objective of a particular program is to prevent high school dropouts, the cost per dropout prevented or cost per percentage increase in graduation rate is to be identified so that such costs can be used to compare different programs that are attempting to achieve the same outcomes.

Since cost-effectiveness studies do not require that dollar values be attached to all program benefits, they are easier to conduct than benefit-cost analyses, which require that all costs associated with programs be calculated as well as the benefits. The Washington Institute for Public Policy has drawn much national attention, including from the Obama administration and the US Congress, for its path-breaking work in using benefit-cost analyses to compare social programs.

There are many ways to frame and deploy evaluation in the public sector, and the way that evaluation is deployed matters to the politicians and leaders who call for the data, the government staff who are called on to provide or collect the data, and the citizens who pay for the evaluation work. The question frequently raised by all three of these groups is, do the benefits exceed the costs of evaluation?

Using Evaluation to Improve Performance

Improving government performance is the ultimate goal of program evaluation. Effectively measuring a program's performance and communicating what is learned through the evaluation process are the two key challenges facing program staff and evaluators in this era of high demand for evidence-based policy and evidence-based management in government. Identifying and reliably measuring performance measures that clearly reflect program missions are fundamental and necessary (but not by themselves sufficient) tasks with which program managers at all levels of government are currently grappling. Calls for measuring program outcomes and impacts are also ubiquitous in the public sector.

Most important, taking a holistic view on how evaluation should be organized and deployed in a public agency has implications across the organization, affecting strategy, learning, performance, and ongoing improvement of internal operations. Program evaluation, including routine measurement and the strategic use of evaluation studies and cost-effectiveness analyses, should be designed in a comprehensive and coordinated manner to ensure that the information gained can and will be used by the appropriate leaders and staff to improve program operations (Nielsen & Hunter, 2013). The evaluation community provides much guidance on strategies to employ so that evaluative information is more likely to be used (Kirkhart, 2000; Patton, 2008).

Certain practices that can help program staff and evaluators design evaluations or performance measurement systems that can and will be useful. The ten points that follow highlight some of the more important issues that program officials committed to improving program performance through the effective use of evaluation should consider:

  1. 1. Program staff and evaluators should first identify the intended audience for any evaluation effort. The identity and information needs of the audiences most critical to the program's future should be clarified before any data collection efforts proceed. Those involved in evaluation design should identify who will use the resulting information and for what purposes, and they should investigate pertinent background issues. For example, are budget offices or congressional or state legislative committees asking for cost-effectiveness data to support decisions? Anticipating how information might be used may greatly affect project or system design.
  2. 2. Adequate stakeholder engagement and timing are critical in evaluation design and use. In designing evaluation projects or measurement systems, program staff and evaluators should engage relevant stakeholders to ensure that evaluation efforts, especially identification of performance measures, are deemed feasible and legitimate. But timing matters; the designers must anticipate when program decisions will be made, so the necessary data will be available in a timely fashion. The timing of performance measurement should be considered carefully. For example, collecting outcome data on clients of substance abuse programs may have to wait for more than a year after program completion so that the treatment has had enough time to affect their behavior; in this situation, annual data collection may not be appropriate. Timing is also important as evaluators develop their recommendations. Recommendations are more likely to be implemented if they include appropriate time frames for program improvements.
  3. 3. Evaluation data should be relevant to the information needs of the primary audience. Program staff and evaluators should work together to create clear user-oriented questions in their design of evaluation efforts. Similarly, performance measures should be understandable, clearly linked to program mission, and deemed legitimate by the intended audience. Criteria for assessing program performance should be clarified. Agreement should be secured from the intended audience as to what constitutes timely, adequate, efficient, and effective performance so that any judgments offered based on these criteria will be considered relevant and legitimate. Anticipating how evaluation information will be received is critical to effective planning (Mohan & Sullivan, 2007).
  4. 4. Early in the design phase, program staff and evaluators should collaborate to decide what data and reporting will look like and ensure that they meet the audience's expectations. Considerations such as whether quantitative or qualitative data (or both) should be collected and how much precision there should be with numerical data should be thought through early on, because they will affect decisions about data collection and analysis techniques. The desired method of presentation should be clarified early on as well, as different data analysis techniques permit different sorts of numerical and graphic presentations. Although graphics cannot compensate for weak or irrelevant data, effective presentation can enhance the utility of relevant data immeasurably. Presentation options may be precluded if foresight is not exercised in designing the evaluation. The new field of data visualization has raised expectations for reporting, but plenty of guidance and software are readily available (Evergreen, 2014).
  5. 5. Competence is an essential quality of effective evaluation efforts. Competence refers here to both the individuals performing the evaluation and the methods they use to collect and analyze data. Reports and other documentation of evaluation and measurement efforts should provide clear, user-friendly accounts describing how evaluation efforts were designed and implemented. This information will help convey to the audience the accuracy and completeness of the information provided. When there are multiple audiences for evaluation findings, they have differing expectations of the level of detail that should be provided about the methods employed to collect and analyze data. For example, university researchers and think tanks have different expectations from members of Congress regarding how much information on evaluation methodology should be included in reports they review. Not all reports or documents need to provide extensive detail about methodology, but such evidence should be available. Documentation should also be available to demonstrate that an effort was made to ensure accurate measurements, that any sampling procedures used were pertinent, and that the logic supporting findings and interpretations is defensible. Evaluators are ethically bound to report sufficient information on methodology to support their work.
  6. 6. The specificity, number, and format of recommendations stemming from evaluation efforts should be appropriate for the target audience. In order to develop useful recommendations, evaluators should consider their structure and content early in the evaluation process. Relevant program staff should be consulted early on to ensure that recommendations are sufficiently feasible. Feasibility and competence are the two factors that are most essential in ensuring that recommendations have a chance of being implemented (Wholey et al., 2010). Recommendations should, to the extent it is politically feasible, clearly specify which office should take exactly what actions and within what time frame.
  7. 7. Reporting vehicles should be tailored to reflect the communication preferences of the different target audiences. Findings may need to be packaged in several different formats. Effective dissemination of evaluation results means that the right people get the right information in the right format to inform their decision making. Decision makers can improve program performance only if they receive information that they can understand and is relevant to their needs.
  8. 8. Much work is required on the part of evaluators to ensure that evaluation information is used. Evaluators should not assume that once they present their data or report findings, the target audience will by itself make fruitful use of this information. Communication skills and diplomacy on the part of evaluators are critical to ensure data and reports are used. Evaluators should attempt to develop good working relationships with program staff and other pertinent decision makers early in the evaluation process so that when it comes time to communicate results, understanding and appreciation will be forthcoming. Evaluators should note, however, that appreciation of the need to improve programs will not by itself be sufficient to ensure recommendations are implemented.
  9. 9. Pertinent decision makers should ensure that incentives are provided to both individuals and organizational units to encourage the use of data and study findings to improve program performance. The methodological challenges accompanying the use of evaluation tools are actually minor in comparison to the political challenges. Public programs have many masters, and thus many views on what to measure and how to judge performance arise. A key obstacle typically hindering efforts to improve performance is uncertainty about what happens if performance targets are not met or recommendations are not implemented. What accountability for results actually means needs to be clear, particularly on the part of those charged with measuring performance.
  10. 10. The bottom line is that leadership buy-in and visible, consistent support of the use of evaluation across the organization is essential. Political will is necessary for evaluation efforts to be used to improve government. Those in leadership positions must believe in evaluation, and they must provide both the resources and the consistent organizational support necessary to ensure that the managers at all levels appreciate the objectives and appropriate use of evaluation. Only if those in charge send clear signals and provide the needed incentives can evaluation be implemented and used effectively to improve government.

Summary

The climate for evaluation of public programs is turbulent and challenging. Politicians at all levels of government are calling for data that demonstrate the value added to society by government programs.

Documenting program performance is challenging because interested parties typically have different ideas about what to measure. Identifying the most useful evaluation questions and the most relevant performance indicators requires that agreement be secured within agencies and among diverse stakeholders involved in implementing complex intergovernmental programs—not an easy task. Many different sources of evaluation expertise are available to help sort through the issues faced in assessing government performance, but there are no simple solutions to address increasingly complex problems.

With a high level of demand for evaluation information and a diverse range of evaluation providers available, there are many different evaluation tools and approaches currently in use at all levels of government and in the nonprofit sector. It is possible to categorize these approaches as motivated by one of three political or managerial objectives: problem-based approaches, where auditors or evaluators go in to investigate problems with program management or design; performance assessment, where evaluators or program staff collect data on performance on an intermittent or ongoing basis; and impact evaluations, undertaken by evaluators responding to funders' calls for evidence that programs are having the intended impact. Evaluation is less costly and easier to accomplish if the focus is on counting program resources or documenting program processes rather than measuring program results. Systematically assessing program impacts or changes in environmental conditions is much more challenging.

The ultimate challenge for both program staff and evaluators is to ensure that evaluations are used to improve government performance. To enhance this likelihood, program staff and evaluators should:

  • Identify and understand the audience for the information and deliver it in a timely fashion
  • Create clear user-oriented questions and identify clear and legitimate performance indicators
  • Envision early on what the reporting should look like to help in making design decisions
  • Use appropriate professionally recognized methods of data collection and analysis
  • Develop specific, clear, useful, and feasible recommendations deemed realistic by program staff
  • Tailor reporting vehicles to the communication preferences of the different target audiences
  • Work together to see that appropriate officials use the information provided.
  • Provide incentives to program staff to implement the recommended improvements.

However, even the best-designed and effectively communicated evaluation efforts will not automatically translate into improved government performance. Political will is the necessary ingredient. Political and senior career leaders must believe in evaluation, and they must signal that in their actions as well as their words. Only leadership buy-in can help ensure that the education and incentives are provided to program staff and oversight bodies so that they will appreciate and use evaluative information to improve public programs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset