Workshop on A Retrospective on Evaluation in the U.S. Federal Public Sector

U.S. Department of State Third Annual Conference on Program Evaluation - Methods Track
Washington, DC
June 9, 2010

MR. PATTERSON: Good morning. Rex Patterson here from the Office of Strategic and Performance Planning in the Department of State. And I want to welcome you to the first in the series of our methods track workshops.

And leading the discussion for this session will be Robert Johnston Shea. He's director in Grant Thornton's Global Public Sector, a member of the Organizational Improvement Team that leads a number of performance management engagements including at the Departments of Agriculture and Homeland Security. He's most recently at the Office of Management and Budget as an associate director for Administration and Government Performance.

And in addition to managing OMB's internal operations, he led the President's Performance Improvement Initiative, administered the Program Assessment Rating Tool, and advised on government human capital policy, and led interagency collaborations in the areas of food safety and implementation of the Federal Funding Accountability and Transparency Act.

MR. SHEA: We actually can probably dispense with the whole --

MR. PATTERSON: Okay, so I'll let you --

MR. SHEA: I'll let you know. It goes on and on.


MR. PATTERSON: But it's so good. Anyway, a lot of good information here. But again, I'll let you introduce the other distinguished members of your team and then we can get started and that will leave us some time for discussion.

Get Adobe Reader View slide presentation ]

MR. SHEA: Okay, great. Our first speaker, John Baron, has stepped out for a minute. So I'll just introduce the topic. And it will be a mystery to him.


The �'�' we're very proud to be a part of this conference. The American Evaluation Association defines evaluation as assessing the strengths and weaknesses of programs.

We started without you, John.

MR. BARON: Fair enough.

MR. SHEA: The �'�' assessing the strengths and weaknesses of programs, policies, personnel, products, and organizations to improve their effectiveness, there's not a common �'�' not a single agreed to definition of evaluation, but I think that's a very good start. But evaluation has a pretty checkered past in the federal public sector as we've heard the last day or so.

Evaluation as a government priority grew steadily throughout the last century. GAO, the �'�' first called the Government �'�' General Accounting Office, now the Government Accountability Office, was created in 1991 and, though begun as an audit institution, has expanded �'�' like many audit institutions, expanded its mission and focus to include more and more assessment of program effectiveness.

Inspector generals had been in existence in the public sector since the founding of our country. But it wasn't until 1978 that they were created in statute, at which point there were 12. Today there are over 70 and many of them believe their mission to be at least in part evaluating program effectiveness.

At some point too, an LBJ executive order requiring agencies to adopt programming, planning, and budgeting system is lending greater legitimacy to evaluation. In the federal public sector, many agencies have program analysis and evaluation offices that are responsible for leading evaluations in agencies and they are explicitly to use those evaluations in their resource allocation decisions. Other agencies have large quasi-independent entities the sole purpose of which is to conduct and oversee program evaluation efforts.

Federal investment in evaluation has certainly fluctuated, but policy attention to evaluation has remained strong. Congress today shows increasing interest in evaluating the effectiveness of promising programs including evaluation requirements and funding in some program legislation.

The Bush Administration asked all federal programs to develop evaluation strategies and assess the independence and rigor of their evaluations. The Obama Administration, as we've heard, has launched a major initiative to enhance the body of impact evaluations and the evaluation capacity of agency staff.

We've got two leading policy makers in the area with us today to discuss these developments. John Baron is the president of the Coalition for Evidence-Based Policy. Dan Rosenbaum is a senior economist in the Economic Policy Division of the Office or Management and Budget. You've got their biographies with you. So I'll just hand it over to John to begin, then to Dan, and then we'll take your questions.

MR. BARON: Thanks. (Inaudible.) Okay, I'm going to (inaudible).


MR. SHEA: He had his Wheaties today.

MR. BARON: (Inaudible.)

MR. SHEA: Do you want me to get this?

MR. BARON: Yeah, if you wouldn't mind. Okay, I'm going to just say a word about our organization by way of introduction. Then I'm going to discuss two things. One is the rationale for evidence-based approaches and why we believe initiatives like those that Ruth just talked about and the OMB-led rigorous evaluation initiative is so important in development assistance and diplomacy.

So first is the rationale, and second, to offer a few brief thoughts on what we think it takes to do this effectively; use rigorous evidence about what works to actually increase the effectiveness of development assistance in diplomacy.

So okay, can you all hear me back there?

MR. SHEA: That's for audio recordings.

MR. BARON: Oh, fair enough. Okay. We are the Coalition for Evidence-Based Policy, a nonprofit, nonpartisan organization. Our mission, in one sentence, is to increase government effectiveness in a number of different areas, mostly areas of social policy, through rigorous evidence about what works. We have a bipartisan board that includes Bob Solow, a Nobel laureate in economics at MIT; David Ellwood, dean of the Kennedy School at Harvard University; David Kessler, the former FDA commissioner, and others.

We're not affiliated with any programs or program models. So we serve as a neutral independent source of expertise to Congress, OMB, the federal agencies on evidence-based programs. Our work is funded independently by the MacArthur Foundation and others. And our work has helped with Congress and the executive branches, helped advance some concrete reforms.

A recent evaluation of our work conducted for the William T. Grant Foundation found that over the past five years, the Coalition has successfully influenced legislative language, increased funding for evidence-based evaluations and programs, helped shape OMB's Program Assessment Rating Tool �'�' that was some of the work that we did with Robert and the Bush Administration �'�' and raised the level of debate in the policy process regarding standards of evidence.

The Coalition has established a generally positive reputation as a rigorous, responsive, and impartial advocate for evidence-based approaches, primarily at the federal level. And this is our board of advisors.

Okay, now turning to the rationale for evidence-based approaches and rigorous evaluation in development assistance and diplomacy. I'm going to talk first about development assistance.

A problem that evidence-based approaches seek to address in this area is that development agencies, developing country governments, and other organizations spend tens of billions of dollars each year on development, yet there is very little scientifically valid evidence about which strategies �'�' which development strategies that they fund are truly effective in reducing poverty and improving health and other key outcomes and which are not.

That was a central finding of the evaluation gap report that Ruth Levine was involved in back in 2006. Based on its comprehensive review of the evaluation literature, it found quote, "For most types of programs, a body of scientific evidence about effectiveness is lacking. For almost all projects currently in operation or in the pipeline, virtually no credible information will be generated about program impact." And early World Bank reviews had reached similar conclusions.

Now, recently as we just heard, there has been a recognition of this problem in a number of different places; the World Bank, USAID, Congress, the Millennium Challenge Corporation, and elsewhere. And there have been some important new initiatives to address it.

For example, in the �'�' there's been a recent congressional effort in the supplemental appropriations bill that was enacted into law last year. There is a sense of Congress that the Secretary of the Treasury shall, quote, "Seek to ensure that multilateral development banks rigorously evaluate the development impact of selected bank projects and emphasize use of random assignment in conducting such evaluations where appropriate and to the extent feasible."

So that was a congressional nudge. We had provided some input to the Foreign Relations Committee on that. But there are a lot of new initiatives, as you know and as have been discussed, that have been going on in this area.

Now, the question came up this morning, you know, is this just a �'�' it wasn't quite phrased this way �'�' is this sort of a passing fad and the next administrator or the next administration is going to come in and the whole thing is going to go away.

We would suggest that rigorous evaluation actually holds a key to bringing rapid progress to social policy and, in particular, development assistance and diplomacy. And I'm going to illustrate why we believe that with a few examples.

Now, for each of these examples, we're going to do this in an interactive way that I have never before attempted in a presentation. I'm going to give you some examples of interventions, mostly in development assistance that were rigorously evaluated and I'm going to ask you to guess whether you think they were found effective or not effective. And then I'll give you the results of the evaluation. Okay, so listen carefully, please.

An example from development policy: Six years ago, researchers conducted a rigorous evaluation in Indonesia of road construction projects to determine whether grassroots monitoring of the road projects would reduce corruption and waste.

So this study randomly assigned 600 villages in Indonesia that received funding for road projects to one group in which villagers were invited to village meetings where the project officials publicly accounted for how they were spending the funds and the villagers were encouraged to anonymously report corruption. Okay, so that was the treatment. It was grassroots monitoring in over 300 of these villages.

In the control group, they didn't do anything new other than the usual approaches, whatever they were, without grassroots monitoring to reduce corruption.

At the end of the study, the researchers measured the corruption and waste in each of the road projects by having a team of independent engineers estimate how much the projects should have cost based on the materials that were used and so on, and then compared that to what was actually charged for the project.

Okay, so that's the first �'�' grassroots monitoring was the first intervention that was rigorously evaluated. Here's the second one. And then I'm going to ask you to vote on their effectiveness.

Also from Indonesia road construction projects, another group of 600 villages were randomly assigned to a group �'�' one group where it was announced that all road projects, 100 percent, would be audited by the government compared to the control group where there wasn't increased auditing. There was the usual amount of auditing, which is about 4 percent of projects.

So the second intervention was increased auditing. And again, they measured the amount of waste or corruption at the end by having the team of independent engineers say how much the project should have cost and comparing it to what was charged.

All right, for the grassroots monitoring, how many people believe that was effective in reducing corruption and waste? Maybe �'�' okay, about five people here.

For the auditing, how many believe that was effective? Okay, about six or seven. Well, how many believe neither were? How many just aren't answering? Let me – okay, it turns out the grassroots monitoring had no effect, no impact on the amount of corruption and waste. The intensive auditing reduced project corruption and waste by about 30 percent compared to the control group and more than paid for itself in reduced cost of the projects in the money saved, which is an important �'�' policy important finding.

Okay, two more examples, one from U.S. domestic policy. In 1980, the Department of Labor in the United States launched a demonstration project where they provided a subsidy to employers to hire disadvantaged workers like welfare recipients. And the way that it worked was that the Department gave these disadvantaged �'�' the program gave these disadvantaged workers a voucher that they could hand to a prospective employer. And the employer could cash that voucher with the government �'�' it was a sizable voucher �'�' if they hire that worker.

This demonstration was set up as a randomized experiment where some workers got the voucher and control group did not. They were left to the usual devices to find a job.

All right, effective �'�' how many people vote �'�' think this was an effective intervention, that an increase �'�' the employment rate of the treatment group, the voucher group, compared to the control group? Okay, about five. How many think it was not effective? About eight. And some people raised their hands a couple times.

As it turns out, it was �'�' it backfired. The �'�' at the end of the experiment, the control groups �'�' a 60 percent �'�' the workers in the control group were 60 percent more likely to have landed a job than the workers in the treatment group who got the voucher. Because what had happened was that the voucher had stigmatized the workers in the eyes of the potential employers. The employers saw them as damaged goods that needed a subsidy to be hired.

So the first thing that evidence-based approaches can offer is a way like this: The grassroots monitoring, the vouchers, to identify things that you think might work that may be backed by expert opinion, in the case of the vouchers, is backed by straightforward economic theory, but which when tested in the field with the rigorous evaluation turns out not to work for reasons that hadn't been anticipated.

And based on the experience in every field where rigorous evaluations like this are conducted, it turns out that there are a lot of things that everybody thinks are going to work, but don't work. A lot of surprises.

Okay, and the last example I'm going to give and ask you to vote on is this example. This is a health example in India. One hundred and thirty-four poor, rural villages in India were randomly assigned to a group that received a reliable vaccination camp that was held once a month. The nurse showed up on schedule, it was well-publicized, and offered vaccinations to the children.

A second group that also receives �'�' a second group that also �'�' villages that also receive this reliable vaccination camp, plus a small incentive, a bag of lentils and some metal plates given to the families every time they brought a child in for a vaccination. It was worth about $3 total.

And a third group of villages that didn't receive �'�' was randomly assigned to a group that didn't receive any of the �'�' any new intervention.

How many people think that the camp by itself was effective, the vaccination camp? Okay, a couple folks. How many people think the camp plus the incentive was more effective than the camp? Okay, a lot of folks.

Well, both of you are right. This was really a blockbuster finding that was just published in the British Medical Journal. It found that the �'�' at the end of 18 �'�' at 18 months after the study started, 6 percent of the children in the control villages were fully vaccinated. 18 percent of the children, three times as many, in the reliable camp villages �'�' reliable vaccination villages were vaccinated. And 39 percent in the camp plus the small incentive were vaccinated. So that small incentive made a very large difference in the vaccination rate.

So to bring it home here, part of what rigorous evaluation can offer is a way to identify at least a few interventions like this that have a very large, important impact on people's lives.

Between those two extremes that I've talked about, things that have been shown effective and things shown not effective, lies the vast majority of programs and strategies and interventions, probably 98 percent or more, where governments are spending billions of dollars �'�' millions, billions of dollars in some cases, and nobody really knows which of them work and don't work.

And so the central goal �'�' I've given these examples �'�' of many of the evidence based reforms in the United States have been two-fold. One is to increase funding for rigorous evaluations in order to grow the number of research proven interventions. And that's the initiative that Dan �'�' an initiative that Dan is helping to lead at OMB and which he'll talk about.

And the second set of evidence-based reforms �'�' second goal of many of the evidence-based reforms in the United States have been to bring strong incentives and assistance for program grantees, those who get the funds to adopt what's been shown effective and put them into widespread use.

And we believe if you can do those two things, grow the number of proven interventions, the rigorous evaluation like the vaccination incentive, and create strong incentives for their widespread use, you could bring something that, you know, tangible success like we've had in medicine to areas of social and behavioral policy including diplomacy and development assistance.

This is just a quick overview. Dan's going to talk �'�' I believe is going to talk a little bit more about these, of new evidence-based initiatives, mostly in domestic policy, where our organization has helped inform, and in some cases, shape the initiatives. One is the OMB-led government-wide evaluation initiative that Dan is leading.

A second is a number of initiatives in the United States to scale up what's been shown to be effective. And there are a number of new Obama Administration initiatives in this area that have recently been enacted into law. These are in domestic policy.

Although, they are Obama Administration initiatives, several of these, including the home visitation initiative, and to a large extent, the rigorous evaluation initiative, respond in the �'�' or had �'�' or grew out of related initiatives in the Bush Administration under Robert's purview.

So now, let me just say a word about what kinds of rigorous evidence are needed to increase government effectiveness. In our advice in our work with federal agencies and congress and others on evaluation strategies, we advocate many different kinds of evaluation methods, including, for example, implementation studies to tell you whether an intervention like �'�' a strategy like a parent training program, for instance, is operating as it's designed to operate.

You know, are parents showing up for the training? Are the main elements of the training curriculum being taught? Does it change parents' behavior, that kind of thing?

We also support small-scale preliminary studies, both randomized and non-randomized, to identify promising interventions that merit evaluation in the more rigorous methods, the large random assignment studies that I just described. And we generally advise agencies to sponsor large, definitive randomized controlled trials of a program on a large scale when the program has been found these kinds �'�' only when these programs have been found in these preliminary studies to actually be well-implemented and promising.

But a central theme of our work consistent with the recent National Academy of Sciences recommendation is that evidence of effectiveness generally, to quote the Academy report, "Cannot be considered definitive unless it's ultimately confirmed in multiple well-conducted, randomized, controlled trials, even if based on the next strongest designs.

And let me just ask you �'�' I'm not going to go into detail here �'�' but why is �'�' what is it about random assignment that's important. I mean, why is it considered the strongest method, although, there may be some good second bests.

Why is it considered the strongest method, generally, of establishing a program's effectiveness in medicine and education and other fields?

QUESTION: It produces the highest and it's also, all things equal, kind of has to even out the chance that (inaudible)?

MR. BARON: Yes, that's right. It's basically what you say. It's �'�' the technical term is reduces chance of bias, but also the process of random assignment ensures to a high degree of confidence that the two groups, a program group and a control group, are equivalent in all factors you can see, like age, sex, education, poverty background, and so on, as well as things you can't see and may not be able to control for with other methods, like people's motivation or creativity.

It's basically the law of large numbers and helps ensure that kind of confidence. Where a random assignment study is not possible, there are good alternatives �'�' well, second best alternatives.

Not �'�' can't get you to definitive evidence, but basically there are a number of studies that suggest that where you have two groups that are basically highly similar �'�' even though it wasn't random assignment that created the program group and comparison group, highly similar in key characteristics and are not formed �'�' generally not formed through self-selection.

It's not like the program group all volunteered for the program and the control group did not, because even though the two groups may look the same in their characteristics, in that situation, the program group �'�' the fact that they volunteered may indicate a higher level of motivation than the control group, which could account for the different outcomes between the two.

So studies like this which are sort of well-matched comparison group studies are a second best alternative when random assignment cannot be done.

I think my time is nearly done. So let me just offer one last thought here. I've talked about development assistance; how might one rigorously evaluate �'�' let me just say this. To my knowledge, a random assignment study has not yet been conducted in diplomacy. That doesn't mean it can't be done.

A few years ago, it hadn't been done in development policy either. Fifty years ago, you had the �'�' basically, the first randomized control trials in medicine.

So here's one �'�' and it can't be done in every instance for every type of diplomacy, but there's some cases where you �'�' I think they are clearly feasible. Suppose, for example, you wanted to evaluate a strategy �'�' public diplomacy strategy to undermine, you know, population support for terrorist ideology where that kind of ideology is widespread.

Some strategies that have been proposed are fund schools that are run by moderates or conduct media campaigns on the harm that terrorism causes innocent civilians. Those are a couple strategies that might be rigorously evaluated.

One could do something similar to what was done in those development examples: Identify about 60 communities where support for violence against Americans exist; randomly assign half of them to receive the strategy, the media campaign, or whatever; the other half to a control group.

And then at the end, measure �'�' into the study measure support for terrorist ideology, for violence against the United States, and the treatment in comparison or control communities.

And that concludes my remarks. Hopefully, they offer some food for thought.

MR. SHEA: Now, Dan Rosenbaum with OMB.

MR. ROSENBAUM: Hi. First of all, I want to thank John. I think when they write the history of the kind of movement towards evidence-based policy, I think John will get one of the, you know, maybe one of the top billings in kind of that movement.

And before I talk a little bit about �'�' I want to talk a little bit about my background, which I think is actually somewhat helpful in thinking through some of my comments here.

I actually have only been in government for three years. I was a tenured economics professor three years ago and doing kind of statistical kind of data oriented things. And then went to see Council of Economic Advisors for a year. Liked it well enough that I actually gave up tenure to join OMB as a career civil servant.

And then on the side, I actually do statistical analysis for the Cleveland Cavaliers basketball team. So if you catch me yawning, it's because I have a lot of things on my plate these days with the free agency coming up and Lebron coming to a decision soon.

And I think part of the reason I wanted to talk a little bit about that is because I think I like to picture what the Administration is doing and what the federal government is doing in the broader picture that I think empirical evidence is becoming more and more important in all kinds of different spheres.

Private industry is using evidence in a way that it hasn't in the past. I mean, data is so much cheaper to collect. Computer programs, software programs are more powerful. And so more and more people are getting used to using empirical evidence in kind of real important decision making.

I've been part of an industry that's been transformed by empirical evidence as part of the sports industry. Foundations are seeing empirical evidence as being more important. You know, things �'�' folks like the Poverty Action Lab are, you know, real leaders in kind of thinking about how to use empirical evidence. Organizations like the UN and World Bank are increasingly kind of using empirical evidence and evaluation.

And then certainly federal governments are playing a big role in state and local probably as well and not just in the U.S., but also in other countries as kind of Ruth mentioned in the previous talk. And again, this is not just something that the Obama Administration has started. This is something that has a long history that, you know, Robert kind of talked about.

And I'm not going to get into the whole history, but you certainly have the Gipper efforts in the 90s and then certainly the part �'�' you know, I think there were lots of �'�' you know, there were pros and cons about the part. But I think it really focused agencies a lot on kind of results and evaluation in some very, very helpful ways that have been very helpful in kind of the Administration's ongoing efforts.

So now, let me kind of branch into what the current Administration's doing. And first, I want to talk about, as kind of Ruth talked about in the last talk, this is a remarkably kind of data-driven administration.

I mean, it's remarkable for me as a kind of former academic economist how much of my kind of the like the leaders of, you know, my field or other empirical, you know, research areas are actually running around these days as policy officials. And so they have a huge pretty insatiable demand for kind of empirical evidence. And they're willing to change their minds based upon that empirical evidence.

And that's �'�' those are the ingredients for, you know, really kind of getting something done in terms of building a more kind of evidence-based policy on that �'�' and this goes all the way up to certainly the director of OMB.

I mean, we have economists that do pretty much empirical economists at the top two positions at OMB. And then the President himself has obviously �'�' has kind of made it pretty clear that he's pretty empirical in his own right. And so this is �'�' I mean, I think that it's something that's pretty central to what this Administration does.

And I guess I don't really have to say this since I'm not one of the policy officials. I mean, just as a career person, I mean, it seems pretty evident that evidence is pretty important to this group.

I think �'�' and I want to �'�' I'll talk a little bit more about the specific things that I work in evaluation here in a minute. But I want to talk about other efforts that are kind of very related to this as well. I think Shelley Metzenbaum talked yesterday about a lot of the performance measurement efforts that are going on. And I think there is a lot of synergies between those efforts and what goes on in the evaluation efforts.

There's also sometimes �'�' they're not all synergies. There's some �'�' there are sometimes where they butt up or are at odds. But for the most part, I think those are efforts that are very, very helpful in the evaluation efforts as well.

There's a lot of efforts to kind of make administrative data, a lot of State administrative data, much more accessible for researchers and for evaluation purposes. And combine that and related to that, there's kind of efforts in the statistical or survey areas.

Also, you know, obviously one of the first pieces of doing good evaluation is having good data. And so I think folks have recognized that. I'm going to talk a little bit about the tiered evidence approaches. And then I'll lastly talk about the Program Evaluation Initiative.

The tiered evidence-based approaches, I think, have been a very nice response to a lot of the political issues with trying to make political decisions more evidence based. And what they've generally done is instead of having �'�' instead of forcing Congress to make a decision about which programs or which interventions are the ones that definitely work.

They set up a structure, whereas �'�' where evidence is used to try to determine which programs will get the most funding. But then lots of other programs that may not have the strongest evidence will also get some funding and then they'll be continually evaluated so that they have some hopes if those programs actually do work that in the future they could move up to a higher tier and get higher levels of funding.

I think this approach, you know, has been very good at getting around a lot of the political arguments when you have certain constituencies in favor of one type of program or one type of intervention. I think, you know, having them all put on this equal footing in one of these tiered approaches has been very helpful.

So they've been adopted in kind of areas of home visitation, teen pregnancy, the social innovation fund, investing innovation fund. I mean, these are mostly in kind of health and human services and education areas. But I think this is a model that is likely to get adopted in lots of other places.

And, you know, I think it's a model that may be very �'�' there may be places where it would be useful to adopt it in, you know, areas, you know, that the State Department, you know, would work with. So I'll talk now a little bit more about the program evaluation initiative.

This was something �'�' there was a memo that came out from the OMB director last October that outlined, kind of, three different prongs of this evaluation initiative. One was an online inventory, planned ongoing and recently completed evaluations. The idea here is if you make kind of evaluation activities known, in particular before the results come out, it's much harder for agencies to hide those results if they get results that aren't politically favorable. And so that's part of the idea.

And then, you know, and also I think they're making this online. It's just, you know, these are evaluations that have been paid for with taxpayer money. And so, you know, I think there's certain obligation that the public get to see what the results of these evaluations are.

A second piece of this is putting together an evaluation working group that would be an interagency evaluation working group that would share �'�' that's sharing kind of best practices and, you know, across the various different agencies.

I mean, one of the things I think that folks already knew at OMB, but have kind of learned again is there's a tremendous amount of heterogeneity both across agencies and within agencies. And it's only kind of through, you know, talking to folks that I think we kind of learn how best it can deal with a lot of that heterogeneity. And I'll come to that probably in later on the talk.

The third piece was a �'�' ended up being approximately a hundred million dollars of new funding for evaluations or demonstrations and for building agency capacity for agencies to do evaluation. And this was new money that was kind of on top of regular agency budgets.

You know, the way that we ended up giving this money out to agencies was completely separate from the rest of the budget process. All of the proposals that came in went into a common pool. We had evaluation experts from across the executive office of the President. Look at these.

We use CEA pretty heavily and, you know. in this exercise. And the idea here was to try to fund the best evaluations that kind of answered important, you know, politically relevant questions.

And so let me get �'�' kind of expand on that point a little bit more and talk a little bit about the �'�' kind of the some of the objectives we're looking for in terms of what we're trying to kind of do with evaluation.

And again, with the backdrop of the recognition that there is a tremendous amount of heterogeneity across program types about how you're going to go about evaluating various different programs or interventions and then just the capacity of different agencies to do evaluation, the willingness of various agencies to do evaluation. These are things that, you know, I think we are very, very cognizant of kind of everything that we do.

I think one of the things that we really have tried to orient a lot of this around is I think the first place is to really start with the questions. You know, we're not a method out searching for some sort of program to evaluate.

The key issue is trying to figure out what are the important questions that need to be answered. And this is where a lot of the performance measurement activities are very synergistic, because a lot of what those efforts do are trying to identify what are the important kind of metrics, what are the important kind of goals, you know, that agencies have. And so this is where they are very synergistic.

And so I think one of the things here is it's not �'�' overall effectiveness of a program is not always the most politically actionable question. Sometimes programs are going to pretty much get funded regardless of what the evidence suggests. In those kind of cases, it's a waste of resources in a lot of ways to evaluate the overall effectiveness of the program.

In that kind of situation, it may make a lot more sense in a world of limited resources to evaluate one intervention versus another for that particular kind of program. And so �'�' and we're very cognizant of if this evidence turns out one way or the other way, is this actually going to influence important kind of budget or management or policy decision.

If that �'�' the answer's no on all three of those fronts �'�' we don't have unlimited evaluation dollars �'�' it would be better off to spend that money somewhere else. Once we start with a question, then the next thing is to kind of use the most rigorous research design for that question. And this is where the heterogeneity across programs and agencies becomes really important and there's no one kind of approach that's going to work in all kinds of situations.

But I think �'�' and then I think two other things that kind of inform �'�' I think, some of the research design choices that folks make are, you know, what kind of evidence is already out there on, you know, that question.

If the evidence is pretty thin and weak and there's very little there, it actually makes it a little bit more possible to consider a kind of wider range of research designs, because there may not be �'�' you know, you don't have to necessarily overturn because of some strong evidence.

On the other hand, if there's a lot of evidence out there maybe going in both directions, in those kind of cases, you're really probably going to have to have �'�' you know, you're going to have to invest in the most rigorous evaluation design that you can or else it's probably just going to be another study in a whole mass of studies and have very little influence on kind of what gets done.

The second thing, and it's kind of related to this, is how firmly set are people's minds, you know, kind of politically about the particular issue. If folks are pretty much dead set for or against on a particular issue, those are going to be cases where you have to get really, really, really strong evidence. You probably need an RCT or something like that or you're not going to be able to change people's minds.

But there's a lot of situations where, you know, maybe one intervention versus another or something along those lines where minds are not so strong �'�' you know, people don't have such strong priors. And there may be �'�' you know may be situations where a wider range of evaluation designs might actually still be able to influence decisions.

Very cognizant in everything we're doing of kind bang for the buck. You know, we don't want to be spending 20 million on evaluation for a program that's 30 million. You know, those kinds of things are very important and so that's something we thought about a lot.

Another thing I think that doesn't get enough focus �'�' I think there's a lot of focus on kind of whether or not evaluations are biased. Not enough focused sometimes on whether they have enough power or large enough sample sizes. And so �'�' and I think too often we funded evaluations with really low power and really small samples.

And so what we end up is a situation going in where there's almost no chance to find statistically significant effects even if the program actually does work. And that's a disaster. I mean that is just a disaster, because what that does is the consumer's of this evidence are not sophisticated enough to interpret as we wasted our time and money doing that evaluation. They interpret that as the program has failed.

And that is a disaster, because it sets program offices and good-meaning people against evaluation efforts, because they think, you know, that we've set them up kind of to fail.

And so we're, you know, I mean, there's times where I've �'�' you know, we've been �'�' you know, we've actually told agencies not to do a particular evaluation, because we just felt like it was �'�' they were just basically setting that program up to fail.

There was no chance, unless the effect sizes were, you know, 20 times what were reasonable that there was going to be much of a chance to actually find anything, because, you know, there were �'�' you know, it was small sample size or really wasn't that big of an indifference in intervention between the treatment and control group, you know, any of the issues that kind of relate to the power of an evaluation.

Another thing we're very cognizant is is trying to build evaluation capacity in the agencies. You know, there's a lot of different dimensions to that. You know, there's just relative levels of expertise. But there's also trying to build up the ability for evaluation offices to communicate with program offices or policy officials. I think there's a lot of efforts to try to empower evaluation offices, you know, as much as we can.

And then I think the other thing is, it's funny as a career person, I'm the one who ends up probably being the biggest one in the band box for this, but I actually think, you know, we really need to envelope this into kind of this broader perspective of, you know �'�' it's not just the folks out of the agencies, the folks in the executive office of the president of OMB that we need to change the minds of here.

You know, this really needs to be an effort where we're getting to Congress, we're getting to the decision makers that Congress listens to. And then we need �'�' because ultimately, you know, we can't just build it and they will come, you know, kind of thing in terms of have a lot of good evaluation evidence and then hope that it gets used.

Ultimately, this only works if Congress or maybe the folks that Congress listens to are kind of demanding this kind of evidence. And, you know, I know when I say this, people like start laughing at me. But I mean, I've been part of a sports industry that, you know, five, ten years ago, you know, folks laughed at folks who were, you know, trying to use statistical evidence to help folks make decisions.

And now, if you're not doing a little bit with statistical evidence, people are looking at you like you're not doing due diligence. And you're kind of behind the curve. And so, you know, I think �'�' you know, Congress is a much harder nut to crack than, you know, the sports industry, but I think, you know, the world does change.

And as more and more and more people get used to in their everyday life of empirical evidence playing a bigger role in the decisions they make, you know, it's going to impact Congress as well. And so I think, you know, I think that's the broader place where all this fits into.

And so I guess the last point I want to make is I think this is obviously �'�' for me, this is a very exciting time to kind of be part of, you know, the Administration or, you know, part of the federal government, because I just think there's a lot of really exciting things going on the kind of area evaluation evidence. You know, and I'm even willing to argue that this is �'�' of my two jobs, this is actually the more exciting of the two. And that's obviously saying a lot.

So �'�' and they're actually two jobs �'�' I always joke they end up being really the same thing. I'm, you know, in one case I'm trying to get a sports team to use more empirical evidence. The other one, I'm trying to get the federal government to use more empirical evidence. There's a lot of synergies to the two of them. So I don't know if you folks have questions.

MR. SHEA: Can we squeeze in a few questions, Rex?

MR. PATTERSON: We have time for questions. I would ask those at the table. If you're going to ask a question, please turn on your table mic and do so. For those of you sitting at the chairs along the back of the room, we have a wireless mic we can pass around. So please wait for the microphone (inaudible).
QUESTION: First of all, thanks. Thanks to both speakers. My question is for John. One of your �'�' in one of your slides, you had a bullet that said that the large-scale rigorous randomized trials should be reserved, if I remember correctly for programs that show definite promise, I think words to that effect.

Do you not advocated evaluating programs for which �'�' that show promise evidence doesn't already exist? I mean, why just limit evaluations to those programs that maybe are more guaranteed to come back with a good result?

MR. BARON: A couple thoughts on that. I think, you know, there are limited evaluation resources that have been mentioned. And it's important when �'�' we think it's important when an agency or a funder is thinking about where to invest those scarce resources to do it in a way that is most likely to build �'�' to produce a finding that something works and can have a big impact on people's lives and can influence policy decisions.

Empirically, when these evaluations have been done, there are a fair amount of null findings of no effects; in some cases it has had adverse effects. And findings of true effectiveness even for interventions that look promising tend to be the exception. They exist, but they are harder to find.

And so given that there are limited resources, the approach that we generally advocate is to focus sort of those large, more expensive randomized evaluations on programs that you know are well-implemented �'�' that should be a precondition, otherwise you're not going to find any impacts �'�' and also that are backed by sufficiently promising evidence in most cases.

On the other hand, there are other, you know, smaller scale studies, most short-term randomized trials, or smaller trials or comparison group studies and other methods that can be used where something doesn't have that kind of earlier evidence. And if it's found effective there, then it might be useful to test in the most definitive type of evaluation.

QUESTION: But it sounds like that's guaranteed to not produce anything other than good news. I mean, if there are well-established programs that people �'�' that may not be effective, isn't it just as important to find out what those are so that despite the intent not to have programs be whacked by the budget process if things don't turn out to be well?

If some programs truly don't work, isn't it just as important to discover that so that resources could be diverted to something a little bit more effective?

MR. BARON: Well, that's the usual discovery. That is the usual discovery when a rigorous evaluation is undertaken in almost any field. In medicine, a lot of things �'�' most things that are rigorously evaluated in a definitive trial are found not to work.

I �'�' from our standpoint, we advocate evaluation more as a �'�' less as an accountability measure to tell is our program working, not working, increase funding, decrease funding, less of that, and more as a tool to identify.

QUESTION: How to do things.

MR. BARON: -- the relatively few proven approaches that can really have a big impact on people's lives and could be scaled up to increase, you know, the government's overall effectiveness.

QUESTION: Okay, thanks.

MR. PATTERSON: I think he's �'�' Rob, quick question. Yeah.

QUESTION: Okay, hi. Kathy Newcomer from George Washington. Real quick questions to Dan. So are the evaluations online already? Or when will they be online? And is the Evaluation Interagency Taskforce working yet? And you know, are they like meeting orderly or whatever?

MR. ROSENBAUM: We're in the process of kind of sending out our requests to the agencies for the information that will be in the online inventory. So that's actually �'�' the top members of OMB administration have been kind of pushing me harder than probably I wanted to be pushed to get that done really, really quickly. But it's getting done very quickly.

We've met �'�' we haven't met with kind of what I think we ultimately will constitute the evaluation working group, which I don't know if we're necessarily going to have.

I think we're going to �'�' that group is probably going to be something that gets defined from whatever the topic happens to be that we might be meeting about at any given time. But we have met a couple times with a steering committee of folks from the agencies that we're using to kind of help us figure out how to best work the kind of the wider group of agencies.

So that's �'�' both things are kind of ongoing �'�' you know, are kind of �'�' have kind of �'�' some of the very early starts of both have already started.

QUESTION: Graham Orrell, USAID. Are any of you or do you know anyone who is addressing problem as sort of access?

At AID, we have the development experience clearinghouse. It has about 200,000 documents on it. None of these are really searchable by Google. I think other donors are pretty much the same.

I don't see anyone that's sort of analyzing our evaluations or any evaluations by level of rigor so that when you do need to do an information search if you're developing a project or something, you know, you can easily get a library, but have very little sort of knowhow.

MR. BARON: You know, I don't think we know of anything �'�' I mean, this is part of the motivation for the online inventory so that, you know, if we're not doing this or, you know, other groups have the ability to kind of look these evaluation results, because part of this request would be, you know, links to ultimate reports that come out and things like that.

So I think that's the hope here is by having more transparency, there will be other groups that will be able to weigh in on some of this evaluation evidence so that more �'�' so it's more useful.

QUESTION: A quick question for everybody. You know, we're dealing with a surprising number of organizations, large organizations, billion dollar budgets, 10,000 FTE mainly in the regulatory and inspection areas of the government who tell us, number one, they were not interested in taking any of OMB's money, any of the hundred million dollars, because they were worried about what the results would be.

Number two, they are not conducting evaluations because they can't see how they would develop a control group given the nature of their business.

And number three, that planning and measurement is not linked to evaluation in any way. But what's your message for these kind of organizations? How do we start cracking this large segment of the government who is really ignoring this message?

MR. BARON: You know, it's tricky. I mean, I don't think there's any kind silver bullet here. I mean, first of all, some of the program evaluations that is partly �'�' you know, evaluation initiative, actually were for regulatory activities. So some of that's not �'�' some of those arguments are actually not true. But some of it is.

I mean, there are activities that the federal government does that are much, much harder to kind of think about evaluating and thinking about how to set up some sort of reasonable counterfactual.

And so I think one of the ways in which we're trying �'�' I think we're trying to communicate this �'�' that this is not some sort of one-size-fits-all approach and that, you know, there are a lot �'�' like you mentioned, there are a lot of dollars attached to some of those programs. And if we �'�' you know, a lot of the non-empirical evidence that you use to make decisions there is probably pretty weak as well.

And so maybe we don't need the �'�' you know, maybe if it's not impossible to have some sort of gold standard evaluation there, there may be other ways of evaluating those programs that still are as good or better than whatever sorts of evidence we're currently using to make decisions, you know, in those kind of cases.

So I think the broader strategy is �'�' I mean, I think of a lot of this gets enveloped up in some of the performance measurement activities as well. But I think it's really just kind of communicating that empirical evidence is really important and then this flexibility to try to work with those agencies or subagencies, you know, that might have more difficult programs or activities to try to evaluate.

MR. ROSENBAUM: Can I just add a brief item to that which is, you know, there are early movers, folks, you know, different fields and different agencies that have �'�' and then there are later movers?

I think a couple things. One of the things that I think is going to �'�' hopefully will help make this irreversible, getting back to the question that was asked earlier, and make it more widespread are examples of proven effectiveness, things like the vaccination study, where a relatively simple change can make a big difference in people's lives.

There are a number �'�' there are a few examples like that now in social policy. Some of them have formed the rationale and the basis for the scale up initiatives that Dan talked about. And those are the things �'�' the same thing as doing healthcare. Everybody knows that medical research has produced enormous benefits, vaccines and so on.

Showing success, I think, is the key. Things that work in a rigorous study is the key to sort of political sustainability of the whole effort.

And finally, I'd just add, you know, sometimes people say, "Ah, you can't do a rigorous evaluation in this area. How could you possibly randomize?" Well, if you thought about maybe five, six years ago and you asked somebody, "Well, how in the heck could you do a large-scale randomized trial to reduce corruption and waste in developing countries," people would have thought, oh, you can't do that, and dismissed it.

Somebody went ahead and did it, those Indonesia examples I talked about. There are some examples in regulatory policy where it's been done. So a few examples showing that it has been done �'�' it's feasible �'�' can help to get others to do it.

MR. SHEA: Unfortunately, we're out of time. But we very much appreciate your coming to our session. Thanks.