Sunday, October 16, 2005

On Grade Inflation

Two unrelated events have called my attention to grade inflation recently. One was a post on Grinnell Plans from a current student who had seen a chart demonstrating the upward drift of Grinnell's grades over the last decade. Essentially, the overall mean GPA has risen from roughly a B grade to roughly a B+ grade--a large change for such a short time, and as I understand it, a fairly typical change over the same stretch in many colleges and universities. The second was a detailed post by Steven Willett on NASSR-L, an email list populated by a couple of thousand people interested in Romanticism, mostly graduate students and professors in the field. Willett is a contrarian and a traditionalist who frequently attacks the state of his profession on the list; in this post, he resisted arguments minimizing the existence and consequences of grade inflation by citing a range of studies on the issue. One of those studies caught my attention because it resisted the moralizing I find tiresome on both sides of inflation debates and offered some insight into the mechanisms of grade inflation. This is Willett's quotation of the summary of that study, by Donald G. Freeman, published in 1999 in the Journal of Economic Education:

"My hypothesis is that, given equal money prices per credit hour
across disciplines, departments manage their enrollments by 'pricing'
their courses with grading standards commensurate with the market
benefits of their courses, as measured by expected incomes.

"I analyzed grade divergence using a cross-section of 59 fields of
study from a recently published survey of college graduates by the
National Center for Education Statistics, A Descriptive Summary of
1992–93 Bachelor’s Degree Recipients: 1 year later (NCES 1996). The
survey tracks 1992–93 college graduates to determine outcomes from
postsecondary education, including returns to investment in
education. Using this sample, I found evidence consistent with the
economic explanation of grade divergence: Graduates from high-grading
fields of study have lower earnings than graduates from low-grading
fields of study. This is true even when controlling for factors such
as student ability and experience" (344-45).

Fascinating! Other bits from Willett's post (drawing on other sources) flesh out some of the details underlying this hypothesis: music and education departments tend to give particularly high grades, for instance, and the latest wave of grade inflation has affected the humanities more than the hard sciences, but English and biology in particular more than mathematics. It seems to me that the place of education among particularly high-grading disciplines deserves a good deal of consideration--and has perhaps received such consideration that I simply haven't read. I'll extend that disclaimer to what follows; my speculations may be supported or contradicted by research I don't know. This isn't one of the books I'm writing.

So here's a starting point. Grade inflation is real, across the board in higher education. Giving higher grades produces higher evaluations for teachers, when other factors are controlled (other studies show). Grade inflation varies by discipline. Grade inflation comes in spurts, one of which occurred roughly around 1970 and one in the last ten years.

I find Freeman's hypothesis--that departments whose majors generally earn little money compensate by awarding high grades--fascinating and largely supported by my intuitions. However, I am prompted to look for further explanations for three reasons. First, a bad reason: Freeman's hypothesis does not match how I've seen professors talk about their grading. I call this a bad reason because of the obvious potential for self-deception or deceptive self-marketing here. The second is that there are some exceptions to the rule that I know off the top of my head: when I was at Penn, the ultra-prestigious Wharton School (business) had a reputation for giving high undergraduate grades, and indeed, a web search confirms that its introductory course has a mandated median grade of B+ in each section, which is especially high for an introductory course, where grades are generally lower than in advanced courses. Similar cases abound in related areas, such as the most prestigious law schools, where students with the highes expected earnings get very high grades. The third reason is that the logic of expected earnings does not apply to institutions; the most prestitgious colleges and universities, whose graduates have the highest expected salaries, have experienced grade inflation along with everyone else. For all these reasons, I suspect Freeman is largely correct but that other factors are also in play.

(Side note: I feel no professional self-interest in this issue. My grades are a little lower than average for Grinnell, as I suspect my department's are, and student comments about my grading reflect that. I am neither an apologist for today's grading levels nor an indulger of nostalgia for yesterday's lower ones. I do want to understand how and why my profession employs grades.)

I offer three hypotheses about those other factors:

1. The growing emphasis on revision allows students in some courses to receive higher grades given the same talent, application, and academic standards. I claim no original insight here, but I mention this factor because so many discussions of grade inflation assume that higher grades must imply better student work or lower academic standards. Allowing students to earn higher grades through revision, however, allows teachers to award higher grades while still feeling that students have received honest feedback on their work. Since many pedagogical studies support the learning outcomes of revision-based writing, this can produce a kind of guiltless grade inflation. I'll come back to this point.

2. Elite colleges and universities can use grade inflation to shift employers and graduate schools from statistical evaluations of transcripts to a self-serving prestige market. If every college and university enforced a strict 2.0 median grade, evaluators would compare transcripts by using implicit prestige adjustments--perhaps a 2.5 GPA from a highly selective institution would be roughly equivalent to a 3.1 at a less selective institution. I've seen the application of this kind of unofficial adjustment many times. If practically everyone graduates from Harvard with honors, however (as is the case), then Harvard has created a situation where most of its students cannot be outperformed in transcript reviews. Shifting all grades close to 4.0 forces evaluators to discount grades themselves, thus increasing the importance of the instutional reputation. Harvard has a great deal to gain from grade inflation, and less selective institutions can only play along--if UMass intentionally lowers grades as Harvard inflates them, UMass only hurts its graduates even more relative to Harvard's. Colleges and universities that have the highest stake in maintaining the importance of institutional prestige also have a strong incentive to keep overall grades high. And the least selective institutions are facing pressure to keep marginal students enrolled (to maintain government support based on enrollment levels).

3. The recent inflation of grades coincides with a significant weakening of tenure. Most college courses are now taught by people who are not tenured or tenure-track. Teachers who are untenured but on the tenure track (including me, for whatever that's worth) may feel some pressure to use high grades to raise the level of student evaluations, but that pressure is limited by the relatively large sample of evaluations and many other factors that go into tenure reviews. I would find a reputation for low standards much more dangerous to my tenure prospects than slightly lower average teaching evaluations. I know circumstances vary, but I think the key here is graduate and adjunct teachers whose piecework employment depends heavily on the student evaluations of any given semester. Such teachers often see their professional lives in the hands of administrators unconstrained by full review processes, administrators who need to care a great deal about student and parent satisfaction and not as much about teachers' other contributions to their institutions and professions. If grade levels are a small but significant factor in student evaluations of teaching, piecework teachers are extremely vulnerable to giving higher grades out or real or perceived self-preservation.

Taking all these factors into consideration, I offer my own hypothesis about the grade inflation of the last decade. We are seeing the confluence of multiple, independent incentives that all point in the direction of higher grades: a dramatic increase in reliance on teachers with tenuous employment, defensible mechanisms of raising grades without changing underlying standards, and institutional incentives for every kind of institution to keep overall grades high.

Tuesday, October 04, 2005

Writing on Research Leave

For scholars who want or need to publish their research findings, no question produces more opinions, self-doubts, and superstitions than this: given the demands of full-time teaching and personal or family life, how do you get the writing done? I imagine the same kind of question applies to anyone who tries to get long-term projects done when those projects compete for time with smaller, deadline-driven tasks.

Most commentary on this question runs along the lines of Tara Gray’s findings in Publish & Flourish: write daily for 15 to 30 minutes and share your progress with someone, says Gray. Her research (and other similar research) backs up the idea that writing a little at a time consistently will produce more than writing in isolated big blocks of time. In a non-academic context, Jeff Covey’s idea of the Progressive Dash relies on many of the same principles. Start the day with a minute or two of attention to all your priorities, says Covey, and then allocate more and more time to them as the day goes along. Both Gray and Covey depend on the notion that simply keeping some kind of momentum is more than half the battle of completing projects. My experience certainly confirms those conclusions, though I have not always been able to live out my convictions and write during heavy teaching semesters.

Right now, however, I want to answer a different question that grows out of the luxury of having a year to write without teaching: given the demonstrated effectiveness of doing a little research work every day, how can we apply the same ideas to making the most of a dedicated stretch of research time?

Early in my year of leave, I’m trying to learn from my experiences as they come to maximize my writing during the rest of my leave. The past two weeks really set me thinking. In the first, I got some work done, but my baby got sick in the middle of the week, and a few other complications (logistical and psychological) clearly set me back, and I didn’t do as much as I wanted to. In the second, I had perhaps the most productive week I’ve ever had in the U.S., working through and taking the notes I needed on about 5,000 pages of commentary. (As any researcher will understand, that doesn’t mean I read 5,000 pages carefully. I went through a stack of material, found what I needed, and paid close attention to those sections.) The work happened as I was taking care of my baby son—while he was sleeping, sometimes while he was playing happily in front of me, and while he was at day care for three or four hours a day.

I’ve worked that well in London before, spending long days at the British Library. Other researchers tell me of similar experiences, where going to a new place for a research trip allows them to overcome their usual limits. So why did the home routine suddenly become as effective as a remote archive for me last week?

My guess, oddly, is that the baby—crucially, during a relatively happy and healthful week—substituted for the plane trip to London. Archival trips work so well, I propose, because they give a scholar a sense of time being both abundant and scarce: abundant in that days are set aside entirely for research, scarce in that the scholar knows that the demands of daily life will wake from their sleep in a few days or weeks. The same combination of abundance and scarcity could explain the effectiveness of the day-by-day approaches I cited above: writing 15-30 minutes a day takes the pressure off any given writing session, since writing days become abundant when one writes every day, but the brevity of each session enforces scarcity—at the end of such a short session, a writer almost always wants to say more.

A research leave provides a sense of abundant, unbroken time for writing. That’s a luxury but a frequently overwhelming one because of the tendency to write to exhaustion. It’s easy to write until you’re sick of writing—and therefore to feel sick of writing most of the time you’re not doing it. The baby takes care of that problem by providing major and largely unpredictable interruptions to my day. When I stop writing, I stop because I have to, and I generally want to do more. I keep thinking about the work, sometimes scribbling an idea with one hand while holding the baby in the other. Even though I can end up spending 6-10 hours on my work, the baby ensures that time feels scarce anyway.

I don’t recommend baby care as a productivity enhancer, but I do think some of what happened in that exceptionally smooth week might carry over into a more general approach. Keeping a sense of other priorities, writing in short sessions, having expected but somewhat unpredictable interruptions, and generally avoiding the feeling of quitting from exhaustion might contribute to maintaining the productivity of short writing blocks in the more open-ended context of research leave. I’ll be thinking about this more as the year progresses.

Monday, October 03, 2005

Baseball MVP talk: quality, value, and chance

For a starting point, I'll take this column by Sean McAdam supporting David Ortiz over Alex Rodriguez for MVP in the American League.

Now, this is an unusually stupid column. A writer who says that "it's impossible to imagine that anyone could be more valuable to his team than David Ortiz is to the Boston Red Sox" is simply not taking language seriously. Sadly, however, the column does seem to reflect the level of thinking among most writers who explain their votes--and the writers elect the MVP.

First, I'm going to articulate what I think would be the traditional "stathead" position on McAdam's column, a position I support almost entirely, and then I'll explain a complication I've come to consider in the statheaded approach.

The most fundamental problem with McAdam's argument is that he's using statistics as an advocate rather than as an analyst. He cites a hodgepodge of stats, ranging from those that do a good job of measuring individual hitting production (slugging percentage) to traditional triple-crown stats that have long been shown to be lacking because they depend on teammates' performance (RBI) and exclude important information such as a hitter's walks and doubles. McAdam's standard is simply to cite the evidence that makes Ortiz look good. One name for that approach is intellectual dishonesty. Another is sports opinion journalism.

The problem is not that some sports opinion writers say thoughtless things or twist evidence to make their cases. They are paid to generate readership (or viewership), and partisan columns can serve that purpose well. But the need for a writer to present an original angle in a debate is directly at odds with the writer's function as a voter in the awards race. To analytical purists, the awards would ideally reflect the application of the best analytical practices we know of; thoughtful people can disagree about the details of the standards, but they must agree that an even-handed account of available evidence is the only reasonable starting point. But sports opinion writers can't do that, for reasons I'll return to.

Baseball offers analysts more objective evidence about individual performance than other sports do. In football, the performance of running backs depends on that of everyone else on the team--the rest of the offense has to create running opportunities, the coaches need to call running plays, and the defense needs to maintain control of the game to avoid a desperate pass-based comeback attempt. Baseball's pitchers and hitters, however, are almost entirely on their own, and the team-based elements of their performance are fairly easy to recognize and disregard in the data generated by baseball's uniquely long seasons. Therefore, statheads say that we can and should factor out statistics that depend on team performance (pitchers' W-L records, hitters' RBIs and runs scored) and test measures of individual performance based on their demonstrable effectiveness. For hitters, the quick statheaded way to account for nearly all of offensive production is to add on-base percentage plus slugging percentage to create a stat called OPS, for "on-base plus slugging." As it happens, this year's MVP race is a no-brainer by that standard: Rodriguez led Ortiz easily in on-base percentage (so McAdam didn't mention that stat), and he also overtook Ortiz in slugging at the very end of the season, finally leading Ortiz in OPS, 1.036 to .999. If Ortiz were a valuable defensive player, his contributions could still justify an MVP award, but, of course, defense is also in Rodriguez's favor, as he played a solid third base every day while Ortiz did not take the field. Because defense hurts his argument, McAdam writes, "Defense has never been much of a factor in MVP voting. If it were, Ozzie Smith, Mark Belanger and Bill Mazeroski would have been serious contenders. They weren't." But this is patent silliness: it's simple and accurate to say that hitting is more important than fielding, but fielding still counts for something--especially when one player plays a skill position, allowing his team to pack more offense into its lineup, and the other clogs the DH hole, robbing his team of offensive flexibility. For all these reasons, Rodriguez clearly had the better individual season, and the fact that I like the Red Sox and Ortiz better than the Yankees and Rodriguez won't change that. A good stathead applies the same standards every year and knows why those standards are better than others. By those standards, the MVP is A-Rod's, hands down. And the infuriating problem with the situation is that his case will be damaged because it's too easy to make: Rodriguez was widely considered the best player in the AL before the season started, and he played better than anybody else. Nobody's going to attract readers with that storyline. And that's why I believe that sports journalists should be stripped of their voting power; the conflict of interest is too great to overcome when voters explain their logic in print for money.

Now here's a twist, where I'm going to diverge a little from statheaded methods. I've addressed the distinction between individual and team-dependent stats, but there's a third category: situational stats, which, for hitters, generally measure performance in "clutch" situations, variously defined: in the pennant race, at the ends of close games, with runners on, and so forth. Some such stats are easily dismissed: in a one-run game, a home run in the first inning is not less valuable than a home run in the ninth, even if the latter is more memorable. The more interesting question is how we should evaluate a single that drives in two runs versus a single with two outs and nobody on.

The statheaded approach, grounded in a lot of careful analysis, has been to contend that the two singles should count the same. At the major league level, hitters do not seem to have special "clutch" abilities; good and bad clutch performance in a given season seems to result mostly or entirely from chance variations rather than special psychological characteristics. If two hitters have similar seasons and one happens to drive in more runs (because of timely hitting rather than more opportunities), statheads say that the difference essentially doesn't count because the hitter could not control it. You shouldn't get credit for luck.

And that was my position, without reservations, for a long time. But about four years ago, in research summarized here, Voros McCracken introduced what he calls DIPS, based on a compelling thesis that pitchers can control a few factors consistently (strikeouts, walks, and home runs allowed), but the number of fair balls that drop for hits against them is largely random. The details are beside the point here; the short version is that McCracken introduced the idea that we can separate a pitcher's performance from his results: if two pitchers each allow four runs per nine innings (and all else is equal), McCracken's method might tell us that one of them was lucky and one unlucky--they had the same results, but one pitched better.

This insight is extremely valuable to people investing in baseball players for the future--you want the guy who really pitched better on your team next year, not the guy who got lucky. The consequences of this approach raise a troubling issue for individual awards based on the past, however: these two pitchers were, demonstrably, equally valuable to their teams, but we can reasonably say that one of them pitched better. And the logic underlying everything I said above is that being better and being more valuable are the same. By traditional stathead logic, in which we credit players only for achievements stripeed of demonstrably random effects, we could give Cy Young awards based on normalized hypothetical results for pitchers rather than what opposing hitters actually did against them.

I'm not ready to do that, so to be consistent, I must entertain this question: if David Ortiz was blessed by fate in ways that enabled his performance to benefit his team because of chance, should he get a little credit for that? By McCracken's logic, I'm giving that kind of credit every time I compare pitchers by ERA.

Honestly, I still don't want to give Ortiz bonus points for pleasing Fate, and I certainly don't think such credit should overcome a clear-cut MVP choice like that of Rodriguez over Ortiz. But I do think our new insights into evaluating performances separately from the results they produce raise serious theoretical questions about statistical analysis of sports performance.