At the NJEA Convention on Thursday, three members of the NJ state board of education met with teachers in a town hall style meeting. They fielded brief questions and tried, as best they could, to answer. You can probably imagine the most prominent topics – SGOs, PARCC, testing, opt out, evaluations.

I sat and listened for a while, and then something about over-testing and PARCC clicked in my head. Earlier in that morning’s plenary session, Pasi Sahlberg spoke about some of the problems with American education. One thing he said, in reference to testing, was that Americans had this bizarre fixation with testing each and every student in order to measure achievement rather than using some kind of population sampling. [If you missed the session, check out his book Finnish Lessons or this TedxTalk]

While I had heard this criticism before (along with a proposed solution), I can’t remember where. But at the time, the recency of Sahlberg’s talk made me get in line to ask a question. If we want to know whether or not New Jersey’s students are up to par, why don’t we use sampling methodology to test part of the student population instead of wasting time and resources testing the entire population?

I wasn’t entirely satisfied with the response I got, and I think partially that was because the 30-second questioning format (along with my nervousness in front of the crowd) didn’t allow me to clearly articulate what I was suggesting. So now that I’m back home sitting on my couch drinking coffee, I thought it was a good time to put those thoughts on paper with a little more clarity.

How and Why We Test Today

First, let’s take a look at how we test students today.

Starting this year, NJ will administer the PARCC assessment – with 9 total testing sessions – to students in grades 3 through 11. Why do we do this?

One thing that acting commissioner Hespe repeatedly said in his presentation and Q&A session Friday morning at the convention was that it would give us data to inform instruction. While assessment can be very helpful in informing instruction (think appropriate formative assessment), I am doubtful that the results of the PARCC assessment will be helpful in any way. The results are not timely, and frankly I could do a much better job checking in with my students on a day to day basis in my class.

Next up, we’ve got high stakes consequences for teachers. We want to measure “teacher quality” through the proxy of test scores. The “theory” is that good teachers produce better gains in testing scores than poor teachers, so therefore some quants can derive a formula to give value to each of us through our students’ test scores. In New Jersey, that statistical wizardry will produce SGPs that factor into teachers evaluations. It would take me all day to enumerate the reasons why this is foolish, and at the end of the day the simple truth is that a good evaluator who observes a teacher and talks with his or her students can make a much better judgement about that teacher’s quality.

Thirdly, we want to create high stakes consequences for students. We need “graduation standards,” and we can’t give a diploma to just anybody. Two problems with this. First, it is ludicrous to base a students graduation on performance on a single test. The results are variable from day to day, and some students performing near the cut-off will likely end up failing and producing a false negative even though on a different day they could easily score above the cut-off. But more to the point, the state of NJ routinely ignores that rigorous graduation standard anyway. This year’s juniors can use a score of 400 on the SATs to demonstrate proficiency (a thoroughly mediocre score that I’m sure does not equate to a “passing” score on the more rigorous PARCC), and you can submit a portfolio for students who fail to pass any of the available tests. These portfolio appeals have consistently – like the SRA process before it – helped schools graduate students who could not pass the HSPA and are not, in the language of the Department of Education, college and career ready.

Finally, the state needs a way to gauge the effectiveness of its educational system. In order to make judgements and comparisons across our hundreds of school districts, they need a simple quantitative measure. This is where standardized tests become useful. While an individual student’s score can be highly variable from day to day, that variability is simply noise in the data that disappears when you test enough students. For every student who has a bad day, another has a good day. Test enough students, and you’ll get pretty reliable data about how the student population as a whole is performing.

Of these four, I begin with the simple assumption that standardized tests – like the PARCC – have only one legitimate purpose. They have no legitimate role in formative assessment, teacher evaluation, or student graduation. But they can reliably measure aggregate performance well and help us make judgements and comparisons about how the system as a whole is functioning.

An Alternative Solution: Sampling Methodology

Ironically enough, when Acting Commissioner Hespe met with teachers in his own town hall meeting Friday morning, he demonstrated quite perfectly why we don’t need to test our entire student population.

In his presentation, Hespe purported to show that New Jersey was doing a good job educating its students, but not good enough. First, he showed us some NAEP data that ranked New Jersey in the top five states nationwide. Sounds good, right? Then he showed us the proficiency breakdown in that data to help explain that only 30% to 40% of New Jersey’s students were college and career ready (according to NAEP’s cut score). The data from the chart below comes from the NAEP report on the 2013 12th Grade Math assessment.

Graph of NAEP results for 12th Grade Math in 2013

Set aside any doubts you might have about test validity for the moment, and assume that the NAEP data is accurate and valid. Here’s the interesting thing about NAEP – it only tests a small fraction of the students in any give state. By drawing representative samples of students – through randomization and stratification – the NAEP data allows us to draw fairly reliable conclusions about the entire state population without actually testing them all.

For example, in 2013 NAEP administered a mathematics test across the country to 4th graders. In New Jersey, approximately 3,100 students out of a total population of 98,000 took the assessment. For the grade 8 test, 2,800 out of 102,000 students were assessed. For more info, see the summary data tables here.

By testing around 3% of our student population, we can measure performance across the state. And this is a large enough sample to allow us to draw conclusions about various groups – by race/ethnicity, free or reduced lunch, students with disabilities, English language learners, etc.

So What Should We Do?

Now the National Assessment of Educational Progress isn’t the solution for our problem. I’m not suggesting we drop all standardized tests and just take the NAEP every few years.

But it is a perfect road map for us to develop a better format for testing in New Jersey (and any state).

Let’s assume for the moment that the PARCC is a valid test, and it accurately measures students’ mastery of the Common Core State Standards. Big leap, I know, but put away the pitch forks for a minute and just go with it.

The test contains 9 segments per grade level, each one (and each question) presumably aligned to various standards. In the same way that there’s no need for each student to take the test, there’s no need for each student to take each part.

Using the same sampling methodology, draw a sample of 3% of 5% of New Jersey’s students to take each segment. Then you use statistical wizardry to produce reports detailing which standards we are reaching and which we aren’t.

You’ve got your benchmark of success and your measures of the achievement gap, and no single student has to take more than 2 hours of a test in any given year. In fact, since you only need to test about a third to half of the student population in any given year, that means that an individual student would only test for about 2 hours every two years.

Why Should We Do This…?

I’m sure the state of New Jersey is in no rush to modify its testing regime to look like this. But it’s still worth considering all the benefits of such a plan, which I think are numerous.

It Provides Data. Say what you will about No Child Left Behind and the increased emphasis on testing, but one beneficial outcome is that we have more data about the achievement gap. Using a test format like the one outlined above would still provide this data, allowing us to draw conclusions about educational achievement and equity across the board, but it would do so in a more judicious and efficient way.

It Removes Stakes for Students. Standardized tests should never be used for high stakes decisions – like placing them in an honors class or deciding if they graduate. With this format, no individual child has a test score. Instead, each child simply contributes to the overall achievement of the state. Lots less stress on students and less potential for misusing scores by administrators.

It Reduces Time Spent Testing. A common response from the BoE members on Thursday was that we’ve always had testing, so there’s nothing wrong with testing. True, but… the aggregate time spent testing students – and not teaching them – has increased consistently. This system is simply much more efficient. It collects data as quickly as possible, leaving as much time as possible for teaching and learning.

It Reduces Technology Needs. Another criticism of PARCC is that the computer-based nature of the test requires immense investments in technology. I for one would love to see every student have a laptop and/or tablet, but we don’t live in a magical world where education funding falls from trees (or where WiFi networks are stable). By reducing the size of the population being tested in any given school, you drastically reduce the need for technology.

It Reduces Testing Costs. The PARCC assessment costs approximately $30 per student. Statewide, if we are currently going to test about 100,000 students per grade (3-11), that’s a total of about $2.7 million (900,000 students x $30 / student). In my school with about 1,200 tested students, that’s around $36,000. Cut that number to 5%, and you’re instead talking $135,000 statewide or $1,800 for my building. I’m pretty sure I can find a better use for the other 95% of those dollars – like hiring more teachers or paying into my pension fund.

At the end of the day, it just makes sense. It solves a real problem – collecting data about student achievement – in a statistically reliable but efficient way. But alas, I doubt I’ll ever live to see the day that the education “reformers” decide to do something this sensible.