# Massive Scale Data Mining For Education

By Greg Linden

November 10, 2010

Let's say, in the near future, tens of millions of students start learning math using online computer software.  Our logs fill with a massive new data stream, millions of students doing billions of exercises, as the students work.

In these logs, we will see some students struggle with some problems, then overcome them.  Others will struggle with those same problems and fail.  There will be paths of learning in the data, some of which quickly reach mastery, others of which go off in the weeds.

At Amazon.com a decade ago, we studied the trails people made as they moved through our Web site.  We looked at probability that people would click on links to go from one page to another.  We watched the trails people took through our site and where they went astray.  As people shopped, we learned how to make shopping easier for other shoppers in the future.

Similarly, Google and Microsoft learn from people using Web search.  When people find what they want, Google notices.  When other people do that same search later, Google has learned from earlier searchers, and makes it easier for the new searchers to get where they want to go.

Beyond a single search, the search giants watch what people look for over time as they do many searches, what they eventually find or whether they find nothing, where they navigate to after searching, and learn to push future searchers on to the more successful paths trod by those before them.

So, let's say we have millions of students learning math on computers.  Let's say we have massive new logs of what these students are doing and how well they are doing.  What would a big Internet company do with this data?  What would be the Googly thing to do with these logs?  What would massive scale data mining look like for students?

We could learn that students who have difficulty solving one problem would have trouble with another.  For example, perhaps students who have difficulty with the problem (3x - 7 = 3) have difficulty with (2x -13 = 5).

We could then learn clusters of problems that all will be difficult for someone to solve if they have the same misunderstanding of an underlying concept.  For example, perhaps many students who cannot solve (3x - 7 = 3) and similar problems are confused about how to move the -7 to the other side of the equation.

And, we could discover the problems in that cluster that are particularly likely to teach that concept well, to break students out of the misunderstanding, and then be able to solve all the problems they previously found so difficult.  For example, perhaps students who have difficulty with (3x - 7 = 3) and similar problems are usually able to solve that problem when presented first with the easier problems (x - 5 = 0) and (2x - 3 = 1).

Then, we could learn paths through clusters of problems that are particularly effective and rapid for students.  Teachers might think that one concept should always be taught before another, but what if the data shows us different?  What if we reorder the problems and students learn faster?

We even could learn personalized, individualized paths for effective and rapid learning.  Some students might start on a generic path, show early mastery, and jump ahead.  Others might struggle with one type of problem or another.  Each time a student struggles, we will try them on problems that might be a path for them to learn the underlying concepts and succeed.  We will know these paths because so many others before struggled, some of which found success.

As we experiment, as millions of students try different exercises, we forget the paths that consistently lead to continued struggles, remember the ones that lead to rapid mastery, and, as new students come in, we put them on the successful paths we have seen before.

It would be student modeling on a heretofore unseen scale.  From tens of millions of students, we automatically learn tens of thousands of models, little trails of success for future students to follow.  We experiment, try different students on different problems, discover which exercises cause similar difficulties, and which help students break out of those difficulties.  We learn paths in the data and models of the students.  We learn to teach.