Monday, July 08, 2013

Free as in Beer: My Recent MOOC experience

I last sat in a conventional classroom in December of 1994. I was taking a Topology Final as a Ph.D. student (Mathematics, if you don't recognize 'Topology') at IU. I had already decided to quit the program and go find a job outside of the academic world, so why I was there is anybody's guess. I think I worried that an especially crappy grade would haunt me for the rest of my life.

So, for nearly 20 years, I've been educated either through my own initiative or via training courses for work. I've attended the occasional workshop, like a recent one at Bloominglabs about using a lathe and a mill. I've been completely outside of the conventional college class and tests world. Eventually I stopped having bad dreams about being late for a Final Exam, not knowing exactly where it was being held, and not having studied or even attended classes. These were replaced by anxiety dreams about showing up for a race late once I started running in races.

2 months ago I decided to try out a MOOC, specifically, 'Intro to Data Science' on Coursera, taught by professor Bill Howe of the University of Washington. As somebody with a (very distant) math past who works with databases all day, learning about technologies I don't get to use during the workday like MapReduce, Hadoop, and the machine learning toolkit for Python, scikit-learn was appealing. It was also a chance to venture into a classroom setting without paying big money like some friends who've decided to go for MBAs (personally, I have zero interest in pursuing an MBA, but I did admire their dedication and devotion to keeping up with the challenge).

Even though the class was free, and I was one of 70,000 people who signed up, the idea of not doing well for whatever reason did provoke some anxiety. For one assignment, students submitted code which was then run by the auto-grader, nicknamed 'Darth Grader' by students on the forum. It suffered under the load, and there was often a long wait before getting results like 'you didn't calculate a value for @JonasBrothers' (because I removed punctuation including the very meaningful '@' symbol from Twitter data). One night I had a new anxiety dream for the MOOC era, where I refreshed my browser to find my scores had all been accidentally converted to zero by the autograder.

One Sunday I realized in the middle of the day that a quiz about MapReduce was due that afternoon. This stress was compounded by the fact that the service we were supposed to use, JSMapReduce, was suffering under the load much as Darth Grader had a few weeks previously. The discussion forums were life savers, and some suggested just running the job on your own machine using a Python library that simulated MapReduce (I say simulated because everything was being run in one process, pretty much missing the whole point of MapReduce, which is to split a load over a crudload of servers).

In general I found the forums to be the most worthwhile and surprisingly beneficial part of the experience. I am as skeptical as anyone of crowdsourcing, and you'd expect with 70,000 students signed up for a course, the forums would be chock full of noise and cluelessness. This was not really the case at all. People shared knowledge and experience (but, for the most part, followed the rules and did not share code). This helped a bunch with the optional AWS project (run a MapReduce job to crunch a TB of data). I found out that there's a $100 grant available to students, so I didn't have to pay for the services out of my own pocket (keeping the course truly free), and in a thread people compared notes about how many nodes they used and how to tweak settings when setting up your job. There were also helpful discussions about setting up and running Pig (a high level query language for Hadoop MapReduce jobs, sort of like SQL, but only sort of) on your own machine, so I was able to debug my pig scripts locally without having to pay for time on Amazon Elastic MapReduce (on AWS). (In the end, I racked up only $8 worth of charges against my $100 credit - we were warned it could cost up to $20).

Some critics say the forums are no match for the rapid fire face to face discussions you can have at a University. That is, if you're not as introverted as I was in my University days. I was lucky enough to have some accessible professors, although in retrospect what that offered was a mixed bag. When people get jazzed about the fact that you're hanging on their every word, they can veer off into weird political or racist directions. They can give you really horrible advice, like the advisor who told me not to take a graph theory class. I will probably expound on this in a future post.

I did enjoy the assignments, although several other students hated the open-ended requirements in some cases (for example: 'participate in a kaggle competition'). I thought the openness was kind of fitting, given the subject matter. Data Scientists have to figure out what the data is telling them without a set of hard fast consultant-friendly requirements.

The Kaggle assignment was fun and humbling. There is a tutorial-like competition on Kaggle (a website where Data Scientists and wannabe Data Scientists compete for money and glory solving problems in scientific or business domains) about 'Predict Based On These Variables If A Person Survived the Sinking Of The Titanic', which walks a person through examples with Excel, Python, and scikit-learn (also Python - it's a Machine Learning toolkit). Why I say it was humbling is that a kind of hokey and hackey Python example in the tutorial did a better job making predictions than the more impressive sounding 'Random Forest', unless you the competitor applied a whole lot of what's called 'Feature Engineering' to the problem to figure out how to deal with missing data and to identify how best to use the info provided. The assignment was due too soon - I would have liked to have dug into that more. As it was it was something of a 'here's a firehose of new tools, good luck!' experience. The Chief Data Scientist at Kaggle, Jeremy P. Howard, made an appearance in an 'Ask Me Anything' in the forums, and this thread was as valuable as any lecture.

The course ended a couple of weeks ago. I completed all the required assignments, but I still don't know if I got a certificate of completion yet. I'm not sure what I will do with such a certificate, but as I've never been a Mayor Of Starbucks on 4square I would like to have some sort of virtual recognition or mark of greatness. The ending did feel a bit like a fizzling out, really.

Since the course finished I've resumed tinkering on some Bloominglabs related things, like my open-ended Arduino based bike computer and playing with our new(ish) laser cutter some more. I find that since I started doing that stuff, a purely academic exercise like completing an assignment is not as rewarding, because ultimately these often feel like meals you clear away when you're done, and move on. On the other hand, w/out more structure, peer pressure, and etc, there is a tendency for my hacking projects to be very open-ended and not really result in some point of completion (I tend to be best at 'completing things' if there's an upcoming opportunity to show the thing off). It's a trade off.

Ultimately I give Coursera a passing grade because I'm going back for more. I'm taking 'Maps And the Geospatial Revolution', starting July 17. Join me if you want to make some cool maps.

Some handy references:

http://www.katyjordan.com/MOOCproject.html
http://www.educause.edu/ero/article/retention-and-intention-massive-open-online-courses-depth-0
http://www.nytimes.com/2013/04/21/opinion/sunday/grading-the-mooc-university.html?pagewanted=all&_r=1&