Toolbox for learning machine learning and data science

Posted: September 6th, 2012 | Author: | Filed under: Uncategorized | 9 Comments »

Recently I jumped in and taught myself how to do medium-sized data exploration and machine learning. (Excel-sized < My Data Set < Big Data)

If you are a real data scientist or expert, skip this. It isn’t for you. But if you have a good data set and want to start playing with it and learning some of the tools of modern data analytics, perhaps this can save you some time.

Matlab vs. R vs. Python

If you work at a university or big company, maybe you have access to Matlab, which is apparently great, but expensive. I didn’t.

A physicist I was working with knew and used R. Apparently it is incredibly powerful and has many of the most cutting edge algorithms. BUT, I found the syntax baffling, the documentation copious, but written for mathematicians instead of hackers, and overall difficult and frustrating.

Python, on the other hand, is a dream. The language is easy. The documentation is copious and comprehensible. The online community is awesome. And in addition to all the analysis, you can do data munging as well. Python was my pick, and I think it was the right one.

The next step is picking the packages to support Python.

Python Packages for Analysis

I’m sure there are a lot of different choices with pluses and minuses, but this set served me very well, came after reasonable research, and never let me down. So it’s a good starting point.

  • Python 2.7 (vs. 3.x) – It feels weird to use an older version, but all the packages work with 2.x, and some might not work with 3.x, so I went with 2.x and never had a problem.
  • NumPy & SciPy – these are the core packages for scientific computation, array manipulation, etc. They are the base.
  • Matplotlib – this allows you to do lots of easy visual analysis, charting, etc. Very powerful, but also easy for easy things. A histogram is a couple of lines.
  • iPython – this is the base interactive shell for python scientific computing. The key is you can easily run commands from the shell as you experiment. I played with others, but I wish I had switched to iPython sooner.
  • Console2 (windows) – If you use Windows, get the free Console2 and hook iPython (and Git, and cygwin, etc.) up to it.
  • Pandas – Pandas implements R’s “dataframe” concept in python. Basically it wraps Numpy arrays so you can reference rows and columns by labels instead of just numbers. Overall, I was mixed on Pandas. It’s undoubtedly powerful, and when you are good at it, you can do things simply and elegantly. But I also found that the vast majority of my debugging and tinkering was in Pandas. So I guess I’d say skip it at the start, but know its out there in case you feel you need it.
  • Statsmodels – This is a statistics package that did the trick for my limited usage.
  • Scikit-learn- A great package with lots of machine learning algorithms, easy to use, and great documentation.
    • Orange – Orange is another ML package which is very highly recommended, and also has a GUI. But I found it very un-pythonic to my novice eyes. I was trying to figure out how to read in a dict of my features and got lost in a seemingly endless list of Orange classes. When I loooked at Sklearn, it was a one liner. So Sklearn was great for me, but you might want to check out Orange.
  • NLTK – Natural Language Toolkit – useful for some base machine learning (though sklearn is better), but good if you are doing any NLP.
  • Excel – Don’t forget Excel! I would often do my analysis in Python, then dump some results into Excel for quick, easy data exploration. If you are really good with the above tools, you probably don’t need it. But Excel is SO easy, that I found it a really valuable tool.

While I’m at it, here are some good documentation sources I used:

Also, if you need to clean up your data to get it into a usable state, you might try Data Wrangler or  Google Refine. I love the concept of both (and Wrangler is wicked-cool), but they were both buggy for me and if you are good with Python, just use it.

Happy data exploring!


9 Comments on “Toolbox for learning machine learning and data science”

  1. 1 Nathan said at 10:21 am on September 6th, 2012:

    Great list! Good explanation of why to use Python 2.7 instead of 3.

    I also like pip to install packages quickly and easily as well as virtualenv to try out those new packages.

  2. 2 Maximus said at 10:23 am on September 6th, 2012:

    Take a look at ConEmu – alternative console for windows

  3. 3 Sherjil Ozair said at 12:52 pm on September 6th, 2012:

    You didn’t mention Python’s pickle or cPickle libraries. Machine Learning is practically impossible without it.

  4. 4 Ryan said at 2:32 pm on September 7th, 2012:

    Great list.

    Sherjil’s correct, but I think most of us would forget to include it because it is so seamless. “shelve” is another good library that is very similar.

    To echo what Nathan said, Pip has made my life so much easier. virtualenv is genius, but for some reason I cannot find a reason to use it yet.

  5. 5 Dmytro said at 2:42 pm on September 7th, 2012:

    Weka is another good one. Has a nice GUI but also written in Java so can be easily plugged in a java project.

  6. 6 Dmytro said at 2:43 pm on September 7th, 2012:

    Weka is another good one. Has a nice GUI but also written in Java so can be easily plugged in a java project.

  7. 7 Bo Beulah said at 11:36 am on September 18th, 2012:

    This can be specifically what I was seeking for, many thanks

  8. 8 Artificial Intelligence Blog · Ross Rosen’s list of Machine Learning Tools for Phython said at 6:15 am on January 28th, 2013:

    [...] out Ross Rosen’s collection of tools for machine learning with [...]

  9. 9 Machine Learning library with ROS | James's Research journey said at 1:31 am on December 29th, 2013:

    [...] Source: [...]

Leave a Reply