“Is there any online reading or courses I can do to get into data analysis?”
At my workplace, I get asked the question above. The question is usually posed by people typically with a finance background, who’s working as a management consultant. In this post I propose a learning path for such people to “get into data analysis”. I will assume that the prospective student someone with decent Excel skills, not afraid of a VLOOKUP or a touch of VB, and can throw together decent plots / dashboards using the same Microsoft package, but has little or no knowledge of programming / command line operations.
A data scientist can be defined by Drew Conway‘s Data Science Venn diagram which suggests that data scientists must have a solid mathematical background, skills in coding and computer hacking, and a healthy mix of subject matter expertise.
The courses mentioned below are by no means a “over a weekend” type of engagement – if you are serious about entering the world of data science as a profession, allow yourself at least 3-6 months to complete and study the content of the courses below.
Learn to program.
R and Python are the two primary scripting languages that are taking over the world of data science. There is very little that can not be done with knowledge of these two languages, and I would recommend getting to grips with both during your learning. R is a statistical programming language that has a huge number of packages available for every function you could think of. Python is a more general language that has data science capabilities built through the numpy and scipy libraries.
Try R Start your journey into R and data visualisation with the “Try R” free online course from CodeSchool.com. Learn the basic syntax and get loading and plotting small data sets. Computing for Data Analysis Augment your fundamental R knowledge with “Computing for Data Analysis” at Coursera.org. Python Track Take a trip into Python and get top grips with the basic syntax with the Python track at Codeacademy.com Introduction to Computer Science Expand this preliminary Python know-how with a fully blown project to create a working search engine at Udacity.com’s Introduction to Computer Science
Learn some maths.
Data scientists are one part statisticians. To gather meaningful information from large data sets requires skills in summarising and correlating variables on a regular basis. A solid understanding of the maths behind statistical transformations and machine learning techniques ensures that results are valid and immune to scrutiny. Note that a lot of the necessary statistics and maths knowledge can be picked up from the machine learning-focussed courses.
Introduction to Statistics Start off with some preliminary statistics at Udacity.com’s “Introduction to Statistics” Statistics One Go a bit deeper with “Statistics One” from Princeton at Coursera.org
Learn machine learning and data visualisation.
The core information that separates data scientists from data analysts is the ability to move beyond reporting and applying more sophisticated analytical techniques to model variance, extract meaning, and predict variables of interest, using your data.
Data Analysis Start with the excellent “Data Analysis” course at Coursera.org that will give you direct experience in loading, visualising, and modelling of real data sets using R. This course is considerably more advanced than the previous “computing for data analysis”, and covers some data analysis techniques, and focuses on teaching students how to structure data analysis reports. Machine Learning Make sure that you take the brilliant and MOOC-starting “Machine Learning” or “ml-class” course with Andrew Ng at Coursera.org. Python skills are a must for this course that covers linear algebra, regression, neural networks, support vector machine, and recommender systems among others. Andrew Ng provides an excellent background for the topics that are covered. Artificial Intelligence for Robotics Sebastian Thrun‘s “Artificial Intelligence for Robotics” class is a brilliant introduction to more applied machine learning techniques such as the Kalman Filter and Particle Filters. While perhaps slightly off-topic, the course has a range of interesting and worthwhile Python-based exercises that will only add to your learning journey. Algorithms / Neural Networks More detailed specific-topic courses can be taken in Algorithms and Network Analysis at Udacity, or Neural Networks for Machine Learning - both of which I’ve personally found useful. The Neural Networks course dips into the realm of “Deep Learning”, a hot, but advanced, topic in machine learning at Google and Facebook at the moment. Introduction to Hadoop and MapReduce At some point, you’re going to need to wet your toes with some Big Data, Hadoop, and MapReduce knowledge – Get a basic introduction with “Introduction to Hadoop and MapReduce” at Udacity.com, in conjunction with Cloudera.
When you have completed the majority of the courses listed above, you’ll be in a very strong position to put your knowledge to use. And practice is the key. Get on Kaggle, download a data set, and get involved!!