So, you want to become a data scientist or may be you are already one and want to expand your tool repository. You have landed at the right place. The aim of this page is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of steps you need to learn to use Python for data analysis. If you already have some background, or don’t need all the components, feel free to adapt your own paths and let us know how you made changes in the path.
Before starting your journey, the first question to answer is:
Why use Python?
or
How would Python be useful?
Watch the first 30 minutes of this talk from Jeremy, Founder of DataRobot at PyCon 2014, Ukraine to get an idea of how useful Python could be.
Now that you have made up your mind, it is time to set up your machine. The easiest way to proceed is to just download Anaconda (或者去右边的网址下载:http://www.continuum.io/downloads)from Continuum.io . It comes packaged with most of the things you will need ever. The major downside of taking this route is that you will need to wait for Continuum to update their packages, even when there might be an update available to the underlying libraries. If you are a starter, that should hardly matter.
If you face any challenges in installing, you can find more detailed instructions for various OS here
You should start by understanding the basics of the language, libraries and data structure. The python track fromCodecademy is one of the best places to start your journey. By end of this course, you should be comfortable writing small scripts on Python, but also understand classes and objects.
Specifically learn: Lists, Tuples, Dictionaries, List comprehensions, Dictionary comprehensions
Assignment: Solve the python tutorial questions on HackerRank. These should get your brain thinking on Python scripting
Alternate resources: If interactive coding is not your style of learning, you can also look at The Google Class for Python. It is a 2 day class series and also covers some of the parts discussed later.
You will need to use them a lot for data cleansing, especially if you are working on text data. The best way to learn Regular expressions is to go through the Google class and keep this cheat sheet handy.
Assignment: Do the baby names exercise
If you still need more practice, follow this tutorial for text cleaning. It will challenge you on various steps involved in data wrangling.
This is where fun begins! Here is a brief introduction to various libraries. Let’s start practicing some common operations.
You can also look at Exploratory Data Analysis with Pandas and Data munging with Pandas
Additional Resources:
Assignment: Solve this assignment from CS109 course from Harvard.
Go through this lecture form CS109. You can ignore the initial 2 minutes, but what follows after that is awesome! Follow this lecture up with this assignment
Now, we come to the meat of this entire process. Scikit-learn is the most useful library on python for machine learning. Here is a brief overview of the library. Go through lecture 10 to lecture 18 from CS109 course from Harvard. You will go through an overview of machine learning, Supervised learning algorithms like regressions, decision trees, ensemble modeling and non-supervised learning algorithms like clustering. Follow individual lectures with the assignments from those lectures.
Additional Resources:
Assignment: Try out this challenge on Kaggle
Congratulations, you made it!
You now have all what you need in technical skills. It is a matter of practice and what better place to practice than compete with fellow Data Scientists on Kaggle. Go, dive into one of the live competitions currently running onKaggle and give all what you have learnt a try!
Now that you have learnt most of machine learning techniques, it is time to give Deep Learning a shot. There is a good chance that you already know what is Deep Learning, but if you still need a brief intro, here it is.
I am myself new to deep learning, so please take these suggestions with a pinch of salt. The most comprehensive resource is deeplearning.net. You will find everything here – lectures, datasets, challenges, tutorials. You can also try the course from Geoff Hinton a try in a bid to understand the basics of Neural Networks.
P.S. In case you need to use Big Data libraries, give Pydoop and PyMongo a try. They are not included here as Big Data learning path is an entire topic in itself.
One of the common problems people face in learning R is lack of a structured path. They don’t know, from where to start, how to proceed, which track to choose? Though, there is an overload of good free resources available on the Internet, this could be overwhelming as well as confusing at the same time.
After digging through endless resources & archives, here is a comprehensive Learning Path on R to help you learn R from ‘the scratch’. This will help you learn R quickly and efficiently. Time to have fun while lea-R-ning!
Before starting your journey, the first question to answer is: Why use R? or How would R be useful?
Watch this 90 seconds video from Revolution Analytics to get an idea of how useful R could be. Incidentally Revolution Analytics just got acquired by Microsoft.
Now that you have made up your mind, it is time to set up your machine. The easiest way to proceed is to just download the basic version of R and detailed installation instructions from CRAN (Comprehensive R Archive Network).
You can then install various other packages. There are 9000 packages in R so this can get confusing. Accordingly, we will guide you to install just the basic R packages first. Here is a link to understand packages called CRAN Views. You can accordingly select the sub type of packages that you are interested in.
How to install a package http://www.r-bloggers.com/installing-r-packages/
Some important packages to learn about: http://blog.yhathq.com/posts/10-R-packages-I-wish-I-knew-about-earlier.html
You should install these three GUIs with all dependent packages.
You should also install RStudio. It helps making R coding much easier and faster as it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment much more productively.
Assignment:
You should start by understanding the basics of the language, libraries and data structure. The R track fromDatacamp is one of the best places to start your journey. Especially see the free Introduction to R course athttps://www.datacamp.com/courses/introduction-to-r. By end of this course, you should be comfortable writing small scripts on R, but also understand data analysis. Alternately, you can also see Code School for R athttp://tryr.codeschool.com/
If you want to learn R offline on your own time – you can use the interactive package swirl fromhttp://swirlstats.com
Specifically learn: read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command
Assignment:
Alternate resources: If interactive coding is not your style of learning, you can also look at The Two Minute Tutorials on R at http://www.twotorials.com/ . It is a video series and also covers some of the parts discussed here. You can also read a comprehensive blog post titled 50 functions to help you clear a job interview in R here.
You will need to use them a lot for data cleansing, especially if you are working on text data. The best way is to go through the text manipulation and numerical manipulation exercises. You can learn about connecting to databases through the RODBC package and writing sql queries to data frames through sqldf package.
Assignment:
If you still need more practice, you can sign up for a $25/month subscription at Datacamp that gives you all tutorials . Please go through the slides here for plyr here.
This is where fun begins! Here is a brief introduction to various libraries. Let’s start practicing some common operations.
Additional Resources:
Now, we come to the most valuable skill for a data scientist which is data mining and machine learning. You can see a very comprehensive set of resources on data mining in R here at http://www.rdatamining.com/ . The rattle package really helps you with an easy to use Graphical User Interface (GUI). You can see a free open source easy to understand book here at http://togaware.com/datamining/survivor/index.html
You will go through an overview of algorithms like regressions, decision trees, ensemble modeling and clustering. You can also see the various machine learning options available in R by seeing the relevant CRAN view here.
Additional Resources:
Congratulations, you made it!
You now have all what you need in technical skills.
Now that you have learnt most of data analytics using R , it is time to give some advanced topics a shot. There is a good chance that you already know many of these, but have a look at these tutorials too.
P.S. In case you need to use Big Data a lot please also have a look at RevoScaleR package from Revolution Analytics. It is commercial but academic usage is free. An example project is given here.