Where to go from here

The goal of this book was to introduce you to the world of machine learning and prepare you to become a machine learning practitioner. Now that you know everything about the fundamental algorithms, you might want to investigate some topics in more depth.

Although it is not necessary to understand all of the details of all of the algorithms we implemented in this book, knowing some of the theory behind them might just make you a better data scientist.

If you are looking for more advanced material, then you might want to consider some of the following classics:
  • Stephen Marsland, Machine Learning: An Algorithmic Perspective, Second Edition, Chapman and Hall/Crc, ISBN 978-146658328-3, 2014
  • Christopher M. Bishop, Pattern Recognition and Machine Learning. Springer, ISBN 978-038731073-2, 2007
  • Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, Springer, ISBN 978-038784857-0, 2016

When it comes to software libraries, we already learned about two essential ones—OpenCV and scikit-learn. Often, using Python is great for trying out and evaluating models, but larger web services and applications are more commonly written in Java or C++.

For example, the C++ package is Vowpal Wabbit (VW), which comes with its own command-line interface. For running machine learning algorithms on a cluster, people often use mllib, a Scala library built on top of Spark. If you are not married to Python, you might also consider using R, another common language of data scientists. R is a language designed specifically for statistical analysis and is famous for its visualization capabilities and the availability of many (often highly specialized) statistical modeling packages.

No matter which software you choose going forward, I guess the most important advice is to keep practicing your skills. But you already knew that. There are a number of excellent datasets out there that are just waiting for you to analyze them:

  • Throughout this book, we made great use of the example datasets that are built into scikit-learn. In addition, scikit-learn provides a way to load datasets from external services, such as mldata.org. Refer to http://scikit-learn.org/stable/datasets/index.html for more information.
  • Kaggle is a company that hosts a wide range of datasets as well as competitions on their website, http://www.kaggle.com. Competitions are often hosted by a variety of companies, nonprofit organizations, and universities, and the winner can take home some serious monetary prizes. A disadvantage of competitions is that they already provide a particular metric to optimize and usually a fixed, preprocessed dataset.
  • The OpenML platform (http://www.openml.org) hosts over 20,000 datasets with over 50,000 associated machine learning tasks.
  • Another popular choice is the UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php), hosting over 370 popular and well-maintained datasets through a searchable interface.
Finally, if you are looking for more example code in Python, a number of excellent books nowadays come with their own GitHub repository:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset