Home Page Icon
Home Page
Table of Contents for
To get the most out of this book
Close
To get the most out of this book
by Stefan Jansen
Hands-On Machine Learning for Algorithmic Trading
Title Page
Copyright and Credits
Hands-On Machine Learning for Algorithmic Trading
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Machine Learning for Trading
How to read this book
What to expect
Who should read this book
How the book is organized
Part 1 – the framework – from data to strategy design
Part 2 – ML fundamentals
Part 3 – natural language processing
Part 4 – deep and reinforcement learning
What you need to succeed
Data sources
GitHub repository
Python libraries
The rise of ML in the investment industry
From electronic to high-frequency trading
Factor investing and smart beta funds
Algorithmic pioneers outperform humans at scale
ML driven funds attract $1 trillion AUM
The emergence of quantamental funds
Investments in strategic capabilities
ML and alternative data
Crowdsourcing of trading algorithms
Design and execution of a trading strategy
Sourcing and managing data
Alpha factor research and evaluation
Portfolio optimization and risk management
Strategy backtesting
ML and algorithmic trading strategies
Use Cases of ML for Trading
Data mining for feature extraction
Supervised learning for alpha factor creation and aggregation
Asset allocation
Testing trade ideas
Reinforcement learning
Summary
Market and Fundamental Data
How to work with market data
Market microstructure
Marketplaces
Types of orders
Working with order book data
The FIX protocol
Nasdaq TotalView-ITCH Order Book data
Parsing binary ITCH messages
Reconstructing trades and the order book
Regularizing tick data
Tick bars
Time bars
Volume bars
Dollar bars
API access to market data
Remote data access using pandas
Reading html tables
pandas-datareader for market data
The Investor Exchange
Quantopian
Zipline
Quandl
Other market-data providers
How to work with fundamental data
Financial statement data
Automated processing – XBRL
Building a fundamental data time series
Extracting the financial statements and notes dataset
Retrieving all quarterly Apple filings
Building a price/earnings time series
Other fundamental data sources
pandas_datareader – macro and industry data
Efficient data storage with pandas
Summary
Alternative Data for Finance
The alternative data revolution
Sources of alternative data
Individuals
Business processes
Sensors
Satellites
Geolocation data
Evaluating alternative datasets
Evaluation criteria
Quality of the signal content
Asset classes
Investment style
Risk premiums
Alpha content and quality
Quality of the data
Legal and reputational risks
Exclusivity
Time horizon
Frequency
Reliability
Technical aspects
Latency
Format
The market for alternative data
Data providers and use cases
Social sentiment data
Dataminr
StockTwits
RavenPack
Satellite data
Geolocation data
Email receipt data
Working with alternative data
Scraping OpenTable data
Extracting data from HTML using requests and BeautifulSoup
Introducing Selenium – using browser automation
Building a dataset of restaurant bookings
One step further – Scrapy and splash
Earnings call transcripts
Parsing HTML using regular expressions
Summary
Alpha Factor Research
Engineering alpha factors
Important factor categories
Momentum and sentiment factors
Rationale
Key metrics
Value factors
Rationale
Key metrics
Volatility and size factors
Rationale
Key metrics
Quality factors
Rationale
Key metrics
How to transform data into factors
Useful pandas and NumPy methods
Loading the data
Resampling from daily to monthly frequency
Computing momentum factors
Using lagged returns and different holding periods
Compute factor betas
Built-in Quantopian factors
TA-Lib
Seeking signals – how to use zipline
The architecture – event-driven trading simulation
A single alpha factor from market data
Combining factors from diverse data sources
Separating signal and noise – how to use alphalens
Creating forward returns and factor quantiles
Predictive performance by factor quantiles
The information coefficient
Factor turnover
Alpha factor resources
Alternative algorithmic trading libraries
Summary
Strategy Evaluation
How to build and test a portfolio with zipline
Scheduled trading and portfolio rebalancing
How to measure performance with pyfolio
The Sharpe ratio
The fundamental law of active management
In and out-of-sample performance with pyfolio
Getting pyfolio input from alphalens
Getting pyfolio input from a zipline backtest
Walk-forward testing out-of-sample returns
Summary performance statistics
Drawdown periods and factor exposure
Modeling event risk
How to avoid the pitfalls of backtesting
Data challenges
Look-ahead bias
Survivorship bias
Outlier control
Unrepresentative period
Implementation issues
Mark-to-market performance
Trading costs
Timing of trades
Data-snooping and backtest-overfitting
The minimum backtest length and the deflated SR
Optimal stopping for backtests
How to manage portfolio risk and return
Mean-variance optimization
How it works
The efficient frontier in Python
Challenges and shortcomings
Alternatives to mean-variance optimization
The 1/n portfolio
The minimum-variance portfolio
Global Portfolio Optimization - The Black-Litterman approach
How to size your bets – the Kelly rule
The optimal size of a bet
Optimal investment – single asset
Optimal investment – multiple assets
Risk parity
Risk factor investment
Hierarchical risk parity
Summary
The Machine Learning Process
Learning from data
Supervised learning
Unsupervised learning
Applications
Cluster algorithms
Dimensionality reduction
Reinforcement learning
The machine learning workflow
Basic walkthrough – k-nearest neighbors
Frame the problem – goals and metrics
Prediction versus inference
Causal inference
Regression problems
Classification problems
Receiver operating characteristics and the area under the curve
Precision-recall curves
Collecting and preparing the data
Explore, extract, and engineer features
Using information theory to evaluate features
Selecting an ML algorithm
Design and tune the model
The bias-variance trade-off
Underfitting versus overfitting
Managing the trade-off
Learning curves
How to use cross-validation for model selection
How to implement cross-validation in Python
Basic train-test split
Cross-validation
Using a hold-out test set
KFold iterator
Leave-one-out CV
Leave-P-Out CV
ShuffleSplit
Parameter tuning with scikit-learn
Validation curves with yellowbricks
Learning curves
Parameter tuning using GridSearchCV and pipeline
Challenges with cross-validation in finance
Time series cross-validation with sklearn
Purging, embargoing, and combinatorial CV
Summary
Linear Models
Linear regression for inference and prediction
The multiple linear regression model
How to formulate the model
How to train the model
Least squares
Maximum likelihood estimation
Gradient descent
The Gauss—Markov theorem
How to conduct statistical inference
How to diagnose and remedy problems
Goodness of fit
Heteroskedasticity
Serial correlation
Multicollinearity
How to run linear regression in practice
OLS with statsmodels
Stochastic gradient descent with sklearn
How to build a linear factor model
From the CAPM to the Fama—French five-factor model
Obtaining the risk factors
Fama—Macbeth regression
Shrinkage methods: regularization for linear regression
How to hedge against overfitting
How ridge regression works
How lasso regression works
How to use linear regression to predict returns
Prepare the data
Universe creation and time horizon
Target return computation
Alpha factor selection and transformation
Data cleaning – missing data
Data exploration
Dummy encoding of categorical variables
Creating forward returns
Linear OLS regression using statsmodels
Diagnostic statistics
Linear OLS regression using sklearn
Custom time series cross-validation
Select features and target
Cross-validating the model
Test results – information coefficient and RMSE
Ridge regression using sklearn
Tuning the regularization parameters using cross-validation
Cross-validation results and ridge coefficient paths
Top 10 coefficients
Lasso regression using sklearn
Cross-validated information coefficient and Lasso Path
Linear classification
The logistic regression model
Objective function
The logistic function
Maximum likelihood estimation
How to conduct inference with statsmodels
How to use logistic regression for prediction
How to predict price movements using sklearn
Summary
Time Series Models
Analytical tools for diagnostics and feature extraction
How to decompose time series patterns
How to compute rolling window statistics
Moving averages and exponential smoothing
How to measure autocorrelation
How to diagnose and achieve stationarity
Time series transformations
How to diagnose and address unit roots
Unit root tests
How to apply time series transformations
Univariate time series models
How to build autoregressive models
How to identify the number of lags
How to diagnose model fit
How to build moving average models
How to identify the number of lags
The relationship between AR and MA models
How to build ARIMA models and extensions
How to identify the number of AR and MA terms
Adding features – ARMAX
Adding seasonal differencing – SARIMAX
How to forecast macro fundamentals
How to use time series models to forecast volatility
The autoregressive conditional heteroskedasticity (ARCH) model
Generalizing ARCH – the GARCH model
Selecting the lag order
How to build a volatility-forecasting model
Multivariate time series models
Systems of equations
The vector autoregressive (VAR) model
How to use the VAR model for macro fundamentals forecasts
Cointegration – time series with a common trend
Testing for cointegration
How to use cointegration for a pairs-trading strategy
Summary
Bayesian Machine Learning
How Bayesian machine learning works
How to update assumptions from empirical evidence
Exact inference: Maximum a Posteriori estimation
How to select priors
How to keep inference simple – conjugate priors
How to dynamically estimate the probabilities of asset price moves
Approximate inference: stochastic versus deterministic approaches
Sampling-based stochastic inference
Markov chain Monte Carlo sampling
Gibbs sampling
Metropolis-Hastings sampling
Hamiltonian Monte Carlo – going NUTS
Variational Inference
Automatic Differentiation Variational Inference (ADVI)
Probabilistic programming with PyMC3
Bayesian machine learning with Theano
The PyMC3 workflow
Model definition – Bayesian logistic regression
Visualization and plate notation
The Generalized Linear Models module
MAP inference
Approximate inference – MCMC
Credible intervals
Approximate inference – variational Bayes
Model diagnostics
Convergence
Posterior Predictive Checks
Prediction
Practical applications
Bayesian Sharpe ratio and performance comparison
Model definition
Performance comparison
Bayesian time series models
Stochastic volatility models
Summary
Decision Trees and Random Forests
Decision trees
How trees learn and apply decision rules
How to use decision trees in practice
How to prepare the data
How to code a custom cross-validation class
How to build a regression tree
How to build a classification tree
How to optimize for node purity
How to train a classification tree
How to visualize a decision tree
How to evaluate decision tree predictions
Feature importance
Overfitting and regularization
How to regularize a decision tree
Decision tree pruning
How to tune the hyperparameters
GridsearchCV for decision trees
How to inspect the tree structure
Learning curves
Strengths and weaknesses of decision trees
Random forests
Ensemble models
How bagging lowers model variance
Bagged decision trees
How to build a random forest
How to train and tune a random forest
Feature importance for random forests
Out-of-bag testing
Pros and cons of random forests
Summary
Gradient Boosting Machines
Adaptive boosting
The AdaBoost algorithm
AdaBoost with sklearn
Gradient boosting machines
How to train and tune GBM models
Ensemble size and early stopping
Shrinkage and learning rate
Subsampling and stochastic gradient boosting
How to use gradient boosting with sklearn
How to tune parameters with GridSearchCV
Parameter impact on test scores
How to test on the holdout set
Fast scalable GBM implementations
How algorithmic innovations drive performance
Second-order loss function approximation
Simplified split-finding algorithms
Depth-wise versus leaf-wise growth
GPU-based training
DART – dropout for trees
Treatment of categorical features
Additional features and optimizations
How to use XGBoost, LightGBM, and CatBoost
How to create binary data formats
How to tune hyperparameters
Objectives and loss functions
Learning parameters
Regularization
Randomized grid search
How to evaluate the results
Cross-validation results across models
How to interpret GBM results
Feature importance
Partial dependence plots
SHapley Additive exPlanations
How to summarize SHAP values by feature
How to use force plots to explain a prediction
How to analyze feature interaction
Summary
Unsupervised Learning
Dimensionality reduction
Linear and non-linear algorithms
The curse of dimensionality
Linear dimensionality reduction
Principal Component Analysis
Visualizing PCA in 2D
The assumptions made by PCA
How the PCA algorithm works
PCA based on the covariance matrix
PCA using Singular Value Decomposition
PCA with sklearn
Independent Component Analysis
ICA assumptions
The ICA algorithm
ICA with sklearn
PCA for algorithmic trading
Data-driven risk factors
Eigen portfolios
Manifold learning
t-SNE
UMAP
Clustering
k-Means clustering
Evaluating cluster quality
Hierarchical clustering
Visualization – dendrograms
Density-based clustering
DBSCAN
Hierarchical DBSCAN
Gaussian mixture models
The expectation-maximization algorithm
Hierarchical risk parity
Summary
Working with Text Data
How to extract features from text data
Challenges of NLP
The NLP workflow
Parsing and tokenizing text data
Linguistic annotation
Semantic annotation
Labeling
Use cases
From text to tokens – the NLP pipeline
NLP pipeline with spaCy and textacy
Parsing, tokenizing, and annotating a sentence
Batch-processing documents
Sentence boundary detection
Named entity recognition
N-grams
spaCy's streaming API
Multi-language NLP
NLP with TextBlob
Stemming
Sentiment polarity and subjectivity
From tokens to numbers – the document-term matrix
The BoW model
Measuring the similarity of documents
Document-term matrix with sklearn
Using CountVectorizer
Visualizing vocabulary distribution
Finding the most similar documents
TfidFTransformer and TfidFVectorizer
The effect of smoothing
How to summarize news articles using TfidFVectorizer
Text Preprocessing - review
Text classification and sentiment analysis
The Naive Bayes classifier
Bayes' theorem refresher
The conditional independence assumption
News article classification
Training and evaluating multinomial Naive Bayes classifier
Sentiment analysis
Twitter data
Multinomial Naive Bayes
Comparison with TextBlob sentiment scores
Business reviews – the Yelp dataset challenge
Benchmark accuracy
Multinomial Naive Bayes model
One-versus-all logistic regression
Combining text and numerical features
Multinomial logistic regression
Gradient-boosting machine
Summary
Topic Modeling
Learning latent topics: goals and approaches
From linear algebra to hierarchical probabilistic models
Latent semantic indexing
How to implement LSI using sklearn
Pros and cons
Probabilistic latent semantic analysis
How to implement pLSA using sklearn
Latent Dirichlet allocation
How LDA works
The Dirichlet distribution
The generative model
Reverse-engineering the process
How to evaluate LDA topics
Perplexity
Topic coherence
How to implement LDA using sklearn
How to visualize LDA results using pyLDAvis
How to implement LDA using gensim
Topic modeling for earnings calls
Data preprocessing
Model training and evaluation
Running experiments
Topic modeling for Yelp business reviews
Summary
Word Embeddings
How word embeddings encode semantics
How neural language models learn usage in context
The Word2vec model – learn embeddings at scale
Model objective – simplifying the softmax
Automatic phrase detection
How to evaluate embeddings – vector arithmetic and analogies
How to use pre-trained word vectors
GloVe – global vectors for word representation
How to train your own word vector embeddings
The Skip-Gram architecture in Keras
Noise-contrastive estimation
The model components
Visualizing embeddings using TensorBoard
Word vectors from SEC filings using gensim
Preprocessing
Automatic phrase detection
Model training
Model evaluation
Performance impact of parameter settings
Sentiment analysis with Doc2vec
Training Doc2vec on yelp sentiment data
Create input data
Bonus – Word2vec for translation
Summary
Next Steps
Key takeaways and lessons learned
Data is the single most important ingredient
Quality control
Data integration
Domain expertise helps unlock value in data
Feature engineering and alpha factor research
ML is a toolkit for solving problems with data
Model diagnostics help speed up optimization
Making do without a free lunch
Managing the bias-variance trade-off
Define targeted model objectives
The optimization verification test
Beware of backtest overfitting
How to gain insights from black-box models
ML for trading in practice
Data management technologies
Database systems
Big Data technologies – Hadoop and Spark
ML tools
Online trading platforms
Quantopian
QuantConnect
QuantRocket
Conclusion
Other Books You May Enjoy
Leave a review - let other readers know what you think
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
What this book covers
Next
Next Chapter
Download the example code files
To get the most out of this book
All you need for this book is a basic understanding of Python and machine learning techniques.
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset