Tutorial
#sp500 #python #classificationmodel
These are my takeaways from video tutorial that uses #Python and Jupyter Notebook to walk through building and evaluating a #ClassificationModel based on historical data to predict daily #SP500 movement direction.
- Ticker for #SP500 is
^GSPC
for Python package yfinance
- Train / test data split for time-series data: for time-series data, split data at a certain cut-off date to split to train / test data sets, for example, use 70% of earlier data as training set and the remaining 30% of recent data as test set.
- Use RandomForestClassifier to train this logistic regression model, why?
- More accurate, less over-fitting: This methodology uses a bunch of individual decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting
- Non-linear relationship: this methodology can handle situation when predictor variable and target variable have non-linear relationship
- Parameters used in the tutorial are
n_estimators
: number of decision trees, the higher this value is the more accurate the model is, however model training is slower.
min_sample_split
: protect us from over-fitting, the higher this value is the less over-fitting the model is
random_state
: setting this to same value at each run, will ensure that each run will produce same model
- Target variable is a binary variable that indicates the daily direction of #SP500 price change
- #SP500 up vs down days historical ratio: 53.6% days up vs 46.3% days down using 10-years of data up to May 2022
- Using 2 rounds of predictor variables and compare prediction results using
precision_score
to evaluate model performance. Mean Squared Error (MSE) is a measure of the average squared difference between the predicted and actual values in a regression problem. Precision, on the other hand, is a metric commonly used in classification problems to measure the accuracy of positive predictions.
- 1st round: using daily stock price variables of Close Price, Volume, Open, High and Low as predictors, the model's precision_score is 53.5% which is lower than benchmark which is #SP500 historical up-days percentage: 53.6%
- 2nd round: using 2 sets of derived rolling aggregate variables, each set derived over 5 time-horizons: Closed_Ratio_ = Close Price / rolling average Close Price over and Trend_ = rolling sum of target variable value over , where is 2, 5, 60, 250, 1000 number of trading days. The prediction threshold was also increased from 0.5 to 0.6 to make the model predictions more selective than before. The resulting precision_score increased to 57.4%, slightly higher than benchmark of 53.6%, indicating this set of derived variables have predictive value.