Coding

Coding#

Section Title: Coding

Data Preparation#

We will use the Yahoo Finance API to import stock data, and we will also use the pandas and NumPy packages.

import yfinance as yf
import pandas as pd
import numpy as np

We will use the following constants throughout this notebook. Using dictionaries allows us to store key–value pairs within a single data structure.

STOCK_DICT = {'Apple': 'AAPL', 'Tesla': 'TSLA', 'Amazon': 'AMZN', 'Visa': 'V', 'Microsoft': 'MSFT'}
START = '2015-1-1'
END = '2020-12-31'

It is good practice to keep all the Close values of the stocks we are considering in a single DataFrame, as this makes it easier to access them.

df = pd.DataFrame()

for name, symbol in STOCK_DICT.items():
    df[name] = yf.Ticker(symbol).history(start=START, end=END).Close
    
df.head().round(2)

If the code above does not work due to a YFRateLimitError, you can load the data from the following URL using the pandas read_csv() method.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datasmp/datasets/refs/heads/main/close_stock_data_raw.csv',
                parse_dates = ['Date'])
df['Date'] = pd.to_datetime(df['Date'], utc=True)
df.set_index('Date', inplace=True)
df.head()

	Apple	Tesla	Amazon	Visa	Microsoft
Date
2015-01-02 05:00:00+00:00	24.261047	14.620667	15.4260	61.462486	39.933056
2015-01-05 05:00:00+00:00	23.577578	14.006000	15.1095	60.105762	39.565842
2015-01-06 05:00:00+00:00	23.579796	14.085333	14.7645	59.718475	38.985107
2015-01-07 05:00:00+00:00	23.910429	14.063333	14.9210	60.518589	39.480442
2015-01-08 05:00:00+00:00	24.829128	14.041333	15.0230	61.330284	40.641880

To remove the time portion from the date values, we first reset the index (row labels). This converts the index, which contains the Date values, into a column. Then, we use .dt.date to extract only the date part from each value in that column.

df.reset_index(inplace=True)
df['Date'] = df.Date.dt.date
df.set_index('Date', inplace=True)
df.head().round(2)

	Apple	Tesla	Amazon	Visa	Microsoft
Date
2015-01-02	24.26	14.62	15.43	61.46	39.93
2015-01-05	23.58	14.01	15.11	60.11	39.57
2015-01-06	23.58	14.09	14.76	59.72	38.99
2015-01-07	23.91	14.06	14.92	60.52	39.48
2015-01-08	24.83	14.04	15.02	61.33	40.64

The info() method in pandas provides basic information about a DataFrame, such as the number of entries, column names, non-null counts, and data types.

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1510 entries, 2015-01-02 to 2020-12-30
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Apple      1510 non-null   float64
 1   Tesla      1510 non-null   float64
 2   Amazon     1510 non-null   float64
 3   Visa       1510 non-null   float64
 4   Microsoft  1510 non-null   float64
dtypes: float64(5)
memory usage: 70.8+ KB

The built-in len() function returns the number of rows in a DataFrame.

len(df)

A DataFrame’s shape attribute returns the number of rows and columns as a tuple.

df.shape

(1510, 5)

The describe() method in pandas provides basic descriptive statistics for each column of a DataFrame. These include
- number of values (count)
- mean
- standard deviation
- minimum value
- 25th percentile (25% of the values are less than or equal to this value)
- 50th percentile (also called the median)
- 75th percentile (75% of the values are less than or equal to this value)
- maximum value

df.describe()

	Apple	Tesla	Amazon	Visa	Microsoft
count	1510.000000	1510.000000	1510.000000	1510.000000	1510.000000
mean	45.501393	30.976507	68.798972	117.331868	92.857470
std	24.912472	37.139818	39.538892	45.824109	50.984855
min	20.624054	9.578000	14.347500	57.134926	34.501617
25%	27.004882	15.139666	36.388124	73.924759	48.993045
50%	38.962189	18.944000	59.735750	108.241325	79.382622
75%	51.278791	23.162333	91.450748	159.760967	127.795855
max	133.190170	231.666672	176.572495	211.000885	222.111893

Log Return#

The log return \(r_t\) is calculated as:

\(\displaystyle r_t = ln\left(\frac{P_t}{P_{t-1}}\right) = ln(P_t) - ln(P_{t-1})\)

where \(P_t\) is the current price, \(P_{t-1}\) is the previous price, and \(ln\) denotes the natural logarithm.

Why Use Log Returns?

Time Additivity
- Log returns can be summed across time.
- Example: The log return over a year is just the sum of monthly log returns. This is not true for simple (arithmetic) returns.
- \(r_t + r_{t+1} = ln(P_t) - ln(P_{t-1}) + ln(P_{t+1}) - ln(P_{t}) = ln(P_{t+1}) - ln(P_{t-1})\)
Statistical Properties
- Log returns often approximate a normal distribution better than simple returns, which is useful for many statistical and financial models
Symmetry
- Percentage changes (simple returns) are asymmetric: a +10% gain and a −10% loss don’t cancel out.
- If P is the initial value then the final value is \(P\times 1.1\times 0.9 = P\times 0.99\)
- \(ln(1.1) + ln(0.9) = -0.01\)
- The difference is in how returns combine mathematically.
- With simple returns, you need to multiply growth factors: \((1+r_1)(1+r_2)\)
- With log returns, you just add them: \(r_1 + r_2\)
- Log returns are symmetric in relative changes, which makes them easier to analyze.

In the following code, the shift(n) method moves the values of a column down by n units, so that in each row the shifted values represent past values.

df_toy = pd.DataFrame([1,2,3,4], columns=['Initial'], index=['day1', 'day2', 'day3', 'day4' ])
df_toy['shift_1'] = df_toy.Initial.shift(1)
df_toy['shift_2'] = df_toy.Initial.shift(2)
df_toy

	Initial	shift_1	shift_2
day1	1	NaN	NaN
day2	2	1.0	NaN
day3	3	2.0	1.0
day4	4	3.0	2.0

Now we will use the shift() method to calculate the log returns.

df_log = np.log(df/df.shift(1))
df_log.dropna(inplace=True)
df_log.head().round(3)

	Apple	Tesla	Amazon	Visa	Microsoft
Date
2015-01-05	-0.029	-0.043	-0.021	-0.022	-0.009
2015-01-06	0.000	0.006	-0.023	-0.006	-0.015
2015-01-07	0.014	-0.002	0.011	0.013	0.013
2015-01-08	0.038	-0.002	0.007	0.013	0.029
2015-01-09	0.001	-0.019	-0.012	-0.015	-0.008

The base of np.log() is Euler’s number \(e\), which is a mathematical constant similar to \(\pi\) and approximately equal to 2.718.

np.log(10) 

2.302585092994046

np.e

2.718281828459045

np.log(np.e) 

1.0

Percentage Change#

If you prefer to use percentage changes instead of log returns, you can use the pandas pct_change() method.

df_pct = df.pct_change()
df_pct.dropna(inplace=True)
df_pct.head().round(3)

	Apple	Tesla	Amazon	Visa	Microsoft
Date
2015-01-05	-0.028	-0.042	-0.021	-0.022	-0.009
2015-01-06	0.000	0.006	-0.023	-0.006	-0.015
2015-01-07	0.014	-0.002	0.011	0.013	0.013
2015-01-08	0.038	-0.002	0.007	0.013	0.029
2015-01-09	0.001	-0.019	-0.012	-0.015	-0.008

Lagged Data#

One way to predict future stock prices is by using a certain number of previous stock prices. To do this, we will prepare a DataFrame that consists of the closing values along with their lagged versions up to a certain window period.

The following function generates lagged values as new columns.
- The parameter data is the DataFrame that contains values for various stocks.
- The parameter name specifies the column name (the stock for which lagged data will be generated as a new DataFrame).
- The parameter lag defines the number of lags, up to which lagged data will be generated as new columns.

def lag_func(data, name, lag):
    df_lag = pd.DataFrame(data[name])
    for i in range(1, lag+1):
        df_lag[f'lag_{i}'] = df_lag[name].shift(i)
        df_lag.dropna(inplace=True)
    return df_lag

df_log.Visa.head(10)

Date
2015-01-05   -0.022321
2015-01-06   -0.006464
2015-01-07    0.013309
2015-01-08    0.013323
2015-01-09   -0.014934
2015-01-12   -0.001959
2015-01-13    0.002918
2015-01-14   -0.020220
2015-01-15   -0.009554
2015-01-16    0.007164
Name: Visa, dtype: float64

lag_func(df_log, 'Visa', 2).head()

	Visa	lag_1	lag_2
Date
2015-01-08	0.013323	0.013309	-0.006464
2015-01-09	-0.014934	0.013323	0.013309
2015-01-12	-0.001959	-0.014934	0.013323
2015-01-13	0.002918	-0.001959	-0.014934
2015-01-14	-0.020220	0.002918	-0.001959

We can generate a dictionary with stock names as keys and DataFrames (containing the closing values and their lagged versions) as the corresponding values, allowing us to keep all the data in a single dictionary.

df_dict = {}
for name in STOCK_DICT.keys():
  df_dict[name] = lag_func(df_log, name, 10)

df_dict.keys()

dict_keys(['Apple', 'Tesla', 'Amazon', 'Visa', 'Microsoft'])

df_dict['Visa'].head()

	Visa	lag_1	lag_2	lag_3	lag_4	lag_5	lag_6	lag_7	lag_8	lag_9	lag_10
Date
2015-03-25	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022	0.018079	-0.001698
2015-03-26	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022	0.018079
2015-03-27	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022
2015-03-30	0.001829	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943
2015-03-31	-0.003815	0.001829	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944

Input and Output Data#

In this section, we will prepare the input and output data that will be used to build the Machine Learning models for a single stock (Visa). You can apply the same process to multiple stocks together by using a for loop.

Input#

The input data consists of lagged values, which represent past prices, and the output data consists of the closing values — the column labeled with the name of the stock.

df_visa = df_dict['Visa']
df_visa.head()

	Visa	lag_1	lag_2	lag_3	lag_4	lag_5	lag_6	lag_7	lag_8	lag_9	lag_10
Date
2015-03-25	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022	0.018079	-0.001698
2015-03-26	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022	0.018079
2015-03-27	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022
2015-03-30	0.001829	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943
2015-03-31	-0.003815	0.001829	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944

iloc[:, 1:] selects all rows and all columns starting from index 1 (the second column), which in this case is the lag_1 column.
This provides all the lagged values, representing the past stock prices, and will be used as the input data.

df_visa.iloc[:,1:].head()

	lag_1	lag_2	lag_3	lag_4	lag_5	lag_6	lag_7	lag_8	lag_9	lag_10
Date
2015-03-25	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022	0.018079	-0.001698
2015-03-26	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022	0.018079
2015-03-27	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943	-0.017022
2015-03-30	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944	0.014943
2015-03-31	0.001829	-0.000762	-0.002132	-0.020629	0.000298	-0.004908	0.008941	-0.001608	0.011914	-0.016944

The values attribute removes the labels (row and column indices) and returns the underlying data as a NumPy array, similar to a matrix.

X = df_visa.iloc[:,1:].values
type(X), X.shape

(numpy.ndarray, (1454, 10))

Output#

Output data in Machine Learning can generally take two different forms: continuous or categorical.

Continuous data means that output values can take any value within a range, such as a price, a percentage change, or a log return. These values are numerical and can be measured on a continuous scale.
Categorical data, on the other hand, represents discrete classes or labels, such as increasing vs. decreasing, or buy, hold, sell. These values do not represent magnitudes but categories.
When the output data is continuous, we use regressor algorithms, and the task is called regression. When the output data is categorical, we use classifier algorithms, and the task is called classification.
Therefore, it is essential to clearly identify the type of output data before building a model, so that we can choose the most appropriate Machine Learning algorithm depending on whether the task is regression or classification.

Regression#

Log return values are continuous data. Therefore, if the output variable in a model is log returns, the output is continuous, and regression algorithms should be used.

yR = df_visa['Visa'].values
yR.shape

(1454,)

yR[:5]

array([-0.02062898, -0.00213218, -0.0007624 ,  0.00182934, -0.0038147 ])

Classification#

If we want to perform classification and predict the behavior of the stock price, whether it increases or decreases, we need to convert the output data into labels (increasing and decreasing). We will encode these two labels numerically as 0 (decreasing or flat) and 1 (increasing).

yC = np.where(yR > 0, 1, 0)
yC

array([0, 0, 0, ..., 1, 1, 1])

NumPy’s bincount() method counts the number of occurrences of each non-negative integer in an array. For example, if the output array contains only 0s and 1s, np.bincount() will return the count of 0s as the first element and the count of 1s as the second element.

# example
np.bincount([0,0,1,1,1,2])

array([2, 3, 1])

The number of 0s and 1s in yC.

np.bincount(yC)

array([642, 812])

You can also use the Counter() class from the collections module to count the number of occurrences of 0 and 1.

import collections
collections.Counter(yC)

Counter({1: 812, 0: 642})

Split Data#

In the following code, we will split the entire dataset into three different subsets:

Training Data (90%): This portion of the data is used to train and build the Machine Learning models.
Validation Data (5%): Defined by valid_ratio, this subset is used to evaluate and compare different models that were built using the training data. It helps in selecting the best-performing model and tuning hyperparameters.
Test Data (5%): The remaining portion of the data is used to assess the performance of the final model chosen based on validation results. This set is never used during training or model selection, only for the final performance check.

N = len(df_visa) # total number of rows

tr = 0.90 # train ratio
vr = (1-train_ratio)/2 # validation ratio

ts = int(N*tr) # training size
vs = int(N*vr) # validation size

X_train, yR_train, yC_train = X[:ts], yR[:ts], yC[:ts]
X_valid, yR_valid, yC_valid = X[ts:ts+vs], yR[ts:ts+vs], yC[ts:ts+vs]
X_test , yR_test , yC_test  = X[ts+vs:], yR[ts+vs:], yC[ts+vs:]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[30], line 4
      1 N = len(df_visa) # total number of rows
      3 tr = 0.90 # train ratio
----> 4 vr = (1-train_ratio)/2 # validation ratio
      6 ts = int(N*tr) # training size
      7 vs = int(N*vr) # validation size

NameError: name 'train_ratio' is not defined

X_train.shape, yR_train.shape, yC_train.shape

((1308, 10), (1308,), (1308,))

X_valid.shape, yR_valid.shape, yC_valid.shape

((72, 10), (72,), (72,))

X_test.shape, yR_test.shape, yC_test.shape

((74, 10), (74,), (74,))

Machine Learning#

In this section, we will use different Machine Learning algorithms to make predictions and evaluate their performance on the validation set for comparison.

For the models that we import from scikit-learn, including K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Multi-layer Perceptron (MLP) , the fit() method performs the training step by using the input and output data from the training dataset.

Once trained, the predict() method generates predictions for the given input values.
For regression models, we use the Root Mean Squared Error (RMSE) to measure the difference between the predicted values and the actual values.
For classification models, the score() method combines the prediction step and evaluation, returning the accuracy score, which is the proportion of correctly classified samples (in this case, trading days).

This approach allows us to fairly compare different models by using the appropriate evaluation metric for the type of problem (regression or classification).

For more information on Machine Learning algorithms, please check the following online book: Introduction to Machine Learning.

KNN#

For more information on KNN, please see KNN chapter.

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import root_mean_squared_error as rmse

knnR = KNeighborsRegressor()
knnR.fit(X_train, yR_train)
pred_valid = knnR.predict(X_valid)
rmse(pred_valid, yR_valid)

0.01686889097515809

knnC = KNeighborsClassifier()
knnC.fit(X_train, yC_train)
knnC.score(X_valid, yC_valid)

0.5277777777777778

Decision Tree#

For more information on Decision Tree, please see Decision Tree chapter.

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

dtR = DecisionTreeRegressor(random_state=0)
dtR.fit(X_train, yR_train)
pred_valid = dtR.predict(X_valid)
rmse(pred_valid, yR_valid)

0.023515643972095466

dtC = DecisionTreeClassifier()
dtC.fit(X_train, yC_train)
dtC.score(X_valid, yC_valid)

0.5416666666666666

Random Forest#

For more information on Random Forest, please see Random Forest chapter.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

rfR = RandomForestRegressor(random_state=0)
rfR.fit(X_train, yR_train)
pred_valid = rfR.predict(X_valid)
rmse(pred_valid, yR_valid)

0.017484014073654453

rfC = RandomForestClassifier()
rfC.fit(X_train, yC_train)
rfC.score(X_valid, yC_valid)

0.5416666666666666

MLP#

For more information on Multi-layer Perceptron, please see MLP chapter.

from sklearn.neural_network import MLPClassifier, MLPRegressor

mlpR = MLPRegressor(random_state=0)
mlpR.fit(X_train, yR_train)
pred_valid = mlpR.predict(X_valid)
rmse(pred_valid, yR_valid)

0.015456051693813764

mlpC = MLPClassifier(random_state=0, max_iter=500)
mlpC.fit(X_train, yC_train)
mlpC.score(X_valid, yC_valid)

0.7083333333333334

Test Data#

Among the models we’ve evaluated so far, the MLP shows the best performance on the validation data. Now, let’s check the performance of this best model (MLP) on the test set. Since we will no longer use the separate training and validation sets, let’s retrain the MLP on the combined training+validation data and then evaluate it on the test data.

X_train.shape, X_valid.shape

((1308, 10), (72, 10))

NumPy’s vstack() method stacks arrays vertically, meaning it places one array below the other.

X_train_valid = np.vstack([X_train, X_valid])
X_train_valid.shape

(1380, 10)

yR_train.shape, yR_valid.shape

((1308,), (72,))

NumPy’s hstack() method stacks arrays horizontally, meaning it places one array to the right of the other.

yR_train_valid = np.hstack([yR_train, yR_valid])
yR_train_valid.shape

(1380,)

yC_train_valid = np.hstack([yC_train, yC_valid])
yC_train_valid.shape

(1380,)

mlpR = MLPRegressor(random_state=0)
mlpR.fit(X_train_valid, yR_train_valid)
pred_test = mlpR.predict(X_test)
rmse(pred_test, yR_test)

0.016404914243523118

mlpC = MLPClassifier(random_state=0, max_iter=1000)
mlpC.fit(X_train_valid, yC_train_valid)
mlpC.score(X_test, yC_test)

0.5540540540540541

Keras#

Keras is a high-level neural network library that runs on top of Theano or TensorFlow, offering a user-friendly API similar to scikit-learn for constructing neural networks in Python.

from tensorflow import keras

Feedforward (FNN / MLP)#

In feedforward neural network structures, data flows from the input to the output by passing through the neurons without returning to a previous neuron.

Regression#

The following model consists of an input layer with 10 neurons, two hidden layers with 100 and 200 neurons respectively, and an output layer with one neuron.

model = keras.models.Sequential([
    keras.layers.Input((10,)),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(200, activation='relu'),
    keras.layers.Dense(1)])

model.summary()

Model: "sequential_9"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_27 (Dense)                │ (None, 100)            │         1,100 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_28 (Dense)                │ (None, 200)            │        20,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_29 (Dense)                │ (None, 1)              │           201 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 21,501 (83.99 KB)

 Trainable params: 21,501 (83.99 KB)

 Non-trainable params: 0 (0.00 B)

In Keras, the compile() step is where you configure the model for training.

The optimizer, metrics, and loss function can all be specified.

In the following code, we set only the loss function to mean squared error.

model.compile(loss = 'mse')

For scikit-learn algorithms, the fit() method is used to train the model. Typically, the model is trained only on the training set, while the validation set is kept separate and not used directly during fitting.

The validation set plays a key role in the training process:
It is used to evaluate the model’s performance on unseen data while the training is still in progress.
It helps detect overfitting, since performance on training data may improve even while performance on validation data deteriorates.
It is commonly used for hyperparameter tuning, either manually or through techniques like GridSearchCV or RandomizedSearchCV, where cross-validation splits serve as validation sets.

In contrast, frameworks like Keras allow you to pass a validation set directly in the fit() method (e.g., validation_data=(x_val, y_val)), so that performance on the validation set is reported at the end of each training epoch.

model.fit(X_train, yR_train, validation_data=(X_valid, yR_valid));

41/41 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - loss: 2.5526e-04 - val_loss: 2.7879e-04

model.predict(X_test[:5])

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step

array([[-0.00121496],
       [-0.00020361],
       [-0.00067773],
       [ 0.0009219 ],
       [ 0.00194272]], dtype=float32)

yR_test_predict = model.predict(X_test)
rmse(yR_test_predict , yR_test)

3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 976us/step

0.016397975497796584

Classification#

Sigmoid#

In the binary case, which means there are only two classes in the output, the activation function can be chosen as the sigmoid function, which returns a single value representing the probability of being in class 1.

model = keras.models.Sequential([
    keras.layers.Input((10,)),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(200, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')]) 

model.summary()

Model: "sequential_7"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_21 (Dense)                │ (None, 100)            │         1,100 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_22 (Dense)                │ (None, 200)            │        20,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_23 (Dense)                │ (None, 1)              │           201 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 21,501 (83.99 KB)

 Trainable params: 21,501 (83.99 KB)

 Non-trainable params: 0 (0.00 B)

model.compile(loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X_train, yC_train, validation_data=(X_valid, yC_valid));

41/41 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.5386 - loss: 0.6911 - val_accuracy: 0.5972 - val_loss: 0.6898

model.predict(X_test[:5])

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step

array([[0.55796814],
       [0.5584805 ],
       [0.5597468 ],
       [0.55970466],
       [0.56186867]], dtype=float32)

yC_test_pred = np.where(model.predict(X_test)>0.5, 1, 0)
yC_test_pred[:5]

3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

array([[1],
       [1],
       [1],
       [1],
       [1]])

Softmax#

The softmax activation function is used for multi-class classification tasks and returns the probability of each class.

model = keras.models.Sequential([
    keras.layers.Input((10,)),
    keras.layers.Dense(100, activation='relu'),
    keras.layers.Dense(200, activation='relu'),
    keras.layers.Dense(2, activation='softmax')]) # binary case: only two classes

model.summary()

Model: "sequential_8"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_24 (Dense)                │ (None, 100)            │         1,100 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_25 (Dense)                │ (None, 200)            │        20,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_26 (Dense)                │ (None, 2)              │           402 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 21,702 (84.77 KB)

 Trainable params: 21,702 (84.77 KB)

 Non-trainable params: 0 (0.00 B)

model.compile(loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(X_train, yC_train, validation_data=(X_valid, yC_valid));

41/41 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5283 - loss: 0.6912 - val_accuracy: 0.5972 - val_loss: 0.6914

model.predict(X_test[:5])

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step

array([[0.43344268, 0.56655735],
       [0.4247901 , 0.5752099 ],
       [0.42437348, 0.57562655],
       [0.42359856, 0.5764014 ],
       [0.4213894 , 0.57861066]], dtype=float32)

yC_test_pred = [np.argmax(i) for i in model.predict(X_test)]
yC_test_pred[:5]

3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 874us/step

[1, 1, 1, 1, 1]

Recurrent Neural Network (RNN)#

Regression#

model = keras.models.Sequential([
    keras.layers.Input((None,1)),
    keras.layers.SimpleRNN(100, return_sequences=True),
    keras.layers.SimpleRNN(200),
    keras.layers.Dense(1)])

model.summary()

Model: "sequential_12"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ simple_rnn_4 (SimpleRNN)        │ (None, None, 100)      │        10,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ simple_rnn_5 (SimpleRNN)        │ (None, 200)            │        60,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_32 (Dense)                │ (None, 1)              │           201 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 70,601 (275.79 KB)

 Trainable params: 70,601 (275.79 KB)

 Non-trainable params: 0 (0.00 B)

model.compile(loss = 'mse')

model.fit(X_train, yR_train, validation_data=(X_valid, yR_valid));

41/41 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0403 - val_loss: 2.8208e-04

model.predict(X_test[:5])

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step

array([[0.00105258],
       [0.00105076],
       [0.00093284],
       [0.00070539],
       [0.00073511]], dtype=float32)

yR_test_predict = model.predict(X_test)
rmse(yR_test_predict , yR_test)

3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step

0.01601122283427286

Long Short-Term Memory (LSTM)#

Regression#

model = keras.models.Sequential([
    keras.layers.Input((None,1)),
    keras.layers.LSTM(100, return_sequences=True),
    keras.layers.LSTM(200),
    keras.layers.Dense(1)])

model.summary()

Model: "sequential_16"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm_12 (LSTM)                  │ (None, None, 100)      │        40,800 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_13 (LSTM)                  │ (None, 200)            │       240,800 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_36 (Dense)                │ (None, 1)              │           201 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 281,801 (1.07 MB)

 Trainable params: 281,801 (1.07 MB)

 Non-trainable params: 0 (0.00 B)

model.compile(loss = 'mse')

model.fit(X_train, yR_train, validation_data=(X_valid, yR_valid));

41/41 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 4.6015e-04 - val_loss: 3.3226e-04

model.predict(X_test[:5])

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 93ms/step

array([[0.00542205],
       [0.00540088],
       [0.00542711],
       [0.00555244],
       [0.00558671]], dtype=float32)

yR_test_predict = model.predict(X_test)
rmse(yR_test_predict , yR_test)

3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

0.01676784618012104

Regression Model Construction#

, batch_size=16, epochs=2

('batch_size=16,', 'epochs=2')

def build_model(n_hiddens=2, n_neurons=100, input_shape=1):
  model = keras.models.Sequential()
  model.add(keras.layers.InputLayer(shape=[None, input_shape]))

  for layer in range(n_hiddens-1):
    model.add(keras.layers.LSTM(n_neurons, return_sequences=True))

  model.add(keras.layers.LSTM(n_neurons))
  model.add(keras.layers.Dense(1))
  model.compile(loss='mse')
  return model

build_model(n_hiddens=5, n_neurons=1).summary()

Model: "sequential_17"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm_14 (LSTM)                  │ (None, None, 1)        │            12 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_15 (LSTM)                  │ (None, None, 1)        │            12 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_16 (LSTM)                  │ (None, None, 1)        │            12 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_17 (LSTM)                  │ (None, None, 1)        │            12 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_18 (LSTM)                  │ (None, 1)              │            12 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_37 (Dense)                │ (None, 1)              │             2 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 62 (248.00 B)

 Trainable params: 62 (248.00 B)

 Non-trainable params: 0 (0.00 B)

from scikeras.wrappers import KerasRegressor
lstm_keras_reg = KerasRegressor(build_model, n_hiddens=2, n_neurons=100)

lstm_keras_reg.fit(X_train, yR_train, validation_data=(X_valid, yR_valid), batch_size=16, epochs=2)

yR_test_predict = model.predict(X_test)
rmse(yR_test_predict , yR_test)

Epoch 1/2
82/82 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - loss: 4.0597e-04 - val_loss: 3.9824e-04
Epoch 2/2
82/82 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - loss: 3.4419e-04 - val_loss: 2.7584e-04
3/3 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

0.01676784618012104

RandomizedSearchCV#

from sklearn.model_selection import RandomizedSearchCV

param_distribs = {
"n_hiddens": [1, 2, 3],
"n_neurons": [50, 100, 150],
}

rnd_search_cv = RandomizedSearchCV(lstm_keras_reg, param_distribs, n_iter=5, cv=3)
rnd_search_cv.fit(X_train, yR_train, validation_data=(X_valid, yR_valid))

28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step - loss: 3.9042e-04 - val_loss: 2.8660e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step - loss: 4.8826e-04 - val_loss: 3.3580e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 18ms/step - loss: 2.3440e-04 - val_loss: 6.1862e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 14ms/step - loss: 4.9173e-04 - val_loss: 2.8557e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 14ms/step - loss: 5.8294e-04 - val_loss: 3.8028e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 14ms/step - loss: 3.5741e-04 - val_loss: 3.7224e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - loss: 4.0899e-04 - val_loss: 3.9309e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 17ms/step - loss: 4.6213e-04 - val_loss: 3.1873e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - loss: 2.9718e-04 - val_loss: 2.8406e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 5.5350e-04 - val_loss: 3.9404e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 15ms/step - loss: 6.0026e-04 - val_loss: 5.7858e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 14ms/step - loss: 3.8712e-04 - val_loss: 8.2294e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 9ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step - loss: 4.6308e-04 - val_loss: 4.4643e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 5.5476e-04 - val_loss: 2.8565e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
28/28 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step - loss: 2.9283e-04 - val_loss: 3.8419e-04
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step
41/41 ━━━━━━━━━━━━━━━━━━━━ 1s 8ms/step - loss: 4.7282e-04 - val_loss: 4.5706e-04

RandomizedSearchCV(cv=3,
                   estimator=KerasRegressor(model=<function build_model at 0x2b50ba3e0>, n_hiddens=2, n_neurons=100),
                   n_iter=5,
                   param_distributions={'n_hiddens': [1, 2, 3],
                                        'n_neurons': [50, 100, 150]})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

rnd_search_cv.best_params_

{'n_neurons': 100, 'n_hiddens': 2}

Coding

Contents

Coding#

Data Preparation#

Log Return#

Percentage Change#

Lagged Data#

Input and Output Data#

Input#

Output#

Regression#

Classification#

Split Data#

Machine Learning#

KNN#

Decision Tree#

Random Forest#

MLP#

Test Data#

Keras#

Feedforward (FNN / MLP)#

Regression#

Classification#

Sigmoid#

Softmax#

Recurrent Neural Network (RNN)#

Regression#

Long Short-Term Memory (LSTM)#

Regression#

Regression Model Construction#

RandomizedSearchCV#