Stock Prediction with BERT (2)
Using pre-trained BERT from Mxnet, the post shows how to predict DJIA's adjusted closing prices.
Code Implementation
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
path = 'embedding_files/'
max_embedding = pd.read_json(path+'max_embedding.json')
min_embedding = pd.read_json(path+'min_embedding.json')
mean_embedding = pd.read_json(path+'mean_embedding.json')
sum_embedding = pd.read_json(path+'sum_embedding.json')
djia = pd.read_csv('data/DJIA_table.csv')
djia = djia.loc[:, ['Date', 'Open', 'Adj Close']].sort_values('Date').set_index('Date')
I only needed Date, Open, and Adj Close columns from the djia data.
open_price = djia[['Open']]
adj_close_price = djia[['Adj Close']]
djia.head()
Open
Adj Close
Date
2008-08-08
11432.089844
11734.320312
2008-08-11
11729.669922
11782.349609
2008-08-12
11781.700195
11642.469727
2008-08-13
11632.809570
11532.959961
2008-08-14
11532.070312
11615.929688
max_embedding.head(1)
Max
2008-08-08
[0.809297204, 0.5163459778, 0.3755577505, 0.59...
Since each value in the list is a feature, I redefined the dataframe by separating them into each column.
def transform_data(tbl):
tbl = pd.DataFrame(tbl.iloc[:, 0].tolist())
tbl = tbl.set_index(djia.index)
return tbl
max_embedding = transform_data(max_embedding)
min_embedding = transform_data(min_embedding)
sum_embedding = transform_data(sum_embedding)
mean_embedding = transform_data(mean_embedding)
max_embedding.head(1)
0
1
2
3
4
5
6
7
8
9
...
758
759
760
761
762
763
764
765
766
767
Date
2008-08-08
0.809297
0.516346
0.375558
0.592091
0.372241
0.27578
0.672928
0.902444
1.321722
0.690093
...
0.414205
0.687436
0.144865
0.403365
0.304636
0.796824
0.586465
0.883279
0.854595
0.175066
max_embedding.shape, open_price.shape, adj_close_price.shape
((1989, 768), (1989, 1), (1989, 1))
I separated them into testing and training next.
def split_test(embedding, test_size):
embedding_test = embedding.iloc[-test_size:, :]
embedding = embedding.iloc[:-test_size, :]
return embedding_test, embedding
test_size = 300
max_embedding_test, max_embedding = split_test(max_embedding, test_size)
min_embedding_test, min_embedding = split_test(min_embedding, test_size)
sum_embedding_test, sum_embedding = split_test(sum_embedding, test_size)
mean_embedding_test, mean_embedding = split_test(mean_embedding, test_size)
combined_embedding = pd.concat((mean_embedding, max_embedding, min_embedding, sum_embedding), axis=1)
combined_embedding_test = pd.concat((mean_embedding_test, max_embedding_test, min_embedding_test, sum_embedding_test), axis=1)
open_test, open_price = split_test(open_price, test_size)
adj_close_test, adj_close_price = split_test(adj_close_price, test_size)
max_embedding.shape, combined_embedding.shape, open_price.shape, adj_close_price.shape
((1689, 768), (1689, 3072), (1689, 1), (1689, 1))
combined_embedding is another dataset I tried to see if how using all of their features affect the open price.
So now there are total of 5 different models with different data.
Model Definition
Finally, below is my custom data loader and model.
def data_loader(data, batch_size, num_iter=100):
# x : Embedding Values
# y : Open Price
# z : Close Price
x = data[0]
y = data[1]
z = data[2]
# num_iter iterations per epoch
# mini batch
for _ in range(num_iter):
idx = np.random.choice(np.arange(x.shape[0]), size=batch_size, replace=False)
batch_x = x.iloc[idx, :]
batch_y = y.iloc[idx]
batch_z = z.iloc[idx]
yield batch_x, batch_y, batch_z
class get_model():
def __init__(self, learning_rate=1e-3, dropout_rate=.5):
self.learning_rate = learning_rate
self.dropout_rate = dropout_rate
# BERT Embedding
self.x = tf.placeholder(tf.float32, shape=(None, 768))
# Open Price
self.y = tf.placeholder(tf.float32, shape=(None, 1))
# Adj Close Price
self.z = tf.placeholder(tf.float32, shape=(None, 1))
self.pred = self.run_model()
self.loss = tf.sqrt(tf.losses.mean_squared_error(self.z, self.pred), name='loss')
self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name='optimizer').minimize(self.loss)
self.saver = tf.train.Saver()
def run_model(self):
# Dense model
layer1 = tf.contrib.layers.fully_connected(self.x, 1000)
layer1 = tf.nn.dropout(layer1, rate=self.dropout_rate)
layer1 = tf.layers.batch_normalization(layer1)
layer2 = tf.contrib.layers.fully_connected(layer1, 500)
layer2 = tf.nn.dropout(layer2, rate=self.dropout_rate)
layer2 = tf.layers.batch_normalization(layer2)
# This would be the value of coefficient indicating how much it impacts on a day's open price
layer3 = tf.contrib.layers.fully_connected(layer2, 1)
layer4 = layer3 * self.y
layer5 = tf.contrib.layers.fully_connected(layer4, 100)
layer5 = tf.nn.dropout(layer5, rate=self.dropout_rate)
layer5 = tf.layers.batch_normalization(layer5)
output = tf.contrib.layers.fully_connected(layer5, 1)
return output
From first to third layer is to extract the value that indicates how much given articles affect the same day's open price.
get_data below concatenates embedding with open price and split them into training and validation sets.
def get_data(embedding):
X = pd.concat((embedding, open_price), axis=1)
X_train, X_valid, y_train, y_valid = train_test_split(X, adj_close_price, test_size=.2)
return [X_train.iloc[:, :-1], X_train.iloc[:, -1:], y_train], [X_valid.iloc[:, :-1], X_valid.iloc[:, -1:], y_valid]
mean_data_train, mean_data_valid = get_data(mean_embedding)
max_data_train, max_data_valid = get_data(max_embedding)
min_data_train, min_data_valid = get_data(min_embedding)
sum_data_train, sum_data_valid = get_data(sum_embedding)
combined_data_train, combined_data_valid = get_data(combined_embedding)
Train
data_name = {'mean_embedding':[mean_data_train, mean_data_valid],
'max_embedding':[max_data_train, max_data_valid],
'min_embedding':[min_data_train, min_data_valid],
'sum_embedding':[sum_data_train, sum_data_valid]}
def train_model(embedding_name, learning_rate=1e-5, epochs=300, batch_size=16, dropout_rate=.5, load_params=True,
verbose=True, save_model=True):
data_train, data_valid = data_name[embedding_name]
tf.reset_default_graph()
model = get_model(learning_rate=learning_rate, dropout_rate=dropout_rate)
# For plots
train_losses = []
valid_losses = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
if load_params:
# Load Model
try:
print(f'------------- Attempting to Load {embedding_name} Model -------------')
model.saver.restore(sess, f'./model/{embedding_name}_model.ckpt')
print(f'------------- {embedding_name} Model Loaded -------------')
except:
print('Training New Model')
else:
print('Training New Model')
# Train Model
print('\n------------- Training Model -------------\n')
for epoch in range(epochs):
for x, y, z in data_loader(data_train, batch_size=batch_size):
train_loss, _ = sess.run([model.loss, model.optimizer], feed_dict={model.x:x,
model.y:y,
model.z:z})
# x : embedding, y : open price, z : close price
valid_loss = sess.run(model.loss, feed_dict={model.x:data_valid[0],
model.y:data_valid[1],
model.z:data_valid[2]})
# print losses
if verbose:
print(f'Epoch {epoch+1}/{epochs}, Train RMSE Loss {train_loss}, Valid RMSE Loss {valid_loss}')
# Save Model at every 20 epochs
if save_model:
if (epoch+1) % 20 == 0 and epoch > 0:
if not os.path.exists('./model'):
os.mkdir('./model/')
model.saver.save(sess, f"./model/{embedding_name}_model.ckpt")
print('\n------------- Model Saved -------------\n')
train_losses.append(train_loss)
valid_losses.append(valid_loss)
return model, train_losses, valid_losses
# Possible Names : mean_embedding, max_embedding, min_embedding, sum_embedding
epochs = 300
learning_rate = 1e-4
mean_model, mean_train_loss, mean_valid_loss = train_model('mean_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)
max_model, max_train_loss, max_valid_loss = train_model('max_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)
min_model, min_train_loss, min_valid_loss = train_model('min_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)
sum_model, sum_train_loss, sum_valid_loss = train_model('sum_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)
Running four models took about 25 minutes on surface pro 4.
New Model for Combined Dataset
Since the combined_embedding is in different shape, I created a new model.
class combined_model():
def __init__(self, learning_rate=1e-3, dropout_rate=.5):
self.learning_rate = learning_rate
self.dropout_rate = dropout_rate
# BERT Embedding
self.x = tf.placeholder(tf.float32, shape=(None, 3072))
# Open Price
self.y = tf.placeholder(tf.float32, shape=(None, 1))
# Adj Close Price
self.z = tf.placeholder(tf.float32, shape=(None, 1))
self.pred = self.run_model()
self.loss = tf.sqrt(tf.losses.mean_squared_error(self.z, self.pred), name='loss')
self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name='optimizer').minimize(self.loss)
self.saver = tf.train.Saver()
def run_model(self):
# Dense layers
layer1 = tf.contrib.layers.fully_connected(self.x, 1000)
layer1 = tf.nn.dropout(layer1, rate=self.dropout_rate)
layer1 = tf.layers.batch_normalization(layer1)
layer2 = tf.contrib.layers.fully_connected(layer1, 500)
layer2 = tf.nn.dropout(layer2, rate=self.dropout_rate)
layer2 = tf.layers.batch_normalization(layer2)
# Coefficient of impact values
layer3 = tf.contrib.layers.fully_connected(layer2, 1)
layer4 = layer3 * self.y
layer5 = tf.contrib.layers.fully_connected(layer4, 100)
layer5 = tf.nn.dropout(layer5, rate=self.dropout_rate)
layer5 = tf.layers.batch_normalization(layer5)
output = tf.contrib.layers.fully_connected(layer5, 1)
return output
combined model took about 4 minutes on surface pro 4.
tf.reset_default_graph()
model = combined_model(learning_rate=1e-4, dropout_rate=.5)
epochs = 300
combined_train_losses = []
combined_valid_losses = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
try:
print(f'------------- Attempting to Load Combined Model -------------')
model.saver.restore(sess, f'./model/combined_model.ckpt')
print(f'------------- Combined Model Loaded -------------')
except:
print('Training New Model')
# Train Model
print('\n------------- Training Model -------------\n')
for epoch in range(epochs):
for x, y, z in data_loader(combined_data_train, batch_size=16):
train_loss, _ = sess.run([model.loss, model.optimizer], feed_dict={model.x:x,
model.y:y,
model.z:z})
valid_loss = sess.run(model.loss, feed_dict={model.x:combined_data_valid[0],
model.y:combined_data_valid[1],
model.z:combined_data_valid[2]})
if epoch % 20 == 0:
print(f'Epoch {epoch+1}/{epochs}, Combined Train RMSE Loss {train_loss}, Combined Valid RMSE Loss {valid_loss}')
if not os.path.exists('./model'):
os.mkdir('./model/')
model.saver.save(sess, f"./model/combined_model.ckpt")
print('\n------------- Model Saved -------------\n')
combined_train_losses.append(train_loss)
combined_valid_losses.append(valid_loss)
------------- Attempting to Load Combined Model -------------
Training New Model
------------- Training Model -------------
Epoch 1/100, Train RMSE Loss 14868.9697265625, Valid RMSE Loss 12097.6728515625
------------- Model Saved -------------
Epoch 21/100, Train RMSE Loss 3311.416015625, Valid RMSE Loss 3776.328125
------------- Model Saved -------------
Epoch 41/100, Train RMSE Loss 2456.6015625, Valid RMSE Loss 2350.877685546875
------------- Model Saved -------------
Epoch 61/100, Train RMSE Loss 2424.5302734375, Valid RMSE Loss 2064.84033203125
------------- Model Saved -------------
Epoch 81/100, Train RMSE Loss 1417.530029296875, Valid RMSE Loss 2210.999755859375
------------- Model Saved -------------
Losses of each model
Here are the results of loss plots
Prediction of each model
Now let's predict.
embedding_data = {'mean_embedding':mean_embedding_test,
'max_embedding':max_embedding_test,
'min_embedding':min_embedding_test,
'sum_embedding':sum_embedding_test}
def predict_model(embedding_name):
tf.reset_default_graph()
data = embedding_data[embedding_name]
model = get_model(learning_rate=1e-5)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Load Model
try:
print(f'------------- Attempting to Load {embedding_name} Model -------------')
model.saver.restore(sess, f'./model/{embedding_name}_model.ckpt')
print('------------- Model Loaded -------------')
except:
pass
pred = sess.run(model.pred, feed_dict={model.x:data,
model.y:open_test})
return model, pred
mean_model, mean_pred = predict_model('mean_embedding')
max_model, max_pred = predict_model('max_embedding')
sum_model, sum_pred = predict_model('sum_embedding')
min_model, min_pred = predict_model('min_embedding')
------------- Attempting to Load mean_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/mean_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load max_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/max_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load sum_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/sum_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load min_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/min_embedding_model.ckpt
------------- Model Loaded -------------
tf.reset_default_graph()
model = combined_model(learning_rate=1e-5)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Load Model
try:
print(f'------------- Attempting to Load Combined Model -------------')
model.saver.restore(sess, f'./model/combined_model.ckpt')
print('------------- Model Loaded -------------')
except:
pass
combined_pred = sess.run(model.pred, feed_dict={model.x:combined_embedding_test,
model.y:open_test})
------------- Attempting to Load Combined Model -------------
INFO:tensorflow:Restoring parameters from ./model/combined_model.ckpt
------------- Model Loaded -------------
mean_pred = mean_pred.flatten()
max_pred = max_pred.flatten()
min_pred = min_pred.flatten()
sum_pred = sum_pred.flatten()
combined_pred = combined_pred.flatten()
Conclusion
We can see that the predicted values have high variance and predicted values fluctuate much. However, the models still were able to capture general trend of the prices. As it is impossible to predict something with 100%, models like above are only used as a general guide.
One way to improve a model is to set a threshold which it limits how much the price can change over a day. For example, we can set it to 10,000 that it won't change above the amount.
Also, the news I used may not (or most likely not) be related to DJIA. Using news that are closely related to it can also improve performance.
You can find the checkpoints I saved and all codes here.
Thank you for reading the post and if there is any mistake I made, please let me know!
Last updated