Stock Prediction with BERT (2)

Using pre-trained BERT from Mxnet, the post shows how to predict DJIA's adjusted closing prices.

Code Implementation

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
path = 'embedding_files/'

max_embedding = pd.read_json(path+'max_embedding.json')
min_embedding = pd.read_json(path+'min_embedding.json')
mean_embedding = pd.read_json(path+'mean_embedding.json')
sum_embedding = pd.read_json(path+'sum_embedding.json')

djia = pd.read_csv('data/DJIA_table.csv')
djia = djia.loc[:, ['Date', 'Open', 'Adj Close']].sort_values('Date').set_index('Date')

I only needed Date, Open, and Adj Close columns from the djia data.

open_price = djia[['Open']]
adj_close_price = djia[['Adj Close']]
djia.head()

Open

Adj Close

Date

2008-08-08

11432.089844

11734.320312

2008-08-11

11729.669922

11782.349609

2008-08-12

11781.700195

11642.469727

2008-08-13

11632.809570

11532.959961

2008-08-14

11532.070312

11615.929688

max_embedding.head(1)

Max

2008-08-08

[0.809297204, 0.5163459778, 0.3755577505, 0.59...

Since each value in the list is a feature, I redefined the dataframe by separating them into each column.

def transform_data(tbl):

    tbl = pd.DataFrame(tbl.iloc[:, 0].tolist())
    tbl = tbl.set_index(djia.index)

    return tbl
max_embedding = transform_data(max_embedding)

min_embedding = transform_data(min_embedding)

sum_embedding = transform_data(sum_embedding)

mean_embedding = transform_data(mean_embedding)
max_embedding.head(1)

0

1

2

3

4

5

6

7

8

9

...

758

759

760

761

762

763

764

765

766

767

Date

2008-08-08

0.809297

0.516346

0.375558

0.592091

0.372241

0.27578

0.672928

0.902444

1.321722

0.690093

...

0.414205

0.687436

0.144865

0.403365

0.304636

0.796824

0.586465

0.883279

0.854595

0.175066

max_embedding.shape, open_price.shape, adj_close_price.shape
((1989, 768), (1989, 1), (1989, 1))

I separated them into testing and training next.

def split_test(embedding, test_size):

    embedding_test = embedding.iloc[-test_size:, :]
    embedding = embedding.iloc[:-test_size, :]

    return embedding_test, embedding
test_size = 300

max_embedding_test, max_embedding = split_test(max_embedding, test_size)
min_embedding_test, min_embedding = split_test(min_embedding, test_size)
sum_embedding_test, sum_embedding = split_test(sum_embedding, test_size)
mean_embedding_test, mean_embedding = split_test(mean_embedding, test_size)

combined_embedding = pd.concat((mean_embedding, max_embedding, min_embedding, sum_embedding), axis=1)
combined_embedding_test = pd.concat((mean_embedding_test, max_embedding_test, min_embedding_test, sum_embedding_test), axis=1)

open_test, open_price = split_test(open_price, test_size)
adj_close_test, adj_close_price = split_test(adj_close_price, test_size)
max_embedding.shape, combined_embedding.shape, open_price.shape, adj_close_price.shape
((1689, 768), (1689, 3072), (1689, 1), (1689, 1))

combined_embedding is another dataset I tried to see if how using all of their features affect the open price.

So now there are total of 5 different models with different data.

Model Definition

Finally, below is my custom data loader and model.

def data_loader(data, batch_size, num_iter=100):

    # x : Embedding Values
    # y : Open Price
    # z : Close Price

    x = data[0]
    y = data[1]
    z = data[2]

    # num_iter iterations per epoch
    # mini batch

    for _ in range(num_iter):

        idx = np.random.choice(np.arange(x.shape[0]), size=batch_size, replace=False)

        batch_x = x.iloc[idx, :]
        batch_y = y.iloc[idx]
        batch_z = z.iloc[idx]

        yield batch_x, batch_y, batch_z


class get_model():

    def __init__(self, learning_rate=1e-3, dropout_rate=.5):

        self.learning_rate = learning_rate
        self.dropout_rate = dropout_rate

        # BERT Embedding
        self.x = tf.placeholder(tf.float32, shape=(None, 768))
        # Open Price
        self.y = tf.placeholder(tf.float32, shape=(None, 1))
        # Adj Close Price
        self.z = tf.placeholder(tf.float32, shape=(None, 1))

        self.pred = self.run_model()

        self.loss = tf.sqrt(tf.losses.mean_squared_error(self.z, self.pred), name='loss')

        self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name='optimizer').minimize(self.loss)

        self.saver = tf.train.Saver()

    def run_model(self):

        # Dense model
        layer1 = tf.contrib.layers.fully_connected(self.x, 1000)
        layer1 = tf.nn.dropout(layer1, rate=self.dropout_rate)
        layer1 = tf.layers.batch_normalization(layer1)

        layer2 = tf.contrib.layers.fully_connected(layer1, 500)
        layer2 = tf.nn.dropout(layer2, rate=self.dropout_rate)
        layer2 = tf.layers.batch_normalization(layer2)

        # This would be the value of coefficient indicating how much it impacts on a day's open price
        layer3 = tf.contrib.layers.fully_connected(layer2, 1)

        layer4 = layer3 * self.y

        layer5 = tf.contrib.layers.fully_connected(layer4, 100)
        layer5 = tf.nn.dropout(layer5, rate=self.dropout_rate)
        layer5 = tf.layers.batch_normalization(layer5)

        output = tf.contrib.layers.fully_connected(layer5, 1)

        return output

From first to third layer is to extract the value that indicates how much given articles affect the same day's open price.

get_data below concatenates embedding with open price and split them into training and validation sets.

def get_data(embedding):

    X = pd.concat((embedding, open_price), axis=1)

    X_train, X_valid, y_train, y_valid = train_test_split(X, adj_close_price, test_size=.2)

    return [X_train.iloc[:, :-1], X_train.iloc[:, -1:], y_train], [X_valid.iloc[:, :-1], X_valid.iloc[:, -1:], y_valid]
mean_data_train, mean_data_valid = get_data(mean_embedding)
max_data_train, max_data_valid = get_data(max_embedding)
min_data_train, min_data_valid = get_data(min_embedding)
sum_data_train, sum_data_valid = get_data(sum_embedding)

combined_data_train, combined_data_valid = get_data(combined_embedding)

Train

data_name = {'mean_embedding':[mean_data_train, mean_data_valid],
            'max_embedding':[max_data_train, max_data_valid],
            'min_embedding':[min_data_train, min_data_valid],
            'sum_embedding':[sum_data_train, sum_data_valid]}

def train_model(embedding_name, learning_rate=1e-5, epochs=300, batch_size=16, dropout_rate=.5, load_params=True,
               verbose=True, save_model=True):

    data_train, data_valid = data_name[embedding_name]

    tf.reset_default_graph()

    model = get_model(learning_rate=learning_rate, dropout_rate=dropout_rate)

    # For plots
    train_losses = []
    valid_losses = []


    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())

        if load_params:
        # Load Model
            try:
                print(f'------------- Attempting to Load {embedding_name} Model -------------')
                model.saver.restore(sess, f'./model/{embedding_name}_model.ckpt')
                print(f'------------- {embedding_name} Model Loaded -------------')
            except:
                print('Training New Model')
        else:
            print('Training New Model')

        # Train Model
        print('\n------------- Training Model -------------\n')
        for epoch in range(epochs):

            for x, y, z in data_loader(data_train, batch_size=batch_size):

                train_loss, _ = sess.run([model.loss, model.optimizer], feed_dict={model.x:x, 
                                                                             model.y:y, 
                                                                             model.z:z})

            # x : embedding, y : open price, z : close price
            valid_loss = sess.run(model.loss, feed_dict={model.x:data_valid[0], 
                                                         model.y:data_valid[1], 
                                                         model.z:data_valid[2]})

            # print losses
            if verbose:
                print(f'Epoch {epoch+1}/{epochs},  Train RMSE Loss {train_loss}, Valid RMSE Loss {valid_loss}')



            # Save Model at every 20 epochs
            if save_model:

                if (epoch+1) % 20 == 0 and epoch > 0:
                    if not os.path.exists('./model'):
                        os.mkdir('./model/')

                    model.saver.save(sess, f"./model/{embedding_name}_model.ckpt")
                    print('\n------------- Model Saved -------------\n')

            train_losses.append(train_loss)
            valid_losses.append(valid_loss)


    return model, train_losses, valid_losses
# Possible Names : mean_embedding, max_embedding, min_embedding, sum_embedding
epochs = 300
learning_rate = 1e-4

mean_model, mean_train_loss, mean_valid_loss = train_model('mean_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)
max_model, max_train_loss, max_valid_loss = train_model('max_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)
min_model, min_train_loss, min_valid_loss = train_model('min_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)
sum_model, sum_train_loss, sum_valid_loss = train_model('sum_embedding', epochs=epochs, learning_rate=learning_rate, load_params=False, verbose=False)

Running four models took about 25 minutes on surface pro 4.

New Model for Combined Dataset

Since the combined_embedding is in different shape, I created a new model.

class combined_model():

    def __init__(self, learning_rate=1e-3, dropout_rate=.5):

        self.learning_rate = learning_rate
        self.dropout_rate = dropout_rate

        # BERT Embedding
        self.x = tf.placeholder(tf.float32, shape=(None, 3072))
        # Open Price
        self.y = tf.placeholder(tf.float32, shape=(None, 1))
        # Adj Close Price
        self.z = tf.placeholder(tf.float32, shape=(None, 1))

        self.pred = self.run_model()

        self.loss = tf.sqrt(tf.losses.mean_squared_error(self.z, self.pred), name='loss')

        self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name='optimizer').minimize(self.loss)

        self.saver = tf.train.Saver()

    def run_model(self):

        # Dense layers
        layer1 = tf.contrib.layers.fully_connected(self.x, 1000)
        layer1 = tf.nn.dropout(layer1, rate=self.dropout_rate)
        layer1 = tf.layers.batch_normalization(layer1)

        layer2 = tf.contrib.layers.fully_connected(layer1, 500)
        layer2 = tf.nn.dropout(layer2, rate=self.dropout_rate)
        layer2 = tf.layers.batch_normalization(layer2)

        # Coefficient of impact values
        layer3 = tf.contrib.layers.fully_connected(layer2, 1)

        layer4 = layer3 * self.y

        layer5 = tf.contrib.layers.fully_connected(layer4, 100)
        layer5 = tf.nn.dropout(layer5, rate=self.dropout_rate)
        layer5 = tf.layers.batch_normalization(layer5)

        output = tf.contrib.layers.fully_connected(layer5, 1)

        return output

combined model took about 4 minutes on surface pro 4.

tf.reset_default_graph()

model = combined_model(learning_rate=1e-4, dropout_rate=.5)
epochs = 300

combined_train_losses = []
combined_valid_losses = []

with tf.Session() as sess:

    sess.run(tf.global_variables_initializer())

    try:
        print(f'------------- Attempting to Load Combined Model -------------')
        model.saver.restore(sess, f'./model/combined_model.ckpt')
        print(f'------------- Combined Model Loaded -------------')

    except:
        print('Training New Model')

    # Train Model
    print('\n------------- Training Model -------------\n')
    for epoch in range(epochs):

        for x, y, z in data_loader(combined_data_train, batch_size=16):

            train_loss, _ = sess.run([model.loss, model.optimizer], feed_dict={model.x:x, 
                                                                         model.y:y, 
                                                                         model.z:z})

        valid_loss = sess.run(model.loss, feed_dict={model.x:combined_data_valid[0], 
                                                     model.y:combined_data_valid[1], 
                                                     model.z:combined_data_valid[2]})

        if epoch % 20 == 0:

            print(f'Epoch {epoch+1}/{epochs},  Combined Train RMSE Loss {train_loss}, Combined Valid RMSE Loss {valid_loss}')

            if not os.path.exists('./model'):
                os.mkdir('./model/')

            model.saver.save(sess, f"./model/combined_model.ckpt")
            print('\n------------- Model Saved -------------\n')

        combined_train_losses.append(train_loss)
        combined_valid_losses.append(valid_loss)
------------- Attempting to Load Combined Model -------------
Training New Model

------------- Training Model -------------

Epoch 1/100,  Train RMSE Loss 14868.9697265625, Valid RMSE Loss 12097.6728515625

------------- Model Saved -------------

Epoch 21/100,  Train RMSE Loss 3311.416015625, Valid RMSE Loss 3776.328125

------------- Model Saved -------------

Epoch 41/100,  Train RMSE Loss 2456.6015625, Valid RMSE Loss 2350.877685546875

------------- Model Saved -------------

Epoch 61/100,  Train RMSE Loss 2424.5302734375, Valid RMSE Loss 2064.84033203125

------------- Model Saved -------------

Epoch 81/100,  Train RMSE Loss 1417.530029296875, Valid RMSE Loss 2210.999755859375

------------- Model Saved -------------

Losses of each model

Here are the results of loss plots

Prediction of each model

Now let's predict.

embedding_data = {'mean_embedding':mean_embedding_test,
                 'max_embedding':max_embedding_test,
                 'min_embedding':min_embedding_test,
                 'sum_embedding':sum_embedding_test}

def predict_model(embedding_name):

    tf.reset_default_graph()

    data = embedding_data[embedding_name]

    model = get_model(learning_rate=1e-5)

    with tf.Session() as sess:

        sess.run(tf.global_variables_initializer())

    #     Load Model
        try:
            print(f'------------- Attempting to Load {embedding_name} Model -------------')
            model.saver.restore(sess, f'./model/{embedding_name}_model.ckpt')
            print('------------- Model Loaded -------------')

        except:
            pass


        pred = sess.run(model.pred, feed_dict={model.x:data, 
                                                    model.y:open_test})

    return model, pred
mean_model, mean_pred = predict_model('mean_embedding')
max_model, max_pred = predict_model('max_embedding')
sum_model, sum_pred = predict_model('sum_embedding')
min_model, min_pred = predict_model('min_embedding')
------------- Attempting to Load mean_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/mean_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load max_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/max_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load sum_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/sum_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load min_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/min_embedding_model.ckpt
------------- Model Loaded -------------
tf.reset_default_graph()

model = combined_model(learning_rate=1e-5)

with tf.Session() as sess:

    sess.run(tf.global_variables_initializer())

#     Load Model
    try:
        print(f'------------- Attempting to Load Combined Model -------------')
        model.saver.restore(sess, f'./model/combined_model.ckpt')
        print('------------- Model Loaded -------------')

    except:
        pass


    combined_pred = sess.run(model.pred, feed_dict={model.x:combined_embedding_test, 
                                                model.y:open_test})
------------- Attempting to Load Combined Model -------------
INFO:tensorflow:Restoring parameters from ./model/combined_model.ckpt
------------- Model Loaded -------------
mean_pred = mean_pred.flatten()
max_pred = max_pred.flatten()
min_pred = min_pred.flatten()
sum_pred = sum_pred.flatten()

combined_pred = combined_pred.flatten()

Conclusion

We can see that the predicted values have high variance and predicted values fluctuate much. However, the models still were able to capture general trend of the prices. As it is impossible to predict something with 100%, models like above are only used as a general guide.

One way to improve a model is to set a threshold which it limits how much the price can change over a day. For example, we can set it to 10,000 that it won't change above the amount.

Also, the news I used may not (or most likely not) be related to DJIA. Using news that are closely related to it can also improve performance.

You can find the checkpoints I saved and all codes here.

Thank you for reading the post and if there is any mistake I made, please let me know!

Last updated