Stock Prediction with BERT (2)

Using pre-trained BERT from Mxnet, the post shows how to predict DJIA's adjusted closing prices.

Code Implementation

import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
path = 'embedding_files/'

max_embedding = pd.read_json(path+'max_embedding.json')
min_embedding = pd.read_json(path+'min_embedding.json')
mean_embedding = pd.read_json(path+'mean_embedding.json')
sum_embedding = pd.read_json(path+'sum_embedding.json')

djia = pd.read_csv('data/DJIA_table.csv')
djia = djia.loc[:, ['Date', 'Open', 'Adj Close']].sort_values('Date').set_index('Date')

I only needed Date, Open, and Adj Close columns from the djia data.

open_price = djia[['Open']]
adj_close_price = djia[['Adj Close']]

Open

Adj Close

Date

2008-08-08

11432.089844

11734.320312

2008-08-11

11729.669922

11782.349609

2008-08-12

11781.700195

11642.469727

2008-08-13

11632.809570

11532.959961

2008-08-14

11532.070312

11615.929688

Max

2008-08-08

[0.809297204, 0.5163459778, 0.3755577505, 0.59...

Since each value in the list is a feature, I redefined the dataframe by separating them into each column.

0

1

2

3

4

5

6

7

8

9

...

758

759

760

761

762

763

764

765

766

767

Date

2008-08-08

0.809297

0.516346

0.375558

0.592091

0.372241

0.27578

0.672928

0.902444

1.321722

0.690093

...

0.414205

0.687436

0.144865

0.403365

0.304636

0.796824

0.586465

0.883279

0.854595

0.175066

I separated them into testing and training next.

combined_embedding is another dataset I tried to see if how using all of their features affect the open price.

So now there are total of 5 different models with different data.

Model Definition

Finally, below is my custom data loader and model.

From first to third layer is to extract the value that indicates how much given articles affect the same day's open price.

get_data below concatenates embedding with open price and split them into training and validation sets.

Train

Running four models took about 25 minutes on surface pro 4.

New Model for Combined Dataset

Since the combined_embedding is in different shape, I created a new model.

combined model took about 4 minutes on surface pro 4.

Losses of each model

Here are the results of loss plots

Losses for each model
Loss for Combined model

Prediction of each model

Now let's predict.

Prediction for each model
Prediction for Combined model

Conclusion

We can see that the predicted values have high variance and predicted values fluctuate much. However, the models still were able to capture general trend of the prices. As it is impossible to predict something with 100%, models like above are only used as a general guide.

One way to improve a model is to set a threshold which it limits how much the price can change over a day. For example, we can set it to 10,000 that it won't change above the amount.

Also, the news I used may not (or most likely not) be related to DJIA. Using news that are closely related to it can also improve performance.

You can find the checkpoints I saved and all codes here.

Thank you for reading the post and if there is any mistake I made, please let me know!

Last updated