Stock Prediction with BERT (2)
Using pre-trained BERT from Mxnet, the post shows how to predict DJIA's adjusted closing prices.
Code Implementation
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_splitpath = 'embedding_files/'
max_embedding = pd.read_json(path+'max_embedding.json')
min_embedding = pd.read_json(path+'min_embedding.json')
mean_embedding = pd.read_json(path+'mean_embedding.json')
sum_embedding = pd.read_json(path+'sum_embedding.json')
djia = pd.read_csv('data/DJIA_table.csv')
djia = djia.loc[:, ['Date', 'Open', 'Adj Close']].sort_values('Date').set_index('Date')I only needed Date, Open, and Adj Close columns from the djia data.
open_price = djia[['Open']]
adj_close_price = djia[['Adj Close']]Open
Adj Close
Date
2008-08-08
11432.089844
11734.320312
2008-08-11
11729.669922
11782.349609
2008-08-12
11781.700195
11642.469727
2008-08-13
11632.809570
11532.959961
2008-08-14
11532.070312
11615.929688
Max
2008-08-08
[0.809297204, 0.5163459778, 0.3755577505, 0.59...
Since each value in the list is a feature, I redefined the dataframe by separating them into each column.
0
1
2
3
4
5
6
7
8
9
...
758
759
760
761
762
763
764
765
766
767
Date
2008-08-08
0.809297
0.516346
0.375558
0.592091
0.372241
0.27578
0.672928
0.902444
1.321722
0.690093
...
0.414205
0.687436
0.144865
0.403365
0.304636
0.796824
0.586465
0.883279
0.854595
0.175066
I separated them into testing and training next.
combined_embedding is another dataset I tried to see if how using all of their features affect the open price.
So now there are total of 5 different models with different data.
Model Definition
Finally, below is my custom data loader and model.
From first to third layer is to extract the value that indicates how much given articles affect the same day's open price.
get_data below concatenates embedding with open price and split them into training and validation sets.
Train
Running four models took about 25 minutes on surface pro 4.
New Model for Combined Dataset
Since the combined_embedding is in different shape, I created a new model.
combined model took about 4 minutes on surface pro 4.
Losses of each model
Here are the results of loss plots


Prediction of each model
Now let's predict.


Conclusion
We can see that the predicted values have high variance and predicted values fluctuate much. However, the models still were able to capture general trend of the prices. As it is impossible to predict something with 100%, models like above are only used as a general guide.
One way to improve a model is to set a threshold which it limits how much the price can change over a day. For example, we can set it to 10,000 that it won't change above the amount.
Also, the news I used may not (or most likely not) be related to DJIA. Using news that are closely related to it can also improve performance.
You can find the checkpoints I saved and all codes here.
Thank you for reading the post and if there is any mistake I made, please let me know!
Last updated