Since each value in the list is a feature, I redefined the dataframe by separating them into each column.
0
1
2
3
4
5
6
7
8
9
...
758
759
760
761
762
763
764
765
766
767
Date
2008-08-08
0.809297
0.516346
0.375558
0.592091
0.372241
0.27578
0.672928
0.902444
1.321722
0.690093
...
0.414205
0.687436
0.144865
0.403365
0.304636
0.796824
0.586465
0.883279
0.854595
0.175066
I separated them into testing and training next.
combined_embedding is another dataset I tried to see if how using all of their features affect the open price.
So now there are total of 5 different models with different data.
Model Definition
Finally, below is my custom data loader and model.
From first to third layer is to extract the value that indicates how much given articles affect the same day's open price.
get_data below concatenates embedding with open price and split them into training and validation sets.
Train
Running four models took about 25 minutes on surface pro 4.
New Model for Combined Dataset
Since the combined_embedding is in different shape, I created a new model.
combined model took about 4 minutes on surface pro 4.
Losses of each model
Here are the results of loss plots
Losses for each model
Loss for Combined model
Prediction of each model
Now let's predict.
Prediction for each model
Prediction for Combined model
Conclusion
We can see that the predicted values have high variance and predicted values fluctuate much. However, the models still were able to capture general trend of the prices. As it is impossible to predict something with 100%, models like above are only used as a general guide.
One way to improve a model is to set a threshold which it limits how much the price can change over a day. For example, we can set it to 10,000 that it won't change above the amount.
Also, the news I used may not (or most likely not) be related to DJIA. Using news that are closely related to it can also improve performance.
You can find the checkpoints I saved and all codes here.
Thank you for reading the post and if there is any mistake I made, please let me know!
data_name = {'mean_embedding':[mean_data_train, mean_data_valid],
'max_embedding':[max_data_train, max_data_valid],
'min_embedding':[min_data_train, min_data_valid],
'sum_embedding':[sum_data_train, sum_data_valid]}
def train_model(embedding_name, learning_rate=1e-5, epochs=300, batch_size=16, dropout_rate=.5, load_params=True,
verbose=True, save_model=True):
data_train, data_valid = data_name[embedding_name]
tf.reset_default_graph()
model = get_model(learning_rate=learning_rate, dropout_rate=dropout_rate)
# For plots
train_losses = []
valid_losses = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
if load_params:
# Load Model
try:
print(f'------------- Attempting to Load {embedding_name} Model -------------')
model.saver.restore(sess, f'./model/{embedding_name}_model.ckpt')
print(f'------------- {embedding_name} Model Loaded -------------')
except:
print('Training New Model')
else:
print('Training New Model')
# Train Model
print('\n------------- Training Model -------------\n')
for epoch in range(epochs):
for x, y, z in data_loader(data_train, batch_size=batch_size):
train_loss, _ = sess.run([model.loss, model.optimizer], feed_dict={model.x:x,
model.y:y,
model.z:z})
# x : embedding, y : open price, z : close price
valid_loss = sess.run(model.loss, feed_dict={model.x:data_valid[0],
model.y:data_valid[1],
model.z:data_valid[2]})
# print losses
if verbose:
print(f'Epoch {epoch+1}/{epochs}, Train RMSE Loss {train_loss}, Valid RMSE Loss {valid_loss}')
# Save Model at every 20 epochs
if save_model:
if (epoch+1) % 20 == 0 and epoch > 0:
if not os.path.exists('./model'):
os.mkdir('./model/')
model.saver.save(sess, f"./model/{embedding_name}_model.ckpt")
print('\n------------- Model Saved -------------\n')
train_losses.append(train_loss)
valid_losses.append(valid_loss)
return model, train_losses, valid_losses
tf.reset_default_graph()
model = combined_model(learning_rate=1e-4, dropout_rate=.5)
epochs = 300
combined_train_losses = []
combined_valid_losses = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
try:
print(f'------------- Attempting to Load Combined Model -------------')
model.saver.restore(sess, f'./model/combined_model.ckpt')
print(f'------------- Combined Model Loaded -------------')
except:
print('Training New Model')
# Train Model
print('\n------------- Training Model -------------\n')
for epoch in range(epochs):
for x, y, z in data_loader(combined_data_train, batch_size=16):
train_loss, _ = sess.run([model.loss, model.optimizer], feed_dict={model.x:x,
model.y:y,
model.z:z})
valid_loss = sess.run(model.loss, feed_dict={model.x:combined_data_valid[0],
model.y:combined_data_valid[1],
model.z:combined_data_valid[2]})
if epoch % 20 == 0:
print(f'Epoch {epoch+1}/{epochs}, Combined Train RMSE Loss {train_loss}, Combined Valid RMSE Loss {valid_loss}')
if not os.path.exists('./model'):
os.mkdir('./model/')
model.saver.save(sess, f"./model/combined_model.ckpt")
print('\n------------- Model Saved -------------\n')
combined_train_losses.append(train_loss)
combined_valid_losses.append(valid_loss)
------------- Attempting to Load Combined Model -------------
Training New Model
------------- Training Model -------------
Epoch 1/100, Train RMSE Loss 14868.9697265625, Valid RMSE Loss 12097.6728515625
------------- Model Saved -------------
Epoch 21/100, Train RMSE Loss 3311.416015625, Valid RMSE Loss 3776.328125
------------- Model Saved -------------
Epoch 41/100, Train RMSE Loss 2456.6015625, Valid RMSE Loss 2350.877685546875
------------- Model Saved -------------
Epoch 61/100, Train RMSE Loss 2424.5302734375, Valid RMSE Loss 2064.84033203125
------------- Model Saved -------------
Epoch 81/100, Train RMSE Loss 1417.530029296875, Valid RMSE Loss 2210.999755859375
------------- Model Saved -------------
embedding_data = {'mean_embedding':mean_embedding_test,
'max_embedding':max_embedding_test,
'min_embedding':min_embedding_test,
'sum_embedding':sum_embedding_test}
def predict_model(embedding_name):
tf.reset_default_graph()
data = embedding_data[embedding_name]
model = get_model(learning_rate=1e-5)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Load Model
try:
print(f'------------- Attempting to Load {embedding_name} Model -------------')
model.saver.restore(sess, f'./model/{embedding_name}_model.ckpt')
print('------------- Model Loaded -------------')
except:
pass
pred = sess.run(model.pred, feed_dict={model.x:data,
model.y:open_test})
return model, pred
------------- Attempting to Load mean_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/mean_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load max_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/max_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load sum_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/sum_embedding_model.ckpt
------------- Model Loaded -------------
------------- Attempting to Load min_embedding Model -------------
INFO:tensorflow:Restoring parameters from ./model/min_embedding_model.ckpt
------------- Model Loaded -------------
tf.reset_default_graph()
model = combined_model(learning_rate=1e-5)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Load Model
try:
print(f'------------- Attempting to Load Combined Model -------------')
model.saver.restore(sess, f'./model/combined_model.ckpt')
print('------------- Model Loaded -------------')
except:
pass
combined_pred = sess.run(model.pred, feed_dict={model.x:combined_embedding_test,
model.y:open_test})
------------- Attempting to Load Combined Model -------------
INFO:tensorflow:Restoring parameters from ./model/combined_model.ckpt
------------- Model Loaded -------------