Text Generation

A deep learning model to generate headlines, given starting word(s)

Recurrent Network

What is it and Why?

A Recurrent Neural Network (RNN) is a deep learning model used in Natural Language Processing (NLP). Usually it is used to predict a next word (or more) that comes after others.

For example, if we are given a sentence 'My cat likes to drink [.......]', one might choose 'milk' while others go with 'water'. A model is trained on massive amount of phrases and using them, it tries to output a word that is most likely.

More details about NLP will be covered in later posts and here I will only show what and how I built a recurrent model.

The goal is to generate headlines given a starting word(s). The models which will be shown later are not trained long enough that its text generation are not so good but still be able to demonstrate how such a model predicts.

The following is an example of a phrase, generated with 'As an example'.

generate_headlines(model_20_750, ["As an example,"], input_length=20, rnn_size=750)
['<START> as an example , apple takes the market game today and says they will get a big hit microsoft']

Data Exploration

The data I used is from UC Berkeley's CS182 (Designing, Visualizing and Understanding Deep Neural Networks). I'm not sure if I can share the data I used on public without their permission, so I'm only sharing my code and model weights only. But similar dataset can be downloaded from Kaggle Dataset, such as A Million News Healines uploaded by Rohk.

print(f'Number of data samples is {len(dataset)}')
print(f'Number of vocabulary is {len(vocabulary)}')
Number of data samples is 89514
Number of vocabulary is 10000

Vocabulary is a list that contains unique words that appear at least once in preprocessed dataset.

Dataset is a dictionary with elements: cut, mask numerized, title, and url.

The dataset is preprocessed so that all the titles (headlines) have the same fixed length of 20. If some of them are shorter than others, paddings are appended to match them. To distinguish those paddings from actual words, mask element is used which is boolean.

  1. cut : indicates whether a sample is training or validation

  2. title : News Headlines that we are going to use for both training and evaluation

  3. url : The origin of each headline

  4. mask : If True, it is an actual word in a headline. If False, it is a padding

  5. numerzied : Maps a word to an index of vocabulary.

START indicates the beginning position in a headline. PAD indicates a padding added to a headline to match the length. UNK indicates an unknown word. This happens when vocabulary doesn't have a word which is used in samples.

First thing to do is to make a mapping from word to index. We cannot just pass string values into a deep network so any non-numeric values should first be converted. By making a dictionary of mapping, each unique word will have its own integer values.

So a model will predict the position of a word, instead of guessing a word of string directly.

Another thing I need is to translate those integers back to its original form so I can read them.

Note that any non-alphabet characters is considered as a separate word that 'facebook?' is translated to 'facebook ?'.

Another thing I need is to numerize a given sentence.

All characters should be in lowercase to have less dimensions.

Code

Class Definition

Train

Losses of Each Model

As the number of rnn_size (shape of an embedding) increases, it overfits faster that we can see from the graph valid loss increases as well. However by the looks of the plots, using small rnn_size stops learning earlier than larger sizes so such parameters should be chosen carefully by exploring more combinations.

Prediction

Headline Generation

The link to the weights is below so if you want, you can download them and try other words or train.

Conclusion

There are already many other and much better models than this such as bi-LSTM model, BERT or transformer that could predict and generate much longer sentences. For example, models in the post is unidirectional that it only reads phrases from left to right but bidirectional models learn from left to right and from right to left that it can learn patterns better.

Although models here are trained with the maximum of 20 words, by padding dataset more one can produce much longer sentences.

As mentioned above, I'm not sure if I can share the data without having their consent, but you can still get the weights of each model at my drive and view the full code at my github.

Again, thank you for reading this and if you find any errors, typos or have any suggestions, please let me know.

Last updated