Natural language processing makes use of something called tokenization, this is a method of encoding text in a numeric form. An example of this would be assigning each unique word in our data an associated number
We can do this encoding using Tensorflow like so:
The above Tokenizer will only tokenize the 20 most common words, the list of words are available as a word_index property
A Tokenizer instance will also catch out and correctly handle punctuation, etc.
Once we’ve got tokens we can represent sentences as sequences of ordered numbers
The Tokenizer contains a function called texts_to_sequences which will convert a sentence into its sequence/token representation
We can use it like so:
Sometimes, a Tokenizer may get words that it’s never seen before it may not know how to handle these. By default the Tokenizer will just leave these out. The problem with doing this is that we loose the sentence lengths
To get around this issue, we can sen an OOV Token which will be used in the place of missing words instead. We can set this to any string that we wouldn’t expect in our text, such as <OOV> when we instantiate the Tokenizer:
Another issue when using a neural network is that we typically need to provide data of the same shape to the network, however, different sentences can be different lengths. In order for us to get around this we can simply pad our sequences:
We can see that the shorter sentences have been padded with 0 on the left. If we want the 0s at the end or want to set a maximum length or truncation, we can specify those too:
Once we’ve got our data into a structure like we do above we can make use of a neural network to build a model