NLP with Tensorflow
Natural Language Processing with Tensorflow for Python
Updated: 03 September 2023
Natural Language Processing
From this Playlist
Tokenization and Sequence Analysis
Natural language processing makes use of something called tokenization, this is a method of encoding text in a numeric form. An example of this would be assigning each unique word in our data an associated number
We can do this encoding using Tensorflow like so:
1import tensorflow as tf2from tensorflow import keras3from tensorflow.keras.preprocessing.text import Tokenizer4
5sentences = [6 'I am a person',7 'What am I?',8 "What is a person!",9 "Hey, how are you doing today>"10]11
12tokenizer = Tokenizer(num_words=20)13tokenizer.fit_on_texts(sentences)14
15print(tokenizer.word_index)16# {'i': 1, 'am': 2, 'a': 3, 'person': 4, 'what': 5, 'is': 6, 'hey': 7, 'how': 8,17# 'are': 9, 'you': 10, 'doing': 11, 'today': 12}
The above Tokenizer
will only tokenize the 20
most common words, the list of words are available as a word_index
property
A Tokenizer
instance will also catch out and correctly handle punctuation, etc.
Once we’ve got tokens we can represent sentences as sequences of ordered numbers
The Tokenizer
contains a function called texts_to_sequences
which will convert a sentence into its sequence/token representation
We can use it like so:
1sequences = tokenizer.texts_to_sequences(sentences)2
3print(sequences)4# [[1, 2, 3, 4], [5, 2, 1], [5, 6, 3, 4], [7, 8, 9, 10, 11, 12]]
Sometimes, a Tokenizer
may get words that it’s never seen before it may not know how to handle these. By default the Tokenizer
will just leave these out. The problem with doing this is that we loose the sentence lengths
To get around this issue, we can sen an OOV Token
which will be used in the place of missing words instead. We can set this to any string that we wouldn’t expect in our text, such as <OOV>
when we instantiate the Tokenizer
:
1tokenizer = Tokenizer(num_words=20, oov_token='<OOV>')
Another issue when using a neural network is that we typically need to provide data of the same shape to the network, however, different sentences can be different lengths. In order for us to get around this we can simply pad our sequences:
1from tensorflow.keras.preprocessing.sequence import pad_sequences2
3# ...4
5padded = pad_sequences(sequences)6
7print(padded)8# [[0 0 2 3 4 5]9# [0 0 0 6 3 2]10# [0 0 6 7 4 5]11# [8 9 10 11 12 13]]
We can see that the shorter sentences have been padded with 0
on the left. If we want the 0
s at the end or want to set a maximum length or truncation, we can specify those too:
1padded = pad_sequences(2 sequences,3 padding='post',4 truncating='post',5 maxlen=56)7
8print(padded)9# [[ 2 3 4 5 0]10# [ 6 3 2 0 0]11# [ 6 7 4 5 0]12# [ 8 9 10 11 12]]
Once we’ve got our data into a structure like we do above we can make use of a neural network to build a model
We’ll build a model using the News Headlines Dataset for Sarcasm Detection
The code for the model is below:
1import json2import numpy as np3
4import tensorflow as tf5from tensorflow import keras6from tensorflow.keras import layers7from tensorflow.keras.preprocessing.text import Tokenizer8from tensorflow.keras.preprocessing.sequence import pad_sequences9
10vocab_size = 1000011sentence_max = 10012
13headlines = []14labels = []15
16with open('./nlp/Sarcasm_Headlines_Dataset.json') as f:17 lines = f.readlines()18 for line in lines:19 datapoint = json.loads(line)20 headlines.append(datapoint['headline'])21 labels.append(datapoint['is_sarcastic'])22
23test_size = int(len(headlines) * 0.2)24
25# TF needs this to be an array26headlines_test = np.array(headlines[0:test_size])27headlines_train = np.array(headlines[test_size:])28
29labels_test = np.array(labels[0:test_size])30labels_train = np.array(labels[test_size:])31
32tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')33tokenizer.fit_on_texts(headlines_test)34
35
36def preprocess_sentences(sentences):37 sequences = tokenizer.texts_to_sequences(sentences)38 padded = pad_sequences(sequences, maxlen=sentence_max)39 return padded40
41
42padded_train = preprocess_sentences(headlines_train)43padded_test = preprocess_sentences(headlines_test)44
45embedding_dim = 1646
47model = keras.Sequential([48 # Embedding creates a vector that will represent each input49 keras.layers.Embedding(vocab_size, embedding_dim,50 input_length=sentence_max),51 keras.layers.GlobalAveragePooling1D(),52 keras.layers.Dense(24, activation='relu'),53 keras.layers.Dense(1, activation='sigmoid')54])55
56model.compile(loss='binary_crossentropy',57 optimizer='adam', metrics=['accuracy'])58
59epoch_num = 3060
61history = model.fit(x=padded_train, y=labels_train, epochs=epoch_num,62 validation_data=(padded_test, labels_test), verbose=2)63
64sentences_new = [65 "OMG this weather is perfect",66 "Absolutely wonderful weather we're having"67]68
69padded_new = preprocess_sentences(sentences_new)70
71result = model.predict(padded_new)72
73for i in range(len(result)):74 print(f'Sentence: ${sentences_new[i]}\nSarcasm{result[i]}')