NLP with Tensorflow

Natural Language Processing with Tensorflow for Python

Updated: 03 September 2023

Natural Language Processing

From this Playlist

Tokenization and Sequence Analysis

Natural language processing makes use of something called tokenization, this is a method of encoding text in a numeric form. An example of this would be assigning each unique word in our data an associated number

We can do this encoding using Tensorflow like so:

1
import tensorflow as tf
2
from tensorflow import keras
3
from tensorflow.keras.preprocessing.text import Tokenizer
4

5
sentences = [
6
    'I am a person',
7
    'What am I?',
8
    "What is a person!",
9
    "Hey, how are you doing today>"
10
]
11

12
tokenizer = Tokenizer(num_words=20)
13
tokenizer.fit_on_texts(sentences)
14

15
print(tokenizer.word_index)
16
# {'i': 1, 'am': 2, 'a': 3, 'person': 4, 'what': 5, 'is': 6, 'hey': 7, 'how': 8,
17
# 'are': 9, 'you': 10, 'doing': 11, 'today': 12}

The above Tokenizer will only tokenize the 20 most common words, the list of words are available as a word_index property

A Tokenizer instance will also catch out and correctly handle punctuation, etc.

Once we’ve got tokens we can represent sentences as sequences of ordered numbers

The Tokenizer contains a function called texts_to_sequences which will convert a sentence into its sequence/token representation

We can use it like so:

1
sequences = tokenizer.texts_to_sequences(sentences)
2

3
print(sequences)
4
# [[1, 2, 3, 4], [5, 2, 1], [5, 6, 3, 4], [7, 8, 9, 10, 11, 12]]

Sometimes, a Tokenizer may get words that it’s never seen before it may not know how to handle these. By default the Tokenizer will just leave these out. The problem with doing this is that we loose the sentence lengths

To get around this issue, we can sen an OOV Token which will be used in the place of missing words instead. We can set this to any string that we wouldn’t expect in our text, such as <OOV> when we instantiate the Tokenizer:

1
tokenizer = Tokenizer(num_words=20, oov_token='<OOV>')

Another issue when using a neural network is that we typically need to provide data of the same shape to the network, however, different sentences can be different lengths. In order for us to get around this we can simply pad our sequences:

1
from tensorflow.keras.preprocessing.sequence import pad_sequences
2

3
# ...
4

5
padded = pad_sequences(sequences)
6

7
print(padded)
8
# [[0  0  2  3  4  5]
9
#  [0  0  0  6  3  2]
10
#  [0  0  6  7  4  5]
11
#  [8  9 10 11 12 13]]

We can see that the shorter sentences have been padded with 0 on the left. If we want the 0s at the end or want to set a maximum length or truncation, we can specify those too:

1
padded = pad_sequences(
2
    sequences,
3
    padding='post',
4
    truncating='post',
5
    maxlen=5
6
)
7

8
print(padded)
9
# [[ 2  3  4  5  0]
10
#  [ 6  3  2  0  0]
11
#  [ 6  7  4  5  0]
12
#  [ 8  9 10 11 12]]

Once we’ve got our data into a structure like we do above we can make use of a neural network to build a model

We’ll build a model using the News Headlines Dataset for Sarcasm Detection

The code for the model is below:

1
import json
2
import numpy as np
3

4
import tensorflow as tf
5
from tensorflow import keras
6
from tensorflow.keras import layers
7
from tensorflow.keras.preprocessing.text import Tokenizer
8
from tensorflow.keras.preprocessing.sequence import pad_sequences
9

10
vocab_size = 10000
11
sentence_max = 100
12

13
headlines = []
14
labels = []
15

16
with open('./nlp/Sarcasm_Headlines_Dataset.json') as f:
17
    lines = f.readlines()
18
    for line in lines:
19
        datapoint = json.loads(line)
20
        headlines.append(datapoint['headline'])
21
        labels.append(datapoint['is_sarcastic'])
22

23
test_size = int(len(headlines) * 0.2)
24

25
# TF needs this to be an array
26
headlines_test = np.array(headlines[0:test_size])
27
headlines_train = np.array(headlines[test_size:])
28

29
labels_test = np.array(labels[0:test_size])
30
labels_train = np.array(labels[test_size:])
31

32
tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
33
tokenizer.fit_on_texts(headlines_test)
34

35

36
def preprocess_sentences(sentences):
37
    sequences = tokenizer.texts_to_sequences(sentences)
38
    padded = pad_sequences(sequences, maxlen=sentence_max)
39
    return padded
40

41

42
padded_train = preprocess_sentences(headlines_train)
43
padded_test = preprocess_sentences(headlines_test)
44

45
embedding_dim = 16
46

47
model = keras.Sequential([
48
    # Embedding creates a vector that will represent each input
49
    keras.layers.Embedding(vocab_size, embedding_dim,
50
                           input_length=sentence_max),
51
    keras.layers.GlobalAveragePooling1D(),
52
    keras.layers.Dense(24, activation='relu'),
53
    keras.layers.Dense(1, activation='sigmoid')
54
])
55

56
model.compile(loss='binary_crossentropy',
57
              optimizer='adam', metrics=['accuracy'])
58

59
epoch_num = 30
60

61
history = model.fit(x=padded_train, y=labels_train, epochs=epoch_num,
62
                    validation_data=(padded_test, labels_test), verbose=2)
63

64
sentences_new = [
65
    "OMG this weather is perfect",
66
    "Absolutely wonderful weather we're having"
67
]
68

69
padded_new = preprocess_sentences(sentences_new)
70

71
result = model.predict(padded_new)
72

73
for i in range(len(result)):
74
    print(f'Sentence: ${sentences_new[i]}\nSarcasm{result[i]}')