Secrets of TensorFlow Tokenizer word_index 0

2019-11-11 233 words 2 minutes views

Contents

Recently I am picking up TensorFlow 2.0.

In the example code [here](total_words = len(tokenizer.word_index) + 1), when calculating the total_words, the total_word should be the len(tokenizer.word_index) + 1, otherwise the to_categorical function will not work.

We can easily find out that the word_index is the dictionary maps words to the unique values, and the value starts from 1, the 0 is reserved.

1
2
3
4
5
6
7
8


{
	"my": 1,
	...
	"hi": 34,
	...
	"there": 371,
	...
}

The reason for this is very straightforward, but it took me a long time to work it out.

For a sentence lists, the max length of all sentences are 6, and a sentence like “Hi there!” is in sentence lists. When doing the padding, all the pads are filled up with 0, and the to_categorical result is something like [0, 0, 0, 0, 34, 371].

The word_index will be the aforementioned one. If the value of word_index starts from 0, say "my": 0, the padding part will not be reversed properly. Therefore, the 0 in intentionally reserved, and when reversing the list value to words, the 0 will be maped to None. In the meantime, the to_categorical has a num_classes, which is set to max value of the word_index as default, and the padding for to_categorical will be 0. So Tokenizer corporates with to_categorical very well.

Reference

[1] GitHub Keras Issues Page