Secrets of TensorFlow Tokenizer word_index 0
Recently I am picking up TensorFlow 2.0.
In the example code [here](total_words = len(tokenizer.word_index) + 1), when calculating the total_words
, the total_word
should be the len(tokenizer.word_index) + 1
, otherwise the to_categorical
function will not work.
We can easily find out that the word_index
is the dictionary maps words to the unique values, and the value starts from 1, the 0 is reserved.
|
|
The reason for this is very straightforward, but it took me a long time to work it out.
For a sentence lists, the max length of all sentences are 6, and a sentence like “Hi there!” is in sentence lists. When doing the padding, all the pads are filled up with 0, and the to_categorical
result is something like [0, 0, 0, 0, 34, 371]
.
The word_index
will be the aforementioned one. If the value of word_index
starts from 0, say "my": 0
, the padding part will not be reversed properly. Therefore, the 0
in intentionally reserved, and when reversing the list value to words, the 0
will be maped to None
. In the meantime, the to_categorical
has a num_classes
, which is set to max value of the word_index
as default, and the padding for to_categorical
will be 0. So Tokenizer
corporates with to_categorical
very well.