Brown Water Python: Better Docs for the Python tokenize Module.

The tokenize module in the Python standard library is very powerful, but its documentation is somewhat limited. In the spirit of Thomas Kluyver’s Green Tree Snakes project, which provides similar extended documentation for the ast module, I am providing here some extended documentation for effectively working with the tokenize module.

Python Versions Supported

The contents of this guide apply to Python 3.5 and up. Several minor changes were made to the tokenize module in various Python versions between 3.5 and 3.8, and they have been noted where appropriate.

The tokenize module tokenizes code according to the version of Python that it is being run under. For example, some new syntax features in 3.6 affect tokenization (in particular, f-strings and underscores in numeric literals). Take 123_456. This will tokenize as a single token in Python 3.6+, NUMBER (123_456), but in Python 3.5, it tokenizes as two tokens, NUMBER (123) and NAME (_456) (see the reference for the NUMBER token type for more info).

Most of what is written here will also apply to earlier Python 3 versions, with obvious exceptions (like tokens that were added for new syntax), though none of it has been tested.

I don’t have any interest in supporting Python 2 in this guide. Its lifetime has officially come to an end, so you should strongly consider being Python 3-only for new code that is written.

With that being said, I will point out one important difference in Python 2: the tokenize() function in Python 2 prints the tokens instead of returning them. Instead, you should use the generate_tokens() function, which works like tokenize() in Python 3 (see the docs).

>>> # Python 2.7 tokenize example
>>> import tokenize
>>> import io
>>> for tok in tokenize.generate_tokens(io.BytesIO('1 + 2').readline):
...     print tok 
(2, '1', (1, 0), (1, 1), '1 + 2')
(51, '+', (1, 2), (1, 3), '1 + 2')
(2, '2', (1, 4), (1, 5), '1 + 2')
(0, '', (2, 0), (2, 0), '')

Another difference is that the result of this function in Python 2 is a regular tuple, not a namedtuple, so you will not be able to use attributes to access the members. Instead use something like for toknum, tokval, start, end, line in tokenize.generate_tokens(...): (this pattern can be used in Python 3 as well, see the Usage section).


Contributions are welcome. So are questions. My goal here is to help people to understand the tokenize module, so if something is not clear, please let me know. If you see something written here that is wrong, please make a pull request correcting it. I’m not an expert at tokenize. I mainly know what is written here from trial and error and from reading the source code.