Brown Water Python: Better Docs for the Python
tokenize module in the Python standard library is very powerful, but its
documentation is somewhat
limited. In the spirit of Thomas Kluyver’s Green Tree
Snakes project, which provides
similar extended documentation for the
ast module, I am providing here some
extended documentation for effectively working with the
Python Versions Supported¶
The contents of this guide apply to Python 3.5 and up. Several minor changes
were made to the
tokenize module in various Python versions between 3.5
and 3.8, and they have been noted where appropriate.
tokenize module tokenizes code according to the version of Python that
it is being run under. For example, some new syntax features in 3.6 affect
tokenization (in particular,
and underscores in numeric
123_456. This will tokenize as a single token in Python 3.6+,
123_456), but in Python 3.5, it tokenizes as two tokens,
_456) (see the reference for the
token type for more info).
Most of what is written here will also apply to earlier Python 3 versions, with obvious exceptions (like tokens that were added for new syntax), though none of it has been tested.
I don’t have any interest in supporting Python 2 in this guide. Its lifetime has officially come to an end, so you should strongly consider being Python 3-only for new code that is written.
With that being said, I will point out one important difference in Python 2:
tokenize() function in Python 2 prints the tokens instead of
returning them. Instead, you should use the
which works like
tokenize() in Python 3 (see the docs).
>>> # Python 2.7 tokenize example >>> import tokenize >>> import io >>> for tok in tokenize.generate_tokens(io.BytesIO('1 + 2').readline): ... print tok ... (2, '1', (1, 0), (1, 1), '1 + 2') (51, '+', (1, 2), (1, 3), '1 + 2') (2, '2', (1, 4), (1, 5), '1 + 2') (0, '', (2, 0), (2, 0), '')
Another difference is that the result of this function in Python 2 is a
regular tuple, not a
namedtuple, so you will not be able to use attributes
to access the members. Instead use something like
for toknum, tokval, start, end, line in tokenize.generate_tokens(...): (this pattern can be used in
Python 3 as well, see the Usage section).
Contributions are welcome.
So are questions. My
goal here is to help people to understand the
tokenize module, so if
something is not clear, please let me
know. If you see
something written here that is wrong, please make a pull
request correcting it.
I’m not an expert at
tokenize. I mainly know what is written here from trial
and error and from reading the source
Table of Contents¶
- What is Tokenization?
- The Token Types
- Helper Functions