The Token Types#
Every token produced by the tokenizer has a type. These types are represented
by integer constants. The actual integer value of the token constants is not
important (except for N_TOKENS), and should never be used or
relied on. Instead, refer to tokens by their variable names, and use the
tok_name dictionary to get the name of a token type. The exact integer value
could change between Python versions, for instance, if new tokens are added or
removed (and indeed, in recent versions of Python, they have). In the examples
below, the token number shown in the output is the number from Python 3.9.
The reason the token types are represented this way is that the actual
tokenizer used by the Python interpreter is not the tokenize module; it is a
much more efficient, but equivalent
implementation
written in C. C does not have an object system like Python. Instead,
enumerated types are represented by integers (actually, tokenizer.c has a
large array of the token types. The integer value of each token is its index
in that array). The tokenize module is written in pure Python, but the token
type values and names mirror those from the C tokenizer, with three
exceptions: COMMENT, NL, and ENCODING.
All token types are defined in the token module, but the tokenize module
does from token import *, so they can be imported from tokenize as well.
Therefore, it is easiest to just import everything from tokenize.
Furthermore, the aforementioned COMMENT, NL, and
ENCODING tokens are not importable from token prior to Python 3.7,
only from tokenize.
The tok_name Dictionary#
The dictionary tok_name maps the tokens back to their names:
>>> import tokenize
>>> tokenize.STRING
3
>>> tokenize.tok_name[tokenize.STRING] # Can also use token.tok_name
'STRING'
The Tokens#
To simplify the below sections, the following utility function is used for all the examples:
>>> import io
>>> def print_tokens(s):
... for tok in tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline):
... print(tok)
ENDMARKER#
This is always the last token emitted by tokenize(), unless it raises an
exception. The string and line attributes are always ''.
The start and end lines are always one more than the total number of lines
in the input, and the start and end columns are always 0.
For most applications it is not necessary to explicitly worry about
ENDMARKER, because tokenize() stops iteration after the last token is
yielded.
>>> print_tokens('x + 1\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='x', start=(1, 0), end=(1, 1), line='x + 1\n')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='x + 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 4), end=(1, 5), line='x + 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='x + 1\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(1, 0), end=(1, 0), line='')
NAME#
The NAME token type is used for any Python identifier, as well as every
keyword.
Keywords
are Python names that are reserved, that is, they cannot be assigned to, such
as for, def, and True.
>>> print_tokens('a or α\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='a', start=(1, 0), end=(1, 1), line='a or α\n')
TokenInfo(type=1 (NAME), string='or', start=(1, 2), end=(1, 4), line='a or α\n')
TokenInfo(type=1 (NAME), string='α', start=(1, 5), end=(1, 6), line='a or α\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 6), end=(1, 7), line='a or α\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
To tell if a NAME token is a keyword, use
keyword.iskeyword()
on the string.
>>> import keyword
>>> keyword.iskeyword('or')
True
As a side note, internally, the tokenize module uses the
str.isidentifier()
method to test if a token should be a NAME token. The full rules for what
makes a valid
identifier
are somewhat complicated, as they involve a large table of Unicode
characters.
One should always use the str.isidentifier() method to test if a string is a
valid Python identifier, combined with a keyword.iskeyword() check. Testing
if a string is an identifier using regular expressions is highly
discouraged.
>>> 'α'.isidentifier()
True
>>> 'or'.isidentifier()
True
NUMBER#
The NUMBER token type is used for any numeric literal, including (decimal) integer literals,
binary, octal, and hexadecimal integer literals, floating point numbers
(including scientific notation), and imaginary number literals (like 1j).
>>> print_tokens('10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='10', start=(1, 0), end=(1, 2), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='0b101', start=(1, 5), end=(1, 10), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 11), end=(1, 12), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='0o10', start=(1, 13), end=(1, 17), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 18), end=(1, 19), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='0xa', start=(1, 20), end=(1, 23), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='-', start=(1, 24), end=(1, 25), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='1.0', start=(1, 26), end=(1, 29), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 30), end=(1, 31), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='1e1', start=(1, 32), end=(1, 35), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 36), end=(1, 37), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='1j', start=(1, 38), end=(1, 40), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 40), end=(1, 41), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
Note that even though literals like 1+2j are a single complex type, they
tokenize as NUMBER (1), OP (+), NUMBER (2j).
>>> print_tokens('1+2j\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1+2j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 1), end=(1, 2), line='1+2j\n')
TokenInfo(type=2 (NUMBER), string='2j', start=(1, 2), end=(1, 4), line='1+2j\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 4), end=(1, 5), line='1+2j\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
Invalid numeric literals may tokenize as multiple numeric literals.
>>> print_tokens('012\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='0', start=(1, 0), end=(1, 1), line='012\n')
TokenInfo(type=2 (NUMBER), string='12', start=(1, 1), end=(1, 3), line='012\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 3), end=(1, 4), line='012\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('0x1.0\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='0x1', start=(1, 0), end=(1, 3), line='0x1.0\n')
TokenInfo(type=2 (NUMBER), string='.0', start=(1, 3), end=(1, 5), line='0x1.0\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='0x1.0\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('0o184\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='0o1', start=(1, 0), end=(1, 3), line='0o184\n')
TokenInfo(type=2 (NUMBER), string='84', start=(1, 3), end=(1, 5), line='0o184\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='0o184\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
One advantage of using tokenize over ast is that floating point numbers
are not rounded at the tokenization stage, so it is possible to access the
full input.
>>> 1.0000000000000001
1.0
>>> print_tokens('1.0000000000000001\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1.0000000000000001', start=(1, 0), end=(1, 18), line='1.0000000000000001\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 18), end=(1, 19), line='1.0000000000000001\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> import ast
>>> ast.dump(ast.parse('1.0000000000000001'))
'Module(body=[Expr(value=Constant(value=1.0))], type_ignores=[])'
This can be used, for instance, to wrap floating point numbers with a type
that supports arbitrary precision, such as
decimal.Decimal. See the
examples.
In Python >=3.6, numeric literals can have underscore
separators,
like 123_456.
>>> # Python 3.6+ only.
>>> print_tokens('123_456\n')
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='123_456', start=(1, 0), end=(1, 7), line='123_456\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 7), end=(1, 8), line='123_456\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
In Python 3.5, this will tokenize as two tokens, NUMBER (123) and NAME
(_456) (and will not be syntactically valid in any context).
>>> # The behavior in Python 3.5
>>> print_tokens('123_456\n')
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='123', start=(1, 0), end=(1, 3), line='123_456\n')
TokenInfo(type=1 (NAME), string='_456', start=(1, 3), end=(1, 7), line='123_456\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 7), end=(1, 8), line='123_456\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
In the examples
we will see how to use tokenize to backport this feature to Python 3.5.
STRING#
The STRING token type matches any string literal, including single quoted,
double quoted strings, triple- single and double quoted strings (i.e.,
multi-line strings, or “docstrings”), raw, “unicode”, bytes, and f-strings
(Python 3.6+).
>>> print_tokens("""
... "I" + 'love' + '''tokenize'''
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=3 (STRING), string='"I"', start=(2, 0), end=(2, 3), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=54 (OP), string='+', start=(2, 4), end=(2, 5), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=3 (STRING), string="'love'", start=(2, 6), end=(2, 12), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=54 (OP), string='+', start=(2, 13), end=(2, 14), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=3 (STRING), string="'''tokenize'''", start=(2, 15), end=(2, 29), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 29), end=(2, 30), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')
Note that even though Python implicitly concatenates string literals,
tokenize tokenizes them separately.
>>> print_tokens('"this is" " fun"\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string='"this is"', start=(1, 0), end=(1, 9), line='"this is" " fun"\n')
TokenInfo(type=3 (STRING), string='" fun"', start=(1, 10), end=(1, 16), line='"this is" " fun"\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 16), end=(1, 17), line='"this is" " fun"\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
In the case of raw, “unicode”, bytes, and f-strings, the string prefix is included in the tokenized string.
>>> print_tokens(r"rb'\hello'" + '\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="rb'\\hello'", start=(1, 0), end=(1, 10), line="rb'\\hello'\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 10), end=(1, 11), line="rb'\\hello'\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
f-strings (Python 3.6+) are parsed as a single STRING token.
>>> # Python 3.6+ only.
>>> print_tokens('f"{a + b}"\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string='f"{a + b}"', start=(1, 0), end=(1, 10), line='f"{a + b}"\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 10), end=(1, 11), line='f"{a + b}"\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
The string.Format.parse() function can be used to parse format strings
(including f-strings).
>>> a = 1
>>> b = 2
>>> # f-strings are Python 3.6+ only
>>> f'a + b is {a + b!r}'
'a + b is 3'
>>> import string
>>> list(string.Formatter().parse('a + b is {a + b!r}'))
[('a + b is ', 'a + b', '', 'r')]
To get the string value from a tokenized string literal (i.e., to strip away
the quote characters), use ast.literal_eval(). This is recommended over
trying to strip the quotes manually, which is error prone, or using raw
eval, which can execute arbitrary code in the case of an f-string.
>>> ast.literal_eval("rb'a\\''")
b"a\\'"
Error Behavior#
If a single quoted string is unclosed, the opening string delimiter is
tokenized as ERRORTOKEN, and the remainder is tokenized as if
it were not in a string.
>>> print_tokens("'unclosed + string\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=60 (ERRORTOKEN), string="'", start=(1, 0), end=(1, 1), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='unclosed', start=(1, 1), end=(1, 9), line="'unclosed + string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 10), end=(1, 11), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='string', start=(1, 12), end=(1, 18), line="'unclosed + string\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 18), end=(1, 19), line="'unclosed + string\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
This behavior can be useful for handling error situations. For example, if you
were to build a syntax highlighter using tokenize, you might not necessarily
want an unclosed string to highlight the rest of the document as a string
(such things are common in “live” editing environments).
However, if a triple quoted string (i.e., multi-line string, or “docstring”)
is not closed, tokenize will raise TokenError
when it reaches it.
>>> print_tokens("'an ' + '''unclosed multi-line string\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="'an '", start=(1, 0), end=(1, 5), line="'an ' + '''unclosed multi-line string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 6), end=(1, 7), line="'an ' + '''unclosed multi-line string\n")
Traceback (most recent call last):
...
raise TokenError("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (1, 8))
This behavior can be annoying to deal with in practice. For many applications,
the correct way to handle this scenario is to consider that since the unclosed
string is multi-line, the remainder of the input from where the
TokenError is raised is inside the unclosed string.
As a final note, beware that it is possible to construct string literals that
tokenize without any errors, but raise SyntaxError when parsed by the
interpreter.
>>> print_tokens(r"'\N{NOT REAL}'" + '\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="'\\N{NOT REAL}'", start=(1, 0), end=(1, 14), line="'\\N{NOT REAL}'\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 14), end=(1, 15), line="'\\N{NOT REAL}'\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> eval(r"'\N{NOT REAL}'")
Traceback (most recent call last):
...
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-11: unknown Unicode character name
To test if a string literal is valid, you can use the ast.literal_eval()
function, which is safe to use on untrusted input.
>>> ast.literal_eval(r"'\N{NOT REAL}'")
Traceback (most recent call last):
...
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-11: unknown Unicode character name
NEWLINE#
The NEWLINE token type represents a newline character (\n or \r\n) that
ends a logical line of Python code. Newlines that do not end a logical line of
Python code use NL.
>>> print_tokens("""\
... def hello():
... return 'hello world'
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def hello():\n')
TokenInfo(type=1 (NAME), string='hello', start=(1, 4), end=(1, 9), line='def hello():\n')
TokenInfo(type=54 (OP), string='(', start=(1, 9), end=(1, 10), line='def hello():\n')
TokenInfo(type=54 (OP), string=')', start=(1, 10), end=(1, 11), line='def hello():\n')
TokenInfo(type=54 (OP), string=':', start=(1, 11), end=(1, 12), line='def hello():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 12), end=(1, 13), line='def hello():\n')
TokenInfo(type=5 (INDENT), string=' ', start=(2, 0), end=(2, 4), line=" return 'hello world'\n")
TokenInfo(type=1 (NAME), string='return', start=(2, 4), end=(2, 10), line=" return 'hello world'\n")
TokenInfo(type=3 (STRING), string="'hello world'", start=(2, 11), end=(2, 24), line=" return 'hello world'\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 24), end=(2, 25), line=" return 'hello world'\n")
TokenInfo(type=6 (DEDENT), string='', start=(3, 0), end=(3, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')
Windows-style newlines (\r\n) are tokenized as a single token.
>>> print_tokens("1\n2\r\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='1\n')
TokenInfo(type=2 (NUMBER), string='2', start=(2, 0), end=(2, 1), line='2\r\n')
TokenInfo(type=4 (NEWLINE), string='\r\n', start=(2, 1), end=(2, 3), line='2\r\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')
Starting in the Python 3.6.7 and 3.7.1 patch releases, NEWLINE is emitted at
the end of all tokenize input even if it doesn’t end in a newline (with the
exception of lines that are just comments). This change was made for
consistency with the internal tokenize.c module used by Python itself.
Implicit NEWLINEs have the line set to ''. This change was not made to
Python 3.5, which is already in “security fix
only”. The examples
in this document all use the 3.6.7+ behavior. If consistency is desired, one
can always force the input to end in a newline (this is why every example in
this document has a newline at the end).
INDENT#
DEDENT#
The INDENT token type represents the indentation for indented blocks. The
indentation itself (the text from the beginning of the line to the first
nonwhitespace character) is in the string attribute. INDENT is emitted
once per block of indented text, not once per line.
The DEDENT token type represents a dedentation. Every INDENT token is
matched by a corresponding DEDENT token. The string attribute of DEDENT
is always ''. The start and end positions of a DEDENT token are the
first position in the line after the indentation (even if there are multiple
consecutive DEDENTs).
Consider the following pseudo-example:
>>> print_tokens("""
... 1
... 2
... 3
... 4
... 5
...
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=2 (NUMBER), string='1', start=(2, 0), end=(2, 1), line='1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 1), end=(2, 2), line='1\n')
TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 4), line=' 2\n')
TokenInfo(type=2 (NUMBER), string='2', start=(3, 4), end=(3, 5), line=' 2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 5), end=(3, 6), line=' 2\n')
TokenInfo(type=2 (NUMBER), string='3', start=(4, 4), end=(4, 5), line=' 3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 5), end=(4, 6), line=' 3\n')
TokenInfo(type=5 (INDENT), string=' ', start=(5, 0), end=(5, 8), line=' 4\n')
TokenInfo(type=2 (NUMBER), string='4', start=(5, 8), end=(5, 9), line=' 4\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 9), end=(5, 10), line=' 4\n')
TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line='5\n')
TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line='5\n')
TokenInfo(type=2 (NUMBER), string='5', start=(6, 0), end=(6, 1), line='5\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(6, 1), end=(6, 2), line='5\n')
TokenInfo(type=62 (NL), string='\n', start=(7, 0), end=(7, 1), line='\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(8, 0), end=(8, 0), line='')
There is one INDENT before the 2-3 block, one INDENT before 4, and two
DEDENTS before 5
INDENT is not used for indentations on line continuations.
>>> print_tokens("""
... (1 +
... 2)
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=54 (OP), string='(', start=(2, 0), end=(2, 1), line='(1 +\n')
TokenInfo(type=2 (NUMBER), string='1', start=(2, 1), end=(2, 2), line='(1 +\n')
TokenInfo(type=54 (OP), string='+', start=(2, 3), end=(2, 4), line='(1 +\n')
TokenInfo(type=62 (NL), string='\n', start=(2, 4), end=(2, 5), line='(1 +\n')
TokenInfo(type=2 (NUMBER), string='2', start=(3, 4), end=(3, 5), line=' 2)\n')
TokenInfo(type=54 (OP), string=')', start=(3, 5), end=(3, 6), line=' 2)\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 6), end=(3, 7), line=' 2)\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')
Indentation can be any number of spaces or tabs. The only restriction is that
every unindented indentation level must match a previous outer indentation
level. If an unindent does not match an outer indentation level, tokenize()
raises IndentationError.
>>> print_tokens("""
... def countdown(x):
... \tassert x>=0
... \twhile x:
... \t\tprint(x)
... \t\tx -= 1
... \tprint('Go!')
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 0), end=(2, 3), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='countdown', start=(2, 4), end=(2, 13), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string='(', start=(2, 13), end=(2, 14), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 14), end=(2, 15), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=')', start=(2, 15), end=(2, 16), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=':', start=(2, 16), end=(2, 17), line='def countdown(x):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='def countdown(x):\n')
TokenInfo(type=5 (INDENT), string='\t', start=(3, 0), end=(3, 1), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='assert', start=(3, 1), end=(3, 7), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='x', start=(3, 8), end=(3, 9), line='\tassert x>=0\n')
TokenInfo(type=54 (OP), string='>=', start=(3, 9), end=(3, 11), line='\tassert x>=0\n')
TokenInfo(type=2 (NUMBER), string='0', start=(3, 11), end=(3, 12), line='\tassert x>=0\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 12), end=(3, 13), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='while', start=(4, 1), end=(4, 6), line='\twhile x:\n')
TokenInfo(type=1 (NAME), string='x', start=(4, 7), end=(4, 8), line='\twhile x:\n')
TokenInfo(type=54 (OP), string=':', start=(4, 8), end=(4, 9), line='\twhile x:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 9), end=(4, 10), line='\twhile x:\n')
TokenInfo(type=5 (INDENT), string='\t\t', start=(5, 0), end=(5, 2), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='print', start=(5, 2), end=(5, 7), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string='(', start=(5, 7), end=(5, 8), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(5, 8), end=(5, 9), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string=')', start=(5, 9), end=(5, 10), line='\t\tprint(x)\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 10), end=(5, 11), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(6, 2), end=(6, 3), line='\t\tx -= 1\n')
TokenInfo(type=54 (OP), string='-=', start=(6, 4), end=(6, 6), line='\t\tx -= 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(6, 7), end=(6, 8), line='\t\tx -= 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(6, 8), end=(6, 9), line='\t\tx -= 1\n')
TokenInfo(type=6 (DEDENT), string='', start=(7, 1), end=(7, 1), line="\tprint('Go!')\n")
TokenInfo(type=1 (NAME), string='print', start=(7, 1), end=(7, 6), line="\tprint('Go!')\n")
TokenInfo(type=54 (OP), string='(', start=(7, 6), end=(7, 7), line="\tprint('Go!')\n")
TokenInfo(type=3 (STRING), string="'Go!'", start=(7, 7), end=(7, 12), line="\tprint('Go!')\n")
TokenInfo(type=54 (OP), string=')', start=(7, 12), end=(7, 13), line="\tprint('Go!')\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(7, 13), end=(7, 14), line="\tprint('Go!')\n")
TokenInfo(type=6 (DEDENT), string='', start=(8, 0), end=(8, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(8, 0), end=(8, 0), line='')
>>> print_tokens("""
... def countdown(x):
... \tassert x>=0
... \twhile x:
... \t\tprint(x)
... \t\tx -= 1
... print('Go!')
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 0), end=(2, 3), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='countdown', start=(2, 4), end=(2, 13), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string='(', start=(2, 13), end=(2, 14), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 14), end=(2, 15), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=')', start=(2, 15), end=(2, 16), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=':', start=(2, 16), end=(2, 17), line='def countdown(x):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='def countdown(x):\n')
TokenInfo(type=5 (INDENT), string='\t', start=(3, 0), end=(3, 1), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='assert', start=(3, 1), end=(3, 7), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='x', start=(3, 8), end=(3, 9), line='\tassert x>=0\n')
TokenInfo(type=54 (OP), string='>=', start=(3, 9), end=(3, 11), line='\tassert x>=0\n')
TokenInfo(type=2 (NUMBER), string='0', start=(3, 11), end=(3, 12), line='\tassert x>=0\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 12), end=(3, 13), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='while', start=(4, 1), end=(4, 6), line='\twhile x:\n')
TokenInfo(type=1 (NAME), string='x', start=(4, 7), end=(4, 8), line='\twhile x:\n')
TokenInfo(type=54 (OP), string=':', start=(4, 8), end=(4, 9), line='\twhile x:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 9), end=(4, 10), line='\twhile x:\n')
TokenInfo(type=5 (INDENT), string='\t\t', start=(5, 0), end=(5, 2), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='print', start=(5, 2), end=(5, 7), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string='(', start=(5, 7), end=(5, 8), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(5, 8), end=(5, 9), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string=')', start=(5, 9), end=(5, 10), line='\t\tprint(x)\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 10), end=(5, 11), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(6, 2), end=(6, 3), line='\t\tx -= 1\n')
TokenInfo(type=54 (OP), string='-=', start=(6, 4), end=(6, 6), line='\t\tx -= 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(6, 7), end=(6, 8), line='\t\tx -= 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(6, 8), end=(6, 9), line='\t\tx -= 1\n')
Traceback (most recent call last):
...
print('Go!')
^
IndentationError: unindent does not match any outer indentation level
The level of indentation at a particular point in the token stream can be
determined by incrementing and decrementing a counter for each INDENT and
DEDENT token, or if the exact indentation spacing is required, by
maintaining a stack of the INDENT strings
(recall
that Python allows different indentation levels to use a different number of
spaces). See the examples.
RARROW#
ELLIPSIS#
The RARROW and ELLIPSIS tokens tokenize as OP. However, due to a
bug present in Python versions prior to
3.7, the exact_type attribute of these tokens will be OP instead of the
correct type.
>>> # Python 3.5 and 3.6 behavior
>>> for tok in tokenize.tokenize(io.BytesIO(b'def func() -> list: ...\n').readline):
... print(tokenize.tok_name[tok.type], tokenize.tok_name[tok.exact_type], repr(tok.string))
ENCODING ENCODING 'utf-8'
NAME NAME 'def'
NAME NAME 'func'
OP LPAR '('
OP RPAR ')'
OP OP '->'
NAME NAME 'list'
OP COLON ':'
OP OP '...'
NEWLINE NEWLINE '\n'
ENDMARKER ENDMARKER ''
This bug has been fixed in Python 3.7.
>>> # Python 3.7+ behavior
>>> for tok in tokenize.tokenize(io.BytesIO(b'def func() -> list: ...\n').readline):
... print(tokenize.tok_name[tok.type], tokenize.tok_name[tok.exact_type], repr(tok.string))
ENCODING ENCODING 'utf-8'
NAME NAME 'def'
NAME NAME 'func'
OP LPAR '('
OP RPAR ')'
OP RARROW '->'
NAME NAME 'list'
OP COLON ':'
OP ELLIPSIS '...'
NEWLINE NEWLINE '\n'
ENDMARKER ENDMARKER ''
OP#
OP is a generic token type for all operators, delimiters, and the ellipsis
literal. This does not include characters and operators that are not
recognized by the parser (these are parsed as ERRORTOKEN).
When using tokenize(), the token type for an operator, delimiter, or
ellipsis literal token will be OP. To get the exact token type, use the
exact_type property of the namedtuple. tok.exact_type is equivalent to
tok.type for the remaining token types (with two exceptions, see the notes
below).
>>> import io
>>> for tok in tokenize.tokenize(io.BytesIO(b'[1+2]\n').readline):
... print(tokenize.tok_name[tok.type], repr(tok.string))
ENCODING 'utf-8'
OP '['
NUMBER '1'
OP '+'
NUMBER '2'
OP ']'
NEWLINE '\n'
ENDMARKER ''
>>> for tok in tokenize.tokenize(io.BytesIO(b'[1+2]\n').readline):
... print(tokenize.tok_name[tok.exact_type], repr(tok.string))
ENCODING 'utf-8'
LSQB '['
NUMBER '1'
PLUS '+'
NUMBER '2'
RSQB ']'
NEWLINE '\n'
ENDMARKER ''
The following table lists all exact OP types and their corresponding
characters.
Exact token type |
String value |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
AWAIT#
ASYNC#
The AWAIT and ASYNC token types are used to tokenize the await and
async keywords in Python 3.5 and 3.6. They do not exist in Python 3.7+.
In Python 3.5 and 3.6, await and async are pseudo-keywords. To aid the
transition in the addition of new keywords, await and async were kept as
valid variable names outside of an async def blocks.
>>> # This is valid Python in Python 3.5 and 3.6. It isn't in Python 3.7+.
>>> async = 1
>>> print_tokens("async = 1\n")
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='async', start=(1, 0), end=(1, 5), line='async = 1\n')
TokenInfo(type=53 (OP), string='=', start=(1, 6), end=(1, 7), line='async = 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 8), end=(1, 9), line='async = 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='async = 1\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
To support this, in Python 3.5 and 3.6 when async def is encountered,
tokenize keeps track of its indentation level, and all await and async
tokens that are nested under it are tokenized as AWAIT and ASYNC,
respectively (including the async from the async def). Otherwise, await
and async are tokenized as NAME, as in the example above.
>>> # This is the behavior in Python 3.5 and 3.6
>>> print_tokens("""
... async def coro():
... async with lock:
... await f()
... await = 1
... """)
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=58 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=55 (ASYNC), string='async', start=(2, 0), end=(2, 5), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 6), end=(2, 9), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='coro', start=(2, 10), end=(2, 14), line='async def coro():\n')
TokenInfo(type=53 (OP), string='(', start=(2, 14), end=(2, 15), line='async def coro():\n')
TokenInfo(type=53 (OP), string=')', start=(2, 15), end=(2, 16), line='async def coro():\n')
TokenInfo(type=53 (OP), string=':', start=(2, 16), end=(2, 17), line='async def coro():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='async def coro():\n')
TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 4), line=' async with lock:\n')
TokenInfo(type=55 (ASYNC), string='async', start=(3, 4), end=(3, 9), line=' async with lock:\n')
TokenInfo(type=1 (NAME), string='with', start=(3, 10), end=(3, 14), line=' async with lock:\n')
TokenInfo(type=1 (NAME), string='lock', start=(3, 15), end=(3, 19), line=' async with lock:\n')
TokenInfo(type=53 (OP), string=':', start=(3, 19), end=(3, 20), line=' async with lock:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 20), end=(3, 21), line=' async with lock:\n')
TokenInfo(type=5 (INDENT), string=' ', start=(4, 0), end=(4, 8), line=' await f()\n')
TokenInfo(type=54 (AWAIT), string='await', start=(4, 8), end=(4, 13), line=' await f()\n')
TokenInfo(type=1 (NAME), string='f', start=(4, 14), end=(4, 15), line=' await f()\n')
TokenInfo(type=53 (OP), string='(', start=(4, 15), end=(4, 16), line=' await f()\n')
TokenInfo(type=53 (OP), string=')', start=(4, 16), end=(4, 17), line=' await f()\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 17), end=(4, 18), line=' await f()\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='await = 1\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='await = 1\n')
TokenInfo(type=1 (NAME), string='await', start=(5, 0), end=(5, 5), line='await = 1\n')
TokenInfo(type=53 (OP), string='=', start=(5, 6), end=(5, 7), line='await = 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(5, 8), end=(5, 9), line='await = 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 9), end=(5, 10), line='await = 1\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(6, 0), end=(6, 0), line='')
In Python 3.7, async and await are proper keywords, and are tokenized as
NAME like all other keywords. In Python 3.7, the AWAIT and
ASYNC token types have been removed from the token module.
>>> # This is the behavior in Python 3.7+
>>> print_tokens("""
... async def coro():
... async with lock:
... await f()
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='async', start=(2, 0), end=(2, 5), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 6), end=(2, 9), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='coro', start=(2, 10), end=(2, 14), line='async def coro():\n')
TokenInfo(type=54 (OP), string='(', start=(2, 14), end=(2, 15), line='async def coro():\n')
TokenInfo(type=54 (OP), string=')', start=(2, 15), end=(2, 16), line='async def coro():\n')
TokenInfo(type=54 (OP), string=':', start=(2, 16), end=(2, 17), line='async def coro():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='async def coro():\n')
TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 4), line=' async with lock:\n')
TokenInfo(type=1 (NAME), string='async', start=(3, 4), end=(3, 9), line=' async with lock:\n')
TokenInfo(type=1 (NAME), string='with', start=(3, 10), end=(3, 14), line=' async with lock:\n')
TokenInfo(type=1 (NAME), string='lock', start=(3, 15), end=(3, 19), line=' async with lock:\n')
TokenInfo(type=54 (OP), string=':', start=(3, 19), end=(3, 20), line=' async with lock:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 20), end=(3, 21), line=' async with lock:\n')
TokenInfo(type=5 (INDENT), string=' ', start=(4, 0), end=(4, 8), line=' await f()\n')
TokenInfo(type=1 (NAME), string='await', start=(4, 8), end=(4, 13), line=' await f()\n')
TokenInfo(type=1 (NAME), string='f', start=(4, 14), end=(4, 15), line=' await f()\n')
TokenInfo(type=54 (OP), string='(', start=(4, 15), end=(4, 16), line=' await f()\n')
TokenInfo(type=54 (OP), string=')', start=(4, 16), end=(4, 17), line=' await f()\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 17), end=(4, 18), line=' await f()\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')
In Python 3.8, the ASYNC and AWAIT tokens have been readded to the token
module, but they are not tokenized by default (the behavior is the same as in
3.7). They are there only facilitate the new feature_version flag to
ast.parse() which allows
parsing Python as older versions would.
TYPE_IGNORE#
TYPE_COMMENT#
TYPE_IGNORE and TYPE_COMMENT are included here for completeness, since
they are in the tok_name dictionary. They are
used in the C tokenizer to tokenize type comments, but the Python tokenizer
does not yet tokenize them. This presumably will change in the future, as the
Python tokenizer is generally made to have the same behavior as the C
tokenizer. It was added in Python 3.8.
>>> print_tokens('# type: ignore\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=61 (COMMENT), string='# type: ignore', start=(1, 0), end=(1, 14), line='# type: ignore\n')
TokenInfo(type=62 (NL), string='\n', start=(1, 14), end=(1, 15), line='# type: ignore\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
SOFT_KEYWORD#
SOFT_KEYWORD is also included here for completeness. It is in the
tok_name dictionary, but it is not actually used
anywhere in the tokenize module, and cannot be emitted as a token from
tokenize(). It is used in the C tokenize as part of the PEG parser to handle
“soft keywords” like match and case that are only keywords when used in
certain contexts. It was added in Python 3.10.
>>> # Python 3.10+
>>> print_tokens('match = 1')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='match', start=(1, 0), end=(1, 5), line='match = 1')
TokenInfo(type=54 (OP), string='=', start=(1, 6), end=(1, 7), line='match = 1')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 8), end=(1, 9), line='match = 1')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 9), end=(1, 10), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens("""\
... match x:
... case 1:
... pass
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='match', start=(1, 0), end=(1, 5), line='match x:\n')
TokenInfo(type=1 (NAME), string='x', start=(1, 6), end=(1, 7), line='match x:\n')
TokenInfo(type=54 (OP), string=':', start=(1, 7), end=(1, 8), line='match x:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 8), end=(1, 9), line='match x:\n')
TokenInfo(type=5 (INDENT), string=' ', start=(2, 0), end=(2, 3), line=' case 1:\n')
TokenInfo(type=1 (NAME), string='case', start=(2, 3), end=(2, 7), line=' case 1:\n')
TokenInfo(type=2 (NUMBER), string='1', start=(2, 8), end=(2, 9), line=' case 1:\n')
TokenInfo(type=54 (OP), string=':', start=(2, 9), end=(2, 10), line=' case 1:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 10), end=(2, 11), line=' case 1:\n')
TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 7), line=' pass\n')
TokenInfo(type=1 (NAME), string='pass', start=(3, 7), end=(3, 11), line=' pass\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 11), end=(3, 12), line=' pass\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')
ERRORTOKEN#
The ERRORTOKEN type is used for any character that isn’t recognized. Inputs
that tokenize ERRORTOKENs cannot be valid Python, but this token type is
used so that applications that process tokens can do error recovery, as the
remainder of the input stream is tokenized normally. It can also be used to
process extensions to Python syntax (see the
examples). Every unrecognized
character is tokenized separately.
>>> print_tokens("1!!\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1!!\n')
TokenInfo(type=60 (ERRORTOKEN), string='!', start=(1, 1), end=(1, 2), line='1!!\n')
TokenInfo(type=60 (ERRORTOKEN), string='!', start=(1, 2), end=(1, 3), line='1!!\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 3), end=(1, 4), line='1!!\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('💯\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=60 (ERRORTOKEN), string='💯', start=(1, 0), end=(1, 1), line='💯\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='💯\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
ERRORTOKEN is also used for single-quoted strings that are not closed before
a newline. See the STRING section for more information.
>>> print_tokens("'unclosed + string\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=60 (ERRORTOKEN), string="'", start=(1, 0), end=(1, 1), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='unclosed', start=(1, 1), end=(1, 9), line="'unclosed + string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 10), end=(1, 11), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='string', start=(1, 12), end=(1, 18), line="'unclosed + string\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 18), end=(1, 19), line="'unclosed + string\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
If the string is continued and unclosed, the entire string is tokenized as an error token. Otherwise only the start quote delimiter is.
>>> print_tokens(r"""
... 'unclosed \
... continued string
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=60 (ERRORTOKEN), string="'unclosed \\\ncontinued string\n", start=(2, 0), end=(3, 17), line="'unclosed \\\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')
In the case of uncontinued unclosed single quoted strings, the spaces before
the string are also tokenized as ERRORTOKEN:
>>> print_tokens("'an' + 'unclosed string\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="'an'", start=(1, 0), end=(1, 4), line="'an' + 'unclosed string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 5), end=(1, 6), line="'an' + 'unclosed string\n")
TokenInfo(type=60 (ERRORTOKEN), string=' ', start=(1, 6), end=(1, 7), line="'an' + 'unclosed string\n")
TokenInfo(type=60 (ERRORTOKEN), string=' ', start=(1, 7), end=(1, 8), line="'an' + 'unclosed string\n")
TokenInfo(type=60 (ERRORTOKEN), string="'", start=(1, 8), end=(1, 9), line="'an' + 'unclosed string\n")
TokenInfo(type=1 (NAME), string='unclosed', start=(1, 9), end=(1, 17), line="'an' + 'unclosed string\n")
TokenInfo(type=1 (NAME), string='string', start=(1, 18), end=(1, 24), line="'an' + 'unclosed string\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 24), end=(1, 25), line="'an' + 'unclosed string\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
This doesn’t apply to unclosed continued strings:
>>> print_tokens(r"""
... 'an' + 'unclosed\
... continued string
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=3 (STRING), string="'an'", start=(2, 0), end=(2, 4), line="'an' + 'unclosed\\\n")
TokenInfo(type=54 (OP), string='+', start=(2, 5), end=(2, 6), line="'an' + 'unclosed\\\n")
TokenInfo(type=60 (ERRORTOKEN), string="'unclosed\\\ncontinued string\n", start=(2, 8), end=(3, 17), line="'an' + 'unclosed\\\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')
Therefore, code that handles ERRORTOKEN specifically for unclosed strings
should check tok.string[0] in '"\''.
COMMENT#
The COMMENT token type represents a comment. If a comment spans multiple
lines, each line is tokenized separately.
>>> print_tokens("""
... # This is a comment
... # This is another comment
... f() # This is a third comment
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=61 (COMMENT), string='# This is a comment', start=(2, 0), end=(2, 19), line='# This is a comment\n')
TokenInfo(type=62 (NL), string='\n', start=(2, 19), end=(2, 20), line='# This is a comment\n')
TokenInfo(type=61 (COMMENT), string='# This is another comment', start=(3, 0), end=(3, 25), line='# This is another comment\n')
TokenInfo(type=62 (NL), string='\n', start=(3, 25), end=(3, 26), line='# This is another comment\n')
TokenInfo(type=1 (NAME), string='f', start=(4, 0), end=(4, 1), line='f() # This is a third comment\n')
TokenInfo(type=54 (OP), string='(', start=(4, 1), end=(4, 2), line='f() # This is a third comment\n')
TokenInfo(type=54 (OP), string=')', start=(4, 2), end=(4, 3), line='f() # This is a third comment\n')
TokenInfo(type=61 (COMMENT), string='# This is a third comment', start=(4, 4), end=(4, 29), line='f() # This is a third comment\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 29), end=(4, 30), line='f() # This is a third comment\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')
The COMMENT token type exists only in the standard library Python
implementation of tokenize. The C implementation used by the interpreter
ignores comments. In Python versions prior to 3.7, COMMENT is only
importable from the tokenize module. In 3.7, it is added to the token
module as well.
NL#
The NL token type represents newline characters (\n or \r\n) that do not
end logical lines of code. Newlines that do end logical lines of Python code
area tokenized using the NEWLINE token type.
There are two situations where newlines are tokenized as NL:
Newlines that end lines that are continued after unclosed braces.
>>> print_tokens("""(1 + ... 2) ... """) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(1 +\n') TokenInfo(type=2 (NUMBER), string='1', start=(1, 1), end=(1, 2), line='(1 +\n') TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='(1 +\n') TokenInfo(type=62 (NL), string='\n', start=(1, 4), end=(1, 5), line='(1 +\n') TokenInfo(type=2 (NUMBER), string='2', start=(2, 0), end=(2, 1), line='2)\n') TokenInfo(type=54 (OP), string=')', start=(2, 1), end=(2, 2), line='2)\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 2), end=(2, 3), line='2)\n') TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')
Newlines that end empty lines or lines that only have comments.
>>> print_tokens(""" ... # Comment line ... ... """) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n') TokenInfo(type=61 (COMMENT), string='# Comment line', start=(2, 0), end=(2, 14), line='# Comment line\n') TokenInfo(type=62 (NL), string='\n', start=(2, 14), end=(2, 15), line='# Comment line\n') TokenInfo(type=62 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n') TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')
Note that newlines that are escaped (preceded with \) are treated like
whitespace, that is, they do not tokenize at all. Consequently, you should
always use the line numbers in the start and end attributes of the
TokenInfo namedtuple. Never try to determine line numbers by counting
NEWLINE and NL tokens.
>>> print_tokens('1 + \\\n2\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 + \\\n')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='1 + \\\n')
TokenInfo(type=2 (NUMBER), string='2', start=(2, 0), end=(2, 1), line='2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 1), end=(2, 2), line='2\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')
The NL token type exists only in the standard library Python implementation
of tokenize. The C implementation used by the interpreter only has
NEWLINE. In Python versions prior to 3.7, NL is only
importable from tokenize module. In 3.7, it is added to the token module
as well.
ENCODING#
ENCODING is a special token type that represents the encoding of the input.
It is always the first token emitted by tokenize(), unless the detected
encoding is invalid, in which case it raises SyntaxError. The encoding is detected via either a PEP
263 formatted comment in one of
the first two lines of the input (like # -*- coding: utf-8 -*-; such
comments are still tokenized as a COMMENT as well), or a
Unicode BOM character.
The detected encoding is in the string attribute of the TokenInfo.
ENCODING is the only token type where tok.string does not appear literally
in the input. The default encoding is utf-8.
If you only want to detect the encoding and nothing else, use
detect_encoding(). If you
only need the encoding to pass to open(), use
tokenize.open().
The start and end line and column numbers for ENCODING will always be
(0, 0).
>>> print_tokens("# The default encoding is utf-8\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=61 (COMMENT), string='# The default encoding is utf-8', start=(1, 0), end=(1, 31), line='# The default encoding is utf-8\n')
TokenInfo(type=62 (NL), string='\n', start=(1, 31), end=(1, 32), line='# The default encoding is utf-8\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens("# -*- coding: ascii -*-\n")
TokenInfo(type=63 (ENCODING), string='ascii', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=61 (COMMENT), string='# -*- coding: ascii -*-', start=(1, 0), end=(1, 23), line='# -*- coding: ascii -*-\n')
TokenInfo(type=62 (NL), string='\n', start=(1, 23), end=(1, 24), line='# -*- coding: ascii -*-\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
The ENCODING token is not typically used when processing tokens, although
its guaranteed presence as the first token can be useful when processing
tokens as it will guarantee that every other token type will have some
previous token from the stream (see the
examples).
Strictly speaking, the string of every token in a token stream should be
decodable by the encoding of the ENCODING token (e.g., if the encoding is
ascii, the tokens cannot include any non-ASCII characters).
The ENCODING token type exists only in the standard library Python
implementation of tokenize. The C implementation used by the interpreter
detects the encoding separately. In Python versions prior to 3.7, ENCODING
is only importable from tokenize module. In 3.7, it is added to the token
module as well.
N_TOKENS#
The number of token types (not including
NT_OFFSET or itself).
In Python 3.5 and 3.6, token.N_TOKENS and tokenize.N_TOKENS are different,
because COMMENT, NL, and ENCODING are in
tokenize but not in token. In these versions, N_TOKENS is also not in
the tok_name dictionary.
The value of N_TOKENS varies between Python versions. Python 3.7 removed the
AWAIT and ASYNC tokens. Python 3.8 added the new tokens
COLONEQUAL, TYPE_IGNORE, and
TYPE_COMMENT, and re-added AWAIT and
ASYNC. In Python 3.10, SOFT_KEYWORD was added.
>> # In Python 3.5 and 3.6
>>> tokenize.N_TOKENS # doctest: +ONLY35, +ONLY36
60
>>> # In Python 3.7
>>> tokenize.N_TOKENS
58
>>> # In Python 3.8 and 3.9
>>> tokenize.N_TOKENS
63
>>> # In Python 3.10 and 3.11
>>> tokenize.N_TOKENS
64