The Token Types#

Every token produced by the tokenizer has a type. These types are represented by integer constants. The actual integer value of the token constants is not important (except for N_TOKENS), and should never be used or relied on. Instead, refer to tokens by their variable names, and use the tok_name dictionary to get the name of a token type. The exact integer value could change between Python versions, for instance, if new tokens are added or removed (and indeed, in recent versions of Python, they have). In the examples below, the token number shown in the output is the number from Python 3.9.

The reason the token types are represented this way is that the actual tokenizer used by the Python interpreter is not the tokenize module; it is a much more efficient, but equivalent implementation written in C. C does not have an object system like Python. Instead, enumerated types are represented by integers (actually, tokenizer.c has a large array of the token types. The integer value of each token is its index in that array). The tokenize module is written in pure Python, but the token type values and names mirror those from the C tokenizer, with three exceptions: COMMENT, NL, and ENCODING.

All token types are defined in the token module, but the tokenize module does from token import *, so they can be imported from tokenize as well. Therefore, it is easiest to just import everything from tokenize. Furthermore, the aforementioned COMMENT, NL, and ENCODING tokens are not importable from token prior to Python 3.7, only from tokenize.

The tok_name Dictionary#

The dictionary tok_name maps the tokens back to their names:

>>> import tokenize
>>> tokenize.STRING
3
>>> tokenize.tok_name[tokenize.STRING] # Can also use token.tok_name
'STRING'

The Tokens#

To simplify the below sections, the following utility function is used for all the examples:

>>> import io
>>> def print_tokens(s):
...     for tok in tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline):
...         print(tok)

ENDMARKER#

This is always the last token emitted by tokenize(), unless it raises an exception. The string and line attributes are always ''. The start and end lines are always one more than the total number of lines in the input, and the start and end columns are always 0.

For most applications it is not necessary to explicitly worry about ENDMARKER, because tokenize() stops iteration after the last token is yielded.

>>> print_tokens('x + 1\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='x', start=(1, 0), end=(1, 1), line='x + 1\n')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='x + 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 4), end=(1, 5), line='x + 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='x + 1\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(1, 0), end=(1, 0), line='')

NAME#

The NAME token type is used for any Python identifier, as well as every keyword. Keywords are Python names that are reserved, that is, they cannot be assigned to, such as for, def, and True.

>>> print_tokens('a or α\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='a', start=(1, 0), end=(1, 1), line='a or α\n')
TokenInfo(type=1 (NAME), string='or', start=(1, 2), end=(1, 4), line='a or α\n')
TokenInfo(type=1 (NAME), string='α', start=(1, 5), end=(1, 6), line='a or α\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 6), end=(1, 7), line='a or α\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

To tell if a NAME token is a keyword, use keyword.iskeyword() on the string.

>>> import keyword
>>> keyword.iskeyword('or')
True

As a side note, internally, the tokenize module uses the str.isidentifier() method to test if a token should be a NAME token. The full rules for what makes a valid identifier are somewhat complicated, as they involve a large table of Unicode characters. One should always use the str.isidentifier() method to test if a string is a valid Python identifier, combined with a keyword.iskeyword() check. Testing if a string is an identifier using regular expressions is highly discouraged.

>>> 'α'.isidentifier()
True
>>> 'or'.isidentifier()
True

NUMBER#

The NUMBER token type is used for any numeric literal, including (decimal) integer literals, binary, octal, and hexadecimal integer literals, floating point numbers (including scientific notation), and imaginary number literals (like 1j).

>>> print_tokens('10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='10', start=(1, 0), end=(1, 2), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='0b101', start=(1, 5), end=(1, 10), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 11), end=(1, 12), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='0o10', start=(1, 13), end=(1, 17), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 18), end=(1, 19), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='0xa', start=(1, 20), end=(1, 23), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='-', start=(1, 24), end=(1, 25), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='1.0', start=(1, 26), end=(1, 29), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 30), end=(1, 31), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='1e1', start=(1, 32), end=(1, 35), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 36), end=(1, 37), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=2 (NUMBER), string='1j', start=(1, 38), end=(1, 40), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 40), end=(1, 41), line='10 + 0b101 + 0o10 + 0xa - 1.0 + 1e1 + 1j\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

Note that even though literals like 1+2j are a single complex type, they tokenize as NUMBER (1), OP (+), NUMBER (2j).

>>> print_tokens('1+2j\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1+2j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 1), end=(1, 2), line='1+2j\n')
TokenInfo(type=2 (NUMBER), string='2j', start=(1, 2), end=(1, 4), line='1+2j\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 4), end=(1, 5), line='1+2j\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

Invalid numeric literals may tokenize as multiple numeric literals.

>>> print_tokens('012\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='0', start=(1, 0), end=(1, 1), line='012\n')
TokenInfo(type=2 (NUMBER), string='12', start=(1, 1), end=(1, 3), line='012\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 3), end=(1, 4), line='012\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('0x1.0\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='0x1', start=(1, 0), end=(1, 3), line='0x1.0\n')
TokenInfo(type=2 (NUMBER), string='.0', start=(1, 3), end=(1, 5), line='0x1.0\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='0x1.0\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('0o184\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='0o1', start=(1, 0), end=(1, 3), line='0o184\n')
TokenInfo(type=2 (NUMBER), string='84', start=(1, 3), end=(1, 5), line='0o184\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 5), end=(1, 6), line='0o184\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

One advantage of using tokenize over ast is that floating point numbers are not rounded at the tokenization stage, so it is possible to access the full input.

>>> 1.0000000000000001
1.0
>>> print_tokens('1.0000000000000001\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1.0000000000000001', start=(1, 0), end=(1, 18), line='1.0000000000000001\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 18), end=(1, 19), line='1.0000000000000001\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> import ast
>>> ast.dump(ast.parse('1.0000000000000001')) 
'Module(body=[Expr(value=Constant(value=1.0))], type_ignores=[])'

This can be used, for instance, to wrap floating point numbers with a type that supports arbitrary precision, such as decimal.Decimal. See the examples.

In Python >=3.6, numeric literals can have underscore separators, like 123_456.

>>> # Python 3.6+ only.
>>> print_tokens('123_456\n') 
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='123_456', start=(1, 0), end=(1, 7), line='123_456\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 7), end=(1, 8), line='123_456\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

In Python 3.5, this will tokenize as two tokens, NUMBER (123) and NAME (_456) (and will not be syntactically valid in any context).

>>> # The behavior in Python 3.5
>>> print_tokens('123_456\n') 
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='123', start=(1, 0), end=(1, 3), line='123_456\n')
TokenInfo(type=1 (NAME), string='_456', start=(1, 3), end=(1, 7), line='123_456\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 7), end=(1, 8), line='123_456\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

In the examples we will see how to use tokenize to backport this feature to Python 3.5.

STRING#

The STRING token type matches any string literal, including single quoted, double quoted strings, triple- single and double quoted strings (i.e., multi-line strings, or “docstrings”), raw, “unicode”, bytes, and f-strings (Python 3.6+).

>>> print_tokens("""
... "I" + 'love' + '''tokenize'''
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=3 (STRING), string='"I"', start=(2, 0), end=(2, 3), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=54 (OP), string='+', start=(2, 4), end=(2, 5), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=3 (STRING), string="'love'", start=(2, 6), end=(2, 12), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=54 (OP), string='+', start=(2, 13), end=(2, 14), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=3 (STRING), string="'''tokenize'''", start=(2, 15), end=(2, 29), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 29), end=(2, 30), line='"I" + \'love\' + \'\'\'tokenize\'\'\'\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')

Note that even though Python implicitly concatenates string literals, tokenize tokenizes them separately.

>>> print_tokens('"this is" " fun"\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string='"this is"', start=(1, 0), end=(1, 9), line='"this is" " fun"\n')
TokenInfo(type=3 (STRING), string='" fun"', start=(1, 10), end=(1, 16), line='"this is" " fun"\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 16), end=(1, 17), line='"this is" " fun"\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

In the case of raw, “unicode”, bytes, and f-strings, the string prefix is included in the tokenized string.

>>> print_tokens(r"rb'\hello'" + '\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="rb'\\hello'", start=(1, 0), end=(1, 10), line="rb'\\hello'\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 10), end=(1, 11), line="rb'\\hello'\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

f-strings (Python 3.6+) are parsed as a single STRING token.

>>> # Python 3.6+ only.
>>> print_tokens('f"{a + b}"\n') 
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string='f"{a + b}"', start=(1, 0), end=(1, 10), line='f"{a + b}"\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 10), end=(1, 11), line='f"{a + b}"\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

The string.Format.parse() function can be used to parse format strings (including f-strings).

>>> a = 1
>>> b = 2
>>> # f-strings are Python 3.6+ only
>>> f'a + b is {a + b!r}' 
'a + b is 3'
>>> import string
>>> list(string.Formatter().parse('a + b is {a + b!r}'))
[('a + b is ', 'a + b', '', 'r')]

To get the string value from a tokenized string literal (i.e., to strip away the quote characters), use ast.literal_eval(). This is recommended over trying to strip the quotes manually, which is error prone, or using raw eval, which can execute arbitrary code in the case of an f-string.

>>> ast.literal_eval("rb'a\\''")
b"a\\'"

Error Behavior#

If a single quoted string is unclosed, the opening string delimiter is tokenized as ERRORTOKEN, and the remainder is tokenized as if it were not in a string.

>>> print_tokens("'unclosed + string\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=60 (ERRORTOKEN), string="'", start=(1, 0), end=(1, 1), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='unclosed', start=(1, 1), end=(1, 9), line="'unclosed + string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 10), end=(1, 11), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='string', start=(1, 12), end=(1, 18), line="'unclosed + string\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 18), end=(1, 19), line="'unclosed + string\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

This behavior can be useful for handling error situations. For example, if you were to build a syntax highlighter using tokenize, you might not necessarily want an unclosed string to highlight the rest of the document as a string (such things are common in “live” editing environments).

However, if a triple quoted string (i.e., multi-line string, or “docstring”) is not closed, tokenize will raise TokenError when it reaches it.

>>> print_tokens("'an ' + '''unclosed multi-line string\n") 
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="'an '", start=(1, 0), end=(1, 5), line="'an ' + '''unclosed multi-line string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 6), end=(1, 7), line="'an ' + '''unclosed multi-line string\n")
Traceback (most recent call last):
    ...
    raise TokenError("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (1, 8))

This behavior can be annoying to deal with in practice. For many applications, the correct way to handle this scenario is to consider that since the unclosed string is multi-line, the remainder of the input from where the TokenError is raised is inside the unclosed string.

As a final note, beware that it is possible to construct string literals that tokenize without any errors, but raise SyntaxError when parsed by the interpreter.

>>> print_tokens(r"'\N{NOT REAL}'" + '\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="'\\N{NOT REAL}'", start=(1, 0), end=(1, 14), line="'\\N{NOT REAL}'\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 14), end=(1, 15), line="'\\N{NOT REAL}'\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> eval(r"'\N{NOT REAL}'")
Traceback (most recent call last):
  ...
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-11: unknown Unicode character name

To test if a string literal is valid, you can use the ast.literal_eval() function, which is safe to use on untrusted input.

>>> ast.literal_eval(r"'\N{NOT REAL}'")
Traceback (most recent call last):
  ...
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-11: unknown Unicode character name

NEWLINE#

The NEWLINE token type represents a newline character (\n or \r\n) that ends a logical line of Python code. Newlines that do not end a logical line of Python code use NL.

>>> print_tokens("""\
... def hello():
...     return 'hello world'
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def hello():\n')
TokenInfo(type=1 (NAME), string='hello', start=(1, 4), end=(1, 9), line='def hello():\n')
TokenInfo(type=54 (OP), string='(', start=(1, 9), end=(1, 10), line='def hello():\n')
TokenInfo(type=54 (OP), string=')', start=(1, 10), end=(1, 11), line='def hello():\n')
TokenInfo(type=54 (OP), string=':', start=(1, 11), end=(1, 12), line='def hello():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 12), end=(1, 13), line='def hello():\n')
TokenInfo(type=5 (INDENT), string='    ', start=(2, 0), end=(2, 4), line="    return 'hello world'\n")
TokenInfo(type=1 (NAME), string='return', start=(2, 4), end=(2, 10), line="    return 'hello world'\n")
TokenInfo(type=3 (STRING), string="'hello world'", start=(2, 11), end=(2, 24), line="    return 'hello world'\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 24), end=(2, 25), line="    return 'hello world'\n")
TokenInfo(type=6 (DEDENT), string='', start=(3, 0), end=(3, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')

Windows-style newlines (\r\n) are tokenized as a single token.

>>> print_tokens("1\n2\r\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='1\n')
TokenInfo(type=2 (NUMBER), string='2', start=(2, 0), end=(2, 1), line='2\r\n')
TokenInfo(type=4 (NEWLINE), string='\r\n', start=(2, 1), end=(2, 3), line='2\r\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')

Starting in the Python 3.6.7 and 3.7.1 patch releases, NEWLINE is emitted at the end of all tokenize input even if it doesn’t end in a newline (with the exception of lines that are just comments). This change was made for consistency with the internal tokenize.c module used by Python itself. Implicit NEWLINEs have the line set to ''. This change was not made to Python 3.5, which is already in “security fix only”. The examples in this document all use the 3.6.7+ behavior. If consistency is desired, one can always force the input to end in a newline (this is why every example in this document has a newline at the end).

INDENT#

DEDENT#

The INDENT token type represents the indentation for indented blocks. The indentation itself (the text from the beginning of the line to the first nonwhitespace character) is in the string attribute. INDENT is emitted once per block of indented text, not once per line.

The DEDENT token type represents a dedentation. Every INDENT token is matched by a corresponding DEDENT token. The string attribute of DEDENT is always ''. The start and end positions of a DEDENT token are the first position in the line after the indentation (even if there are multiple consecutive DEDENTs).

Consider the following pseudo-example:

>>> print_tokens("""
... 1
...     2
...     3
...         4
... 5
...
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=2 (NUMBER), string='1', start=(2, 0), end=(2, 1), line='1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 1), end=(2, 2), line='1\n')
TokenInfo(type=5 (INDENT), string='    ', start=(3, 0), end=(3, 4), line='    2\n')
TokenInfo(type=2 (NUMBER), string='2', start=(3, 4), end=(3, 5), line='    2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 5), end=(3, 6), line='    2\n')
TokenInfo(type=2 (NUMBER), string='3', start=(4, 4), end=(4, 5), line='    3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 5), end=(4, 6), line='    3\n')
TokenInfo(type=5 (INDENT), string='        ', start=(5, 0), end=(5, 8), line='        4\n')
TokenInfo(type=2 (NUMBER), string='4', start=(5, 8), end=(5, 9), line='        4\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 9), end=(5, 10), line='        4\n')
TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line='5\n')
TokenInfo(type=6 (DEDENT), string='', start=(6, 0), end=(6, 0), line='5\n')
TokenInfo(type=2 (NUMBER), string='5', start=(6, 0), end=(6, 1), line='5\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(6, 1), end=(6, 2), line='5\n')
TokenInfo(type=62 (NL), string='\n', start=(7, 0), end=(7, 1), line='\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(8, 0), end=(8, 0), line='')

There is one INDENT before the 2-3 block, one INDENT before 4, and two DEDENTS before 5

INDENT is not used for indentations on line continuations.

>>> print_tokens("""
... (1 +
...     2)
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=54 (OP), string='(', start=(2, 0), end=(2, 1), line='(1 +\n')
TokenInfo(type=2 (NUMBER), string='1', start=(2, 1), end=(2, 2), line='(1 +\n')
TokenInfo(type=54 (OP), string='+', start=(2, 3), end=(2, 4), line='(1 +\n')
TokenInfo(type=62 (NL), string='\n', start=(2, 4), end=(2, 5), line='(1 +\n')
TokenInfo(type=2 (NUMBER), string='2', start=(3, 4), end=(3, 5), line='    2)\n')
TokenInfo(type=54 (OP), string=')', start=(3, 5), end=(3, 6), line='    2)\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 6), end=(3, 7), line='    2)\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

Indentation can be any number of spaces or tabs. The only restriction is that every unindented indentation level must match a previous outer indentation level. If an unindent does not match an outer indentation level, tokenize() raises IndentationError.

>>> print_tokens("""
... def countdown(x):
... \tassert x>=0
... \twhile x:
... \t\tprint(x)
... \t\tx -= 1
... \tprint('Go!')
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 0), end=(2, 3), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='countdown', start=(2, 4), end=(2, 13), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string='(', start=(2, 13), end=(2, 14), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 14), end=(2, 15), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=')', start=(2, 15), end=(2, 16), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=':', start=(2, 16), end=(2, 17), line='def countdown(x):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='def countdown(x):\n')
TokenInfo(type=5 (INDENT), string='\t', start=(3, 0), end=(3, 1), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='assert', start=(3, 1), end=(3, 7), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='x', start=(3, 8), end=(3, 9), line='\tassert x>=0\n')
TokenInfo(type=54 (OP), string='>=', start=(3, 9), end=(3, 11), line='\tassert x>=0\n')
TokenInfo(type=2 (NUMBER), string='0', start=(3, 11), end=(3, 12), line='\tassert x>=0\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 12), end=(3, 13), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='while', start=(4, 1), end=(4, 6), line='\twhile x:\n')
TokenInfo(type=1 (NAME), string='x', start=(4, 7), end=(4, 8), line='\twhile x:\n')
TokenInfo(type=54 (OP), string=':', start=(4, 8), end=(4, 9), line='\twhile x:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 9), end=(4, 10), line='\twhile x:\n')
TokenInfo(type=5 (INDENT), string='\t\t', start=(5, 0), end=(5, 2), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='print', start=(5, 2), end=(5, 7), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string='(', start=(5, 7), end=(5, 8), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(5, 8), end=(5, 9), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string=')', start=(5, 9), end=(5, 10), line='\t\tprint(x)\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 10), end=(5, 11), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(6, 2), end=(6, 3), line='\t\tx -= 1\n')
TokenInfo(type=54 (OP), string='-=', start=(6, 4), end=(6, 6), line='\t\tx -= 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(6, 7), end=(6, 8), line='\t\tx -= 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(6, 8), end=(6, 9), line='\t\tx -= 1\n')
TokenInfo(type=6 (DEDENT), string='', start=(7, 1), end=(7, 1), line="\tprint('Go!')\n")
TokenInfo(type=1 (NAME), string='print', start=(7, 1), end=(7, 6), line="\tprint('Go!')\n")
TokenInfo(type=54 (OP), string='(', start=(7, 6), end=(7, 7), line="\tprint('Go!')\n")
TokenInfo(type=3 (STRING), string="'Go!'", start=(7, 7), end=(7, 12), line="\tprint('Go!')\n")
TokenInfo(type=54 (OP), string=')', start=(7, 12), end=(7, 13), line="\tprint('Go!')\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(7, 13), end=(7, 14), line="\tprint('Go!')\n")
TokenInfo(type=6 (DEDENT), string='', start=(8, 0), end=(8, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(8, 0), end=(8, 0), line='')
>>> print_tokens("""
... def countdown(x):
... \tassert x>=0
... \twhile x:
... \t\tprint(x)
... \t\tx -= 1
...     print('Go!')
... """) 
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 0), end=(2, 3), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='countdown', start=(2, 4), end=(2, 13), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string='(', start=(2, 13), end=(2, 14), line='def countdown(x):\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 14), end=(2, 15), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=')', start=(2, 15), end=(2, 16), line='def countdown(x):\n')
TokenInfo(type=54 (OP), string=':', start=(2, 16), end=(2, 17), line='def countdown(x):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='def countdown(x):\n')
TokenInfo(type=5 (INDENT), string='\t', start=(3, 0), end=(3, 1), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='assert', start=(3, 1), end=(3, 7), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='x', start=(3, 8), end=(3, 9), line='\tassert x>=0\n')
TokenInfo(type=54 (OP), string='>=', start=(3, 9), end=(3, 11), line='\tassert x>=0\n')
TokenInfo(type=2 (NUMBER), string='0', start=(3, 11), end=(3, 12), line='\tassert x>=0\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 12), end=(3, 13), line='\tassert x>=0\n')
TokenInfo(type=1 (NAME), string='while', start=(4, 1), end=(4, 6), line='\twhile x:\n')
TokenInfo(type=1 (NAME), string='x', start=(4, 7), end=(4, 8), line='\twhile x:\n')
TokenInfo(type=54 (OP), string=':', start=(4, 8), end=(4, 9), line='\twhile x:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 9), end=(4, 10), line='\twhile x:\n')
TokenInfo(type=5 (INDENT), string='\t\t', start=(5, 0), end=(5, 2), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='print', start=(5, 2), end=(5, 7), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string='(', start=(5, 7), end=(5, 8), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(5, 8), end=(5, 9), line='\t\tprint(x)\n')
TokenInfo(type=54 (OP), string=')', start=(5, 9), end=(5, 10), line='\t\tprint(x)\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 10), end=(5, 11), line='\t\tprint(x)\n')
TokenInfo(type=1 (NAME), string='x', start=(6, 2), end=(6, 3), line='\t\tx -= 1\n')
TokenInfo(type=54 (OP), string='-=', start=(6, 4), end=(6, 6), line='\t\tx -= 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(6, 7), end=(6, 8), line='\t\tx -= 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(6, 8), end=(6, 9), line='\t\tx -= 1\n')
Traceback (most recent call last):
  ...
    print('Go!')
    ^
IndentationError: unindent does not match any outer indentation level

The level of indentation at a particular point in the token stream can be determined by incrementing and decrementing a counter for each INDENT and DEDENT token, or if the exact indentation spacing is required, by maintaining a stack of the INDENT strings (recall that Python allows different indentation levels to use a different number of spaces). See the examples.

RARROW#

ELLIPSIS#

The RARROW and ELLIPSIS tokens tokenize as OP. However, due to a bug present in Python versions prior to 3.7, the exact_type attribute of these tokens will be OP instead of the correct type.

>>> # Python 3.5 and 3.6 behavior
>>> for tok in tokenize.tokenize(io.BytesIO(b'def func() -> list: ...\n').readline):
...     print(tokenize.tok_name[tok.type], tokenize.tok_name[tok.exact_type], repr(tok.string)) 
ENCODING ENCODING 'utf-8'
NAME NAME 'def'
NAME NAME 'func'
OP LPAR '('
OP RPAR ')'
OP OP '->'
NAME NAME 'list'
OP COLON ':'
OP OP '...'
NEWLINE NEWLINE '\n'
ENDMARKER ENDMARKER ''

This bug has been fixed in Python 3.7.

>>> # Python 3.7+ behavior
>>> for tok in tokenize.tokenize(io.BytesIO(b'def func() -> list: ...\n').readline):
...     print(tokenize.tok_name[tok.type], tokenize.tok_name[tok.exact_type], repr(tok.string)) 
ENCODING ENCODING 'utf-8'
NAME NAME 'def'
NAME NAME 'func'
OP LPAR '('
OP RPAR ')'
OP RARROW '->'
NAME NAME 'list'
OP COLON ':'
OP ELLIPSIS '...'
NEWLINE NEWLINE '\n'
ENDMARKER ENDMARKER ''

OP#

OP is a generic token type for all operators, delimiters, and the ellipsis literal. This does not include characters and operators that are not recognized by the parser (these are parsed as ERRORTOKEN).

When using tokenize(), the token type for an operator, delimiter, or ellipsis literal token will be OP. To get the exact token type, use the exact_type property of the namedtuple. tok.exact_type is equivalent to tok.type for the remaining token types (with two exceptions, see the notes below).

>>> import io
>>> for tok in tokenize.tokenize(io.BytesIO(b'[1+2]\n').readline):
...     print(tokenize.tok_name[tok.type], repr(tok.string))
ENCODING 'utf-8'
OP '['
NUMBER '1'
OP '+'
NUMBER '2'
OP ']'
NEWLINE '\n'
ENDMARKER ''
>>> for tok in tokenize.tokenize(io.BytesIO(b'[1+2]\n').readline):
...     print(tokenize.tok_name[tok.exact_type], repr(tok.string))
ENCODING 'utf-8'
LSQB '['
NUMBER '1'
PLUS '+'
NUMBER '2'
RSQB ']'
NEWLINE '\n'
ENDMARKER ''

The following table lists all exact OP types and their corresponding characters.

Exact token type

String value

LPAR

(

RPAR

)

LSQB

[

RSQB

]

COLON

:

COMMA

,

SEMI

;

PLUS

+

MINUS

-

STAR

*

SLASH

/

VBAR

`

AMPER

&

LESS

<

GREATER

>

EQUAL

=

DOT

.

PERCENT

%

LBRACE

{

RBRACE

}

EQEQUAL

==

NOTEQUAL

!=

LESSEQUAL

<=

GREATEREQUAL

>=

TILDE

~

CIRCUMFLEX

^

LEFTSHIFT

<<

RIGHTSHIFT

>>

DOUBLESTAR

**

PLUSEQUAL

+=

MINEQUAL

-=

STAREQUAL

*=

SLASHEQUAL

/=

PERCENTEQUAL

%=

AMPEREQUAL

&=

VBAREQUAL

`

CIRCUMFLEXEQUAL

^=

LEFTSHIFTEQUAL

<<=

RIGHTSHIFTEQUAL

>>=

DOUBLESTAREQUAL

**=

DOUBLESLASH

//

DOUBLESLASHEQUAL

//=

AT

@

ATEQUAL

@=

RARROW [1]

->

ELLIPSIS [1]

...

COLONEQUAL [2]

:=

AWAIT#

ASYNC#

The AWAIT and ASYNC token types are used to tokenize the await and async keywords in Python 3.5 and 3.6. They do not exist in Python 3.7+.

In Python 3.5 and 3.6, await and async are pseudo-keywords. To aid the transition in the addition of new keywords, await and async were kept as valid variable names outside of an async def blocks.

>>> # This is valid Python in Python 3.5 and 3.6. It isn't in Python 3.7+.
>>> async = 1 
>>> print_tokens("async = 1\n") 
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='async', start=(1, 0), end=(1, 5), line='async = 1\n')
TokenInfo(type=53 (OP), string='=', start=(1, 6), end=(1, 7), line='async = 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 8), end=(1, 9), line='async = 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='async = 1\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

To support this, in Python 3.5 and 3.6 when async def is encountered, tokenize keeps track of its indentation level, and all await and async tokens that are nested under it are tokenized as AWAIT and ASYNC, respectively (including the async from the async def). Otherwise, await and async are tokenized as NAME, as in the example above.

>>> # This is the behavior in Python 3.5 and 3.6
>>> print_tokens("""
... async def coro():
...     async with lock:
...         await f()
... await = 1
... """) 
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=58 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=55 (ASYNC), string='async', start=(2, 0), end=(2, 5), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 6), end=(2, 9), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='coro', start=(2, 10), end=(2, 14), line='async def coro():\n')
TokenInfo(type=53 (OP), string='(', start=(2, 14), end=(2, 15), line='async def coro():\n')
TokenInfo(type=53 (OP), string=')', start=(2, 15), end=(2, 16), line='async def coro():\n')
TokenInfo(type=53 (OP), string=':', start=(2, 16), end=(2, 17), line='async def coro():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='async def coro():\n')
TokenInfo(type=5 (INDENT), string='    ', start=(3, 0), end=(3, 4), line='    async with lock:\n')
TokenInfo(type=55 (ASYNC), string='async', start=(3, 4), end=(3, 9), line='    async with lock:\n')
TokenInfo(type=1 (NAME), string='with', start=(3, 10), end=(3, 14), line='    async with lock:\n')
TokenInfo(type=1 (NAME), string='lock', start=(3, 15), end=(3, 19), line='    async with lock:\n')
TokenInfo(type=53 (OP), string=':', start=(3, 19), end=(3, 20), line='    async with lock:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 20), end=(3, 21), line='    async with lock:\n')
TokenInfo(type=5 (INDENT), string='        ', start=(4, 0), end=(4, 8), line='        await f()\n')
TokenInfo(type=54 (AWAIT), string='await', start=(4, 8), end=(4, 13), line='        await f()\n')
TokenInfo(type=1 (NAME), string='f', start=(4, 14), end=(4, 15), line='        await f()\n')
TokenInfo(type=53 (OP), string='(', start=(4, 15), end=(4, 16), line='        await f()\n')
TokenInfo(type=53 (OP), string=')', start=(4, 16), end=(4, 17), line='        await f()\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 17), end=(4, 18), line='        await f()\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='await = 1\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='await = 1\n')
TokenInfo(type=1 (NAME), string='await', start=(5, 0), end=(5, 5), line='await = 1\n')
TokenInfo(type=53 (OP), string='=', start=(5, 6), end=(5, 7), line='await = 1\n')
TokenInfo(type=2 (NUMBER), string='1', start=(5, 8), end=(5, 9), line='await = 1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 9), end=(5, 10), line='await = 1\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(6, 0), end=(6, 0), line='')

In Python 3.7, async and await are proper keywords, and are tokenized as NAME like all other keywords. In Python 3.7, the AWAIT and ASYNC token types have been removed from the token module.

>>> # This is the behavior in Python 3.7+
>>> print_tokens("""
... async def coro():
...     async with lock:
...         await f()
... """) 
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=1 (NAME), string='async', start=(2, 0), end=(2, 5), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='def', start=(2, 6), end=(2, 9), line='async def coro():\n')
TokenInfo(type=1 (NAME), string='coro', start=(2, 10), end=(2, 14), line='async def coro():\n')
TokenInfo(type=54 (OP), string='(', start=(2, 14), end=(2, 15), line='async def coro():\n')
TokenInfo(type=54 (OP), string=')', start=(2, 15), end=(2, 16), line='async def coro():\n')
TokenInfo(type=54 (OP), string=':', start=(2, 16), end=(2, 17), line='async def coro():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 17), end=(2, 18), line='async def coro():\n')
TokenInfo(type=5 (INDENT), string='    ', start=(3, 0), end=(3, 4), line='    async with lock:\n')
TokenInfo(type=1 (NAME), string='async', start=(3, 4), end=(3, 9), line='    async with lock:\n')
TokenInfo(type=1 (NAME), string='with', start=(3, 10), end=(3, 14), line='    async with lock:\n')
TokenInfo(type=1 (NAME), string='lock', start=(3, 15), end=(3, 19), line='    async with lock:\n')
TokenInfo(type=54 (OP), string=':', start=(3, 19), end=(3, 20), line='    async with lock:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 20), end=(3, 21), line='    async with lock:\n')
TokenInfo(type=5 (INDENT), string='        ', start=(4, 0), end=(4, 8), line='        await f()\n')
TokenInfo(type=1 (NAME), string='await', start=(4, 8), end=(4, 13), line='        await f()\n')
TokenInfo(type=1 (NAME), string='f', start=(4, 14), end=(4, 15), line='        await f()\n')
TokenInfo(type=54 (OP), string='(', start=(4, 15), end=(4, 16), line='        await f()\n')
TokenInfo(type=54 (OP), string=')', start=(4, 16), end=(4, 17), line='        await f()\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 17), end=(4, 18), line='        await f()\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')

In Python 3.8, the ASYNC and AWAIT tokens have been readded to the token module, but they are not tokenized by default (the behavior is the same as in 3.7). They are there only facilitate the new feature_version flag to ast.parse() which allows parsing Python as older versions would.

TYPE_IGNORE#

TYPE_COMMENT#

TYPE_IGNORE and TYPE_COMMENT are included here for completeness, since they are in the tok_name dictionary. They are used in the C tokenizer to tokenize type comments, but the Python tokenizer does not yet tokenize them. This presumably will change in the future, as the Python tokenizer is generally made to have the same behavior as the C tokenizer. It was added in Python 3.8.

>>> print_tokens('# type: ignore\n') 
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=61 (COMMENT), string='# type: ignore', start=(1, 0), end=(1, 14), line='# type: ignore\n')
TokenInfo(type=62 (NL), string='\n', start=(1, 14), end=(1, 15), line='# type: ignore\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

SOFT_KEYWORD#

SOFT_KEYWORD is also included here for completeness. It is in the tok_name dictionary, but it is not actually used anywhere in the tokenize module, and cannot be emitted as a token from tokenize(). It is used in the C tokenize as part of the PEG parser to handle “soft keywords” like match and case that are only keywords when used in certain contexts. It was added in Python 3.10.

>>> # Python 3.10+
>>> print_tokens('match = 1') 
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='match', start=(1, 0), end=(1, 5), line='match = 1')
TokenInfo(type=54 (OP), string='=', start=(1, 6), end=(1, 7), line='match = 1')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 8), end=(1, 9), line='match = 1')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 9), end=(1, 10), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens("""\
... match x:
...    case 1:
...        pass
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='match', start=(1, 0), end=(1, 5), line='match x:\n')
TokenInfo(type=1 (NAME), string='x', start=(1, 6), end=(1, 7), line='match x:\n')
TokenInfo(type=54 (OP), string=':', start=(1, 7), end=(1, 8), line='match x:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 8), end=(1, 9), line='match x:\n')
TokenInfo(type=5 (INDENT), string='   ', start=(2, 0), end=(2, 3), line='   case 1:\n')
TokenInfo(type=1 (NAME), string='case', start=(2, 3), end=(2, 7), line='   case 1:\n')
TokenInfo(type=2 (NUMBER), string='1', start=(2, 8), end=(2, 9), line='   case 1:\n')
TokenInfo(type=54 (OP), string=':', start=(2, 9), end=(2, 10), line='   case 1:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 10), end=(2, 11), line='   case 1:\n')
TokenInfo(type=5 (INDENT), string='       ', start=(3, 0), end=(3, 7), line='       pass\n')
TokenInfo(type=1 (NAME), string='pass', start=(3, 7), end=(3, 11), line='       pass\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 11), end=(3, 12), line='       pass\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

ERRORTOKEN#

The ERRORTOKEN type is used for any character that isn’t recognized. Inputs that tokenize ERRORTOKENs cannot be valid Python, but this token type is used so that applications that process tokens can do error recovery, as the remainder of the input stream is tokenized normally. It can also be used to process extensions to Python syntax (see the examples). Every unrecognized character is tokenized separately.

>>> print_tokens("1!!\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1!!\n')
TokenInfo(type=60 (ERRORTOKEN), string='!', start=(1, 1), end=(1, 2), line='1!!\n')
TokenInfo(type=60 (ERRORTOKEN), string='!', start=(1, 2), end=(1, 3), line='1!!\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 3), end=(1, 4), line='1!!\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens('💯\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=60 (ERRORTOKEN), string='💯', start=(1, 0), end=(1, 1), line='💯\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='💯\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

ERRORTOKEN is also used for single-quoted strings that are not closed before a newline. See the STRING section for more information.

>>> print_tokens("'unclosed + string\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=60 (ERRORTOKEN), string="'", start=(1, 0), end=(1, 1), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='unclosed', start=(1, 1), end=(1, 9), line="'unclosed + string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 10), end=(1, 11), line="'unclosed + string\n")
TokenInfo(type=1 (NAME), string='string', start=(1, 12), end=(1, 18), line="'unclosed + string\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 18), end=(1, 19), line="'unclosed + string\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

If the string is continued and unclosed, the entire string is tokenized as an error token. Otherwise only the start quote delimiter is.

>>> print_tokens(r"""
... 'unclosed \
... continued string
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=60 (ERRORTOKEN), string="'unclosed \\\ncontinued string\n", start=(2, 0), end=(3, 17), line="'unclosed \\\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

In the case of uncontinued unclosed single quoted strings, the spaces before the string are also tokenized as ERRORTOKEN:

>>> print_tokens("'an' +  'unclosed string\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=3 (STRING), string="'an'", start=(1, 0), end=(1, 4), line="'an' +  'unclosed string\n")
TokenInfo(type=54 (OP), string='+', start=(1, 5), end=(1, 6), line="'an' +  'unclosed string\n")
TokenInfo(type=60 (ERRORTOKEN), string=' ', start=(1, 6), end=(1, 7), line="'an' +  'unclosed string\n")
TokenInfo(type=60 (ERRORTOKEN), string=' ', start=(1, 7), end=(1, 8), line="'an' +  'unclosed string\n")
TokenInfo(type=60 (ERRORTOKEN), string="'", start=(1, 8), end=(1, 9), line="'an' +  'unclosed string\n")
TokenInfo(type=1 (NAME), string='unclosed', start=(1, 9), end=(1, 17), line="'an' +  'unclosed string\n")
TokenInfo(type=1 (NAME), string='string', start=(1, 18), end=(1, 24), line="'an' +  'unclosed string\n")
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 24), end=(1, 25), line="'an' +  'unclosed string\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

This doesn’t apply to unclosed continued strings:

>>> print_tokens(r"""
... 'an' +  'unclosed\
... continued string
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=3 (STRING), string="'an'", start=(2, 0), end=(2, 4), line="'an' +  'unclosed\\\n")
TokenInfo(type=54 (OP), string='+', start=(2, 5), end=(2, 6), line="'an' +  'unclosed\\\n")
TokenInfo(type=60 (ERRORTOKEN), string="'unclosed\\\ncontinued string\n", start=(2, 8), end=(3, 17), line="'an' +  'unclosed\\\n")
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

Therefore, code that handles ERRORTOKEN specifically for unclosed strings should check tok.string[0] in '"\''.

COMMENT#

The COMMENT token type represents a comment. If a comment spans multiple lines, each line is tokenized separately.

>>> print_tokens("""
... # This is a comment
... # This is another comment
... f() # This is a third comment
... """)
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=61 (COMMENT), string='# This is a comment', start=(2, 0), end=(2, 19), line='# This is a comment\n')
TokenInfo(type=62 (NL), string='\n', start=(2, 19), end=(2, 20), line='# This is a comment\n')
TokenInfo(type=61 (COMMENT), string='# This is another comment', start=(3, 0), end=(3, 25), line='# This is another comment\n')
TokenInfo(type=62 (NL), string='\n', start=(3, 25), end=(3, 26), line='# This is another comment\n')
TokenInfo(type=1 (NAME), string='f', start=(4, 0), end=(4, 1), line='f() # This is a third comment\n')
TokenInfo(type=54 (OP), string='(', start=(4, 1), end=(4, 2), line='f() # This is a third comment\n')
TokenInfo(type=54 (OP), string=')', start=(4, 2), end=(4, 3), line='f() # This is a third comment\n')
TokenInfo(type=61 (COMMENT), string='# This is a third comment', start=(4, 4), end=(4, 29), line='f() # This is a third comment\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 29), end=(4, 30), line='f() # This is a third comment\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')

The COMMENT token type exists only in the standard library Python implementation of tokenize. The C implementation used by the interpreter ignores comments. In Python versions prior to 3.7, COMMENT is only importable from the tokenize module. In 3.7, it is added to the token module as well.

NL#

The NL token type represents newline characters (\n or \r\n) that do not end logical lines of code. Newlines that do end logical lines of Python code area tokenized using the NEWLINE token type.

There are two situations where newlines are tokenized as NL:

  1. Newlines that end lines that are continued after unclosed braces.

    >>> print_tokens("""(1 +
    ... 2)
    ... """)
    TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
    TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(1 +\n')
    TokenInfo(type=2 (NUMBER), string='1', start=(1, 1), end=(1, 2), line='(1 +\n')
    TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='(1 +\n')
    TokenInfo(type=62 (NL), string='\n', start=(1, 4), end=(1, 5), line='(1 +\n')
    TokenInfo(type=2 (NUMBER), string='2', start=(2, 0), end=(2, 1), line='2)\n')
    TokenInfo(type=54 (OP), string=')', start=(2, 1), end=(2, 2), line='2)\n')
    TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 2), end=(2, 3), line='2)\n')
    TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')
    
  2. Newlines that end empty lines or lines that only have comments.

    >>> print_tokens("""
    ... # Comment line
    ...
    ... """)
    TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
    TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
    TokenInfo(type=61 (COMMENT), string='# Comment line', start=(2, 0), end=(2, 14), line='# Comment line\n')
    TokenInfo(type=62 (NL), string='\n', start=(2, 14), end=(2, 15), line='# Comment line\n')
    TokenInfo(type=62 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n')
    TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')
    

Note that newlines that are escaped (preceded with \) are treated like whitespace, that is, they do not tokenize at all. Consequently, you should always use the line numbers in the start and end attributes of the TokenInfo namedtuple. Never try to determine line numbers by counting NEWLINE and NL tokens.

>>> print_tokens('1 + \\\n2\n')
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 + \\\n')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='1 + \\\n')
TokenInfo(type=2 (NUMBER), string='2', start=(2, 0), end=(2, 1), line='2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 1), end=(2, 2), line='2\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')

The NL token type exists only in the standard library Python implementation of tokenize. The C implementation used by the interpreter only has NEWLINE. In Python versions prior to 3.7, NL is only importable from tokenize module. In 3.7, it is added to the token module as well.

ENCODING#

ENCODING is a special token type that represents the encoding of the input. It is always the first token emitted by tokenize(), unless the detected encoding is invalid, in which case it raises SyntaxError. The encoding is detected via either a PEP 263 formatted comment in one of the first two lines of the input (like # -*- coding: utf-8 -*-; such comments are still tokenized as a COMMENT as well), or a Unicode BOM character.

The detected encoding is in the string attribute of the TokenInfo. ENCODING is the only token type where tok.string does not appear literally in the input. The default encoding is utf-8.

If you only want to detect the encoding and nothing else, use detect_encoding(). If you only need the encoding to pass to open(), use tokenize.open().

The start and end line and column numbers for ENCODING will always be (0, 0).

>>> print_tokens("# The default encoding is utf-8\n")
TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=61 (COMMENT), string='# The default encoding is utf-8', start=(1, 0), end=(1, 31), line='# The default encoding is utf-8\n')
TokenInfo(type=62 (NL), string='\n', start=(1, 31), end=(1, 32), line='# The default encoding is utf-8\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
>>> print_tokens("# -*- coding: ascii -*-\n")
TokenInfo(type=63 (ENCODING), string='ascii', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=61 (COMMENT), string='# -*- coding: ascii -*-', start=(1, 0), end=(1, 23), line='# -*- coding: ascii -*-\n')
TokenInfo(type=62 (NL), string='\n', start=(1, 23), end=(1, 24), line='# -*- coding: ascii -*-\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

The ENCODING token is not typically used when processing tokens, although its guaranteed presence as the first token can be useful when processing tokens as it will guarantee that every other token type will have some previous token from the stream (see the examples).

Strictly speaking, the string of every token in a token stream should be decodable by the encoding of the ENCODING token (e.g., if the encoding is ascii, the tokens cannot include any non-ASCII characters).

The ENCODING token type exists only in the standard library Python implementation of tokenize. The C implementation used by the interpreter detects the encoding separately. In Python versions prior to 3.7, ENCODING is only importable from tokenize module. In 3.7, it is added to the token module as well.

N_TOKENS#

The number of token types (not including NT_OFFSET or itself).

In Python 3.5 and 3.6, token.N_TOKENS and tokenize.N_TOKENS are different, because COMMENT, NL, and ENCODING are in tokenize but not in token. In these versions, N_TOKENS is also not in the tok_name dictionary.

The value of N_TOKENS varies between Python versions. Python 3.7 removed the AWAIT and ASYNC tokens. Python 3.8 added the new tokens COLONEQUAL, TYPE_IGNORE, and TYPE_COMMENT, and re-added AWAIT and ASYNC. In Python 3.10, SOFT_KEYWORD was added.

>> # In Python 3.5 and 3.6
>>> tokenize.N_TOKENS # doctest: +ONLY35, +ONLY36
60
>>> # In Python 3.7
>>> tokenize.N_TOKENS 
58
>>> # In Python 3.8 and 3.9
>>> tokenize.N_TOKENS 
63
>>> # In Python 3.10 and 3.11
>>> tokenize.N_TOKENS 
64