Usage ===== The `tokenize` module has several quirks which make it complicated to work with (in my opinion, more complicated than necessary, but it is what it is). The primary motivation of this guide is to document these quirks and behaviors, as such a document would have been very helpful to me when I first started using the module. Most of these behaviors were learned from experimentation and reading the [source code](https://github.com/python/cpython/blob/master/Lib/tokenize.py). I have no idea what behaviors can be considered API guarantees, that is, the CPython developers may decide to change them in future Python versions. With that being said, the CPython developers are generally very conservative about changes to the standard library that might break downstream code, even for major releases. I will try to keep this guide updated as new Python versions are released. [Issue reports](https://github.com/asmeurer/brown-water-python/issues) and [pull requests](https://github.com/asmeurer/brown-water-python/pulls) are most welcome. (calling-syntax)= ## Calling Syntax The first thing you'll notice when using `tokenize()` is that its calling API is rather odd. It does not accept a string. It does not accept a file-like object either. Rather, it requires **the `readline` method of a bytes-mode file-like object**. The bytes-mode part is important. If a file is opened in text mode (`'r'` instead of `'br'`), `tokenize()` will fail with an exception: ```py >>> import tokenize >>> with open('example.py') as f: # Incorrect, the default mode is 'r', not 'br' ... for tok in tokenize.tokenize(f.readline): ... print(tok) Traceback (most recent call last): ... TypeError: startswith first arg must be str or a tuple of str, not bytes >>> with open('example.py', 'br') as f: ... for tok in tokenize.tokenize(f.readline): ... print(tok) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=61 (COMMENT), string='# This is a an example file to be tokenized', start=(1, 0), end=(1, 43), line='# This is a an example file to be tokenized\n') TokenInfo(type=62 (NL), string='\n', start=(1, 43), end=(1, 44), line='# This is a an example file to be tokenized\n') TokenInfo(type=62 (NL), string='\n', start=(2, 0), end=(2, 1), line='\n') TokenInfo(type=1 (NAME), string='def', start=(3, 0), end=(3, 3), line='def two():\n') TokenInfo(type=1 (NAME), string='two', start=(3, 4), end=(3, 7), line='def two():\n') TokenInfo(type=54 (OP), string='(', start=(3, 7), end=(3, 8), line='def two():\n') TokenInfo(type=54 (OP), string=')', start=(3, 8), end=(3, 9), line='def two():\n') TokenInfo(type=54 (OP), string=':', start=(3, 9), end=(3, 10), line='def two():\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 10), end=(3, 11), line='def two():\n') TokenInfo(type=5 (INDENT), string=' ', start=(4, 0), end=(4, 4), line=' return 1 + 1\n') TokenInfo(type=1 (NAME), string='return', start=(4, 4), end=(4, 10), line=' return 1 + 1\n') TokenInfo(type=2 (NUMBER), string='1', start=(4, 11), end=(4, 12), line=' return 1 + 1\n') TokenInfo(type=54 (OP), string='+', start=(4, 13), end=(4, 14), line=' return 1 + 1\n') TokenInfo(type=2 (NUMBER), string='1', start=(4, 15), end=(4, 16), line=' return 1 + 1\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 16), end=(4, 17), line=' return 1 + 1\n') TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='') TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='') ``` To tokenize a string, you must encode it into a `bytes`, create an `io.BytesIO` object, and use the `readline` method of that object. Note that if you are starting with a string object that is already encoded, you may use the [`generate_tokens()`](generate_tokens) helper function (but be aware of the [differences](generate_tokens) between `tokenize()` and `generate_tokens()`). Either way, if you find yourself doing this often, it may be useful to define a helper function. ```py >>> import io >>> def tokenize_string(s): ... return tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline) >>> for tok in tokenize_string('hello + tokenize\n'): ... print(tok) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=1 (NAME), string='hello', start=(1, 0), end=(1, 5), line='hello + tokenize\n') TokenInfo(type=54 (OP), string='+', start=(1, 6), end=(1, 7), line='hello + tokenize\n') TokenInfo(type=1 (NAME), string='tokenize', start=(1, 8), end=(1, 16), line='hello + tokenize\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 16), end=(1, 17), line='hello + tokenize\n') TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='') ``` The reason for this API is that `tokenize()` is a streaming API, which works line-by-line on a Python document. This is also why the `tokenize()` function returns a generator. The typical pattern when using `tokenize()` is to iterate over it with a `for` loop (see the next section). If you are finished doing whatever you are doing before you reach the [`ENDMARKER`](endmarker) token, you can break early from the loop efficiently without tokenizing the rest of the document. It is not recommended to convert the `tokenize()` generator into a list, except for debugging purposes. Not only is this inefficient, it makes it impossible to deal with [exceptions](exceptions). ## `TokenInfo` The `tokenize()` generator yields `TokenInfo` namedtuple objects, with the following fields: ```py >>> tokenize.TokenInfo._fields ('type', 'string', 'start', 'end', 'line') ``` The meaning of each field is outlined [below](tokeninfo-fields). There are two ways to work with `TokenInfo` objects. One is to unpack the tuple, typically in the `for` statement: ```py >>> for toknum, tokval, start, end, line in tokenize_string('hello + tokenize\n'): ... print("toknum:", toknum, "tokval:", repr(tokval), "start:", start, "end:", end, "line:", repr(line)) toknum: 63 tokval: 'utf-8' start: (0, 0) end: (0, 0) line: '' toknum: 1 tokval: 'hello' start: (1, 0) end: (1, 5) line: 'hello + tokenize\n' toknum: 54 tokval: '+' start: (1, 6) end: (1, 7) line: 'hello + tokenize\n' toknum: 1 tokval: 'tokenize' start: (1, 8) end: (1, 16) line: 'hello + tokenize\n' toknum: 4 tokval: '\n' start: (1, 16) end: (1, 17) line: 'hello + tokenize\n' toknum: 0 tokval: '' start: (2, 0) end: (2, 0) line: '' ``` By tradition, unused values are often replaced by `_`. You can also unpack the `start` and `end` tuples directly. ```py >>> for _, tokval, (start_line, start_col), (end_line, end_col), _ in tokenize_string('hello + tokenize\n'): ... print("{tokval!r} on lines {start_line} to {end_line} on columns {start_col} to {end_col}".format(tokval=tokval, start_line=start_line, end_line=end_line, start_col=start_col, end_col=end_col)) 'utf-8' on lines 0 to 0 on columns 0 to 0 'hello' on lines 1 to 1 on columns 0 to 5 '+' on lines 1 to 1 on columns 6 to 7 'tokenize' on lines 1 to 1 on columns 8 to 16 '\n' on lines 1 to 1 on columns 16 to 17 '' on lines 2 to 2 on columns 0 to 0 ``` The other is to use it as-is, and access the members via attributes. I like using `tok` as the variable name for the tokens. `token` can be confused with the [module name](https://docs.python.org/3/library/token.html), so I don't recommend using that (even though I recommend only importing `tokenize`, which includes all the names from `token`). ```py >>> for tok in tokenize_string('hello + tokenize\n'): ... print("type:", tok.type, "string:", repr(tok.string), "start:", tok.start, "end:", tok.end, "line:", repr(tok.line)) type: 63 string: 'utf-8' start: (0, 0) end: (0, 0) line: '' type: 1 string: 'hello' start: (1, 0) end: (1, 5) line: 'hello + tokenize\n' type: 54 string: '+' start: (1, 6) end: (1, 7) line: 'hello + tokenize\n' type: 1 string: 'tokenize' start: (1, 8) end: (1, 16) line: 'hello + tokenize\n' type: 4 string: '\n' start: (1, 16) end: (1, 17) line: 'hello + tokenize\n' type: 0 string: '' start: (2, 0) end: (2, 0) line: '' ``` One advantage of this second way is that the `TokenInfo` object contains an additional attribute, `exact_type`, which contains the exact token type of an [`OP` token](op). However, this can also be determined from the `string`. Additionally, the first form is less verbose, but the second form avoids errors from getting the attributes in the wrong order. The form that you should use depends on your preference on these tradeoffs. I personally recommend the second form (`for tok in tokenize(...): ... tok.type, etc.`), unless you have the `(toknum, tokstr, start, end, line)` order memorized. (tokeninfo-fields)= ### `TokenInfo` Fields #### `type` The token types are outlined in detail in the [Token Types](tokens.md) section. #### `string` The chunk of code that is tokenized. For token types where the string is meaningless, such as [`ENDMARKER`](endmarker), the string is empty. For the [`ENCODING`](encoding) token, the string is the encoding, which does not appear literally in the code, which is why for `ENCODING` the line and column numbers are 0 and the `line` is the empty string. (start-and-end)= #### `start` and `end` `start` and `end` are tuples of (line number, column number) for the line and column numbers of the start and end of the tokenized string. **Line numbers start at 1 and column numbers start at 0**. The line and column numbers for the [`ENCODING`](encoding) token, which is always the first token emitted, are both `(0, 0)`. Because Python tuples compare lexicographically (i.e., `(a, b) < (c, d)` is equivalent to `a < c or (a == c and b <= d)`), these tuples can be compared directly to determine which comes earlier in the input. The `start` and `end` tuples as emitted from `tokenize()` are always nondecreasing (that is, `start <= end` will always be True for a single `TokenInfo`, and `tok0.start <= tok1.start` and `tok0.end <= tok1.end` will be True for consecutive `TokenInfo`s, `tok0` and `tok1`). You should always use the `start` and `end` tuples to deduce line or column information. `tokenize()` ignores syntactically irrelevant whitespace, which can include newlines (in particular, escaped newlines, see [`NL`](nl)). #### `line` `line` gives the full line that the token comes from. This is useful for reconstructing the whitespace between tokens (never assume that the whitespace between tokens is space characters---it could also consist of escaped newlines or tabs). `line` can also be useful for providing contextual error messages relating to the tokenization. (exceptions)= ## Exceptions `tokenize()` has two failure modes: [`ERRORTOKEN`](errortoken) and exceptions. When a non-fatal error occurs, some text will be tokenized as an [`ERRORTOKEN`](errortoken), and tokenizing will continue on the remainder of the input. This happens, for instance, for unrecognized characters, such as `$`, and unclosed single-quoted strings. See the [`ERRORTOKEN`](errortoken) and [`STRING`](error-behavior) references for more information. Other failures are so fatal that tokenization cannot continue, causing an exception to be raised. Depending on what you are doing, you may want to catch the exception and deal with it or let it bubble up to the caller. These are the exceptions that can be raised from `tokenize()`. An exception other than these likely indicates incorrect [input](calling-syntax). (syntaxerror)= ### `SyntaxError` `SyntaxError` is raised when the input has an invalid encoding. The encoding is detected using the [`detect_encoding()`](detect-encoding) function, which uses either a [PEP 263](https://www.python.org/dev/peps/pep-0263/) formatted comment at the beginning of the input (like `# -*- coding: utf-8 -*-`), or a Unicode BOM character. If both are present, or if the PEP 263 encoding is invalid, `SyntaxError` is raised. ```py >>> tokenize.tokenize(io.BytesIO(b"# -*- coding: invalid -*-\n").readline) Traceback (most recent call last): ... SyntaxError: unknown encoding: invalid ``` Note here is one difference between `tokenize` and [`generate_tokens`](generate_tokens). The `generate_tokens` assumes the input has already been encoded, since it works on a `readline` method that returns strings, instead of bytes. As a result, it ignores `coding` comments: ```py >>> for t in tokenize.generate_tokens(io.StringIO("# -*- coding: invalid -*-\n").readline): ... print(t) TokenInfo(type=61 (COMMENT), string='# -*- coding: invalid -*-', start=(1, 0), end=(1, 25), line='# -*- coding: invalid -*-\n') TokenInfo(type=62 (NL), string='\n', start=(1, 25), end=(1, 26), line='# -*- coding: invalid -*-\n') TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='') ``` This is why it is important to use `tokenize` with raw bytes whenever the encoding is not necessarily known, e.g., when reading Python code from a file. See also [`ENCODING`](encoding). (tokenerror)= ### `TokenError` `tokenize.TokenError` is raised in two situations. The only way to distinguish the two is to inspect the exception message. In both cases, `TokenError` is raised if the end of the input (`EOF`) is reached without a delimiter being closed. Tokens before the end of the input are still emitted, so it is typically desirable to process the tokens from `tokenize()` and handle `TokenError` as an end of input condition. The second argument of the exception (`e.args[1]`) is a tuple of the [start](start-and-end) line and column number for the part of the input that was not tokenized due to the exception. 1. **EOF in multi-line string** This happens if a triple-quoted string (i.e., a multi-line string, or "docstring"), is not closed at the end of the input. This exception can be detected by checking `'string' in e.args[0]`. ```py >>> for tok in tokenize.tokenize(io.BytesIO(b""" ... def f(): ... ''' ... unclosed docstring ... """).readline): ... print(tok) # doctest: +SKIP TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n') TokenInfo(type=1 (NAME), string='def', start=(2, 0), end=(2, 3), line='def f():\n') TokenInfo(type=1 (NAME), string='f', start=(2, 4), end=(2, 5), line='def f():\n') TokenInfo(type=54 (OP), string='(', start=(2, 5), end=(2, 6), line='def f():\n') TokenInfo(type=54 (OP), string=')', start=(2, 6), end=(2, 7), line='def f():\n') TokenInfo(type=54 (OP), string=':', start=(2, 7), end=(2, 8), line='def f():\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 8), end=(2, 9), line='def f():\n') TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 4), line=" '''\n") Traceback (most recent call last): ... tokenize.TokenError: ('EOF in multi-line string', (3, 4)) ``` 2. **EOF in multi-line statement** This error occurs when an unclosed brace is found. Note that `tokenize` does not necessarily stop parsing as soon as the input is syntactically invalid, as it only has limited knowledge of Python syntax. In fact, in the current implementation, this exception is only raised after all tokens have been emitted, except possibly [`DEDENT`](dedent) tokens. This exception can be detected by checking `'statement' in e.args[0]`. ```py >>> for tok in tokenize.tokenize(io.BytesIO(b""" ... (1 + ... def f(): ... pass ... """).readline): ... print(tok) # doctest: +SKIP TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n') TokenInfo(type=54 (OP), string='(', start=(2, 0), end=(2, 1), line='(1 +\n') TokenInfo(type=2 (NUMBER), string='1', start=(2, 1), end=(2, 2), line='(1 +\n') TokenInfo(type=54 (OP), string='+', start=(2, 3), end=(2, 4), line='(1 +\n') TokenInfo(type=62 (NL), string='\n', start=(2, 4), end=(2, 5), line='(1 +\n') TokenInfo(type=1 (NAME), string='def', start=(3, 0), end=(3, 3), line='def f():\n') TokenInfo(type=1 (NAME), string='f', start=(3, 4), end=(3, 5), line='def f():\n') TokenInfo(type=54 (OP), string='(', start=(3, 5), end=(3, 6), line='def f():\n') TokenInfo(type=54 (OP), string=')', start=(3, 6), end=(3, 7), line='def f():\n') TokenInfo(type=54 (OP), string=':', start=(3, 7), end=(3, 8), line='def f():\n') TokenInfo(type=62 (NL), string='\n', start=(3, 8), end=(3, 9), line='def f():\n') TokenInfo(type=1 (NAME), string='pass', start=(4, 4), end=(4, 8), line=' pass\n') TokenInfo(type=62 (NL), string='\n', start=(4, 8), end=(4, 9), line=' pass\n') Traceback (most recent call last): ... tokenize.TokenError: ('EOF in multi-line statement', (5, 0)) ``` The following example shows how `TokenError` might be caught and processed. See also the [examples](examples.md). ```py >>> def tokens_with_errors(s): ... try: ... for tok in tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline): ... print(tok) ... except tokenize.TokenError as e: ... if 'string' in e.args[0]: ... print("TokenError: Unclosed multi-line string starting at", e.args[1]) ... elif 'statement' in e.args[0]: ... print("TokenError: Unclosed brace(s)") ... else: ... # Unrecognized TokenError. Shouldn't happen ... raise >>> tokens_with_errors(""" ... def f(): ... ''' ... unclosed docstring ... """) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n') TokenInfo(type=1 (NAME), string='def', start=(2, 0), end=(2, 3), line='def f():\n') TokenInfo(type=1 (NAME), string='f', start=(2, 4), end=(2, 5), line='def f():\n') TokenInfo(type=54 (OP), string='(', start=(2, 5), end=(2, 6), line='def f():\n') TokenInfo(type=54 (OP), string=')', start=(2, 6), end=(2, 7), line='def f():\n') TokenInfo(type=54 (OP), string=':', start=(2, 7), end=(2, 8), line='def f():\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 8), end=(2, 9), line='def f():\n') TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 4), line=" '''\n") TokenError: Unclosed multi-line string starting at (3, 4) >>> tokens_with_errors(""" ... (1 + ... def f(): ... pass ... """) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n') TokenInfo(type=54 (OP), string='(', start=(2, 0), end=(2, 1), line='(1 +\n') TokenInfo(type=2 (NUMBER), string='1', start=(2, 1), end=(2, 2), line='(1 +\n') TokenInfo(type=54 (OP), string='+', start=(2, 3), end=(2, 4), line='(1 +\n') TokenInfo(type=62 (NL), string='\n', start=(2, 4), end=(2, 5), line='(1 +\n') TokenInfo(type=1 (NAME), string='def', start=(3, 0), end=(3, 3), line='def f():\n') TokenInfo(type=1 (NAME), string='f', start=(3, 4), end=(3, 5), line='def f():\n') TokenInfo(type=54 (OP), string='(', start=(3, 5), end=(3, 6), line='def f():\n') TokenInfo(type=54 (OP), string=')', start=(3, 6), end=(3, 7), line='def f():\n') TokenInfo(type=54 (OP), string=':', start=(3, 7), end=(3, 8), line='def f():\n') TokenInfo(type=62 (NL), string='\n', start=(3, 8), end=(3, 9), line='def f():\n') TokenInfo(type=1 (NAME), string='pass', start=(4, 4), end=(4, 8), line=' pass\n') TokenInfo(type=62 (NL), string='\n', start=(4, 8), end=(4, 9), line=' pass\n') TokenError: Unclosed brace(s) ``` (indentationerror)= ### `IndentationError` `tokenize()` raises `IndentationError` if an unindent does not match an outer indentation level. ```py >>> for tok in tokenize.tokenize(io.BytesIO(b""" ... if x: ... pass ... f ... """).readline): ... print(tok) # doctest: +SKIP TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n') TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if x:\n') TokenInfo(type=1 (NAME), string='x', start=(2, 3), end=(2, 4), line='if x:\n') TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if x:\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if x:\n') TokenInfo(type=5 (INDENT), string=' ', start=(3, 0), end=(3, 4), line=' pass\n') TokenInfo(type=1 (NAME), string='pass', start=(3, 4), end=(3, 8), line=' pass\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 8), end=(3, 9), line=' pass\n') Traceback (most recent call last): ... File "", line 4 f ^ IndentationError: unindent does not match any outer indentation level ``` This error is difficult to recover from. If you need to handle tokenizing input with invalid indentation, my best recommendation is to instead use the [parso](https://parso.readthedocs.io/en/latest/) library, which does not raise `IndentationError` (it also does not raise any of the other exceptions discussed here). See also the [discussion](parso) of parso in the alternatives section. This is the only indentation error `tokenize` cares about. It does not care about other syntactically invalid constructs such as inconsistently mixing tabs and spaces. ```py >>> for tok in tokenize.tokenize(io.BytesIO(b""" ... if x: ... \tpass ... \t pass ... """).readline): ... print(tok) TokenInfo(type=63 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=62 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n') TokenInfo(type=1 (NAME), string='if', start=(2, 0), end=(2, 2), line='if x:\n') TokenInfo(type=1 (NAME), string='x', start=(2, 3), end=(2, 4), line='if x:\n') TokenInfo(type=54 (OP), string=':', start=(2, 4), end=(2, 5), line='if x:\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 5), end=(2, 6), line='if x:\n') TokenInfo(type=5 (INDENT), string=' \t', start=(3, 0), end=(3, 5), line=' \tpass\n') TokenInfo(type=1 (NAME), string='pass', start=(3, 5), end=(3, 9), line=' \tpass\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 9), end=(3, 10), line=' \tpass\n') TokenInfo(type=5 (INDENT), string='\t ', start=(4, 0), end=(4, 5), line='\t pass\n') TokenInfo(type=1 (NAME), string='pass', start=(4, 5), end=(4, 9), line='\t pass\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 9), end=(4, 10), line='\t pass\n') TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='') TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='') TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='') >>> exec(""" ... if x: ... \tpass ... \t pass ... """) Traceback (most recent call last): ... File "", line 4 pass ^ TabError: inconsistent use of tabs and spaces in indentation ```