Examples

Here are some examples that use tokenize.

To copy the examples, click the >>> button on the top right of the code block to hide the Python prompts and outputs.

To simplify the examples, the following helper function is used.

>>> import tokenize
>>> import io
>>> def tokenize_string(s):
...     """
...     Generator of tokens from the string s
...     """
...     return tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline)

Processing Tokens

These examples show different ways that tokens can be processed.

inside_string()

inside_string(s, row, col) takes the Python code s and determines if position at (row, col) is inside a STRING token.

To simplify the example for the purposes of illustration, inside_string returns True even if (row, col) is on the quote delimiter or prefix character (like r or f) of the string.

>>> def inside_string(s, row, col):
...     """
...     Returns True if (row, col) is inside a string in s, False otherwise.
...
...     row starts at 1 and col starts at 0.
...     """
...     try:
...         for toknum, tokval, start, end, _ in tokenize_string(s):
...             if toknum == tokenize.ERRORTOKEN and tokval[0] in '"\'':
...                 # There is an unclosed string. We haven't gotten to the
...                 # position yet, so it must be inside this string
...                 return True
...             if start <= (row, col) <= end:
...                 return toknum == tokenize.STRING
...     except tokenize.TokenError as e:
...         # Uncompleted docstring or braces.
...         # 'string' in the exception means uncompleted multi-line string
...         return 'string' in e.args[0]
...
...     return False

Let’s walk through the code.

We don’t want the function to raise TokenError on uncompleted delimiters or unclosed multi-line strings, so we wrap the loop in a try block.

try:

Next we have the main loop. We don’t use the line attribute, so we use _ instead to indicate it isn’t used.

    for toknum, tokval, start, end, _ in tokenize_string(s):

The idea is to loop through the tokens until we find one that (row, col) is contained in (it is between the start and end tokens). This may not actually happen, for instance, if the (row, col) is inside whitespace that isn’t tokenized.

The first thing to check for is an ERRORTOKEN caused by an unclosed single-quoted string. If an unclosed single-quoted (not multi-line) string is encountered, that is, it is closed by a newline, like "an unclosed string, and we haven’t reached our (row, col) yet, then we assume our (row, col) is inside this unclosed string. This implicitly makes the rest of the document part of the unclosed string. We could also easily modify this to only assume the rest of the line is inside the unclosed string.

         if toknum == tokenize.ERRORTOKEN and tokval[0] in '"\'':
             # There is an unclosed string. We haven't gotten to the
             # position yet, so it must be inside this string
             return True

Now we have the main condition. If the (row, col) is between the start and end of a token, we have gone as far as we need to.

         if start <= (row, col) <= end:

That token is either a STRING token, in which case, we should return True, or it is another token type, which means our (row, col) is not on a STRING token and we can return False. This can be written as simply:

             return toknum == tokenize.STRING

Now the exceptional case. If we see a TokenError, we don’t want the function to fail.

except tokenize.TokenError as e:

Remember that there are two possibilities for a TokenError. If 'statement' is in the error message, there is an unclosed brace somewhere. This case only happens when tokenize() has reached the end of the token stream, so if the above checks haven’t returned True yet, then (row, col) must not be inside a STRING token, so we should return False.

If 'string' is inside the error message, there is an unclosed multi-line string. In this case, we want to check if we are inside this string. We can check the start of the multi-line string in the TokenError. Remember that the message is in e.args[0] and the start is in e.args[1]. So we should return True in this case if the (row, col) are after the e.args[1], and False otherwise.

This logic can all be written succinctly as

    # Uncompleted docstring or braces.
    # 'string' in the exception means uncompleted multi-line string
    return 'string' in e.args[0] and (row, col) >= e.args[1]

Finally, if we reach the end of the token stream without returning anything, it means we never found a STRING that is on our (row, col).

return False

Here are some test cases to verify the code is correct

>>> # Basic test. Remember that lines start at 1 and columns start at 0.
>>> inside_string("print('a string')", 1, 4) # 't' in print
False
>>> inside_string("print('a string')", 1, 9) # 'a' in 'a string'
True
>>> # Note: because our input uses """, the first line is empty
>>> inside_string("""
... "an unclosed single quote string
... 1 + 1
... """, 2, 4) # 'u' in 'unclosed'
True
>>> # Check for whitespace right before TokenError
>>> inside_string("""
...  '''an unclosed multi-line string
... 1 + 1
... """, 1, 0) # the space before '''
False
>>> # Check inside an unclosed multi-line string
>>> inside_string("""
...  '''an unclosed multi-line string
... 1 + 1
... """, 1, 4) # 'a' in 'an'
True
>>> # Check for whitespace between tokens
>>> inside_string("""
... def hello(name):
...     return 'hello %s' % name
... """, 2, 3) # ' ' before hello(name)
False
>>> # Check TokenError from unclosed delimiters
>>> inside_string("""
... def hello(name:
...     return 'hello %s' % name
... """, 4, 0) # Last character in the input
False
>>> inside_string("""
... def hello(name:
...     return 'hello %s' % name
... """, 3, 12) # 'h' in 'hello'
True

Exercises

  • Modify inside_string to return False if (row, col) is on a prefix or quote character. For instance in rb'abc' it should only return True on the abc part of the string. (This is more challenging than it may sound. Be sure to write lots of test cases)

  • Right now if (row, col) is a whitespace character that is not tokenized, the loop will pass over it and tokenize the entire input before returning False. Make this more efficient

  • Only consider characters to be inside an unclosed single-quoted string if they are on the same line.

  • Write a version of inside_string() using parso’s tokenizer (parso.python.tokenize.tokenize()).

line_numbers()

Let’s go back to our motivating example from the tokenize vs. Alternatives section, a function that prints the line numbers of every function definition. Our function looked like this (rewritten to use our tokenize_string() helper):

>>> def line_numbers(s):
...     for tok in tokenize_string(s):
...         if tok.type == tokenize.NAME and tok.string == 'def':
...             print(tok.start[0])

As we noted, this function works, but it doesn’t handle any of our error conditions.

Looking at our exceptions list, SynatxError and IndentationError are unrecoverable, so we will just let them bubble up. However, TokenError simply means that the input had an unclosed brace or multi-line string. In the former case, the tokenization reaches the end of the input before the exception is raised, and in the latter case, the remainder of the input is inside the unclosed multi-line string, so we can safely ignore TokenError in either case.

>>> def line_numbers(s):
...     try:
...         for tok in tokenize_string(s):
...             if tok.type == tokenize.NAME and tok.string == 'def':
...                 print(tok.start[0])
...     except tokenize.TokenError:
...         pass

Finally, let’s consider ERRORTOKEN due to unclosed single-quoted strings. Our motivation for using tokenize to solve this problem is to handle incomplete or invalid Python (otherwise, we should use the ast implementation, which is much simpler). Thus, it makes sense to treat unclosed single-quoted strings as if they were closed at the end of the line.

>>> def line_numbers(s):
...     try:
...         skip_line = -1
...         for tok in tokenize_string(s):
...             if tok.start[0] == skip_line:
...                 continue
...             elif tok.start[0] >= skip_line:
...                 # reset skip_line
...                 skip_line = -1
...             if tok.type == tokenize.ERRORTOKEN and tok.string in '"\'':
...                 # Unclosed single-quoted string. Ignore the rest of this line
...                 skip_line = tok.start[0]
...                 continue
...             if tok.type == tokenize.NAME and tok.string == 'def':
...                 print(tok.start[0])
...     except tokenize.TokenError:
...         pass

Here are our original test cases, plus some additional ones for our added behavior.

>>> code = """\
... def f(x):
...     pass
...
... class MyClass:
...     def g(self):
...         pass
... """
>>> line_numbers(code)
1
5
>>> code = '''\
... FUNCTION_SKELETON = """
... def {name}({args}):
...     {body}
... """
... '''
>>> line_numbers(code) # no output
>>> code = """\
... def f():
...     '''
...     an unclosed docstring.
... """
>>> line_numbers(code)
1
>>> code = """\
... def f(: # Unclosed parenthesis
...     pass
... """
>>> line_numbers(code)
1
>>> code = """\
... def f():
...     "an unclosed single-quoted string. It should not match this def
... def g():
...     pass
... """
>>> line_numbers(code)
1
3

Indentation Level

INDENT and DEDENT tokens are always balanced in the token stream (unless there is an IndentationError), so it is easy to detect the indentation level of a block of code by incrementing and decrementing a counter.

>>> def indentation_level(s, row, col):
...     """
...     Returns the indentation level of the code at (row, col)
...     """
...     level = 0
...     try:
...         for tok in tokenize_string(s):
...             if tok.start >= (row, col):
...                 return level
...             if tok.type == tokenize.INDENT:
...                 level += 1
...             if tok.type == tokenize.DEDENT:
...                 level -= 1
...     except tokenize.TokenError:
...         # Ignore TokenError (we don't care about incomplete code)
...         pass
...     return level

To demonstrate the function, let’s apply it to itself.

>>> indentation_level_source = '''\
... def indentation_level(s, row, col):
...     """
...     Returns the indentation level of the code at (row, col)
...     """
...     level = 0
...     try:
...         for tok in tokenize_string(s):
...             if tok.start >= (row, col):
...                 return level
...             if tok.type == tokenize.INDENT:
...                 level += 1
...             if tok.type == tokenize.DEDENT:
...                 level -= 1
...     except tokenize.TokenError:
...         # Ignore TokenError (we don't care about incomplete code)
...         pass
...     return level
... '''
>>> # Use a large column number so it always looks at the fully indented line.
>>> for i in range(1, indentation_level_source.count('\n') + 2): 
...     print(indentation_level(indentation_level_source, i, 100), indentation_level_source.split('\n')[i-1])
0 def indentation_level(s, row, col):
1     """
1     Returns the indentation level of the code at (row, col)
1     """
1     level = 0
1     try:
2         for tok in tokenize_string(s):
3             if tok.start >= (row, col):
4                 return level
3             if tok.type == tokenize.INDENT:
4                 level += 1
3             if tok.type == tokenize.DEDENT:
4                 level -= 1
1     except tokenize.TokenError:
1         # Ignore TokenError (we don't care about incomplete code)
2         pass
1     return level
0

An oddity worth mentioning: the comment near the end of the function is not considered indented more than the except line. Remember that the C parser, which tokenize is based on, ignores comments, so true indentations with INDENT must occur on lines with real code.

Mismatched Parentheses

This example shows how to use a very important tool when processing tokens, a stack. A stack is a data structure that operates as Last In, First Out. A stack has two basic operations, push, which adds something to the stack, and pop, which removes the most recently added item.

In Python, a stack is usually implemented using a list. The push method is list.append() and the pop method is list.pop().

>>> stack = []
>>> stack.append(1)
>>> stack.append(2)
>>> stack.pop()
2
>>> stack.append(3)
>>> stack.pop()
3
>>> stack.pop()
1

Stacks are important because they allow keeping track of nested structures. Going down one level of nesting can be represented by pushing something on the stack, and going up can be represented by popping. The stack allows keeping track of the opening of the nested structure to ensure it properly matches the closing.

In particular, for a set of parentheses, such as (((())())()), a stack can be used to check if they are properly balanced. Every time we encounter an opening parenthesis, we push it on the stack, and every time we encounter a closing parenthesis, we pop the stack. If the stack is empty when we try to pop it, or if it still has items when we finish processing all the parentheses, it means they are not balanced. Otherwise, they are. In this case, we could simply use a counter like we did for the previous example, and make sure it doesn’t go negative and ends at 0, but a stack is required to handle more than one type of brace, like (())([])[], which, of course, is the case in Python.

Here is an example showing how to use a stack to find all the mismatched parentheses or braces in a piece of Python code. The function handles (), [], and {} type braces.

>>> braces = {
...     tokenize.LPAR: tokenize.RPAR, # ()
...     tokenize.LSQB: tokenize.RSQB, # []
...     tokenize.LBRACE: tokenize.RBRACE, # {}
... }
...
>>> def matching_parens(s):
...     """
...     Find matching and mismatching parentheses and braces
...
...     s should be a string of (partial) Python code.
...
...     Returns a tuple (matching, mismatching).
...
...     matching is a list of tuples of matching TokenInfo objects for
...     matching parentheses/braces.
...
...     mismatching is a list of TokenInfo objects for mismatching
...     parentheses/braces.
...     """
...     stack = []
...     matching = []
...     mismatching = []
...     try:
...         for tok in tokenize_string(s):
...             exact_type = tok.exact_type
...             if exact_type == tokenize.ERRORTOKEN and tok.string[0] in '"\'':
...                 # There is an unclosed string. If we do not break here,
...                 # tokenize will tokenize the stuff after the string
...                 # delimiter.
...                 break
...             elif exact_type in braces:
...                 stack.append(tok)
...             elif exact_type in braces.values():
...                 if not stack:
...                     mismatching.append(tok)
...                     continue
...                 prevtok = stack.pop()
...                 if braces[prevtok.exact_type] == exact_type:
...                     matching.append((prevtok, tok))
...                 else:
...                     mismatching.insert(0, prevtok)
...                     mismatching.append(tok)
...             else:
...                 continue
...     except tokenize.TokenError:
...         # Either unclosed brace (what we are trying to handle here), or
...         # unclosed multi-line string (which we don't care about).
...         pass
...
...     matching.reverse()
...
...     # Anything remaining on the stack is mismatching. Keep the mismatching
...     # list in order.
...     stack.reverse()
...     mismatching = stack + mismatching
...     return matching, mismatching
>>> matching, mismatching = matching_parens("('a', {(1, 2)}, ]")
>>> matching 
[(TokenInfo(..., string='{', ...), TokenInfo(..., string='}', ...)),
 (TokenInfo(..., string='(', ...), TokenInfo(..., string=')', ...))]
>>> mismatching 
[TokenInfo(..., string='(', ...), TokenInfo(..., string=']', ...)]

Exercise

Add a flag, allow_intermediary_mismatches, which when True, allows an opening brace to still be considered matching if it is closed with the wrong brace but later closed with the correct brace (False would give the current behavior, that is, once an opening brace is closed with the wrong brace it—and any unclosed braces before it—cannot be matched).

For example, consider '[ { ] }'. Currently, all the braces are considered mismatched.

>>> matching, mismatching = matching_parens('[ { ] }')
>>> matching
[]
>>> mismatching 
[TokenInfo(..., string='[', ...),
 TokenInfo(..., string='{', ...),
 TokenInfo(..., string=']', ...),
 TokenInfo(..., string='}', ...)]

With allow_intermediary_mismatches set to True, the { and } should be considered matching.

>>> matching, mismatching = matching_parens('[ { ] }',
... allow_intermediary_mismatches=True) 
>>> matching 
[(TokenInfo(..., string='{', ...),
 TokenInfo(..., string='}', ...))]
>>> mismatching 
[TokenInfo(..., string='[', ...),
 TokenInfo(..., string=']', ...)]

Furthermore, with '[ { ] } ]' only the middle ] would be considered mismatched (with the current version, all would be mismatched).

>>> matching, mismatching = matching_parens('[ { ] } ]',
... allow_intermediary_mismatches=True) 
>>> matching 
[(TokenInfo(..., string='[', ...), TokenInfo(..., string=']', start=(1, 8), ...)),
 (TokenInfo(..., string='{', ...), TokenInfo(..., string='}', ...))]
>>> mismatching 
[TokenInfo(..., string=']', start=(1, 4), ...)]
>>> # The current version, which would be allow_intermediary_mismatches=False
>>> matching, mismatching = matching_parens('[ { ] } ]')
>>> matching
[]
>>> mismatching 
[TokenInfo(..., string='[', ...),
 TokenInfo(..., string='{', ...),
 TokenInfo(..., string=']', ...),
 TokenInfo(..., string='}', ...),
 TokenInfo(..., string=']', ...)]

The current behavior (allow_intermediary_mismatches=False) is a more technically correct version, but allow_intermediary_mismatches=True would provide more useful feedback for applications that might use this function to highlight mismatching braces, as it would be more likely to highlight only the “mistake” braces.

Since this exercise is relatively difficult, I’m providing the solution. I recommend trying to solve it yourself first, as it will really force you to understand how the function works.

Click here to show the solution

No really, do try it yourself first. At least think about it.

I've thought about it. Show the solution

Replace

mismatching.insert(0, prevtok)
mismatching.append(tok)

with

if allow_intermediary_mismatches:
    stack.append(prevtok)
else:
    mismatching.insert(0, prevtok)
mismatching.append(tok)

In this code block, tok is a closing brace and prevtok is the most recently found opening brace (stack.pop()). Under the current code, we append both braces to the mismatching list (keeping their order), and we continue to do that with allow_intermediary_mismatches=False. However, if allow_intermediary_mismatches=True, we instead put the prevtok back on the stack, and still put the tok in the mismatching list. This allows prevtok to be matched by a closing brace later.

For example, suppose we have ( } ). We first append ( to the stack, so the stack is ['(']. Then when we get to }. We pop ( from the stack, and see that it doesn’t match. If allow_intermediary_mismatches=False, we consider these both to be mismatched, and add them to the mismatched list in the correct order (['(', '}']). If allow_intermediary_mismatches=True, though, we only add '}' to the mismatched list and put ( back on the stack.

Then we get to ). In the allow_intermediary_mismatches=False case, the stack will be empty, so it will not be considered matching, and thus be placed in the mismatching list (the if not stack: block prior to the code we modified). In the allow_intermediary_mismatches=True case, the stack is ['('], so prevtok will be (, which matches the ), so they are both put in the matching list.

Modifying Tokens

These examples show some ways that you can modify the token stream.

The general pattern we will apply here is to get the token stream from tokenize(), modify it in some way, and convert it back to a bytes string with untokenize().

When new tokens are added, untokenize() does not maintain whitespace between tokens in a human-readable way. Doing this is possible, by keeping track of column offsets, but we will not bother with it here except where it is convenient. See the discussion in the untokenize() section.

Converting ^ to **

Python’s syntax uses ** for exponentiation, although many might expect it to use ^ instead. ^ is actually the XOR operator.

>>> bin(0b101 ^ 0b001)
'0b100'

Suppose you don’t care about XOR, and want to allow ^ to represent exponentiation. You might think to use the ast module and replace BitXor nodes with Pow, but this will not work, because ^ has a different precedence than **:

>>> import ast
>>> ast.dump(ast.parse('x**2 + 1')) 
"Module(body=[Expr(value=BinOp(left=BinOp(left=Name(id='x', ctx=Load()), op=Pow(), right=Constant(value=2)), op=Add(), right=Constant(value=1)))], type_ignores=[])"
>>> ast.dump(ast.parse('x^2 + 1')) 
"Module(body=[Expr(value=BinOp(left=Name(id='x', ctx=Load()), op=BitXor(), right=BinOp(left=Constant(value=2), op=Add(), right=Constant(value=1))))], type_ignores=[])"

This is difficult to read, but it basically says that x**2 + 1 is parsed like (x**2) + 1 and x^2 + 1 is parsed like x^(2 + 1). There’s no way to distinguish the two in the AST representation, because it does not keep track of redundant parentheses.

We could do a simple s.replace('^', '**'), but this would also replace any occurrences of ^ in strings and comments.

Instead, we can use tokenize. The replacement is quite easy to do:

>>> def xor_to_pow(s):
...     result = []
...     for tok in tokenize_string(s):
...         if tok.type == tokenize.ENCODING:
...             encoding = tok.string
...         if tok.exact_type == tokenize.CIRCUMFLEX: # CIRCUMFLEX is ^
...             result.append((tokenize.OP, '**'))
...         else:
...             result.append(tok)
...     return tokenize.untokenize(result).decode(encoding)
...
>>> xor_to_pow('x^2 + 1')
'x**2 +1 '

Because we are replacing a 1-character token with a 2-character token, untokenize() removes the original whitespace and replaces it with its own. An exercise for the reader is to redefine the column offsets for the new token and all subsequent tokens on that line to avoid this issue.

Wrapping floats with decimal.Decimal

This example is modified from the example in the standard library docs for tokenize. It is a good example for modifying tokens because the logic is not too complex, and it is something that is not possible to do with other tools such as the ast module, because ast does not keep the full precision of floats as they are in the input.

>>> def float_to_decimal(s):
...     result = []
...     for tok in tokenize_string(s):
...         if tok.type == tokenize.ENCODING:
...             encoding = tok.string
...         # A float is a NUMBER token with a . or e (scientific notation)
...         if tok.type == tokenize.NUMBER and '.' in tok.string or 'e' in tok.string.lower():
...             result.extend([
...                 (tokenize.NAME, 'Decimal'),
...                 (tokenize.OP, '('),
...                 (tokenize.STRING, repr(tok.string)),
...                 (tokenize.OP, ')')
...             ])
...         else:
...             result.append(tok)
...     return tokenize.untokenize(result).decode(encoding)

This works like this

>>> 1e-1000 + 1.000000000000000000000000000000001
1.0
>>> float_to_decimal('1e-1000 + 1.000000000000000000000000000000001')
"Decimal ('1e-1000')+Decimal ('1.000000000000000000000000000000001')"

Notice that because new tokens were added as length 2 tuples, the whitespace of the result is not the same as the input, and does not really follow PEP 8.

The transformed code can produce arbitrary precision decimals. Note that the decimal module still requires setting the context precision high enough to avoid rounding the input. An exercise for the reader is to extend float_to_decimal to determine the required precision automatically.

>>> from decimal import Decimal, getcontext
>>> getcontext().prec = 1001
>>> eval(float_to_decimal('1e-1000 + 1.000000000000000000000000000000001'))
Decimal('1.0000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001')

Extending Python’s Syntax

Because tokenize() emits ERRORTOKEN on any unrecognized operators, it can be used to add extensions to the Python syntax. This can be challenging to do in general, as you may need to do significant parsing of the tokens to ensure that your new “operator” has the correct precedence.

You can find some more advanced examples of extending Python’s syntax in SymPy’s parser module, for example, implicit multiplication (x yx*y), implicit function application (sin xsin(x)), factorial notation (x!factorial(x)), and more.

Emoji Math

The below example is relatively simple. It allows the “emoji” mathematical symbols ➕, ➖, ➗, and ✖ to be used instead of their ASCII counterparts.

>>> emoji_map = {
...     '➕': '+',
...     '➖': '-',
...     '➗': '/',
...     '✖': '*',
... }
>>> def emoji_math(s):
...     result = []
...     for tok in tokenize_string(s):
...         if tok.type == tokenize.ENCODING:
...             encoding = tok.string
...         if tok.type == tokenize.ERRORTOKEN and tok.string in emoji_map:
...             new_tok = (tokenize.OP, emoji_map[tok.string], *tok[2:])
...             result.append(new_tok)
...         else:
...             result.append(tok)
...     return tokenize.untokenize(result).decode(encoding)
...
>>> emoji_math('1 ➕ 2 ➖ 3➗4✖5')
'1 + 2 - 3/4*5'

Because we are replacing a single character with a single character, we can use 5-tuples and keep the column offsets intact, making untokenize() maintain the whitespace of the input.

Note

These emoji may often appear as two characters, for instance, ✖ may often appear instead as ✖️, which is ✖ (HEAVY MULTIPLICATION X) + (VARIATION SELECTOR-16). The VARIATION SELECTOR-16 is an invisible character which forces it to render as an emoji. The above example does not include the VARIATION SELECTOR-16. An exercise for the reader is to modify the above function to work with this.

Backporting Underscores in Numeric Literals

Python 3.6 added a new syntactic feature that allows underscores in numeric literals. To quote the docs, “single underscores are allowed between digits and after any base specifier. Leading, trailing, or multiple underscores in a row are not allowed.”

For example,

>>> # Python 3.6+ only
>>> 123_456 
123456

We can write a function using tokenize to backport this feature to Python 3.5. Some experimentation shows how the new literals tokenize in Python 3.5:

  • 123_456 tokenizes as NUMBER (123) and NAME (_456). In general, multiple underscores between tokens will tokenize like this.

  • Additionally, underscores are allowed after base specifiers, like 0x_1. This also tokenizes as NUMBER (0) and NAME (x_1). Since NUMBER and NAME tokens cannot appear next to one another in valid Python, we can simply combine them when they do.

  • Finally, if an underscore appears before the . in a floating point literal, like 1_2.3_4 it will tokenize NUMBER (1), NAME (_2), NUMBER (.3), NAME (_4).

Note that in this example, the ENCODING token allows us to access result[-1] unconditionally, as we know there must always be at least this token already processed before any NAME token. This is often a useful property the ENCODING allows us to take advantage of.

We do some basic checks here to not allow spaces before underscores and double underscores, which are not allowed in Python 3.6. But for simplicity, this function takes a garbage in, garbage out approach. Invalid syntax in Python 3.6, like 123a_bc, will transform to something that is still invalid syntax in Python 3.5 (123abc).

>>> import sys
>>> def underscore_literals(s):
...     if sys.version_info >= (3, 6):
...         return s
...
...     result = []
...     for tok in tokenize_string(s):
...         if tok.type == tokenize.ENCODING:
...             encoding = tok.string
...         if tok.type == tokenize.NAME and result[-1].type == tokenize.NUMBER:
...             # Check that there are no spaces between the tokens
...             # e.g., 123 _456 is not allowed, and there aren't multiple
...             # consecutive undercores, e.g., 123__456 is not allowed.
...             if result[-1].end == tok.start and '__' not in tok.string:
...                 new_tok = tokenize.TokenInfo(
...                     tokenize.NUMBER,
...                     result[-1].string + tok.string.replace('_', ''),
...                     result[-1].start,
...                     tok.end,
...                     tok.line,
...                     )
...                 result[-1] = new_tok
...                 continue
...         if tok.type == tokenize.NUMBER and tok.string[0] == '.' and result[-1].type == tokenize.NUMBER:
...             # Float with underscore before the ., like 1_2.0, which is 1, _2, .0
...             if result[-1].end == tok.start:
...                 new_tok = tokenize.TokenInfo(
...                     tokenize.NUMBER,
...                     result[-1].string + tok.string,
...                     result[-1].start,
...                     tok.end,
...                     tok.line,
...                 )
...                 result[-1] = new_tok
...                 continue
...         if tok.exact_type == tokenize.DOT and result[-1].type == tokenize.NUMBER:
...             # Like 1_2. which becomes 1, _2, .
...             if result[-1].end == tok.start:
...                 new_tok = tokenize.TokenInfo(
...                     tokenize.NUMBER,
...                     result[-1].string + tok.string,
...                     result[-1].start,
...                     tok.end,
...                     tok.line,
...                 )
...                 result[-1] = new_tok
...                 continue
...
...         result.append(tok)
...     return tokenize.untokenize(result).decode(encoding)
...

Note that by reusing the start and end tokens, we are able to make untokenize() keep the whitespace, even though characters were removed. untokenize() only uses the differences between end and start to determine how many spaces to add between tokens, not their absolute values (remember that untokenize() only requires the start and end tuples to be nondecreasing; it doesn’t care if the actual column values are correct).

>>> s = '1_0 + 0b_101 + 0o_1_0 + 0x_a - 1.0_0 + 1e1 + 1.0_0j + 1_2.3_4 + 1_2.'
>>> # In Python 3.5
>>> underscore_literals(s) 
'10 + 0b101 + 0o10 + 0xa - 1.00 + 1e1 + 1.00j + 12.34 + 12.'