Examples ======== Here are some examples that use `tokenize`. To simplify the examples, the following helper function is used. ```py >>> import tokenize >>> import io >>> def tokenize_string(s): ... """ ... Generator of tokens from the string s ... """ ... return tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline) ``` ## Processing Tokens These examples show different ways that tokens can be processed. ### `inside_string()` `inside_string(s, row, col)` takes the Python code `s` and determines if position at `(row, col)` is inside a `STRING` token. To simplify the example for the purposes of illustration, `inside_string` returns `True` even if `(row, col)` is on the quote delimiter or prefix character (like `r` or `f`) of the string. ```py >>> def inside_string(s, row, col): ... """ ... Returns True if (row, col) is inside a string in s, False otherwise. ... ... row starts at 1 and col starts at 0. ... """ ... try: ... for toknum, tokval, start, end, _ in tokenize_string(s): ... if toknum == tokenize.ERRORTOKEN and tokval[0] in '"\'': ... # There is an unclosed string. We haven't gotten to the ... # position yet, so it must be inside this string ... return True ... if start <= (row, col) <= end: ... return toknum == tokenize.STRING ... except tokenize.TokenError as e: ... # Uncompleted docstring or braces. ... # 'string' in the exception means uncompleted multi-line string ... return 'string' in e.args[0] ... ... return False ``` Let's walk through the code. We don't want the function to raise [`TokenError`](tokenerror) on uncompleted delimiters or unclosed multi-line strings, so we wrap the loop in a `try` block. ```py try: ``` Next we have the main loop. We don't use the `line` attribute, so we use `_` instead to indicate it isn't used. ```py for toknum, tokval, start, end, _ in tokenize_string(s): ``` The idea is to loop through the tokens until we find one that `(row, col)` is contained in (it is between the [`start`](start-and-end) and [`end`](start-and-end) tokens). This may not actually happen, for instance, if the `(row, col)` is inside whitespace that isn't tokenized. The first thing to check for is an [`ERRORTOKEN`](errortoken) caused by an unclosed single-quoted string. If an unclosed single-quoted (not multi-line) string is encountered, that is, it is closed by a newline, like `"an unclosed string`, and we haven't reached our `(row, col)` yet, then we assume our `(row, col)` is inside this unclosed string. This implicitly makes the rest of the document part of the unclosed string. We could also easily modify this to only assume the rest of the line is inside the unclosed string. ```py if toknum == tokenize.ERRORTOKEN and tokval[0] in '"\'': # There is an unclosed string. We haven't gotten to the # position yet, so it must be inside this string return True ``` Now we have the main condition. If the `(row, col)` is between the [`start`](start-and-end) and [`end`](start-and-end) of a token, we have gone as far as we need to. ```py if start <= (row, col) <= end: ``` That token is either a `STRING` token, in which case, we should return `True`, or it is another token type, which means our `(row, col)` is not on a `STRING` token and we can return `False`. This can be written as simply: ```py return toknum == tokenize.STRING ``` Now the exceptional case. If we see a [`TokenError`](tokenerror), we don't want the function to fail. ```py except tokenize.TokenError as e: ``` Remember that there are two possibilities for a [`TokenError`](tokenerror). If `'statement'` is in the error message, there is an unclosed brace somewhere. This case only happens when `tokenize()` has reached the end of the token stream, so if the above checks haven't returned `True` yet, then `(row, col)` must not be inside a `STRING` token, so we should return `False`. If `'string'` is inside the error message, there is an unclosed multi-line string. In this case, we want to check if we are inside this string. We can check the start of the multi-line string in the [`TokenError`](tokenerror). Remember that the message is in `e.args[0]` and the start is in `e.args[1]`. So we should return `True` in this case if the `(row, col)` are after the `e.args[1]`, and `False` otherwise. This logic can all be written succinctly as ```py # Uncompleted docstring or braces. # 'string' in the exception means uncompleted multi-line string return 'string' in e.args[0] and (row, col) >= e.args[1] ``` Finally, if we reach the end of the token stream without returning anything, it means we never found a `STRING` that is on our `(row, col)`. ```py return False ``` Here are some test cases to verify the code is correct ```py >>> # Basic test. Remember that lines start at 1 and columns start at 0. >>> inside_string("print('a string')", 1, 4) # 't' in print False >>> inside_string("print('a string')", 1, 9) # 'a' in 'a string' True >>> # Note: because our input uses """, the first line is empty >>> inside_string(""" ... "an unclosed single quote string ... 1 + 1 ... """, 2, 4) # 'u' in 'unclosed' True >>> # Check for whitespace right before TokenError >>> inside_string(""" ... '''an unclosed multi-line string ... 1 + 1 ... """, 1, 0) # the space before ''' False >>> # Check inside an unclosed multi-line string >>> inside_string(""" ... '''an unclosed multi-line string ... 1 + 1 ... """, 1, 4) # 'a' in 'an' True >>> # Check for whitespace between tokens >>> inside_string(""" ... def hello(name): ... return 'hello %s' % name ... """, 2, 3) # ' ' before hello(name) False >>> # Check TokenError from unclosed delimiters >>> inside_string(""" ... def hello(name: ... return 'hello %s' % name ... """, 4, 0) # Last character in the input False >>> inside_string(""" ... def hello(name: ... return 'hello %s' % name ... """, 3, 12) # 'h' in 'hello' True ``` #### Exercises - Modify `inside_string` to return `False` if `(row, col)` is on a prefix or quote character. For instance in `rb'abc'` it should only return True on the `abc` part of the string. (*This is more challenging than it may sound. Be sure to write lots of test cases*) - Right now if `(row, col)` is a whitespace character that is not tokenized, the loop will pass over it and tokenize the entire input before returning `False`. Make this more efficient - Only consider characters to be inside an unclosed single-quoted string if they are on the same line. - Write a version of `inside_string()` using [parso](https://parso.readthedocs.io/en/latest/)'s tokenizer (`parso.python.tokenize.tokenize()`). (line-numbers)= ### `line_numbers()` Let's go back to our motivating example from the [`tokenize` vs. Alternatives](alternatives) section, a function that prints the line numbers of every function definition. [Our function](tokenize) looked like this (rewritten to use our `tokenize_string()` helper): ```py >>> def line_numbers(s): ... for tok in tokenize_string(s): ... if tok.type == tokenize.NAME and tok.string == 'def': ... print(tok.start[0]) ``` As we noted, this function works, but it doesn't handle any of our [error](errortoken) [conditions](exceptions). Looking at our exceptions list, [`SynatxError`](syntaxerror) and [`IndentationError`](indentationerror) are unrecoverable, so we will just let them bubble up. However, [`TokenError`](tokenerror) simply means that the input had an unclosed brace or multi-line string. In the former case, the tokenization reaches the end of the input before the exception is raised, and in the latter case, the remainder of the input is inside the unclosed multi-line string, so we can safely ignore [`TokenError`](tokenerror) in either case. ```py >>> def line_numbers(s): ... try: ... for tok in tokenize_string(s): ... if tok.type == tokenize.NAME and tok.string == 'def': ... print(tok.start[0]) ... except tokenize.TokenError: ... pass ``` Finally, let's consider [`ERRORTOKEN`](errortoken) due to unclosed single-quoted strings. Our motivation for using `tokenize` to solve this problem is to handle incomplete or invalid Python (otherwise, we should use the [`ast` implementation](ast), which is much simpler). Thus, it makes sense to treat unclosed single-quoted strings as if they were closed at the end of the line. ```py >>> def line_numbers(s): ... try: ... skip_line = -1 ... for tok in tokenize_string(s): ... if tok.start[0] == skip_line: ... continue ... elif tok.start[0] >= skip_line: ... # reset skip_line ... skip_line = -1 ... if tok.type == tokenize.ERRORTOKEN and tok.string in '"\'': ... # Unclosed single-quoted string. Ignore the rest of this line ... skip_line = tok.start[0] ... continue ... if tok.type == tokenize.NAME and tok.string == 'def': ... print(tok.start[0]) ... except tokenize.TokenError: ... pass ``` Here are our original test cases, plus some additional ones for our added behavior. ```py >>> code = """\ ... def f(x): ... pass ... ... class MyClass: ... def g(self): ... pass ... """ >>> line_numbers(code) 1 5 >>> code = '''\ ... FUNCTION_SKELETON = """ ... def {name}({args}): ... {body} ... """ ... ''' >>> line_numbers(code) # no output >>> code = """\ ... def f(): ... ''' ... an unclosed docstring. ... """ >>> line_numbers(code) 1 >>> code = """\ ... def f(: # Unclosed parenthesis ... pass ... """ >>> line_numbers(code) 1 >>> code = """\ ... def f(): ... "an unclosed single-quoted string. It should not match this def ... def g(): ... pass ... """ >>> line_numbers(code) 1 3 ``` (indentation-level)= ### Indentation Level [`INDENT`](indent) and [`DEDENT`](dedent) tokens are always balanced in the token stream (unless there is an [`IndentationError`](indentationerror)), so it is easy to detect the indentation level of a block of code by incrementing and decrementing a counter. ```py >>> def indentation_level(s, row, col): ... """ ... Returns the indentation level of the code at (row, col) ... """ ... level = 0 ... try: ... for tok in tokenize_string(s): ... if tok.start >= (row, col): ... return level ... if tok.type == tokenize.INDENT: ... level += 1 ... if tok.type == tokenize.DEDENT: ... level -= 1 ... except tokenize.TokenError: ... # Ignore TokenError (we don't care about incomplete code) ... pass ... return level ``` To demonstrate the function, let's apply it to itself. ```py >>> indentation_level_source = '''\ ... def indentation_level(s, row, col): ... """ ... Returns the indentation level of the code at (row, col) ... """ ... level = 0 ... try: ... for tok in tokenize_string(s): ... if tok.start >= (row, col): ... return level ... if tok.type == tokenize.INDENT: ... level += 1 ... if tok.type == tokenize.DEDENT: ... level -= 1 ... except tokenize.TokenError: ... # Ignore TokenError (we don't care about incomplete code) ... pass ... return level ... ''' >>> # Use a large column number so it always looks at the fully indented line. >>> for i in range(1, indentation_level_source.count('\n') + 2): # doctest: +NORMALIZE_WHITESPACE ... print(indentation_level(indentation_level_source, i, 100), indentation_level_source.split('\n')[i-1]) 0 def indentation_level(s, row, col): 1 """ 1 Returns the indentation level of the code at (row, col) 1 """ 1 level = 0 1 try: 2 for tok in tokenize_string(s): 3 if tok.start >= (row, col): 4 return level 3 if tok.type == tokenize.INDENT: 4 level += 1 3 if tok.type == tokenize.DEDENT: 4 level -= 1 1 except tokenize.TokenError: 1 # Ignore TokenError (we don't care about incomplete code) 2 pass 1 return level 0 ``` An oddity worth mentioning: the comment near the end of the function is not considered indented more than the `except` line. Remember that the C parser, which `tokenize` is based on, ignores [comments](comment), so true indentations with [`INDENT`](indent) must occur on lines with real code. (mismatched-parentheses)= ### Mismatched Parentheses This example shows how to use a very important tool when processing tokens, a stack. A [*stack*](https://en.wikipedia.org/wiki/Stack_(abstract_data_type)) is a data structure that operates as Last In, First Out. A stack has two basic operations, *push*, which adds something to the stack, and *pop*, which removes the most recently added item. In Python, a stack is usually implemented using a list. The push method is `list.append()` and the pop method is `list.pop()`. ```py >>> stack = [] >>> stack.append(1) >>> stack.append(2) >>> stack.pop() 2 >>> stack.append(3) >>> stack.pop() 3 >>> stack.pop() 1 ``` Stacks are important because they allow keeping track of nested structures. Going down one level of nesting can be represented by pushing something on the stack, and going up can be represented by popping. The stack allows keeping track of the opening of the nested structure to ensure it properly matches the closing. In particular, for a set of parentheses, such as `(((())())())`, a stack can be used to check if they are properly balanced. Every time we encounter an opening parenthesis, we push it on the stack, and every time we encounter a closing parenthesis, we pop the stack. If the stack is empty when we try to pop it, or if it still has items when we finish processing all the parentheses, it means they are not balanced. Otherwise, they are. In this case, we could simply use a counter like we did for the previous example, and make sure it doesn't go negative and ends at 0, but a stack is required to handle more than one type of brace, like `(())([])[]`, which, of course, is the case in Python. Here is an example showing how to use a stack to find all the mismatched parentheses or braces in a piece of Python code. The function handles `()`, `[]`, and `{}` type braces. ```py >>> braces = { ... tokenize.LPAR: tokenize.RPAR, # () ... tokenize.LSQB: tokenize.RSQB, # [] ... tokenize.LBRACE: tokenize.RBRACE, # {} ... } ... >>> def matching_parens(s): ... """ ... Find matching and mismatching parentheses and braces ... ... s should be a string of (partial) Python code. ... ... Returns a tuple (matching, mismatching). ... ... matching is a list of tuples of matching TokenInfo objects for ... matching parentheses/braces. ... ... mismatching is a list of TokenInfo objects for mismatching ... parentheses/braces. ... """ ... stack = [] ... matching = [] ... mismatching = [] ... try: ... for tok in tokenize_string(s): ... exact_type = tok.exact_type ... if exact_type == tokenize.ERRORTOKEN and tok.string[0] in '"\'': ... # There is an unclosed string. If we do not break here, ... # tokenize will tokenize the stuff after the string ... # delimiter. ... break ... elif exact_type in braces: ... stack.append(tok) ... elif exact_type in braces.values(): ... if not stack: ... mismatching.append(tok) ... continue ... prevtok = stack.pop() ... if braces[prevtok.exact_type] == exact_type: ... matching.append((prevtok, tok)) ... else: ... mismatching.insert(0, prevtok) ... mismatching.append(tok) ... else: ... continue ... except tokenize.TokenError: ... # Either unclosed brace (what we are trying to handle here), or ... # unclosed multi-line string (which we don't care about). ... pass ... ... matching.reverse() ... ... # Anything remaining on the stack is mismatching. Keep the mismatching ... # list in order. ... stack.reverse() ... mismatching = stack + mismatching ... return matching, mismatching ``` ```py >>> matching, mismatching = matching_parens("('a', {(1, 2)}, ]") >>> matching # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE [(TokenInfo(..., string='{', ...), TokenInfo(..., string='}', ...)), (TokenInfo(..., string='(', ...), TokenInfo(..., string=')', ...))] >>> mismatching # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE [TokenInfo(..., string='(', ...), TokenInfo(..., string=']', ...)] ``` #### Exercise Add a flag, `allow_intermediary_mismatches`, which when `True`, allows an opening brace to still be considered matching if it is closed with the wrong brace but later closed with the correct brace (`False` would give the current behavior, that is, once an opening brace is closed with the wrong brace it---and any unclosed braces before it---cannot be matched). For example, consider `'[ { ] }'`. Currently, all the braces are considered mismatched. ```py >>> matching, mismatching = matching_parens('[ { ] }') >>> matching [] >>> mismatching # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE [TokenInfo(..., string='[', ...), TokenInfo(..., string='{', ...), TokenInfo(..., string=']', ...), TokenInfo(..., string='}', ...)] ``` With `allow_intermediary_mismatches` set to `True`, the `{` and `}` should be considered matching. ```py >>> matching, mismatching = matching_parens('[ { ] }', ... allow_intermediary_mismatches=True) # doctest: +SKIP >>> matching # doctest: +SKIP [(TokenInfo(..., string='{', ...), TokenInfo(..., string='}', ...))] >>> mismatching # doctest: +SKIP [TokenInfo(..., string='[', ...), TokenInfo(..., string=']', ...)] ``` Furthermore, with `'[ { ] } ]'` only the middle `]` would be considered mismatched (with the current version, all would be mismatched). ```py >>> matching, mismatching = matching_parens('[ { ] } ]', ... allow_intermediary_mismatches=True) # doctest: +SKIP >>> matching # doctest: +SKIP [(TokenInfo(..., string='[', ...), TokenInfo(..., string=']', start=(1, 8), ...)), (TokenInfo(..., string='{', ...), TokenInfo(..., string='}', ...))] >>> mismatching # doctest: +SKIP [TokenInfo(..., string=']', start=(1, 4), ...)] >>> # The current version, which would be allow_intermediary_mismatches=False >>> matching, mismatching = matching_parens('[ { ] } ]') >>> matching [] >>> mismatching # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE [TokenInfo(..., string='[', ...), TokenInfo(..., string='{', ...), TokenInfo(..., string=']', ...), TokenInfo(..., string='}', ...), TokenInfo(..., string=']', ...)] ``` The current behavior (`allow_intermediary_mismatches=False`) is a more technically correct version, but `allow_intermediary_mismatches=True` would provide more useful feedback for applications that might use this function to highlight mismatching braces, as it would be more likely to highlight only the "mistake" braces. Since this exercise is relatively difficult, I'm providing the solution. I recommend trying to solve it yourself first, as it will really force you to understand how the function works.
Click here to show the solution No really, do try it yourself first. At least think about it.
I've thought about it. Show the solution Replace ```py mismatching.insert(0, prevtok) mismatching.append(tok) ``` with ```py if allow_intermediary_mismatches: stack.append(prevtok) else: mismatching.insert(0, prevtok) mismatching.append(tok) ``` In this code block, `tok` is a closing brace and `prevtok` is the most recently found opening brace (`stack.pop()`). Under the current code, we append both braces to the `mismatching` list (keeping their order), and we continue to do that with `allow_intermediary_mismatches=False`. However, if `allow_intermediary_mismatches=True`, we instead put the `prevtok` back on the stack, and still put the `tok` in the `mismatching` list. This allows `prevtok` to be matched by a closing brace later. For example, suppose we have `( } )`. We first append `(` to the stack, so the stack is `['(']`. Then when we get to `}`. We pop `(` from the stack, and see that it doesn't match. If `allow_intermediary_mismatches=False`, we consider these both to be mismatched, and add them to the `mismatched` list in the correct order (`['(', '}']`). If `allow_intermediary_mismatches=True`, though, we only add `'}'` to the `mismatched` list and put `(` back on the stack. Then we get to `)`. In the `allow_intermediary_mismatches=False` case, the stack will be empty, so it will not be considered matching, and thus be placed in the `mismatching` list (the `if not stack:` block prior to the code we modified). In the `allow_intermediary_mismatches=True` case, the stack is `['(']`, so `prevtok` will be `(`, which matches the `)`, so they are both put in the `matching` list.
## Modifying Tokens These examples show some ways that you can modify the token stream. The general pattern we will apply here is to get the token stream from `tokenize()`, modify it in some way, and convert it back to a bytes string with [`untokenize()`](untokenize). When new tokens are added, [`untokenize()`](untokenize) does not maintain whitespace between tokens in a human-readable way. Doing this is possible, by keeping track of column offsets, but we will not bother with it here except where it is convenient. See the discussion in the [`untokenize()`](untokenize) section. ### Converting `^` to `**` Python's syntax uses `**` for exponentiation, although many might expect it to use `^` instead. `^` is actually the [XOR operator](https://docs.python.org/3/reference/expressions.html#binary-bitwise-operations). ```py >>> bin(0b101 ^ 0b001) '0b100' ``` Suppose you don't care about XOR, and want to allow `^` to represent exponentiation. You might think to use the `ast` module and replace [`BitXor`](https://greentreesnakes.readthedocs.io/en/latest/nodes.html#BitXor) nodes with [`Pow`](https://greentreesnakes.readthedocs.io/en/latest/nodes.html#Pow), but this will not work, because `^` has a different precedence than `**`: ```py >>> import ast >>> ast.dump(ast.parse('x**2 + 1')) # doctest: +SKIP35, +SKIP36, +SKIP37, +SKIP38 "Module(body=[Expr(value=BinOp(left=BinOp(left=Name(id='x', ctx=Load()), op=Pow(), right=Constant(value=2)), op=Add(), right=Constant(value=1)))], type_ignores=[])" >>> ast.dump(ast.parse('x^2 + 1')) # doctest: +SKIP35, +SKIP36, +SKIP37, +SKIP38 "Module(body=[Expr(value=BinOp(left=Name(id='x', ctx=Load()), op=BitXor(), right=BinOp(left=Constant(value=2), op=Add(), right=Constant(value=1))))], type_ignores=[])" ``` This is difficult to read, but it basically says that `x**2 + 1` is parsed like `(x**2) + 1` and `x^2 + 1` is parsed like `x^(2 + 1)`. There's no way to distinguish the two in the AST representation, because it does not keep track of redundant parentheses. We could do a simple `s.replace('^', '**')`, but this would [also replace](regular-expressions) any occurrences of `^` in strings and comments. Instead, we can use `tokenize`. The replacement is quite easy to do: ```py >>> def xor_to_pow(s): ... result = [] ... for tok in tokenize_string(s): ... if tok.type == tokenize.ENCODING: ... encoding = tok.string ... if tok.exact_type == tokenize.CIRCUMFLEX: # CIRCUMFLEX is ^ ... result.append((tokenize.OP, '**')) ... else: ... result.append(tok) ... return tokenize.untokenize(result).decode(encoding) ... >>> xor_to_pow('x^2 + 1') 'x**2 +1 ' ``` Because we are replacing a 1-character token with a 2-character token, [`untokenize()`](untokenize) removes the original whitespace and replaces it with its own. An exercise for the reader is to redefine the column offsets for the new token and all subsequent tokens on that line to avoid this issue. (wrapping-floats-with-decimal-decimal)= ### Wrapping floats with `decimal.Decimal` This example is modified from the [example in the standard library docs](https://docs.python.org/3/library/tokenize.html#examples) for `tokenize`. It is a good example for modifying tokens because the logic is not too complex, and it is something that is not possible to do with other tools such as the `ast` module, because `ast` does not keep the full precision of floats as they are in the input. ```py >>> def float_to_decimal(s): ... result = [] ... for tok in tokenize_string(s): ... if tok.type == tokenize.ENCODING: ... encoding = tok.string ... # A float is a NUMBER token with a . or e (scientific notation) ... if tok.type == tokenize.NUMBER and '.' in tok.string or 'e' in tok.string.lower(): ... result.extend([ ... (tokenize.NAME, 'Decimal'), ... (tokenize.OP, '('), ... (tokenize.STRING, repr(tok.string)), ... (tokenize.OP, ')') ... ]) ... else: ... result.append(tok) ... return tokenize.untokenize(result).decode(encoding) ``` This works like this ```py >>> 1e-1000 + 1.000000000000000000000000000000001 1.0 >>> float_to_decimal('1e-1000 + 1.000000000000000000000000000000001') "Decimal ('1e-1000')+Decimal ('1.000000000000000000000000000000001')" ``` Notice that because new tokens were added as length 2 tuples, the whitespace of the result is not the same as the input, and does not really follow [PEP 8](https://www.python.org/dev/peps/pep-0008/). The transformed code can produce arbitrary precision decimals. Note that the `decimal` module still requires setting the context precision high enough to avoid rounding the input. An exercise for the reader is to extend `float_to_decimal` to determine the required precision automatically. ```py >>> from decimal import Decimal, getcontext >>> getcontext().prec = 1001 >>> eval(float_to_decimal('1e-1000 + 1.000000000000000000000000000000001')) Decimal('1.0000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001') ``` (extending-pythons-syntax)= ### Extending Python's Syntax Because `tokenize()` emits [`ERRORTOKEN`](errortoken) on any unrecognized operators, it can be used to add extensions to the Python syntax. This can be challenging to do in general, as you may need to do significant parsing of the tokens to ensure that your new "operator" has the correct precedence. You can find some more advanced examples of extending Python's syntax in SymPy's [parser module](https://github.com/sympy/sympy/blob/master/sympy/parsing/sympy_parser.py), for example, implicit multiplication (`x y` ⮕ `x*y`), implicit function application (`sin x` ⮕ `sin(x)`), factorial notation (`x!` ⮕ `factorial(x)`), and more. #### Emoji Math The below example is relatively simple. It allows the "emoji" mathematical symbols ➕, ➖, ➗, and ✖ to be used instead of their ASCII counterparts. ```py >>> emoji_map = { ... '➕': '+', ... '➖': '-', ... '➗': '/', ... '✖': '*', ... } >>> def emoji_math(s): ... result = [] ... for tok in tokenize_string(s): ... if tok.type == tokenize.ENCODING: ... encoding = tok.string ... if tok.type == tokenize.ERRORTOKEN and tok.string in emoji_map: ... new_tok = (tokenize.OP, emoji_map[tok.string], *tok[2:]) ... result.append(new_tok) ... else: ... result.append(tok) ... return tokenize.untokenize(result).decode(encoding) ... >>> emoji_math('1 ➕ 2 ➖ 3➗4✖5') '1 + 2 - 3/4*5' ``` Because we are replacing a single character with a single character, we can use 5-tuples and keep the column offsets intact, making [`untokenize()`](untokenize) maintain the whitespace of the input. ```{note} These emoji may often appear as two characters, for instance, ✖ may often appear instead as ✖️, which is ✖ (`HEAVY MULTIPLICATION X`) + (`VARIATION SELECTOR-16`). The `VARIATION SELECTOR-16` is an invisible character which forces it to render as an emoji. The above example does not include the `VARIATION SELECTOR-16`. An exercise for the reader is to modify the above function to work with this. ``` (backporting-underscores)= #### Backporting Underscores in Numeric Literals Python 3.6 added a new syntactic feature that allows [underscores in numeric literals](https://docs.python.org/3/whatsnew/3.6.html#pep-515-underscores-in-numeric-literals). To quote the docs, "single underscores are allowed between digits and after any base specifier. Leading, trailing, or multiple underscores in a row are not allowed." For example, ```py >>> # Python 3.6+ only >>> 123_456 # doctest: +SKIP35 123456 ``` We can write a function using `tokenize` to backport this feature to Python 3.5. Some experimentation shows how the new literals tokenize in Python 3.5: - `123_456` tokenizes as [`NUMBER`](number) (`123`) and [`NAME`](name) (`_456`). In general, multiple underscores between tokens will tokenize like this. - Additionally, underscores are allowed after base specifiers, like `0x_1`. This also tokenizes as [`NUMBER`](number) (`0`) and [`NAME`](name) (`x_1`). Since [`NUMBER`](number) and [`NAME`](name) tokens cannot appear next to one another in valid Python, we can simply combine them when they do. - Finally, if an underscore appears before the `.` in a floating point literal, like `1_2.3_4` it will tokenize [`NUMBER`](number) (`1`), [`NAME`](name) (`_2`), [`NUMBER`](number) (`.3`), [`NAME`](name) (`_4`). Note that in this example, the [`ENCODING`](encoding) token allows us to access `result[-1]` unconditionally, as we know there must always be at least this token already processed before any [`NAME`](name) token. This is often a useful property the [`ENCODING`](encoding) allows us to take advantage of. We do some basic checks here to not allow spaces before underscores and double underscores, which are not allowed in Python 3.6. But for simplicity, this function takes a [garbage in, garbage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out) approach. Invalid syntax in Python 3.6, like `123a_bc`, will transform to something that is still invalid syntax in Python 3.5 (`123abc`). ```py >>> import sys >>> def underscore_literals(s): ... if sys.version_info >= (3, 6): ... return s ... ... result = [] ... for tok in tokenize_string(s): ... if tok.type == tokenize.ENCODING: ... encoding = tok.string ... if tok.type == tokenize.NAME and result[-1].type == tokenize.NUMBER: ... # Check that there are no spaces between the tokens ... # e.g., 123 _456 is not allowed, and there aren't multiple ... # consecutive undercores, e.g., 123__456 is not allowed. ... if result[-1].end == tok.start and '__' not in tok.string: ... new_tok = tokenize.TokenInfo( ... tokenize.NUMBER, ... result[-1].string + tok.string.replace('_', ''), ... result[-1].start, ... tok.end, ... tok.line, ... ) ... result[-1] = new_tok ... continue ... if tok.type == tokenize.NUMBER and tok.string[0] == '.' and result[-1].type == tokenize.NUMBER: ... # Float with underscore before the ., like 1_2.0, which is 1, _2, .0 ... if result[-1].end == tok.start: ... new_tok = tokenize.TokenInfo( ... tokenize.NUMBER, ... result[-1].string + tok.string, ... result[-1].start, ... tok.end, ... tok.line, ... ) ... result[-1] = new_tok ... continue ... if tok.exact_type == tokenize.DOT and result[-1].type == tokenize.NUMBER: ... # Like 1_2. which becomes 1, _2, . ... if result[-1].end == tok.start: ... new_tok = tokenize.TokenInfo( ... tokenize.NUMBER, ... result[-1].string + tok.string, ... result[-1].start, ... tok.end, ... tok.line, ... ) ... result[-1] = new_tok ... continue ... ... result.append(tok) ... return tokenize.untokenize(result).decode(encoding) ... ``` Note that by reusing the [`start`](start-and-end) and [`end`](start-and-end) tokens, we are able to make [`untokenize()`](untokenize) keep the whitespace, even though characters were removed. [`untokenize()`](untokenize) only uses the differences between [`end`](start-and-end) and [`start`](start-and-end) to determine how many spaces to add between tokens, not their absolute values (remember that [`untokenize()`](untokenize) only requires the [`start`](start-and-end) and [`end`](start-and-end) tuples to be nondecreasing; it doesn't care if the actual column values are correct). ```py >>> s = '1_0 + 0b_101 + 0o_1_0 + 0x_a - 1.0_0 + 1e1 + 1.0_0j + 1_2.3_4 + 1_2.' >>> # In Python 3.5 >>> underscore_literals(s) # doctest: +ONLY35 '10 + 0b101 + 0o10 + 0xa - 1.00 + 1e1 + 1.00j + 12.34 + 12.' ```