Skip to main content

Moving Away from Python 2

About a month ago I tweeted this:

EDIT: Some people have started working on making this happen. See https://python3statement.github.io/.

For those of you who don't know, Python 2.7 is slated to reach end-of-life in 2020 (originally, it was slated to end in 2015, but it was extended in 2014, due to the extraordinary difficulty of moving to a newer version). "End-of-life" means absolutely no more support from the core Python team, even for security updates.

I'm writing this post because I want to clarify why I think this should be done, and to clear up some misconceptions, the primary one being that this represents library developers being antagonistic against those who want or have to use Python 2.

I'm writing this from my perspective as a library developer. I'm the lead developer of SymPy, and I have sympathies for developers of other libraries.1 I say this because my idea may seem a bit in tension with "users" (even though I hate the "developer/user" distinction).

Python 2

There are a few reasons why I think libraries should drop (and announce that they will drop) Python 2 support by 2020 (actually earlier, say 2018 or 2019, depending on how core the library is).

First, library developers have to be the leaders here. This is apparent from the historical move to Python 3 up to this point. Consider the three (not necessarily disjoint) classes of people: CPython core developers, library developers, and users. The core developers were the first to move to Python 3, since they were the ones who wrote it. They were also the ones who provided the messaging around Python 3, which has varied over time. In my opinion, it should have been and should be more forceful.2 Then you have the library developers and the users. A chief difference here is that users are probably going to be using only one version of Python. In order for them to switch that version to Python 3, all the libraries that they use need to support it. This took some time, since library developers saw little impetus to support Python 3 when no one was using it (Catch 22), and to worsen the situation, versions of Python older than 2.6 made single codebase compatibility almost impossible.

Today, though, almost all libraries support Python 3, and we're reaching a point where those that don't have forks that do.

But it only happened after the library developers transitioned. I believe libraries need to be the leaders in moving away from Python 2 as well. It's important to do this for a few reasons:

  • Python 2.7 support ends in 2020. That means all updates, including security updates. For all intents and purposes, Python 2.7 becomes an insecure language to use at that point in time.

  • Supporting two major versions of Python is technical debt for every project that does it. While writing cross compatible code is easier than ever, it still remains true that you have to remember to add __future__ imports to the top of every file, to import all relevant builtins from your compatibility file or library, and to run all your tests in both Python 2 and 3. Supporting both versions is a major cognitive burden to library developers, as they always have to be aware of important differences in the two languages. Developers on any library that does anything with strings will need to understand how things work in both Python 2 and 3, and the often obscure workarounds required for things to work in both (pop quiz: how do you write Unicode characters to a file in a Python 2/3 compatible way?).

  • Some of Python 3's new syntax features (i.e., features that are impossible to use in Python 2) only matter for library developers. A great example of this is keyword-only arguments. From an API standpoint, almost every instance of keyword arguments should be implemented as keyword-only arguments. This avoids mistakes that come from the antipattern of passing keyword arguments without naming the keyword, and allows the argspec of the function to be expanded in the future without breaking API.3

The second reason I think library developers should agree to drop Python 2 support by 2020 is completely selfish. A response that I heard on that tweet (as well as elsewhere), was that libraries should provide carrots, not sticks. In other words, instead of forcing people off of Python 2, we should make them want to come to Python 3. There are some issues with this argument. First, Python 3 already has tons of carrots. Honestly, not being terrible at Unicode ought to be a carrot in its own right.4

If you don't deal with strings, or do but don't care about those silly foreigners with weird accents in their names, there are other major carrots as well. For SymPy, the fact that 1/2 gives 0 in Python 2 has historically been a major source of frustration for new users. Imagine writing out 1/2*x + x**(1/2)*y*z - 3*z**2 and wondering why half of what you wrote just "disappeared" (granted, this was worse before we fixed the printers). While integer/integer not giving a rational number is a major gotcha for SymPy, giving a float is infinitely better than giving what is effectively the wrong answer. Don't use strings or integers? I've got more.

Frankly, if these "carrots" haven't convinced you yet, then I'll wager you're not really the sort of person who is persuaded by carrots.

Second, some "carrots" are impossible unless they are implemented in libraries. While some features can be implemented in 2/3 compatible code and only work in Python 3 (such as @ matrix multiplication), others, such as keyword-only arguments, can only be implemented in code that does not support Python 2. Supporting them in Python 2 would be a net deficit of technical debt (one can imagine, for instance, trying to support keyword-only arguments manually using **kwargs, or by using some monstrous meta-programming).

Third, as I said, I'm selfish. Python 3 does have carrots, and I want them. As long as I have to support Python 2 in my code, I can't use keyword-only arguments, or extended argument unpacking, or async/await, or any of the dozens of features that can't be used in cross compatible code.

A counterargument might be that instead of blocking users of existing libraries, developers should create new libraries which are Python 3-only and make use of new exciting features of Python 3 there. I agree we should do that, but existing libraries are good too. I don't see why developers should throw out all of a well-developed library just so they can use some Python features that they are excited about.

Legacy Python

A lot of people have taken to calling Python 2 "legacy Python". This phrase is often used condescendingly and angers a lot of people (and indeed, this blog post is the first time I've used it myself). However, I think Python 2 really should be seen this way, as a "legacy" system. If you want to use it, for whatever your reasons, that's fine, but just as you shouldn't expect to get any of the newest features of Python, you shouldn't expect to be able to use the newest versions of your libraries. Those libraries that have a lot of development resources may choose to support older Python 2-compatible versions with bug and/or security fixes. Python 2 itself will be supported for these until 2020. Those without resources probably won't (keep in mind that you're using open source libraries without paying money for them).

I get that some people have to use Python 2, for whatever reasons. But using outdated software comes at a cost. Libraries have borne this technical debt for the most part thus far, but they shouldn't be expected to bear it forever. The debt will only increase, especially as the technical opportunity cost, if you will, of not being able to use newer and shinier versions of Python 3 grows. The burden will have to shift at some point. Those with the financial resources may choose to offload this debt to others,5 say, by backporting features or bugfixes to older library versions that support Python 2 (or by helping to move code to Python 3).

I want to end by pointing out that if you are, for whatever reason, still using Python 2, you may be worried that if libraries become Python 3-only and start using Python 3 features, won't that break your code? The answer is no. Assuming package maintainers mark the metadata on their packages correctly, tools like pip and conda will not install non-Python 2 compatible versions into Python 2.

If you haven't transitioned yet, and want to know more, a good place to start is the official docs. I also highly recommend using conda environments, as it will make it easy to separate your Python 2 code from your Python 3 code.

Footnotes


  1. With that being said, the opinions here are entirely my own, and are don't necessarily represent those of other people, nor do they represent official SymPy policy (no decisions have been made by the community about this at this time). 

  2. It often feels like core Python itself doesn't really want people to use Python 3. It's little things, like docs links that redirect to Python 2, or PEP 394, which still says that the python should always point to Python 2. 

  3. In Swift, Apple's new language for iOS and OS X, function parameter names are effectively "keyword-only" by default

  4. As an example of this, in conda, if you use Python 2 in the root environment, then installing into a path with non-ASCII characters is unsupported. This is common on Windows, because Windows by default uses the user's full name as the username, and the default conda install path is in the user directory.

    This is unsupported except in Python 3, because to fix the issue, every single place in conda where a string appears would have to be changed to use a unicode string in Python 2. The basic issue is that things like 'π' + u'i' raise UnicodeDecodeError in Python 2 (even though 'π' + 'i', u'π' + 'i', and u'π' + u'i' all work fine). You can read a more in-depth description of the problem here. Incidentally, this is also why you should never use from __future__ import unicode_literals in Python 2, in my opinion.

    I no longer work on conda, but as far as I know, the issue remains unfixed. Of course, this whole thing works just fine if conda is run in Python 3. 

  5. If that legitimately interests you, I hear Continuum may be able to help you. 

What happens when you mess with hashing in Python

This post is based off a Jupyter notebook I made in 2013. You can download the original here. That notebook was based off a wiki page on the SymPy wiki, which in turn was based on a message to the SymPy mailing list.

What is hashing?

Before we start, let's have a brief introduction to hashing. A hash function is a function that maps a set of objects to a set of integers. There are many kinds of hash functions, which satisfy many different properties, but the most important property that must be satisfied by any hash function is that it be a function (in the mathematical sense), that is, if two objects are equal, then their hash should also be equal.

Usually, the set of integers that the hash function maps to is much smaller than the set of objects, so that there will be multiple objects that hash to the same value. However, generally for a hash function to be useful, the set of integers should be large enough, and the hash function well distributed enough that if two objects hash to the same value, then they are very likely to be equal.

To summarize, a hash function must satisfy the property:

  • If two objects are equal, then their hashes should be equal.

Additionally, a good hash function should satisfy the property:

  • If two objects have the same hash, then they are likely to be the same object.

Since there are generally more possible objects than hash values, two objects may hash to the same value. This is called a hash collision, and anything that deals with hashes should be able to deal with them.

This won't be discussed here, but an additional property that a good hash function should satisfy to be useful is this:

  • The hash of an object should be cheap to compute.

What is it used for?

If we have a hash function that satisfies the above properties, then we can use it to create from a collection of objects something called a hash table. Suppose we have a collection of objects, and given any object, we want to be able to compute very quickly if that object belongs to our collection. We could store these objects in an ordered array, but then to determine if it is in the array, we would have to search potentially through every element of the array (in other words, an \(O(n)\)) algorithm.

With hashing, we can do better. We create what is known as a hash table. Instead of storing the objects in an ordered array, we create an array of buckets, each corresponding to some hash values. We then hash each object, and store it into the array corresponding to its hash value (if there are more hash values than buckets, we distribute them using a second hash function, which can be as simple as taking the modulus with respect to the number of buckets, % n).

This image from Wikipedia shows an example.

img

To determine if an object is in a hash table, we only have to hash the object, and look in the bucket corresponding to that hash. This is an \(O(1)\) algorithm, assuming we have a good hash function, because each bucket will generally hold very few objects, possibly even none.

Note: there are some additional things that need to be done to handle hash collisions, but the basic idea is the same, and as long as there aren't too many hash collisions, which should happen if hash values are evenly distributed and the size of the hash table is large compared to the number of objects stored in it, the average time to determine if an object is in the hash table is still \(O(1)\).

Hashing in Python

Python has a built in function that performs a hash called hash(). For many objects, the hash is not very surprising. Note, the hashes you see below may not be the same ones you see if you run the examples, because Python hashing depends on the architecture of the machine you are running on, and, in newer versions of Python, hashes are randomized for security purposes.

>>> hash(10)
10
>>> hash(()) # An empty tuple
3527539
>>> hash('a')
12416037344

In Python, not all objects are hashable. For example

>>> hash([]) # An empty list
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

This is because Python has an additional restriction on hashing:

  • In order for an object to be hashable, it must be immutable.

This is important basically because we want the hash of an object to remain the same across the object's lifetime. But if we have a mutable object, then that object itself can change over its lifetime. But then according to our first bullet point above, that object's hash has to change too.

This restriction simplifies hash tables. If we allowed an object's hash to change while it is in a hash table, we would have to move it to a different bucket. Not only is this costly, but the hash table would have to notice that this happened; the object itself doesn't know that it is sitting in a hash table, at least not in the Python implementation.

In Python, there are two objects that correspond to hash tables, dict and set. A dict is a special kind of hash table called an associative array. An associative array is a hash table where each element of the hash table points to another object. The other object itself is not hashed.

Think of an associative array as a generalization of a regular array (like a list). In a list, objects are associated to nonnegative integer indices, like

>>> l = ['a', 'b', 7]
>>> l[0]
'a'
>>> l[2]
7

In an associative array (i.e., a dict) we can index objects by anything, so long as the key is hashable.

>>> d = {0: 'a', 'hello': ['world']}
>>> d[0]
'a'
>>> d['hello']
['world']

Note that only the keys need to be hashable. The values can be anything, even unhashable objects like lists.

The uses for associative arrays are boundless. dict is one of the most useful data types in the Python language. Some example uses are

  • Extension of list with "missing values". For example, {0: 'a', 2: 7} would correspond to the above list l with the value 'b' corresponding to the key 1 removed.

  • Representation of a mathematical function with a finite domain.

  • A poor-man's database (the Wikipedia image above is an associative array mapping names to telephone numbers).

  • Implementing a Pythonic version of the switch-case statement.

The other type of hash table, set, more closely matches the definition I gave above for a hash table. A set is just a container of hashable objects. sets are unordered, and can only contain one of each object (this is why they are called "sets," because this matches the mathematical definition of a set).

In Python 2.7 or later, you can create a set with { and }, like {a, b, c}. Otherwise, use set([a, b, c]).

>>> s = {0, (), '2'}
>>> s
{0, '2', ()}
>>> s.add(1)
>>> s
{0, 1, '2', ()}
>>> s.add(0)
>>> s
{0, 1, '2', ()}

A final note: set and dict are themselves mutable, and hence not hashable! There is an immutable version of set called frozenset. There are no immutable dictionaries.

>>> f = frozenset([0, (), '2'])
>>> f
frozenset({0, '2', ()})
>>> hash(f)
-7776452922777075760
>>> # A frozenset, unlike a set, can be used as a dictionary key
>>> d[f] = 'a set'
>>> d
{0: 'a', frozenset({0, '2', ()}): 'a set', 'hello': ['world']}

Creating your own hashable objects

Before we move on, there is one final thing we need to know about hashing in Python, which is how to create hashes for custom objects. By default, if we create an object, it will be hashable.

>>> class Nothing(object):
...     pass
...
>>> N = Nothing()
>>> hash(N)
270498113

Implementation-wise, the hash is just the object's id, which corresponds to its position in memory. This satisfies the above conditions: it is (extremely) cheap to compute, and since by default objects in Python compare unequal to one another, objects with different hashes will be unequal.

>>> M = Nothing()
>>> M == N
False
>>> hash(M)
270498117
>>> hash(M) == hash(N)
False

To define a hash function for an object, define the __hash__ method.

>>> class HashToOne(object):
...     def __hash__(self):
...         return 1
...
>>> HTO = HashToOne()
>>> hash(HTO)
1

To set an object as not hashable, set __hash__ to None.

>>> class NotHashable(object):
...     __hash__ = None
...
>>> NH = NotHashable()
>>> hash(NH)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'NotHashable'

Finally, to override the equality operator ==, define __eq__.

>>> class AlwaysEqual(object):
...     def __eq__(self, other):
...         if isinstance(other, AlwaysEqual):
...             return True
...         return False
...
>>> AE1 = AlwaysEqual()
>>> AE2 = AlwaysEqual()
>>> AE1 == AE2
True

One of the key points that I hope you will take away from this post is that if you override __eq__ and you want a hashable object, you must also override __hash__ to agree. Note that Python 3 will actually require this: in Python 3, if you override __eq__, it automatically sets __hash__ to None, making the object unhashable. You need to manually override __hash__ to make it hashable again. But that's as far as Python goes in enforcing these rules, as we will see below. In particular, Python will never actually check that your __hash__ actually agrees with your __eq__.

Messing with hashing

Now to the fun stuff. What happens if we break some of the invariants that Python expects of hashing. Python expects two key invariants to hold

  1. The hash of an object does not change across the object's lifetime (in other words, a hashable object should be immutable).

  2. a == b implies hash(a) == hash(b) (note that the reverse might not hold in the case of a hash collision).

As we shall see, Python expects, but does not enforce either of these.

Example 1: Mutating a hash

Let's break rule 1 first. Let's create an object with a hash, and then change that object's hash over its lifetime, and see what sorts of things can happen.

>>> class Bad(object):
...     def __init__(self, hash): # The object's hash will be hash
...         self.hash = hash
...     def __hash__(self):
...         return self.hash
...
>>> b = Bad(1)
>>> hash(b)
1
>>> d = {b:42}
>>> d[b]
42
>>> b.hash = 2
>>> hash(b)
2
>>> d[b]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: <__main__.Bad object at 0x1047e7438>

Here, we implicitly changed the hash of b by mutating the attribute of b that is used to compute the hash. As a result, the object is no longer found in a dictionary, which uses the hash to find the object.

The object is still there, we just can't access it any more.

>>> d
{<__main__.Bad object at 0x1047e7438>: 42}

Note that Python doesn't prevent me from doing this. We could make it if we want (e.g., by making __setattr__ raise AttributeError), but even then we could forcibly change it by modifying the object's __dict__. We could try some more fancy things using descriptors, metaclasses, and/or __getattribute__, but even then, if we knew what was happening, we could probably find a way to change it.

This is what is meant when people say that Python is a "consenting adults" language. You are expected to not try to break things, but generally aren't prevented from doing so if you try.

Example 2: More mutation

Let's try something even more crazy. Let's make an object that hashes to a different value each time we look at the hash.

>>> class DifferentHash(object):
...     def __init__(self):
...         self.hashcounter = 0
...     def __hash__(self):
...         self.hashcounter += 1
...         return self.hashcounter
...
>>> DH = DifferentHash()
>>> hash(DH)
1
>>> hash(DH)
2
>>> hash(DH)
3

Obviously, if we use DH as a key to a dictionary, then it will not work, because we will run into the same issue we had with Bad. But what about putting DH in a set.

>>> DHset = {DH, DH, DH}
>>> DHset
{<__main__.DifferentHash at 0x101f79f50>,
 <__main__.DifferentHash at 0x101f79f50>,
 <__main__.DifferentHash at 0x101f79f50>}

Woah! We put the exact same object in a set three times, and it appeared all three times. This is not what is supposed to happen with a set.

>>> {1, 1, 1}
{1}

What happens when we do stuff with DHset?

>>> DHset.remove(DH)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: <__main__.DifferentHash object at 0x1047e75f8>

That didn't work, because set.remove searches for an object by its hash, which is different by this point.

Now let's make a copy of DHset. The set.copy method will create a shallow copy (meaning that the set container itself will be different, according to is comparison, but the objects themselves will the same, according to is comparison).

>>> DHset2 = DHset.copy()
>>> DHset2 == DHset
True

Everything is fine so far. This object is only going to cause trouble if something recomputes its hash. But remember that the whole reason that we had trouble with something like Bad above is that Python doesn't recompute that hash of an object, unless it has to. So let's do something that will force it to do so: let's pop an object from one of the sets and add it back in.

>>> D = DHset.pop()
>>> DHset.add(D)
>>> DHset
{<__main__.DifferentHash at 0x101f79f50>,
 <__main__.DifferentHash at 0x101f79f50>,
 <__main__.DifferentHash at 0x101f79f50>}
>>> DHset2
{<__main__.DifferentHash at 0x101f79f50>,
 <__main__.DifferentHash at 0x101f79f50>,
 <__main__.DifferentHash at 0x101f79f50>}
>>> DHset == DHset2
False

There we go. By removing it from the set, we made the set forget about its hash, so it had to be recomputed when we added it again. This version of DHset now has a DH with a different hash than it had before. Thinking back to set being a hash table, in this DHset, the three DH objects are in different "buckets" than they were in before. DHset.__eq__(DHset2) notices that the bucket structure is different right away and returns False.

By the way, what hash value are we up to these days?

>>> hash(DH)
9

Example 3: When a == b does not imply hash(a) == hash(b)

Now let's look at point 2. What happens if we create an object with __eq__ that disagrees with __hash__. For Python 2, we already have made a class like this, the AlwaysEqual object above. Instances of AlwaysEqual will always compare equal to one another, but they will not have the same hash, because they will use object's default __hash__ of id. For Python 3, however, AlwaysEqual is automatically set as unhashable because we overrode __eq__ without also overriding __hash__.

This blog post originally used the AE1 and AE2 objects we created above for the next example, but to make it work in both Python 2 and 3, let's create a custom AlwaysEqual subclass that is hashable.

>>> class AlwaysEqualHashable(AlwaysEqual):
...     def __hash__(self):
...         return id(self)
...
>>> AE1 = AlwaysEqualHashable()
>>> AE2 = AlwaysEqualHashable()
>>> hash(AE1)
270498221
>>> hash(AE2)
270498197
>>> hash(AE1) == hash(AE2)
False
>>> AE1 == AE2
True
>>> {AE1, AE2}
{<__main__.AlwaysEqualHashable at 0x101f79950>,
 <__main__.AlwaysEqualHashable at 0x101f79ad0>}

We can already see that we have broken one of the key properties of a set, which is that it does not contain the same object twice (remember that AE1 and AE2 should be considered the "same object" because AE1 == AE2 is True).

This can lead to subtle issues. For example, suppose we had a list and we wanted to remove all the duplicate items from it. An easy way to do this is to convert the list to a set and then convert it back to a list.

>>> l = ['a', 'a', 'c', 'a', 'c', 'b']
>>> list(set(l))
['a', 'b', 'c']

Now, this method is obviously not going to work for a list of AlwaysEqualHashable objects.

>>> AE3 = AlwaysEqualHashable()
>>> l = [AE1, AE1, AE3, AE2, AE3]
>>> list(set(l))
[<__main__.AlwaysEqualHashable at 0x102c1d590>,
 <__main__.AlwaysEqualHashable at 0x101f79ad0>,
 <__main__.AlwaysEqualHashable at 0x101f79950>]

Actually, what happened here is that the equality that we defined on AlwaysEqual was essentially ignored. We got a list of unique items by id, instead of by __eq__. You can imagine that if __eq__ were something a little less trivial, where some, but not all, objects are considered equal, that this could lead to very subtle issues.

But there is an issue with the above algorithm. It isn't stable, that is, it removes the ordering that we had on the list. We could do this better by making a new list, and looping through the old one, adding elements to the new list if they aren't already there.

>>> def uniq(l):
...     newl = []
...     for i in l:
...         if i not in newl:
...             newl.append(i)
...     return newl
...
>>> uniq(['a', 'a', 'c', 'a', 'c', 'b'])
['a', 'c', 'b']
>>> uniq([AE1, AE1, AE3, AE2, AE3])
[<__main__.AlwaysEqualHashable at 0x101f79ad0>]

This time, we used in, which uses ==, so we got only one unique element of the list of AlwaysEqual objects.

But there is an issue with this algorithm as well. Checking if something is in a list is \(O(n)\), but we have an object that allows checking in \(O(1)\) time, namely, a set. So a more efficient version might be to create a set alongside the new list for containment checking purposes.

>>> def uniq2(l):
...     newl = []
...     newlset = set()
...     for i in l:
...         if i not in newlset:
...             newl.append(i)
...             newlset.add(i)
...     return newl
...
>>> uniq2(['a', 'a', 'c', 'a', 'c', 'b'])
['a', 'c', 'b']
>>> uniq2([AE1, AE1, AE3, AE2, AE3])
[<__main__.AlwaysEqualHashable at 0x101f79ad0>,
 <__main__.AlwaysEqualHashable at 0x102c1d590>,
 <__main__.AlwaysEqualHashable at 0x101f79950>]

Bah! Since we used a set, we compared by hashing, not equality, so we are left with three objects again. Notice the extremely subtle difference here. Basically, it is this:

>>> AE1 in {AE2}
False
>>> AE1 in [AE2]
True

Set containment uses hashing; list containment uses equality. If the two don't agree, then the result of your algorithm will depend on which one you use!

By the way, as you might expect, dictionary containment also uses hashing, and tuple containment uses equality:

>>> AE1 in {AE2: 42}
False
>>> AE1 in (AE2,)
True

Example 4: Caching hashing

If you ever want to add subtle bizarreness to a system, add some sort of caching, and then do it wrong.

As we noted in the beginning, one important property of a hash function is that it is quick to compute. A nice way to achieve this for heavily cached objects is to cache the value of the cache on the object, so that it only needs to be computed once. The pattern (which is modeled after SymPy's Basic) is something like this:

>>> class HashCache(object):
...     def __init__(self, arg):
...         self.arg = arg
...         self.hash_cache = None
...     def __hash__(self):
...         if self.hash_cache is None:
...             self.hash_cache = hash(self.arg)
...         return self.hash_cache
...     def __eq__(self, other):
...         if not isinstance(other, HashCache):
...             return False
...         return self.arg == other.arg
...

HashCache is nothing more than a small wrapper around a hashable argument, which caches its hash.

>>> hash('a')
12416037344
>>> a = HashCache('a')
>>> hash(a)
12416037344

For ordinary Python builtins, simply recomputing the hash will be faster than the attribute lookup used by HashCache. Note: This uses the %timeit magic from IPython. %timeit only works when run in IPython or Jupyter.

>>> %timeit hash('a')
10000000 loops, best of 3: 69.9 ns per loop
>>> %timeit hash(a)
1000000 loops, best of 3: 328 ns per loop

But for a custom object, computing the hash may be more computationally expensive. As hashing is supposed to agree with equality (as I hope you've realized by now!), if computing equality is expensive, computing a hash function that agrees with it might be expensive as well.

As a simple example of where this might be useful, consider a highly nested tuple, an object whose hash that is relatively expensive to compute.

>>> a = ()
>>> for i in range(1000):
...     a = (a,)
...
>>> A = HashCache(a)
>>> %timeit hash(a)
100000 loops, best of 3: 9.61 µs per loop
>>> %timeit hash(A)
1000000 loops, best of 3: 325 ns per loop

So far, we haven't done anything wrong. HashCache, as you may have noticed, has __eq__ defined correctly:

>>> HashCache(1) == HashCache(2)
False
>>> HashCache(1) == HashCache(1)
True

But what happens if we mutate a HashCache. This is different from examples 1 and 2 above, because we will be mutating what happens with equality testing, but not the hash (because of the cache).

In the below example, recall that small integers hash to themselves, so hash(1) == 1 and hash(2) == 2.

>>> a = HashCache(1)
>>> d = {a: 42}
>>> a.arg = 2
>>> hash(a)
1
>>> d[a]
42

Because we cached the hash of a, which was computed as soon as we created the dictionary d, it remained unchanged when modified the arg to be 2. Thus, we can still find the key of the dictionary. But since we have mutated a, the equality testing on it has changed. This means that, as with the previous example, we are going to have issues with dicts and sets keeping unique keys and entries (respectively).

>>> a = HashCache(1)
>>> b = HashCache(2)
>>> hash(a)
1
>>> hash(b)
2
>>> b.arg = 1
>>> a == b
True
>>> hash(a) == hash(b)
False
>>> {a, b}
{<__main__.HashCache at 0x102c32050>, <__main__.HashCache at 0x102c32450>}
>>> uniq([a, b])
[<__main__.HashCache at 0x102c32050>]
>>> uniq2([a, b])
[<__main__.HashCache at 0x102c32050>, <__main__.HashCache at 0x102c32450>]

Once we mutate b so that it compares equal to a, we start to have the same sort of issues that we had in example 3 with AlwaysEqualHashable. Let's look at an instant replay.

>>> a = HashCache(1)
>>> b = HashCache(2)
>>> b.arg = 1
>>> print(a == b)
True
>>> print(hash(a) == hash(b))
True
>>> print({a, b})
{<__main__.HashCache object at 0x102c32a10>}
>>> print(uniq([a, b]))
[<__main__.HashCache object at 0x102c32a50>]
>>> print(uniq2([a, b]))
[<__main__.HashCache object at 0x102c32a50>]

Wait a minute, this time it's different! Comparing it to above, it's pretty easy to see what was different this time. We left out the part where we showed the hash of a and b. When we did that the first time, it cached the hash of b, making it forever be 2, but when we didn't do it the second time, the hash had not been cached yet, so the first time it is computed (in the print(hash(a) == hash(b)) line), b.arg has already been changed to 1.

And herein lies the extreme subtlety: if you mutate an object with that hashes its cache like this, you will run into issues only if you had already called some function that hashed the object somewhere. Now just about anything might compute the hash of an object. Or it might not. For example, our uniq2 function computes the hash of the objects in its input list, because it stores them in a set, but uniq does not:

>>> a = HashCache(1)
>>> b = HashCache(2)
>>> uniq2([a, b])
>>> b.arg = 1
>>> print(a == b)
True
>>> print(hash(a) == hash(b))
False
>>> print({a, b})
{<__main__.HashCache object at 0x102c32c50>, <__main__.HashCache object at 0x102c32c10>}
>>> print(uniq([a, b]))
[<__main__.HashCache object at 0x102c32c50>]
>>> print(uniq2([a, b]))
[<__main__.HashCache object at 0x102c32c50>, <__main__.HashCache object at 0x102c32c10>]
>>> a = HashCache(1)
>>> b = HashCache(2)
>>> uniq([a, b])
>>> b.arg = 1
>>> print(a == b)
True
>>> print(hash(a) == hash(b))
True
>>> print({a, b})
{<__main__.HashCache object at 0x102c32c90>}
>>> print(uniq([a, b]))
[<__main__.HashCache object at 0x102c32bd0>]
>>> print(uniq2([a, b]))
[<__main__.HashCache object at 0x102c32bd0>]

The moral of this final example is that if you are going to cache something, that something had better be immutable.

Conclusion

The conclusion is this: don't mess with hashing. The two invariants above are important. Let's restate them here,

  1. The hash of an object must not change across the object's lifetime (in other words, a hashable object should be immutable).

  2. a == b implies hash(a) == hash(b) (note that the reverse might not hold in the case of a hash collision).

If you don't follow these rules, you will run into very subtle issues, because very basic Python operations expect these invariants.

If you want to be able to mutate an object's properties, you have two options. First, make the object unhashable (set __hash__ = None). You won't be able to use it in sets or as keys to a dictionary, but you will be free to change the object in-place however you want.

A second option is to make all mutable properties non-dependent on hashing or equality testing. This option works well if you just want to cache some internal state that doesn't inherently change the object. Both __eq__ and __hash__ should remain unchanged by changes to this state. You may also want to make sure you use proper getters and setters to prevent modification of internal state that equality testing and hashing does depend on.

If you choose this second option, however, be aware that Python considers it fair game to swap out two identical immutable (i.e., hashable) objects at any time. If a == b and a is hashable, Python (and Python libraries) are free to replace a with b anywhere. For example, Python uses an optimization on strings called interning, where common strings are stored only once in memory. A similar optimization is used in CPython for small integers. If store something on a but not b and make a's hash ignore that data, you may find that some function that should return a may actually return b. For this reason, I generally don't recommend this second option unless you know what you are doing.

Finally, to keep invariant 2, here are some tips:

  • Make sure that the parts of the object that you use to compare equality are not themselves mutable. If they are, then your object cannot itself be immutable. This means that if a == b depends on a.attr == b.attr, and a.attr is a list, then you will need to use a tuple instead (if you want a to be hashable).

  • You don't have to invent a hash function. If you find yourself doing bitshifts and XORs, you're doing it wrong. Reuse Python's builtin hashable objects. If the hash of your object should depend on the hash of a and b, define __hash__ to return hash((a, b)). If the order of a and b does not matter, use hash(frozenset([a, b])).

  • Don't cache something unless you know that the entire cached state will not be changed over the lifetime of the cache. Hashable objects are actually great for caches. If they properly satisfy invariant 1, and all the state that should be cached is part of the hash, then you will not need to worry. And the best part is that you can just use dict for your cache.

  • Unless you really need the performance or memory gains, don't make your objects mutable. This makes programs much harder to reason about. Some functional programming languages take this idea so far that they don't allow any mutable objects.

  • Don't worry about the situation where hash(a) == hash(b) but a != b. This is a hash collision. Unlike the issues we looked at here, hash collisions are expected and checked for in Python. For example, our HashToOne object from the beginning will always hash to 1, but different instances will compare unequal. We can see that the right thing is done in every case with them.

    >>> a = HashToOne()
    >>> b = HashToOne()
    >>> a == b
    False
    >>> hash(a) == hash(b)
    True
    >>> {a, b}
    {<__main__.HashToOne at 0x102c32a10>, <__main__.HashToOne at 0x102c32cd0>}
    >>> uniq([a, b])
    [<__main__.HashToOne at 0x102c32cd0>, <__main__.HashToOne at 0x102c32a10>]
    >>> uniq2([a, b])
    [<__main__.HashToOne at 0x102c32cd0>, <__main__.HashToOne at 0x102c32a10>]
    

    The only concern with hash collisions is that too many of them can remove the performance gains of dict and set.

  • Conversely, if you are writing something that uses an object's hash, remember that hash collisions are possible and unavoidable.

    A classic example of a hash collision is -1 and -2. Remember I mentioned above that small integers hash to themselves:

    >>> hash(1)
    1
    >>> hash(-3)
    -3
    

    The exception to this is -1. The CPython interpreter uses -1 as an error state, so -1 is not a valid hash value. Hence, hash(-1) can't be -1. So the Python developers picked the next closest thing.

    >>> hash(-1)
    -2
    >>> hash(-2)
    -2
    

    If you want to check if something handles hash collisions correctly, this is a simple example. I should also note that the fact that integers hash to themselves is an implementation detail of CPython that may not be true in alternate Python implementations.

  • Finally, we didn't discuss this much here, but don't assume that the hash of your object will be the same across Python sessions. In Python 3.3 and up, hash values of strings are randomized from a value that is seeded when Python starts up. This also affects any object whose hash is computed from the hash of strings. In Python 2.7, you can enable hash randomization with the -R flag to the interpreter. The following are two different Python sessions.

    >>> print(hash('a'))
    -7750608935454338104
    
    >>> print(hash('a'))
    8897161376854729812