A quick md5sum equivalent in python.

May 29, 2010  

This post will show you how to write a function to compute md5 sum of a file using the hashlib module, the with statement and being memory efficient by not reading the whole file in memory.

from __future__ import with_statement
from hashlib import md5

def md5sum(filename, buf_size=8192):
    m = md5()
    # the with statement makes sure the file will be closed
    with open(filename, 'b') as f:
        # We read the file in small chunk until EOF
        data = f.read(buf_size)
        while data:
            # We had data to the md5 hash
            m.update(data)
            data = f.read(buf_size)
    # We return the md5 hash in hexadecimal format
    return m.hexdigest()

if __name__ == '__main__':
    import sys
    print md5sum(sys.argv[1])

Now let’s see how quick it is against the real md5sum using a test file of 10Go!

The real md5sum:

$ time md5sum /data/testfile
b215f7bf5b09fa3e9848a6a66f3f3172  /data/testfile

real    0m31.148s
user    0m27.738s
sys     0m3.408s

The python version of md5sum:

$ time python md5sum.py /data/testfile
b215f7bf5b09fa3e9848a6a66f3f3172

real    0m27.791s
user    0m24.514s
sys     0m3.276s

The python based version is almost 4 seconds quicker than the C based version!