This post will show you how to write a function to compute md5 sum of a file using the hashlib module, the with statement and being memory efficient by not reading the whole file in memory.
from __future__ import with_statement from hashlib import md5 def md5sum(filename, buf_size=8192): m = md5() # the with statement makes sure the file will be closed with open(filename, 'b') as f: # We read the file in small chunk until EOF data = f.read(buf_size) while data: # We had data to the md5 hash m.update(data) data = f.read(buf_size) # We return the md5 hash in hexadecimal format return m.hexdigest() if __name__ == '__main__': import sys print md5sum(sys.argv)
Now let’s see how quick it is against the real md5sum using a test file of 10Go!
The real md5sum:
$ time md5sum /data/testfile b215f7bf5b09fa3e9848a6a66f3f3172 /data/testfile real 0m31.148s user 0m27.738s sys 0m3.408s
The python version of md5sum:
$ time python md5sum.py /data/testfile b215f7bf5b09fa3e9848a6a66f3f3172 real 0m27.791s user 0m24.514s sys 0m3.276s
The python based version is almost 4 seconds quicker than the C based version!