This post will show you how to write a function to compute md5 sum of a file using the hashlib module, the with statement and being memory efficient by not reading the whole file in memory.
from __future__ import with_statement
from hashlib import md5
def md5sum(filename, buf_size=8192):
m = md5()
# the with statement makes sure the file will be closed
with open(filename, 'b') as f:
# We read the file in small chunk until EOF
data = f.read(buf_size)
while data:
# We had data to the md5 hash
m.update(data)
data = f.read(buf_size)
# We return the md5 hash in hexadecimal format
return m.hexdigest()
if __name__ == '__main__':
import sys
print md5sum(sys.argv[1])
Now let’s see how quick it is against the real md5sum using a test file of 10Go!
The real md5sum:
$ time md5sum /data/testfile
b215f7bf5b09fa3e9848a6a66f3f3172 /data/testfile
real 0m31.148s
user 0m27.738s
sys 0m3.408s
The python version of md5sum:
$ time python md5sum.py /data/testfile
b215f7bf5b09fa3e9848a6a66f3f3172
real 0m27.791s
user 0m24.514s
sys 0m3.276s
The python based version is almost 4 seconds quicker than the C based version!