Calmcode - scikit save: checksum

Checksum against Pickles

1 2 3 4 5 6

To protect ourselvse against an evil .joblib file we might consider calculating a checksum of a file. A checksum of a file is basically a hash value associated with the contents of a file. Every file can be expected to have a unique hash value, so this might allow us to detect if a file has been tampered with.

The code below demonstrates how you might calculate a checksum for a file using Python.

import hashlib

def calc_checksum(path):
    md5_hash = hashlib.md5()

    with open(path, "rb") as f:
        content = f.read()
    md5_hash.update(content)
    digest = md5_hash.hexdigest()
    print(digest)

calc_checksum("pipe.joblib") # 04a415025a812c2a69cb3552d83ee275
calc_checksum("pipe-evil.joblib") # 0b119f868ac251eee25af5c4b0c2064d

While this approach has merit to it, you will need to keep track of a checksum in order for this to work. So we may want to consider other tactics as well.