Let's assume that once again we've got a trained pipeline saved
in a file called pipe.joblib
. Then we can load the pipeline
directly via
from joblib import load
trained = load("pipe.joblib")
trained.predict(["hello world"])
But let's consider what's happening. We're loading in a file that will be turned into a Python object. That means that arbitrary Python code might be running just by loading this. That's a huge security risk! If somebody tampered with that file, all sorts of bad things might happen.
Example
For example, the file might contain an object that behaves just like a pipeline but is running bad code as a side effect.
evil_pipe = EvilThing()
class EvilThing:
def predict(self, X):
print("fooled you!")
return [1 for _ in X]
evil_pipe = EvilThing()
dump(evil_pipe, "pipe-evil.joblib")
The pipe-evol.joblib
file will now contain malicious code. But
you can still load it without realising it.
trained = load("pipe-evil.joblib")
trained.predict(["hello world"])
The example that we've shown here is only relatively innocent.
Just from loading an untrusted .joblib
file you can give
access the server and risk leaking data to a 3rd party.
Serialization Attacks
This type of security leak falls in the "serialization attack" category. These kinds of attacks abuse the fact that objects need to be loaded from disk and they can lead to a lot of damage. If you're interested in a detailed demo of such an attack, you might appreciate this YouTube video.