How to bring your own machine learning models to databases
Introduction
At MindsDB, we have focused on democratizing machine learning (ML) for a while now. Not only do we believe ML should be done at the data layer (i.e. your database), but we also think that offering a compelling automated machine learning pipeline can be of great value for people who are not ML experts.
However, our community has grown quickly these past few months, and we have seen new members that do have great machine learning expertise repeatedly ask the same question:
How can I bring my own model into the MindsDB ecosystem?
For a while, we offered an answer that was not ideal to many: “our in-house AutoML engine Lightwood is flexible enough to incorporate custom model logic, so you should look into that”. This is a detour that some ML-savvy users found counterintuitive, or just did not have the time to properly learn how to use.
We’re happy to announce that -as of release 22.3.2.0 — you can bring your own model to MindsDB. In this blog post, we’ll show how you can leverage this beta feature to deploy a pre-existing natural language processing machine learning model written in Keras, a popular deep learning framework. As some of you may know, we love PyTorch and leverage it in our AutoML engine. Hence, we thought using a different framework would lead to a challenging test drive.
The rest of this article is structured as follows: first, we’ll explore the basics of the Bring Your Own Model (a.k.a. BYOM) feature and how you can use it. Then, we will briefly explain the machine learning model we’ll be bringing into a database. Finally, we’ll actually train the model, connect it to MindsDB and call it via SQL to get some predictions. Buckle up!
Requirements
To follow this tutorial, you will need a Python3.7 or Python3.8 virtual environment with the following packages:
MindsDB ≥ 22.3.2.0 — pip install mindsdb
Keras ≥ 2.8.0 — pip install tensorflow
MLflow ≥ 1.15.0 — pip install mlflow
NOTE: We recommend these versions as they’re the ones we used to write this post, but it’s possible that you’ll get this example to work with older versions of some/all of the packages above.
You will also need Conda, as it is used by MLflow to generate reproducible experiments. Additionally, the tutorial assumes you have a working SQL database instance to upload a couple of datasets, and a SQL database access tool like DBeaver, to interact with it.
Finally, you should also download the data from this Kaggle competition, as we’ll be using it to train a model and then get predictions for the test split.
BYOM basics
The Bring Your Own Model feature is fairly straightforward. Even though it supports both MLflow and Ray Serve APIs, in this post we will focus on the former.
The BYOM feature assumes you have the code for some machine learning model that you would like to deploy, and interact with, from the database. If you’re using MLflow, you will need to first train and save the model with it.
Once this is done, serving the model will enable access through a REST API endpoint. The BYOM feature essentially bridges the gap between your database and the predictor without ever leaving the data layer. In practice, this means you get to call the model and access predictions as if they were just another table in your database.
Let’s bring a model!
For this tutorial, we are going to focus on the Kaggle competition “Natural Language Processing with Disaster Tweets”. The dataset contains a bunch of tweets, each with their own ID, location, and related keywords. The task is to predict when a tweet is related to a disaster, which is denoted by the “target” column for the training set. Hence, this is a (supervised) binary classification task.
As the focus of this post is not the model itself, we’ll keep it simple and leverage the nice example model by Kaggle Grandmaster Shahules, built using the Keras deep learning framework. It works by taking pre-trained GLOVE embeddings of words in pre-processed/cleaned tweets, aggregating and passing them as input for an LSTM neural network that will learn to gauge if a tweet is referencing a disaster or not.
We will not be inspecting the code for the model here (please refer to the link at the end of the article for more details), but as a first step we need to train the model and store it . Remember that, as this is a Keras model, it is actually really simple to achieve this:
# we train the model
model.fit(...)# then save it
model.save(model_path)
Let’s focus on how the model should be wrapped to use MLflow. This particular use case is complex because data preprocessing is required to transform actual tweets into the initial embeddings that the model takes. These embeddings are not really meant to be pre-loaded from a feature store to a table on a per-tweet basis, but rather loaded and aggregated on demand as you trigger new inferences (otherwise, said table would be huge and highly redundant). So, when calling the predictor we actually need to remove a bunch of noise in the tweet (URLs, punctuation, emojis), tokenize the remaining words and pad the sequence because our model expects a constant length in the input sequences. Whew.
To serve models with custom inference logic like the above in MLflow, we need to subclass mlflow.pyfunc.PythonModel:
class Model(mlflow.pyfunc.PythonModel):
def load_context(self, context):
# Here we fetch and expose the model and any
# other requirements for `predict()` calls.
# `context` will have any artifacts passed to save_model()
def predict(self, context, model_input):
# 1. We modify `model_input` at will
# 2. Call the model and return predictions
As you can see, the context loader has to load whatever artifacts are required to predict, model included. This can be achieved by passing a dictionary with all relevant information. Additionally, as MLflow will build a conda environment to guarantee reproducibility, we have to specify how this environment should look:
# these will be accessible inside the Model() wrapper
artifacts = {
'model': model_path,
'tokenizer_path': tokenizer_path,
}# specs for environment that will be created when serving the model
conda_env = {
'name': 'nlp_keras_env',
'channels': ['defaults'],
'dependencies': [
'python=3.8',
'pip',
{
'pip': [
'mlflow',
'tensorflow',
'cloudpickle',
'nltk',
'pandas',
'numpy',
'scikit-learn',
'tqdm',
],
},
],
}
Now, let’s implement the final Model wrapper to replicate the pre-processing done at training, ensuring that inference works with well-formed inputs:
class Model(mlflow.pyfunc.PythonModel):
def load_context(self, context):
# we use paths in the context to load everything self.model_path = context.artifacts['model']
self.model = load_model(self.model_path)
with open(context.artifacts['tokenizer_path'], 'rb') as f:
self.tokenizer = pickle.load(f)
def predict(self, context, model_input):
# preprocess input, tokenize, pad, and call the model df = preprocess_df(model_input)
corpus = create_corpus(df)
sequences = self.tokenizer.texts_to_sequences(corpus)
tweet_pad = pad_sequences(sequences,
maxlen=MAX_LEN,
truncating='post',
padding='post')
df = tweet_pad[:df.shape[0]]
y_pre = self.model.predict(df)
y_pre = np.round(y_pre).astype(int).flatten().tolist()
return list(y_pre)
With this, we can ask MLflow to actually store the trained model in some path of our preference:
mlflow.pyfunc.save_model(
path="nlp_kaggle",
python_model=Model(),
conda_env=conda_env,
artifacts=artifacts
)
Finally, serving is simple. Go to the directory where you called the training script, and execute mlflow models serve --model-uri ./nlp_kaggle. If all goes well, your model should be ready to be called! But wait, we actually wanted to do this from our database with SQL, remember? Here’s where MindsDB comes in.
USE MindsDB;
MindsDB has its own MySQL and HTTP APIs. To start it, you would do python -m mindsdb which will spin up the MySQL API by default. Then, you need to add a connection to MindsDB from your SQL access tool (e.g. DBeaver). For this step, I will actually refer you to our documentation, but it is a rather simple procedure.
Once that’s done, you can link the MLflow model with these SQL statements:
USE mindsdb;CREATE PREDICTOR nlp_kaggle_mlflow
PREDICT `target`
USING
url.predict='http://localhost:5000/invocations',
format='mlflow',
dtype_dict={"text": "rich text", "target": "binary"};
As you can see, linking is easy. We only need to specify the name by which we’ll access this model, the name of the column that it will predict, what URL should we find the model at (i.e. the /invocations endpoint) and last but not least, what is the data type for each column (target included).
Now, to get predictions from the model there are two main paths. On the one hand, you can directly pass input data using the WHERE clause:
SELECT target
FROM mindsdb.nlp_kaggle_mlflow
WHERE text='The tsunami is coming, seek high ground';
However, most of the time what you actually will want in a deployed context is to pass a batch of data for the model to emit a batch of predictions.
For this, you should have your database linked to MindsDB. As an example, here’s how a Postgres connection (let’s call it db_byom) would look like:
CREATE DATASOURCE db_byom
WITH
engine='postgres',
parameters={
"user":"user_name",
"port": 3307,
"password": "password",
"host": "127.0.0.1",
"database": "postgres"
};
We will create a table called nlp_kaggle_test inside a testdatabase with the following schema:
id INT,
keyword VARCHAR(255),
location VARCHAR(255),
text VARCHAR(5000),
target INT
And ingest the dataset’s test split into this table. Once you’ve managed this, MindsDB can actuallyJOIN the table with our model, like so:
SELECT
ta.text,
tb.target as predicted
FROM db_byom.test.nlp_kaggle_test as ta
JOIN mindsdb.nlp_kaggle_mlflow as tb;
The output will look something like this:
+------------------------------------------+-----------+
| `text` | predicted |
+------------------------------------------+-----------+
| London is cool ;) | 0 |
| Forest fire near La Ronge Sask. Canada | 1 |
| [...] try to bring the heavy. #metal | 0 |
| People get #wildfires evacuation orders | 1 |
| ACCIDENT PROPERTY DAMAGE; PINER RD/HO... | 1 |
+------------------------------------------+-----------+
Seems like the predictor did learn a thing or two about disaster tweets!
Conclusion and next steps
Hopefully, this blog post has given you an idea of how any model can be served and then linked with MindsDB to generate predictions inside the data layer.
This can be pretty powerful for ML in the loop if you use tools like DBT, effectively simplifying the deployment of your machine learning models.
If you have some ML models of your own and are tired of bringing its results back into the database, we suggest you go check the MindsDB repo, give it a try (and a star) and use your own models with this feature! We recommend joining our community slack as well, our team will be more than happy to help you with any questions that may arise, and all feedback is always appreciated.
As for what’s next, we’re hard at work simplifying the steps required to integrate alternative datasources, ML modeling tools, and serving tools (e.g. Ray Serve, which is already available), as well as improving the overall user experience and journey.
Links
All things MindsDB may be interesting to you: our homepage, our documentation, GitHub repository, and finally our Slack community, where you will be able to ask questions and interact with our community!
The full script can be found in markdown format in this tutorial.
— About the author: Patricio Cerda Mardini, Machine Learning Research Engineer @ MindsDB.
As a masters student at PUC Chile, he focused on machine learning methods for human–robot interaction and recommendation systems, areas in which he holds several academic publications. Prior to joining MindsDB, he also interned at EY Chile as a computer vision researcher.