Building a Semantic Search Knowledge Base with MindsDB

Cover Image for Building a Semantic Search Knowledge Base with MindsDB

Written by Andriy Burkov, Ph.D. & Author, MindsDB Advisor

What happens when a developer searches for "how to make async HTTP calls" but your documentation says "asynchronous network requests"? Traditional keyword search fails—even though the content is exactly what they need.

This is the fundamental limitation of keyword search: it matches words, not meaning.

In this tutorial, we'll build a semantic search system using MindsDB that understands user intent. Using 2 million Stack Overflow posts, we'll create knowledge bases with two different vector storage backends—PGVector and FAISS- and compare their performance.

What You'll Learn:

  • How MindsDB knowledge bases convert text into searchable vectors

  • Setting up pgvector (PostgreSQL-based) and FAISS (Facebook AI Similarity Search) storage

  • Combining semantic search with metadata filters

  • Building an AI agent that uses your knowledge base to answer questions

Prerequisites:

  • A MindsDB account (cloud or self-hosted)

  • PostgreSQL database with the Stack Overflow dataset

  • An OpenAI API key for embeddings

How Semantic Search Works

Before we dive in, let's understand the key difference between keyword and semantic search:

AspectKeyword SearchSemantic Search
MatchingExact wordsMeaning/intent
Query: "async HTTP"Misses "asynchronous requests"Finds both
Handles synonyms
Understands context

Semantic search works by:

  1. Embedding: Converting text into numerical vectors using an embedding model

  2. Storing: Saving these vectors in a vector database

  3. Querying: Converting the search query to a vector and finding the closest matches

MindsDB handles all of this through its Knowledge Base abstraction.

Installing Dependencies

We need two packages:

  • mindsdb_sdk: Python client for interacting with MindsDB servers

  • pandas: For working with query results as DataFrames

!pip install mindsdb_sdk pandas

3. Connecting to the MindsDB Cloud Instance

import mindsdb_sdk

# Connect to your MindsDB instance
server = mindsdb_sdk.connect(
    'YOUR_MINDSDB_URL',  # e.g., 'https://cloud.mindsdb.com' for MindsDB Cloud
    login='YOUR_USERNAME',
    password='YOUR_PASSWORD'
)
print("Connected to MindsDB server")
Connected to MindsDB server

4. Connecting to the Data Source

def run_query(sql, success_msg="Query executed successfully"):
    """Execute a SQL query and handle 'already exists' errors gracefully."""
    try:
        result = server.query(sql).fetch()
        print(success_msg)
        return result
    except RuntimeError as e:
        if "already exists" in str(e).lower():
            print("Resource already exists - skipping")
        else:
            raise
        return None

# Connect to your PostgreSQL database containing Stack Overflow data
run_query("""
    CREATE DATABASE pg_sample
    WITH ENGINE = "postgres",
    PARAMETERS = {
        "user": "YOUR_PG_USER",
        "password": "YOUR_PG_PASSWORD",
        "host": "YOUR_PG_HOST",
        "port": "5432",
        "database": "sample"
    }
""", "Created pg_sample database connection")
Created pg_sample database connection

Let's verify the connection by exploring the data. Check the dataset size:

# Get total row count
count = server.query("SELECT COUNT(*) as cnt FROM pg_sample.stackoverflow_2m").fetch()
print(f"Dataset size: {count['cnt'].iloc[0]:,} rows")
Dataset size: 2,000,000 rows

Show 10 records:

# Test sample data
df = server.query("SELECT * FROM pg_sample.stackoverflow_2m LIMIT 10").fetch()

# Display as a nice table (in Jupyter notebooks)
from IPython.display import display
display(df)
idPostTypeIdAcceptedAnswerIdParentIdScoreViewCountBodyTitleContentLicenseFavoriteCountCreationDateLastActivityDateLastEditDateLastEditorUserIdOwnerUserIdTags
072NaN4.0522NaNAn explicit cast to double like this isn't n...NoneCC BY-SA 4.0NaN2008-07-31T22:17:57.8832019-10-21T14:03:54.6072019-10-21T14:03:54.6075496973.09.0None
1911404.0NaN2199784860.0Given a DateTime representing a person's bir...How do I calculate someone's age based on a Da...CC BY-SA 4.00.02008-07-31T23:40:59.7432023-02-02T18:38:32.6132022-07-27T22:34:36.3203524942.01.0c#,.net,datetime
21111248.0NaN1644197314.0Given a specific DateTime value, how do I di...Calculate relative time in C#CC BY-SA 4.00.02008-07-31T23:55:37.9672022-09-05T11:26:30.1872022-07-10T00:19:55.23716790137.01.0c#,datetime,time,datediff,relative-time-span
3141NaNNaN491173083.0What is the difference between [Math.Floor()](...Difference between Math.Floor() and Math.Trunc...CC BY-SA 3.00.02008-08-01T00:59:11.1772022-04-22T08:59:43.8172017-02-25T17:42:17.8106495084.011.0.net,math
46131.0NaN31923465.0I have an absolutely positioned div containi...Why did the width collapse in the percentage w...CC BY-SA 4.00.02008-07-31T22:08:08.6202021-01-29T18:46:45.9632021-01-29T18:46:45.9639134576.09.0html,css,internet-explorer-7
5122NaN11.0347NaNHere's how I do it\n\n```\nvar ts = new TimeSp...NoneCC BY-SA 4.0NaN2008-07-31T23:56:41.3032020-06-13T10:30:44.3972020-06-13T10:30:44.397238419.01.0None
6131NaNNaN701277780.0Is there a standard way for a web server to be...Determine a user's timezoneCC BY-SA 4.00.02008-08-01T00:42:38.9032022-03-29T07:31:31.3202020-12-03T03:37:56.313584192.09.0html,browser,timezone,user-agent,timezone-offset
7417.0NaN79470633.0I want to assign the decimal variable "trans" ...How to convert Decimal to Double in C#?CC BY-SA 4.00.02008-07-31T21:42:52.6672022-09-08T05:07:26.0332022-09-08T05:07:26.03316124033.08.0c#,floating-point,type-conversion,double,decimal
817126.0NaN19885547.0How do I store binary data in [MySQL](http://e...Binary Data in MySQLCC BY-SA 3.00.02008-08-01T05:09:55.9932020-12-03T03:37:51.7632020-12-03T03:37:51.763584192.02.0mysql,database,binary-data,data-storage
924149.0NaN193101180.0If I have a trigger before the update on a tab...Throw an error preventing a table update in a ...CC BY-SA 4.00.02008-08-01T12:12:19.3502021-01-29T12:57:17.1532021-01-29T12:57:17.15314152908.022.0mysql,database,triggers

The Stack Overflow dataset contains 2 million posts—both questions (PostTypeId=1) and answers (PostTypeId=2). Key columns include:

  • Id - Unique identifier for each post

  • Body - The content we'll make semantically searchable

  • Title - The title of the post (questions only)

  • Tags - Programming language and topic tags (e.g., python, javascript)

  • Score - Community voting score—useful for prioritizing high-quality content

  • ViewCount - Popularity metric for filtering

  • PostTypeId - Type of post (1=question, 2=answer)

  • AcceptedAnswerId - ID of the accepted answer (for questions)

  • CreationDate, LastActivityDate, LastEditDate - Timestamps

This rich metadata allows us to combine semantic understanding with traditional filters—for example, finding Python questions about async programming with a score above 10.

4. Setting Up Vector Storage Backends

MindsDB supports multiple vector storage options. We'll set up both pgvector and a recently added FAISS and will compare how quick they are.

PGVector (PostgreSQL Extension)

PGVector is a PostgreSQL extension for vector similarity search. It's ideal when you want to keep vectors alongside your relational data.

# Create pgvector database connection
run_query("""
    CREATE DATABASE pg_vector
    WITH ENGINE = "pgvector",
    PARAMETERS = {
        "user": "YOUR_PG_USER",
        "password": "YOUR_PG_PASSWORD",
        "host": "YOUR_PG_HOST",
        "port": "5432",
        "database": "vector"
    }
""", "Created pg_vector database connection")
Created pg_vector database connection

FAISS is a library for efficient similarity search developed by Facebook AI Research. It's optimized for fast similarity search on large datasets.

# Create FAISS database connection
run_query("""
    CREATE DATABASE db_faiss
    WITH ENGINE = 'duckdb_faiss',
    PARAMETERS = {
        "persist_directory": "/home/ubuntu/faiss"
    }
""", "Created db_faiss database connection")
Created db_faiss database connection

Choosing Between PGVector and FAISS

FeaturepgvectorFAISS
Best forIntegration with existing PostgreSQLMaximum query speed
PersistenceNative PostgreSQL storageFile-based
ScalabilityGood (PostgreSQL limits)Excellent (billions of vectors)
Setup complexityRequires PostgreSQL extensionStandalone
Query speedGood (~19s for 2M vectors)Excellent (~5s for 2M vectors)

For this tutorial, we'll implement both so you can see the performance difference firsthand.

5. Creating Knowledge Bases

Now we have a table with relational data and two vector stores to keep the embedding vectors. We are ready to create knowledge bases using both storage backends.

The knowledge base will:

  • Use OpenAI's text-embedding-3-small model for generating embeddings

  • Store the post Body as searchable content

  • Include metadata fields for filtering results

Knowledge Base with PGVector Storage

def kb_exists(kb_name):
    """Check if a knowledge base already exists."""
    try:
        result = server.query("SELECT name FROM information_schema.knowledge_bases").fetch()
        return kb_name in result['name'].values
    except Exception:
        return False

# Create pgvector knowledge base
if kb_exists("kb_stack_vector"):
    print("kb_stack_vector already exists - skipping creation")
else:
    run_query("""
        CREATE KNOWLEDGE_BASE kb_stack_vector
        USING
            storage = pg_vector.stack,
            embedding_model = {
                "provider": "openai",
                "model_name": "text-embedding-3-small"
            },
            content_columns = ['Body'],
            metadata_columns = [
                "PostTypeId",
                "AcceptedAnswerId",
                "ParentId",
                "Score",
                "ViewCount",
                "Title",
                "ContentLicense",
                "FavoriteCount",
                "CreationDate",
                "LastActivityDate",
                "LastEditDate",
                "LastEditorUserId",
                "OwnerUserId",
                "Tags"
            ]
    """, "Created kb_stack_vector knowledge base")
Created kb_stack_vector knowledge base

Knowledge Base with FAISS Storage

# Create FAISS knowledge base
if kb_exists("kb_stack_faiss"):
    print("kb_stack_faiss already exists - skipping creation")
else:
    run_query("""
        CREATE KNOWLEDGE_BASE kb_stack_faiss
        USING
            storage = db_faiss.stack,
            embedding_model = {
                "provider": "openai",
                "model_name": "text-embedding-3-small"
            },
            content_columns = ['Body'],
            metadata_columns = [
                "PostTypeId",
                "AcceptedAnswerId",
                "ParentId",
                "Score",
                "ViewCount",
                "Title",
                "ContentLicense",
                "FavoriteCount",
                "CreationDate",
                "LastActivityDate",
                "LastEditDate",
                "LastEditorUserId",
                "OwnerUserId",
                "Tags"
            ]
    """, "Created kb_stack_faiss knowledge base")
Created kb_stack_faiss knowledge base

Understanding the Parameters

ParameterDescription
storageSpecifies the vector database connection and table name
embedding_modelConfiguration for the embedding model (provider and model name)
content_columnsColumns to embed and make semantically searchable
metadata_columnsColumns available for filtering (not embedded, but stored)

6. Loading Data into Knowledge Bases

Now we'll insert the Stack Overflow data into our knowledge bases. This process:

  1. Fetches data from the source table in batches

  2. Generates embeddings for content columns using the OpenAI API

  3. Stores vectors and metadata in the vector database

Loading Data into PGVector Knowledge Base

def is_kb_empty(kb_name):
    """Check if a knowledge base is empty (fast - only fetches 1 row)."""
    result = server.query(f"SELECT id FROM {kb_name} LIMIT 1").fetch()
    return len(result) == 0

if is_kb_empty("kb_stack_vector"):
    print("kb_stack_vector is empty - starting data insertion...")
    server.query("""
        INSERT INTO kb_stack_vector
        SELECT * FROM pg_sample.stackoverflow_2m 
        USING 
            batch_size = 1000, 
            track_column = id
    """).fetch()
    print("Data insertion started for kb_stack_vector")
else:
    print("kb_stack_vector is not empty - skipping data insertion")
Data insertion started for kb_stack_vector

Loading Data into FAISS Knowledge Base

if is_kb_empty("kb_stack_faiss"):
    print("kb_stack_faiss is empty - starting data insertion...")
    server.query("""
        INSERT INTO kb_stack_faiss
        SELECT * FROM pg_sample.stackoverflow_2m 
        USING 
            batch_size = 1000, 
            track_column = id
    """).fetch()
    print("Data insertion started for kb_stack_faiss")
else:
    print("kb_stack_faiss is not empty - skipping data insertion")
Data insertion started for kb_stack_faiss

Wait until the data insertion is complete.

7. Querying the Knowledge Bases

Once data is loaded, you can perform semantic searches combined with metadata filtering.

Search for content related to "8-bit music" (finds semantically similar content):

import time

# Semantic search on pgvector KB
start = time.time()
results_vector = server.query("""
    SELECT * FROM kb_stack_vector 
    WHERE content = '8-bit music'
    AND Tags LIKE '%python%'
    LIMIT 10
""").fetch()
elapsed_vector = time.time() - start
print(f"pgvector query time: {elapsed_vector:.2f} seconds")
display(results_vector)

# Semantic search on FAISS KB
start = time.time()
results_faiss = server.query("""
    SELECT * FROM kb_stack_faiss 
    WHERE content = '8-bit music'
    AND Tags LIKE '%python%'
    LIMIT 10
""").fetch()
elapsed_faiss = time.time() - start
print(f"FAISS query time: {elapsed_faiss:.2f} seconds")
display(results_faiss)
pgvector query time: 19.21 seconds
idchunk_idchunk_contentdistancerelevanceContentLicenseViewCountLastEditDateScoreAcceptedAnswerIdOwnerUserIdLastActivityDateTagsLastEditorUserIdPostTypeIdParentIdTitleFavoriteCountCreationDatemetadata
011182661118266:Body:1of2:0to971Im trying to engineer in python a way of trans...0.6054470.622879CC BY-SA 2.51694.02009-07-13T08:32:20.7970NaNNaN2010-03-17T15:16:17.060python,audio12855.01NoneList of values to a sound fileNaN2009-07-13T08:27:25.393{'Tags': 'python,audio', 'Score': 0, 'Title': ...
1974071974071:Body:1of1:0to791I have a mosquito problem in my house. This wo...0.6152570.619097CC BY-SA 2.555695.02017-05-23T12:32:21.50744974291.051197.02020-02-12T22:24:39.977python,audio,mp3,frequency-1.01NonePython library for playing fixed-frequency sound0.02009-06-10T07:05:02.037{'Tags': 'python,audio,mp3,frequency', 'Score'...
219670401967040:Body:1of1:0to224I am confused because there are a lot of progr...0.6269040.614665CC BY-SA 2.56615.0None71968691.0237934.02021-08-10T10:40:59.217python,audioNaN1NoneHow can i create a melody? Is there any sound-...0.02009-12-27T21:04:34.243{'Tags': 'python,audio', 'Score': 7, 'Title': ...
311182661118266:Body:2of2:972to1430The current solution I'm thinking of involves ...0.6274420.614461CC BY-SA 2.51694.02009-07-13T08:32:20.7970NaNNaN2010-03-17T15:16:17.060python,audio12855.01NoneList of values to a sound fileNaN2009-07-13T08:27:25.393{'Tags': 'python,audio', 'Score': 0, 'Title': ...
413448841344884:Body:1of1:0to327I want to learn how to program a music applica...0.6439570.608289CC BY-SA 2.52205.02017-05-23T12:11:22.60771346272.0164623.02022-04-14T09:12:07.197python,perl,waveform-1.01NoneProgramming a Self Learning Music Maker0.02009-08-28T03:28:03.937{'Tags': 'python,perl,waveform', 'Score': 7, '...
523765052376505:Body:1of2:0to968Write a function called listenToPicture that t...0.6452140.607824CC BY-SA 2.53058.02010-03-04T02:28:26.7030NaN285922.02010-03-06T05:27:48.017python,image,audio34397.01NoneHow do I loop through every 4th pixel in every...NaN2010-03-04T02:26:22.603{'Tags': 'python,image,audio', 'Score': 0, 'Ti...
622268532226853:Body:1of1:0to877I'm trying to write a program to display PCM d...0.6541620.604536CC BY-SA 2.512425.0None72226907.0210920.02015-07-25T11:16:16.747python,audio,pcmNaN1NoneInterpreting WAV Data0.02010-02-09T05:01:25.703{'Tags': 'python,audio,pcm', 'Score': 7, 'Titl...
715611041561104:Body:1of1:0to306Is there a way to do this? Also, I need this t...0.6680740.599494CC BY-SA 2.51303.02020-06-20T09:12:55.06011561314.0151377.02012-01-29T00:01:18.230python,pygame,pitch-1.01NonePlaying sounds with python and changing their ...NaN2009-10-13T15:44:54.267{'Tags': 'python,pygame,pitch', 'Score': 1, 'T...
813829981382998:Body:4of4:2649to3382```\n¼ éíñ§ÐÌëÑ » ¼ ö ® © ’\n0 1\n2 10\n3 10\n...0.6706540.598568CC BY-SA 3.012497.02011-06-09T06:00:51.243181383721.06946.02015-06-04T17:13:43.323python,unicode6946.01Nonelatin-1 to ascii0.02009-09-05T10:44:40.167{'Tags': 'python,unicode', 'Score': 18, 'Title...
918376861837686:Body:1of2:0to950I wish to take a file encoded in UTF-8 that do...0.6759990.596659CC BY-SA 3.03016.02011-10-15T13:17:24.5202NaNNaN2011-10-15T13:17:24.520python,c,utf-8,compression12113.01NoneCompressing UTF-8(or other 8-bit encoding) to ...0.02009-12-03T04:43:05.963{'Tags': 'python,c,utf-8,compression', 'Score'...
FAISS query time: 5.04 seconds
iddistancechunk_idchunk_contentrelevanceContentLicenseViewCountLastEditDateScoreAcceptedAnswerIdOwnerUserIdParentIdLastEditorUserIdLastActivityDateTagsPostTypeIdFavoriteCountTitleCreationDatemetadata
011182660.6054681118266:Body:1of2:0to971Im trying to engineer in python a way of trans...0.622871CC BY-SA 2.51694.02009-07-13T08:32:20.7970NaNNaNNone12855.02010-03-17T15:16:17.060python,audio1NaNList of values to a sound file2009-07-13T08:27:25.393{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
19740710.615225974071:Body:1of1:0to791I have a mosquito problem in my house. This wo...0.619109CC BY-SA 2.555695.02017-05-23T12:32:21.50744974291.051197.0None-1.02020-02-12T22:24:39.977python,audio,mp3,frequency10.0Python library for playing fixed-frequency sound2009-06-10T07:05:02.037{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
219670400.6269231967040:Body:1of1:0to224I am confused because there are a lot of progr...0.614657CC BY-SA 2.56615.0None71968691.0237934.0NoneNaN2021-08-10T10:40:59.217python,audio10.0How can i create a melody? Is there any sound-...2009-12-27T21:04:34.243{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
311182660.6274611118266:Body:2of2:972to1430The current solution I'm thinking of involves ...0.614454CC BY-SA 2.51694.02009-07-13T08:32:20.7970NaNNaNNone12855.02010-03-17T15:16:17.060python,audio1NaNList of values to a sound file2009-07-13T08:27:25.393{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
413448840.6439551344884:Body:1of1:0to327I want to learn how to program a music applica...0.608289CC BY-SA 2.52205.02017-05-23T12:11:22.60771346272.0164623.0None-1.02022-04-14T09:12:07.197python,perl,waveform10.0Programming a Self Learning Music Maker2009-08-28T03:28:03.937{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
523765050.6451922376505:Body:1of2:0to968Write a function called listenToPicture that t...0.607832CC BY-SA 2.53058.02010-03-04T02:28:26.7030NaN285922.0None34397.02010-03-06T05:27:48.017python,image,audio1NaNHow do I loop through every 4th pixel in every...2010-03-04T02:26:22.603{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
622268530.6541122226853:Body:1of1:0to877I'm trying to write a program to display PCM d...0.604554CC BY-SA 2.512425.0None72226907.0210920.0NoneNaN2015-07-25T11:16:16.747python,audio,pcm10.0Interpreting WAV Data2010-02-09T05:01:25.703{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
715611040.6680551561104:Body:1of1:0to306Is there a way to do this? Also, I need this t...0.599501CC BY-SA 2.51303.02020-06-20T09:12:55.06011561314.0151377.0None-1.02012-01-29T00:01:18.230python,pygame,pitch1NaNPlaying sounds with python and changing their ...2009-10-13T15:44:54.267{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
813829980.6706681382998:Body:4of4:2649to3382```\n¼ éíñ§ÐÌëÑ » ¼ ö ® © ’\n0 1\n2 10\n3 10\n...0.598563CC BY-SA 3.012497.02011-06-09T06:00:51.243181383721.06946.0None6946.02015-06-04T17:13:43.323python,unicode10.0latin-1 to ascii2009-09-05T10:44:40.167{'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...
918376860.6759861837686:Body:1of2:0to950I wish to take a file encoded in UTF-8 that do...0.596664CC BY-SA 3.03016.02011-10-15T13:17:24.5202NaNNaNNone12113.02011-10-15T13:17:24.520python,c,utf-8,compression10.0Compressing UTF-8(or other 8-bit encoding) to ...2009-12-03T04:43:05.963{'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...

Analyzing the Results

Notice how the search for "8-bit music" returned posts about:

  • Converting values to sound files

  • Playing fixed-frequency sounds

  • Creating melodies programmatically

None of these posts contain the exact phrase "8-bit music," yet they're all semantically relevant to chiptune/retro audio generation. This is the power of semantic search.

Also note the 4x speed improvement with FAISS (5 seconds vs 19 seconds for pgvector). For production systems with high query volumes, this difference is significant.

Combined Semantic and Metadata Filtering

Find AJAX-related posts tagged with jQuery that have high view counts:

# pgvector: Semantic search with metadata filters
start = time.time()
results = server.query("""
    SELECT * FROM kb_stack_vector 
    WHERE content = 'ajax'
        AND Tags LIKE '%jquery%'
        AND ViewCount > 1000.0
        AND relevance > 0.6
    LIMIT 10
""").fetch()
print(f"pgvector query time: {time.time() - start:.2f} seconds")
display(results)

# FAISS: Semantic search with metadata filters
start = time.time()
results = server.query("""
    SELECT * FROM kb_stack_faiss 
    WHERE content = 'ajax'
        AND Tags LIKE '%jquery%'
        AND ViewCount > 1000.0
        AND relevance > 0.6
    LIMIT 10
""").fetch()
print(f"FAISS query time: {time.time() - start:.2f} seconds")
display(results)
pgvector query time: 5.76 seconds
idchunk_idchunk_contentdistancerelevanceContentLicenseViewCountLastEditDateScoreAcceptedAnswerIdOwnerUserIdLastActivityDateTagsLastEditorUserIdPostTypeIdParentIdTitleFavoriteCountCreationDatemetadata
014006371400637:Body:28of32:25627to26627o.ajax({type:"POST",url:E,data:G,success:H,dat...0.4272650.700641CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.02013-08-05T16:07:54.400javascript,jquery8590.01NoneStop reload for ajax submitted formNaN2009-09-09T16:12:46.057{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
114006371400637:Body:30of32:27488to28356O=false;T.onload=T.onreadystatechange=function...0.4537640.687870CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.02013-08-05T16:07:54.400javascript,jquery8590.01NoneStop reload for ajax submitted formNaN2009-09-09T16:12:46.057{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
214006371400637:Body:27of32:24691to25626rn this},serialize:function(){return o.param(t...0.4546290.687460CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.02013-08-05T16:07:54.400javascript,jquery8590.01NoneStop reload for ajax submitted formNaN2009-09-09T16:12:46.057{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
314247741424774:Body:2of2:934to1745var self = this;\n $.ajax({\n ...0.4614860.684235CC BY-SA 2.53601.0None11426940.0173350.02020-06-08T10:43:45.037jquery,loopsNaN1NoneLoop with 8 timesNaN2009-09-15T02:02:58.927{'Tags': 'jquery,loops', 'Score': 1, 'Title': ...
414006371400637:Body:31of32:28357to29238N=function(X){if(J.readyState==0){if(P){clearI...0.4621910.683905CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.02013-08-05T16:07:54.400javascript,jquery8590.01NoneStop reload for ajax submitted formNaN2009-09-09T16:12:46.057{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
5546344546344:Body:2of3:902to1764var before = function() { $(loading).show() ;...0.4632580.683407CC BY-SA 2.51463.02009-02-13T16:17:38.1700546642.02755.02009-02-13T16:37:59.867javascript,jquery,ajax2755.01NoneUsing jQuery, how can I store the result of a ...0.02009-02-13T15:25:00.963{'Tags': 'javascript,jquery,ajax', 'Score': 0,...
612796251279625:Body:2of3:782to1754```\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...0.4688820.680790CC BY-SA 3.01130.02016-12-03T07:00:58.21301279881.058375.02016-12-03T07:00:58.213events,jquery,getjson6637668.01NoneTrouble with jQuery Ajax timingNaN2009-08-14T19:06:28.043{'Tags': 'events,jquery,getjson', 'Score': 0, ...
714006371400637:Body:32of32:29239to30048L(){if(M.complete){M.complete(J,R)}if(M.global...0.4689440.680761CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.02013-08-05T16:07:54.400javascript,jquery8590.01NoneStop reload for ajax submitted formNaN2009-09-09T16:12:46.057{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
817756251775625:Body:5of9:3144to4049}\n\n}\n</script>\n\n\n\n<script type=...0.4727230.679014CC BY-SA 2.52100.02009-11-21T14:46:00.25011776406.0212889.02009-11-21T19:03:52.070jquery,form-submit212889.01NonejQuery - Multiple form submission trigger unre...0.02009-11-21T14:32:41.383{'Tags': 'jquery,form-submit', 'Score': 1, 'Ti...
914006371400637:Body:26of32:23690to24690nclick")}o(function(){var L=document.createEle...0.4777840.676689CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.02013-08-05T16:07:54.400javascript,jquery8590.01NoneStop reload for ajax submitted formNaN2009-09-09T16:12:46.057{'Tags': 'javascript,jquery', 'Score': 2, 'Tit...
FAISS query time: 2.50 seconds
iddistancechunk_idchunk_contentrelevanceContentLicenseViewCountLastEditDateScoreAcceptedAnswerIdOwnerUserIdParentIdLastEditorUserIdLastActivityDateTagsPostTypeIdFavoriteCountTitleCreationDatemetadata
014006370.4272431400637:Body:28of32:25627to26627o.ajax({type:"POST",url:E,data:G,success:H,dat...0.700651CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.0None8590.02013-08-05T16:07:54.400javascript,jquery1NaNStop reload for ajax submitted form2009-09-09T16:12:46.057{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
114006370.4537691400637:Body:30of32:27488to28356O=false;T.onload=T.onreadystatechange=function...0.687867CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.0None8590.02013-08-05T16:07:54.400javascript,jquery1NaNStop reload for ajax submitted form2009-09-09T16:12:46.057{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
214006370.4545891400637:Body:27of32:24691to25626rn this},serialize:function(){return o.param(t...0.687479CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.0None8590.02013-08-05T16:07:54.400javascript,jquery1NaNStop reload for ajax submitted form2009-09-09T16:12:46.057{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
314247740.4614691424774:Body:2of2:934to1745var self = this;\n $.ajax({\n ...0.684243CC BY-SA 2.53601.0None11426940.0173350.0NoneNaN2020-06-08T10:43:45.037jquery,loops1NaNLoop with 8 times2009-09-15T02:02:58.927{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
414006370.4622331400637:Body:31of32:28357to29238N=function(X){if(J.readyState==0){if(P){clearI...0.683886CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.0None8590.02013-08-05T16:07:54.400javascript,jquery1NaNStop reload for ajax submitted form2009-09-09T16:12:46.057{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
55463440.463237546344:Body:2of3:902to1764var before = function() { $(loading).show() ;...0.683416CC BY-SA 2.51463.02009-02-13T16:17:38.1700546642.02755.0None2755.02009-02-13T16:37:59.867javascript,jquery,ajax10.0Using jQuery, how can I store the result of a ...2009-02-13T15:25:00.963{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
612796250.4688541279625:Body:2of3:782to1754```\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ...0.680803CC BY-SA 3.01130.02016-12-03T07:00:58.21301279881.058375.0None6637668.02016-12-03T07:00:58.213events,jquery,getjson1NaNTrouble with jQuery Ajax timing2009-08-14T19:06:28.043{'ContentLicense': 'CC BY-SA 3.0', 'LastActivi...
714006370.4689311400637:Body:32of32:29239to30048L(){if(M.complete){M.complete(J,R)}if(M.global...0.680767CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.0None8590.02013-08-05T16:07:54.400javascript,jquery1NaNStop reload for ajax submitted form2009-09-09T16:12:46.057{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
817756250.4727401775625:Body:5of9:3144to4049}\n\n}\n</script>\n\n\n\n<script type=...0.679007CC BY-SA 2.52100.02009-11-21T14:46:00.25011776406.0212889.0None212889.02009-11-21T19:03:52.070jquery,form-submit10.0jQuery - Multiple form submission trigger unre...2009-11-21T14:32:41.383{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...
914006370.4777851400637:Body:26of32:23690to24690nclick")}o(function(){var L=document.createEle...0.676688CC BY-SA 2.52741.02009-09-09T16:16:59.43021400656.0107129.0None8590.02013-08-05T16:07:54.400javascript,jquery1NaNStop reload for ajax submitted form2009-09-09T16:12:46.057{'ContentLicense': 'CC BY-SA 2.5', 'LastActivi...

Understanding Query Results

The query returns these columns:

ColumnDescription
idOriginal document ID
chunk_idIdentifier for the text chunk
chunk_contentThe actual text content
metadataJSON object with all metadata fields
distanceVector distance (lower = more similar)
relevanceRelevance score (higher = more relevant, 0-1)

Filtering by Relevance

Get only highly relevant results:

The Power of Combined Filtering

The query we just ran demonstrates MindsDB's hybrid search capability:

SELECT * FROM kb_stack_faiss 
WHERE content = 'ajax'              -- Semantic match
    AND Tags LIKE '%jquery%'        -- Metadata filter
    AND ViewCount > 1000            -- Popularity threshold
    AND relevance > 0.6             -- Quality threshold

This finds posts that:

  1. Are semantically similar to "ajax" (not just keyword matches)

  2. Are tagged with jQuery

  3. Have significant engagement (>1000 views)

  4. Meet a minimum relevance score

This combination is impossible with traditional search and would require complex custom code with raw vector databases.

def run_query_ignore_exists(sql, success_msg="Query executed successfully"):
    """Execute a query, silently ignoring 'already exists' errors."""
    try:
        result = server.query(sql).fetch()
        print(success_msg)
        return result
    except RuntimeError as e:
        return None  # Silently ignore
# Create MindsDB Agent
run_query_ignore_exists("""
    drop agent stackoverflow_agent
""", "Dropped stackoverflow_agent")

run_query("""
    CREATE AGENT stackoverflow_agent
    USING
        model = {
            "provider": "openai",
            "model_name": "gpt-4.1"
        },
        data = {
            "knowledge_bases": ["mindsdb.kb_stack_faiss"]
        },
        prompt_template = '
            You are a helpful programming assistant. 
            mindsdb.kb_stack_faiss is a knowledge base that contains Stack Overflow questions and answers.
            Use this knowledge to provide accurate, helpful responses to programming questions.
            Include code examples when relevant.
            You must base your answer on the Stack Overflow questions and answers extracted from mindsdb.kb_stack_faiss.
            If you failed to get the results from mindsdb.kb_stack_faiss, answer I could not get the results from mindsdb.kb_stack_faiss.
            Print the chunk ID for each question and answer you based your answer on.
            IMPORTANT: Use a limit of 100 in your query to the knowledge base.
        '
""", "Created stackoverflow_agent")
Dropped stackoverflow_agent
Created stackoverflow_agent
# Query the agent
start = time.time()
response = server.query("""
    SELECT answer
    FROM stackoverflow_agent 
    WHERE question = 'Compare JavaScript to TypeScript for building web services'
""").fetch()
print(f"Agent response time: {time.time() - start:.2f} seconds\n")
print(response['answer'].iloc[0])
Agent response time: 63.44 seconds

To compare JavaScript and TypeScript for building web services, let's look at insights from Stack Overflow posts (see chunk IDs for reference):

**JavaScript:**
- JavaScript is a dynamic, weakly typed, prototype-based language with first-class functions ([1253285:Body:1of1:0to384](https://stackoverflow.com/posts/1253285)).
- It is the default language for web development, both on the client (browser) and, with Node.js, on the server ([870980:Body:1of1:0to133](https://stackoverflow.com/posts/870980)).
- JavaScript is flexible and widely supported, but its lack of static typing can lead to runtime errors and makes large codebases harder to maintain.

**TypeScript:**
- While not directly mentioned in the top results, TypeScript is a superset of JavaScript that adds static typing and modern language features. It compiles to JavaScript, so it runs anywhere JavaScript does.
- TypeScript helps catch errors at compile time, improves code readability, and is especially beneficial for large projects or teams.

**Web Services:**
- JavaScript (with Node.js) is commonly used to build RESTful APIs and web services ([208051:Body:1of1:0to147](https://stackoverflow.com/posts/208051)).
- TypeScript is increasingly popular for the same purpose, as it provides all the benefits of JavaScript plus type safety and better tooling (e.g., autocompletion, refactoring).

**Summary Table:**

| Feature         | JavaScript                        | TypeScript                          |
|-----------------|----------------------------------|-------------------------------------|
| Typing          | Dynamic, weakly typed            | Static typing (optional)            |
| Tooling         | Good, but less type-aware        | Excellent (autocompletion, refactor)|
| Learning Curve  | Lower                            | Slightly higher (due to types)      |
| Error Checking  | Runtime                          | Compile-time + runtime              |
| Ecosystem       | Huge, universal                  | Same as JS, plus TS-specific tools  |
| Maintainability | Can be challenging in large code | Easier in large codebases           |

**Conclusion:**  
- For small projects or rapid prototyping, JavaScript is sufficient and easy to start with.
- For larger projects, teams, or when maintainability and reliability are priorities, TypeScript is generally preferred.

References:  
- [1253285:Body:1of1:0to384](https://stackoverflow.com/posts/1253285)  
- [870980:Body:1of1:0to133](https://stackoverflow.com/posts/870980)  
- [208051:Body:1of1:0to147](https://stackoverflow.com/posts/208051)  

If you want more specific code examples or a deeper dive into either technology, let me know!

Conclusion

We've built a complete semantic search system that:

  • Processes 2 million Stack Overflow posts

  • Supports both pgvector and FAISS backends

  • Combines semantic search with metadata filtering

  • Powers an AI agent for natural language queries

Key Takeaways

  1. FAISS is much faster than pgvector for pure search queries

  2. Metadata filtering lets you narrow results by tags, scores, dates

  3. Knowledge bases abstract complexity—no need to manage embeddings manually

  4. Agents can leverage knowledge bases for RAG-style applications

Next Steps

  • Try different embedding models

  • Add more data sources

  • Build a chat interface

  • Explore different chunking strategies

You can watch the recording of this demo by registering here to receive the link.