postgresml · montanalow · May 24, 2024 · May 24, 2024
diff --git a/packages/pgml-rds-proxy/README.md b/packages/pgml-rds-proxy/README.md
@@ -76,7 +76,7 @@ SELECT
 FROM
     dblink(
         'postgresml',
-        'SELECT * FROM pgml.embed(''intfloat/e5-small'', ''embed this text'') AS embedding'
+        'SELECT * FROM pgml.embed(''intfloat/e5-small-v2'', ''embed this text'') AS embedding'
 ) AS t1(embedding real[386]);
 ```
 

diff --git a/pgml-apps/pgml-chat/pgml_chat/main.py b/pgml-apps/pgml-chat/pgml_chat/main.py
@@ -123,7 +123,7 @@ def handler(signum, frame):
     "--chat_completion_model",
     dest="chat_completion_model",
     type=str,
-    default="HuggingFaceH4/zephyr-7b-beta",
+    default="meta-llama/Meta-Llama-3-8B-Instruct",
 )
 
 parser.add_argument(
@@ -195,9 +195,8 @@ def handler(signum, frame):
 )
 
 splitter = Splitter(splitter_name, splitter_params)
-model_name = "hkunlp/instructor-xl"
-model_embedding_instruction = "Represent the %s document for retrieval: " % (bot_topic)
-model_params = {"instruction": model_embedding_instruction}
+model_name = "intfloat/e5-small-v2"
+model_params = {}
 
 model = Model(model_name, "pgml", model_params)
 pipeline = Pipeline(args.collection_name + "_pipeline", model, splitter)

diff --git a/pgml-cms/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md b/pgml-cms/blog/generating-llm-embeddings-with-open-source-models-in-postgresml.md
@@ -122,14 +122,14 @@ LIMIT 5;
 
 PostgresML provides a simple interface to generate embeddings from text in your database. You can use the [`pgml.embed`](https://postgresml.org/docs/guides/transformers/embeddings) function to generate embeddings for a column of text. The function takes a transformer name and a text value. The transformer will automatically be downloaded and cached on your connection process for reuse. You can see a list of potential good candidate models to generate embeddings on the [Massive Text Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
 
-Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. [`intfloat/e5-small`](https://huggingface.co/intfloat/e5-small) will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models.
+Since our corpus of documents (movie reviews) are all relatively short and similar in style, we don't need a large model. [`intfloat/e5-small-v2`](https://huggingface.co/intfloat/e5-small-v2) will be a good first attempt. The great thing about PostgresML is you can always regenerate your embeddings later to experiment with different embedding models.
 
-It takes a couple of minutes to download and cache the `intfloat/e5-small` model to generate the first embedding. After that, it's pretty fast.
+It takes a couple of minutes to download and cache the `intfloat/e5-small-v2` model to generate the first embedding. After that, it's pretty fast.
 
 Note how we prefix the text we want to embed with either `passage:` or `query:` , the e5 model requires us to prefix our data with `passage:` if we're generating embeddings for our corpus and `query:` if we want to find semantically similar content.
 
 ```postgresql
-SELECT pgml.embed('intfloat/e5-small', 'passage: hi mom');
+SELECT pgml.embed('intfloat/e5-small-v2', 'passage: hi mom');
 ```
 
 This is a pretty powerful function, because we can pass any arbitrary text to any open source model, and it will generate an embedding for us. We can benchmark how long it takes to generate an embedding for a single review, using client-side timings in Postgres:
@@ -147,7 +147,7 @@ Aside from using this function with strings passed from a client, we can use it
 ```postgresql
 SELECT
     review_body,
-    pgml.embed('intfloat/e5-small', 'passage: ' || review_body)
+    pgml.embed('intfloat/e5-small-v2', 'passage: ' || review_body)
 FROM pgml.amazon_us_reviews
 LIMIT 1;
 ```
@@ -171,7 +171,7 @@ Time to generate an embedding increases with the length of the input text, and v
 ```postgresql
 SELECT
     review_body,
-    pgml.embed('intfloat/e5-small', 'passage: ' || review_body) AS embedding
+    pgml.embed('intfloat/e5-small-v2', 'passage: ' || review_body) AS embedding
 FROM pgml.amazon_us_reviews
 LIMIT 1000;
 ```
@@ -190,7 +190,7 @@ We can also do a quick sanity check to make sure we're really getting value out
 SELECT
     reviqew_body,
     pgml.embed(
-        'intfloat/e5-small',
+        'intfloat/e5-small-v2',
         'passage: ' || review_body,
         '{"device": "cpu"}'
     ) AS embedding
@@ -224,6 +224,12 @@ You can also find embedding models that outperform OpenAI's `text-embedding-ada-
 
 The current leading model is `hkunlp/instructor-xl`. Instructor models take an additional `instruction` parameter which includes context for the embeddings use case, similar to prompts before text generation tasks.
 
+!!! note
+
+    "intfloat/e5-small-v2" surpassed the quality of instructor-xl, and should be used instead, but we've left this documentation available for existing users 
+
+!!!
+
 Instructions can provide a "classification" or "topic" for the text:
 
 #### Classification
@@ -325,7 +331,7 @@ BEGIN
 
         UPDATE pgml.amazon_us_reviews
         SET review_embedding_e5_large = pgml.embed(
-                'intfloat/e5-large',
+                'intfloat/e5-small-v2',
                 'passage: ' || review_body
             )
         WHERE id BETWEEN i AND i + 10

diff --git a/...roducing-the-openai-switch-kit-move-from-closed-to-open-source-ai-in-minutes.md b/...roducing-the-openai-switch-kit-move-from-closed-to-open-source-ai-in-minutes.md
@@ -44,7 +44,7 @@ The Switch Kit is an open-source AI SDK that provides a drop in replacement for
 const pgml = require("pgml");
 const client = pgml.newOpenSourceAI();
 const results = client.chat_completions_create(
-      "HuggingFaceH4/zephyr-7b-beta",
+      "meta-llama/Meta-Llama-3-8B-Instruct",
       [
           {
               role: "system",
@@ -65,7 +65,7 @@ console.log(results);
 import pgml
 client = pgml.OpenSourceAI()
 results = client.chat_completions_create(
-    "HuggingFaceH4/zephyr-7b-beta",
+    "meta-llama/Meta-Llama-3-8B-Instruct",
     [
         {
             "role": "system",
@@ -96,7 +96,7 @@ print(results)
   ],
   "created": 1701291672,
   "id": "abf042d2-9159-49cb-9fd3-eef16feb246c",
-  "model": "HuggingFaceH4/zephyr-7b-beta",
+  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
   "object": "chat.completion",
   "system_fingerprint": "eecec9d4-c28b-5a27-f90b-66c3fb6cee46",
   "usage": {
@@ -113,7 +113,7 @@ We don't charge per token, so OpenAI “usage” metrics are not particularly re
 
 !!!
 
-The above is an example using our open-source AI SDK with zephyr-7b-beta, an incredibly popular and highly efficient 7 billion parameter model.
+The above is an example using our open-source AI SDK with Meta-Llama-3-8B-Instruct, an incredibly popular and highly efficient 8 billion parameter model.
 
 Notice there is near one to one relation between the parameters and return type of OpenAI’s `chat.completions.create` and our `chat_completion_create`.
 

diff --git a/pgml-cms/blog/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md b/pgml-cms/blog/llm-based-pipelines-with-postgresml-and-dbt-data-build-tool.md
@@ -119,7 +119,7 @@ vars:
   splitter_name: "recursive_character"
   splitter_parameters: {"chunk_size": 100, "chunk_overlap": 20}
   task: "embedding"
-  model_name: "intfloat/e5-base"
+  model_name: "intfloat/e5-small-v2"
   query_string: 'Lorem ipsum 3'
   limit: 2
 ```
@@ -129,7 +129,7 @@ Here's a summary of the key parameters:
 * `splitter_name`: Specifies the name of the splitter, set as "recursive\_character".
 * `splitter_parameters`: Defines the parameters for the splitter, such as a chunk size of 100 and a chunk overlap of 20.
 * `task`: Indicates the task being performed, specified as "embedding".
-* `model_name`: Specifies the name of the model to be used, set as "intfloat/e5-base".
+* `model_name`: Specifies the name of the model to be used, set as "intfloat/e5-small-v2".
 * `query_string`: Provides a query string, set as 'Lorem ipsum 3'.
 * `limit`: Specifies a limit of 2, indicating the maximum number of results to be processed.
 

diff --git a/...ms/blog/personalize-embedding-results-with-application-data-in-your-database.md b/...ms/blog/personalize-embedding-results-with-application-data-in-your-database.md
@@ -137,7 +137,7 @@ We can find a customer that our embeddings model feels is close to the sentiment
 ```postgresql
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'query: I love all Star Wars, but Empire Strikes Back is particularly amazing'
   )::vector(1024) AS embedding
 )
@@ -214,7 +214,7 @@ Now we can write our personalized SQL query. It's nearly the same as our query f
 -- create a request embedding on the fly
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'query: Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 ),

diff --git a/...-command-line-tool-for-deploying-low-latency-knowledge-based-chatbots-part-i.md b/...-command-line-tool-for-deploying-low-latency-knowledge-based-chatbots-part-i.md
@@ -127,9 +127,7 @@ cp .env.template .env
 ```bash
 OPENAI_API_KEY=<OPENAI_API_KEY>
 DATABASE_URL=<POSTGRES_DATABASE_URL starts with postgres://>
-MODEL=hkunlp/instructor-xl
-MODEL_PARAMS={"instruction": "Represent the document for retrieval: "}
-QUERY_PARAMS={"instruction": "Represent the question for retrieving supporting documents: "}
+MODEL=intfloat/e5-small-v2
 SYSTEM_PROMPT=<> # System prompt used for OpenAI chat completion
 BASE_PROMPT=<> # Base prompt used for OpenAI chat completion for each turn
 SLACK_BOT_TOKEN=<SLACK_BOT_TOKEN> # Slack bot token to run Slack chat service
@@ -332,7 +330,7 @@ Once the discord app is running, you can interact with the chatbot on Discord as
 
 ### PostgresML vs. Hugging Face + Pinecone
 
-To evaluate query latency, we performed an experiment with 10,000 Wikipedia documents from the SQuAD dataset. Embeddings were generated using the intfloat/e5-large model.
+To evaluate query latency, we performed an experiment with 10,000 Wikipedia documents from the SQuAD dataset. Embeddings were generated using the intfloat/e5-small-v2 model.
 
 For PostgresML, we used a GPU-powered serverless database running on NVIDIA A10G GPUs with client in us-west-2 region. For HuggingFace, we used their inference API endpoint running on NVIDIA A10G GPUs in us-east-1 region and a client in the same us-east-1 region. Pinecone was used as the vector search index for HuggingFace embeddings.
 

diff --git a/pgml-cms/blog/speeding-up-vector-recall-5x-with-hnsw.md b/pgml-cms/blog/speeding-up-vector-recall-5x-with-hnsw.md
@@ -45,7 +45,7 @@ Let's run that query again:
 ```postgresql
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'query: Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 )
@@ -100,7 +100,7 @@ Now let's try the query again utilizing the new HNSW index we created.
 ```postgresql
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'query: Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 )

diff --git a/pgml-cms/blog/the-1.0-sdk-is-here.md b/pgml-cms/blog/the-1.0-sdk-is-here.md
@@ -50,7 +50,7 @@ const pipeline = pgml.newPipeline("my_pipeline", {
   text: {
     splitter: { model: "recursive_character" },
     semantic_search: {
-      model: "intfloat/e5-small",
+      model: "intfloat/e5-small-v2",
     },
   },
 });
@@ -90,7 +90,7 @@ pipeline = Pipeline(
         "text": {
             "splitter": {"model": "recursive_character"},
             "semantic_search": {
-                "model": "intfloat/e5-small",
+                "model": "intfloat/e5-small-v2",
             },
         },
     },

diff --git a/.../blog/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md b/.../blog/tuning-vector-recall-while-generating-query-embeddings-in-the-database.md
@@ -124,7 +124,7 @@ We'll start with semantic search. Given a user query, e.g. "Best 1980's scifi mo
 ```postgresql
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'query: Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 )
@@ -171,7 +171,7 @@ Generating a query plan more quickly and only computing the values once, may mak
 There's some good stuff happening in those query results, so let's break it down:
 
 * **It's fast** - We're able to generate a request embedding on the fly with a state-of-the-art model, and search 5M reviews in 152ms, including fetching the results back to the client 😍. You can't even generate an embedding from OpenAI's API in that time, much less search 5M reviews in some other database with it.
-* **It's good** - The `review_body` results are very similar to the "Best 1980's scifi movie" request text. We're using the `intfloat/e5-large` open source embedding model, which outperforms OpenAI's `text-embedding-ada-002` in most [quality benchmarks](https://huggingface.co/spaces/mteb/leaderboard).
+* **It's good** - The `review_body` results are very similar to the "Best 1980's scifi movie" request text. We're using the `intfloat/e5-small-v2` open source embedding model, which outperforms OpenAI's `text-embedding-ada-002` in most [quality benchmarks](https://huggingface.co/spaces/mteb/leaderboard).
   * Qualitatively: the embeddings understand our request for `scifi` being equivalent to `Sci-Fi`, `sci-fi`, `SciFi`, and `sci fi`, as well as `1980's` matching `80s` and `80's` and is close to `seventies` (last place). We didn't have to configure any of this and the most enthusiastic for "best" is at the top, the least enthusiastic is at the bottom, so the model has appropriately captured "sentiment".
   * Quantitatively: the `cosine_similarity` of all results are high and tight, 0.90-0.95 on a scale from -1:1. We can be confident we recalled very similar results from our 5M candidates, even though it would take 485 times as long to check all of them directly.
 * **It's reliable** - The model is stored in the database, so we don't need to worry about managing a separate service. If you repeat this query over and over, the timings will be extremely consistent, because we don't have to deal with things like random network congestion.
@@ -254,7 +254,7 @@ Now we can quickly search for movies by what people have said about them:
 ```postgresql
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 )
@@ -312,7 +312,7 @@ SET ivfflat.probes = 300;
 ```postgresql
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 )
@@ -401,7 +401,7 @@ SET ivfflat.probes = 1;
 ```postgresql
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'query: Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 )
@@ -457,7 +457,7 @@ SQL is a very expressive language that can handle a lot of complexity. To keep t
 -- create a request embedding on the fly
 WITH request AS (
   SELECT pgml.embed(
-    'intfloat/e5-large',
+    'intfloat/e5-small-v2',
     'query: Best 1980''s scifi movie'
   )::vector(1024) AS embedding
 ),

diff --git a/pgml-cms/blog/using-postgresml-with-django-and-embedding-search.md b/pgml-cms/blog/using-postgresml-with-django-and-embedding-search.md
@@ -58,7 +58,7 @@ class EmbedSmallExpression(models.Expression):
         self.embedding_field = field
 
     def as_sql(self, compiler, connection, template=None):
-        return f"pgml.embed('intfloat/e5-small', {self.embedding_field})", None
+        return f"pgml.embed('intfloat/e5-small-v2', {self.embedding_field})", None
 ```
 
 And that's it! In just a few lines of code, we're generating and storing high quality embeddings automatically in our database. No additional setup is required, and all the AI complexity is taken care of by PostgresML.
@@ -70,7 +70,7 @@ Djago Rest Framework provides the bulk of the implementation. We just added a `M
 ```python
 results = TodoItem.objects.annotate(
     similarity=RawSQL(
-        "pgml.embed('intfloat/e5-small', %s)::vector(384) &#x3C;=> embedding",
+        "pgml.embed('intfloat/e5-small-v2', %s)::vector(384) &#x3C;=> embedding",
         [query],
     )
 ).order_by("similarity")
@@ -115,7 +115,7 @@ In return, you'll get your to-do item alongside the embedding of the `descriptio
 
 The embedding contains 384 floating point numbers; we removed most of them in this blog post to make sure it fits on the page.
 
-You can try creating multiple to-do items for fun and profit. If the description is changed, so will the embedding, demonstrating how the `intfloat/e5-small` model understands the semantic meaning of your text.
+You can try creating multiple to-do items for fun and profit. If the description is changed, so will the embedding, demonstrating how the `intfloat/e5-small-v2` model understands the semantic meaning of your text.
 
 ### Searching
 

diff --git a/pgml-cms/docs/api/client-sdk/README.md b/pgml-cms/docs/api/client-sdk/README.md
@@ -80,7 +80,7 @@ const pipeline = pgml.newPipeline("sample_pipeline", {
   text: {
     splitter: { model: "recursive_character" },
     semantic_search: {
-      model: "intfloat/e5-small",
+      model: "intfloat/e5-small-v2",
     },
   },
 });
@@ -98,7 +98,7 @@ pipeline = Pipeline(
         "text": {
             "splitter": { "model": "recursive_character" },
             "semantic_search": {
-                "model": "intfloat/e5-small",
+                "model": "intfloat/e5-small-v2",
             },
         },
     },
@@ -111,7 +111,7 @@ await collection.add_pipeline(pipeline)
 
 The pipeline configuration is a key/value object, where the key is the name of a column in a document, and the value is the action the SDK should perform on that column. 
 
-In this example, the documents contain a column called `text` which we are instructing the SDK to chunk the contents of using the recursive character splitter, and to embed those chunks using the Hugging Face `intfloat/e5-small` embeddings model.
+In this example, the documents contain a column called `text` which we are instructing the SDK to chunk the contents of using the recursive character splitter, and to embed those chunks using the Hugging Face `intfloat/e5-small-v2` embeddings model.
 
 ### Add documents
 

diff --git a/pgml-cms/docs/api/client-sdk/document-search.md b/pgml-cms/docs/api/client-sdk/document-search.md
@@ -10,17 +10,14 @@ This section will assume we have previously ran the following code:
 const pipeline = pgml.newPipeline("test_pipeline", {
   abstract: {
     semantic_search: {
-      model: "intfloat/e5-small",
+      model: "intfloat/e5-small-v2",
     },
     full_text_search: { configuration: "english" },
   },
   body: {
     splitter: { model: "recursive_character" },
     semantic_search: {
-      model: "hkunlp/instructor-base",
-      parameters: {
-        instruction: "Represent the Wikipedia document for retrieval: ",
-      }
+      model: "intfloat/e5-small-v2",
     },
   },
 });
@@ -36,17 +33,14 @@ pipeline = Pipeline(
     {
         "abstract": {
             "semantic_search": {
-                "model": "intfloat/e5-small",
+                "model": "intfloat/e5-small-v2",
             },
             "full_text_search": {"configuration": "english"},
         },
         "body": {
             "splitter": {"model": "recursive_character"},
             "semantic_search": {
-                "model": "hkunlp/instructor-base",
-                "parameters": {
-                    "instruction": "Represent the Wikipedia document for retrieval: ",
-                },
+                "model": "intfloat/e5-small-v2",
             },
         },
     },