Skip to content

Commit a5d9eca

Browse files
authored
Korvus x Firecrawl blog post (postgresml#1600)
1 parent 7b6069b commit a5d9eca

File tree

3 files changed

+235
-0
lines changed

3 files changed

+235
-0
lines changed
Loading

pgml-cms/blog/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Table of contents
22

33
* [Home](README.md)
4+
* [Korvus x Firecrawl: Rag in a single query](korvus-firecrawl-rag-in-a-single-query.md)
45
* [A Speed Comparison of the Most Popular Retrieval Systems for RAG](a-speed-comparison-of-the-most-popular-retrieval-systems-for-rag.md)
56
* [Korvus The All-in-One RAG Pipeline for PostgresML](introducing-korvus-the-all-in-one-rag-pipeline-for-postgresml.md)
67
* [Semantic Search in Postgres in 15 Minutes](semantic-search-in-postgres-in-15-minutes.md)
Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
---
2+
description: How to perform all-in-one RAG over any website with Firecrawl and Korvus.
3+
featured: true
4+
tags: [engineering]
5+
image: ".gitbook/assets/Blog-Image_Korvus-Firecrawl.jpg"
6+
---
7+
8+
# Korvus x Firecrawl: RAG in a single query
9+
10+
<div align="left">
11+
12+
<figure><img src=".gitbook/assets/silas.jpg" alt="Author" width="100"><figcaption></figcaption></figure>
13+
14+
</div>
15+
16+
Silas Marvin
17+
18+
July 30, 2024
19+
20+
We’re excited to share a quick guide on how you use the power of Korvus’ single query RAG along with Firecrawl to quickly and easily standup a retrieval augmented generation system with data from any website.
21+
22+
You’ll learn how to:
23+
24+
1. Use Firecrawl to efficiently scrape web content (we’re using our blog as an example)
25+
2. Process and index the scraped data using Korvus's Pipeline and Collection
26+
3. Perform vector search, text generation and reranking (RAG) in a single query, using open-source models
27+
28+
[Firecrawl](https://firecrawl.dev) is a nifty web scraper that turns websites into clean, structured markdown data — perfect to create a knowledge base for RAG applications.
29+
30+
[Korvus](https://github.com/postgresml/korvus) is the Python, JavaScript, Rust or C SDK for PostgresML. It handles the heavy lifting of document processing, vector search, and response generation in a single query.
31+
32+
[PostgresML](https://postgresml.org) is an in-database ML/AI engine built by the ML engineers at Instacart. It lets you train, test and deploy models right inside Postgres. With Korvus, you can get all the efficiencies of in-database machine learning without SQL or database management.
33+
34+
These three tools are all you’ll need to deploy a flexible and powerful RAG stack grounded in web data. Since your data is stored right where you're performing inference, you won’t need a vector database or an additional framework like LlamaIndex or Langchain to tie everything together. Mo’ microservices = more problems.
35+
36+
Let’s dive in!
37+
38+
## Getting Started
39+
40+
To follow along you will need to set both the `FIRECRAWL_API_KEY` and `KORVUS_DATABASE_URL` env variables.
41+
42+
Sign up at [firecrawl.dev](https://www.firecrawl.dev/) to get your `FIRECRAWL_API_KEY`.
43+
44+
The easiest way to get your `KORVUS_DATABASE_URL` is by signing up at [postgresml.org](https://postgresml.org) but you can also host postgres with the `pgml` and `pgvector` extensions yourself.
45+
46+
### Some Imports
47+
48+
First, let's break down the initial setup and imports:
49+
50+
```python
51+
from korvus import Collection, Pipeline
52+
from firecrawl import FirecrawlApp
53+
import os
54+
import time
55+
import asyncio
56+
from rich import print
57+
58+
# Initialize the FirecrawlApp with your API key
59+
firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
60+
```
61+
62+
Here we're importing `korvus`, `firecrawl`, and some other convenient libraries, and initializing the `FirecrawlApp` with an API key stored in an environment variable. This setup allows us to use Firecrawl for web scraping.
63+
64+
### Defining the Pipeline and Collection
65+
66+
Next, we define our Pipeline and Collection:
67+
68+
```python
69+
pipeline = Pipeline(
70+
"v0",
71+
{
72+
"markdown": {
73+
"splitter": {"model": "markdown"},
74+
"semantic_search": {
75+
"model": "mixedbread-ai/mxbai-embed-large-v1",
76+
},
77+
},
78+
},
79+
)
80+
collection = Collection("fire-crawl-demo-v0")
81+
82+
# Add our Pipeline to our Collection
83+
async def add_pipeline():
84+
await collection.add_pipeline(pipeline)
85+
```
86+
87+
This Pipeline configuration tells Korvus how to process our documents. It specifies that we'll be working with markdown content, using a markdown-specific splitter, and the `mixedbread-ai/mxbai-embed-large-v1` model for semantic search embeddings.
88+
89+
See the [Korvus guide to construction Pipelines](https://postgresml.org/docs/open-source/korvus/guides/constructing-pipelines) for more information on Collections and Pipelines.
90+
91+
### Web Crawling with Firecrawl
92+
93+
The `crawl()` function demonstrates how to use Firecrawl to scrape a website:
94+
95+
```python
96+
def crawl():
97+
crawl_url = "https://postgresml.org/blog"
98+
params = {
99+
"crawlerOptions": {
100+
"excludes": [],
101+
"includes": ["blog/*"],
102+
"limit": 250,
103+
},
104+
"pageOptions": {"onlyMainContent": True},
105+
}
106+
job = firecrawl.crawl_url(crawl_url, params=params, wait_until_done=False)
107+
while True:
108+
print("Scraping...")
109+
status = firecrawl.check_crawl_status(job["jobId"])
110+
if not status["status"] == "active":
111+
break
112+
time.sleep(5)
113+
return status
114+
```
115+
116+
This function initiates a crawl of the PostgresML blog, focusing on blog posts and limiting the crawl to 250 pages. It then periodically checks the status of the crawl job until it's complete.
117+
118+
Alternativly to sleeping, we could set the `wait_until_done` parameter to `True` and the `crawl_url` method would block until the data is ready.
119+
120+
121+
### Processing and Indexing the Crawled Data
122+
123+
After crawling the website, we need to process and index the data for efficient searching. This is done in the `main()` function:
124+
125+
```python
126+
async def main():
127+
# Add our Pipeline to our Collection
128+
await add_pipeline()
129+
130+
# Crawl the website
131+
results = crawl()
132+
133+
# Construct our documents to upsert
134+
documents = [
135+
{"id": data["metadata"]["sourceURL"], "markdown": data["markdown"]}
136+
for data in results["data"]
137+
]
138+
139+
# Upsert our documents
140+
await collection.upsert_documents(documents)
141+
```
142+
143+
This code does the following:
144+
1. Adds the previously defined pipeline to our collection.
145+
2. Crawls the website using the `crawl()` function.
146+
3. Constructs a list of documents from the crawled data, using the source URL as the ID and the markdown content as the document text.
147+
4. Upserts these documents into the collection. The pipeline automatically splits the markdown and generates embeddings for each chunk storing it all in Postgres.
148+
149+
### Performing RAG
150+
151+
With our data indexed, we can now perform RAG:
152+
153+
```python
154+
async def do_rag(user_query):
155+
results = await collection.rag(
156+
{
157+
"CONTEXT": {
158+
"vector_search": {
159+
"query": {
160+
"fields": {
161+
"markdown": {
162+
"query": user_query,
163+
"parameters": {
164+
"prompt": "Represent this sentence for searching relevant passages: "
165+
},
166+
}
167+
},
168+
},
169+
"document": {"keys": ["id"]},
170+
"rerank": {
171+
"model": "mixedbread-ai/mxbai-rerank-base-v1",
172+
"query": user_query,
173+
"num_documents_to_rerank": 100,
174+
},
175+
"limit": 5,
176+
},
177+
"aggregate": {"join": "\n\n\n"},
178+
},
179+
"chat": {
180+
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
181+
"messages": [
182+
{
183+
"role": "system",
184+
"content": "You are a question and answering bot. Answer the users question given the context succinctly.",
185+
},
186+
{
187+
"role": "user",
188+
"content": f"Given the context\n<context>\n:{{CONTEXT}}\n</context>\nAnswer the question: {user_query}",
189+
},
190+
],
191+
"max_tokens": 256,
192+
},
193+
},
194+
pipeline,
195+
)
196+
return results
197+
```
198+
199+
This function combines vector search, reranking, and text generation to provide context-aware answers to user queries. It uses the Meta-Llama-3.1-405B-Instruct model for text generation.
200+
201+
This query can be broken down into 4 steps:
202+
1. Perform vector search finding the 100 best matching chunks for the `user_query`
203+
2. Rerank the results of the vector search using the `mixedbread-ai/mxbai-rerank-base-v1` cross-encoder and limit the results to 5
204+
3. Join the reranked results with `\n\n\n` and substitute them in place of the `{{CONTEXT}}` placeholder in the messages
205+
4. Perform text-generation with `meta-llama/Meta-Llama-3.1-405B-Instruct`
206+
207+
This is a complex query and there are more options and parameters to be tuned. See the [Korvus guide to RAG](https://postgresml.org/docs/open-source/korvus/guides/rag) for more information on the `rag` method.
208+
209+
### All Together Now
210+
211+
To tie everything together, we use an interactive loop in our `main()` function:
212+
213+
```python
214+
async def main():
215+
# ... (previous code for setup and indexing)
216+
217+
# Now we can search
218+
while True:
219+
user_query = input("\n\nquery > ")
220+
if user_query == "q":
221+
break
222+
results = await do_rag(user_query)
223+
print(results)
224+
225+
asyncio.run(main())
226+
```
227+
228+
This loop allows users to input queries and receive RAG-powered responses based on the crawled and indexed content from the PostgresML blog.
229+
230+
## Wrapping up
231+
232+
We've demonstrated how to create a powerful RAG system using [Firecrawl](https://firecrawl.dev) and [Korvus](https://github.com/postgresml/korvus) – but it’s just a small example of the simplicity of doing RAG in-database, with fewer microservices.
233+
234+
It’s faster, cheaper and easier to manage than the common approach to RAG (Vector DB + frameworks + moving your data to the models). But don’t take our word for it. Try out Firecrawl and Korvus on PostgresML, and see the performance benefits yourself. And as always, let us know what you think.

0 commit comments

Comments
 (0)