|
| 1 | +--- |
| 2 | +description: How to perform all-in-one RAG over any website with Firecrawl and Korvus. |
| 3 | +featured: true |
| 4 | +tags: [engineering] |
| 5 | +image: ".gitbook/assets/Blog-Image_Korvus-Firecrawl.jpg" |
| 6 | +--- |
| 7 | + |
| 8 | +# Korvus x Firecrawl: RAG in a single query |
| 9 | + |
| 10 | +<div align="left"> |
| 11 | + |
| 12 | +<figure><img src=".gitbook/assets/silas.jpg" alt="Author" width="100"><figcaption></figcaption></figure> |
| 13 | + |
| 14 | +</div> |
| 15 | + |
| 16 | +Silas Marvin |
| 17 | + |
| 18 | +July 30, 2024 |
| 19 | + |
| 20 | +We’re excited to share a quick guide on how you use the power of Korvus’ single query RAG along with Firecrawl to quickly and easily standup a retrieval augmented generation system with data from any website. |
| 21 | + |
| 22 | +You’ll learn how to: |
| 23 | + |
| 24 | +1. Use Firecrawl to efficiently scrape web content (we’re using our blog as an example) |
| 25 | +2. Process and index the scraped data using Korvus's Pipeline and Collection |
| 26 | +3. Perform vector search, text generation and reranking (RAG) in a single query, using open-source models |
| 27 | + |
| 28 | +[Firecrawl](https://firecrawl.dev) is a nifty web scraper that turns websites into clean, structured markdown data — perfect to create a knowledge base for RAG applications. |
| 29 | + |
| 30 | +[Korvus](https://github.com/postgresml/korvus) is the Python, JavaScript, Rust or C SDK for PostgresML. It handles the heavy lifting of document processing, vector search, and response generation in a single query. |
| 31 | + |
| 32 | +[PostgresML](https://postgresml.org) is an in-database ML/AI engine built by the ML engineers at Instacart. It lets you train, test and deploy models right inside Postgres. With Korvus, you can get all the efficiencies of in-database machine learning without SQL or database management. |
| 33 | + |
| 34 | +These three tools are all you’ll need to deploy a flexible and powerful RAG stack grounded in web data. Since your data is stored right where you're performing inference, you won’t need a vector database or an additional framework like LlamaIndex or Langchain to tie everything together. Mo’ microservices = more problems. |
| 35 | + |
| 36 | +Let’s dive in! |
| 37 | + |
| 38 | +## Getting Started |
| 39 | + |
| 40 | +To follow along you will need to set both the `FIRECRAWL_API_KEY` and `KORVUS_DATABASE_URL` env variables. |
| 41 | + |
| 42 | +Sign up at [firecrawl.dev](https://www.firecrawl.dev/) to get your `FIRECRAWL_API_KEY`. |
| 43 | + |
| 44 | +The easiest way to get your `KORVUS_DATABASE_URL` is by signing up at [postgresml.org](https://postgresml.org) but you can also host postgres with the `pgml` and `pgvector` extensions yourself. |
| 45 | + |
| 46 | +### Some Imports |
| 47 | + |
| 48 | +First, let's break down the initial setup and imports: |
| 49 | + |
| 50 | +```python |
| 51 | +from korvus import Collection, Pipeline |
| 52 | +from firecrawl import FirecrawlApp |
| 53 | +import os |
| 54 | +import time |
| 55 | +import asyncio |
| 56 | +from rich import print |
| 57 | + |
| 58 | +# Initialize the FirecrawlApp with your API key |
| 59 | +firecrawl = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"]) |
| 60 | +``` |
| 61 | + |
| 62 | +Here we're importing `korvus`, `firecrawl`, and some other convenient libraries, and initializing the `FirecrawlApp` with an API key stored in an environment variable. This setup allows us to use Firecrawl for web scraping. |
| 63 | + |
| 64 | +### Defining the Pipeline and Collection |
| 65 | + |
| 66 | +Next, we define our Pipeline and Collection: |
| 67 | + |
| 68 | +```python |
| 69 | +pipeline = Pipeline( |
| 70 | + "v0", |
| 71 | + { |
| 72 | + "markdown": { |
| 73 | + "splitter": {"model": "markdown"}, |
| 74 | + "semantic_search": { |
| 75 | + "model": "mixedbread-ai/mxbai-embed-large-v1", |
| 76 | + }, |
| 77 | + }, |
| 78 | + }, |
| 79 | +) |
| 80 | +collection = Collection("fire-crawl-demo-v0") |
| 81 | + |
| 82 | +# Add our Pipeline to our Collection |
| 83 | +async def add_pipeline(): |
| 84 | + await collection.add_pipeline(pipeline) |
| 85 | +``` |
| 86 | + |
| 87 | +This Pipeline configuration tells Korvus how to process our documents. It specifies that we'll be working with markdown content, using a markdown-specific splitter, and the `mixedbread-ai/mxbai-embed-large-v1` model for semantic search embeddings. |
| 88 | + |
| 89 | +See the [Korvus guide to construction Pipelines](https://postgresml.org/docs/open-source/korvus/guides/constructing-pipelines) for more information on Collections and Pipelines. |
| 90 | + |
| 91 | +### Web Crawling with Firecrawl |
| 92 | + |
| 93 | +The `crawl()` function demonstrates how to use Firecrawl to scrape a website: |
| 94 | + |
| 95 | +```python |
| 96 | +def crawl(): |
| 97 | + crawl_url = "https://postgresml.org/blog" |
| 98 | + params = { |
| 99 | + "crawlerOptions": { |
| 100 | + "excludes": [], |
| 101 | + "includes": ["blog/*"], |
| 102 | + "limit": 250, |
| 103 | + }, |
| 104 | + "pageOptions": {"onlyMainContent": True}, |
| 105 | + } |
| 106 | + job = firecrawl.crawl_url(crawl_url, params=params, wait_until_done=False) |
| 107 | + while True: |
| 108 | + print("Scraping...") |
| 109 | + status = firecrawl.check_crawl_status(job["jobId"]) |
| 110 | + if not status["status"] == "active": |
| 111 | + break |
| 112 | + time.sleep(5) |
| 113 | + return status |
| 114 | +``` |
| 115 | + |
| 116 | +This function initiates a crawl of the PostgresML blog, focusing on blog posts and limiting the crawl to 250 pages. It then periodically checks the status of the crawl job until it's complete. |
| 117 | + |
| 118 | +Alternativly to sleeping, we could set the `wait_until_done` parameter to `True` and the `crawl_url` method would block until the data is ready. |
| 119 | + |
| 120 | + |
| 121 | +### Processing and Indexing the Crawled Data |
| 122 | + |
| 123 | +After crawling the website, we need to process and index the data for efficient searching. This is done in the `main()` function: |
| 124 | + |
| 125 | +```python |
| 126 | +async def main(): |
| 127 | + # Add our Pipeline to our Collection |
| 128 | + await add_pipeline() |
| 129 | + |
| 130 | + # Crawl the website |
| 131 | + results = crawl() |
| 132 | + |
| 133 | + # Construct our documents to upsert |
| 134 | + documents = [ |
| 135 | + {"id": data["metadata"]["sourceURL"], "markdown": data["markdown"]} |
| 136 | + for data in results["data"] |
| 137 | + ] |
| 138 | + |
| 139 | + # Upsert our documents |
| 140 | + await collection.upsert_documents(documents) |
| 141 | +``` |
| 142 | + |
| 143 | +This code does the following: |
| 144 | +1. Adds the previously defined pipeline to our collection. |
| 145 | +2. Crawls the website using the `crawl()` function. |
| 146 | +3. Constructs a list of documents from the crawled data, using the source URL as the ID and the markdown content as the document text. |
| 147 | +4. Upserts these documents into the collection. The pipeline automatically splits the markdown and generates embeddings for each chunk storing it all in Postgres. |
| 148 | + |
| 149 | +### Performing RAG |
| 150 | + |
| 151 | +With our data indexed, we can now perform RAG: |
| 152 | + |
| 153 | +```python |
| 154 | +async def do_rag(user_query): |
| 155 | + results = await collection.rag( |
| 156 | + { |
| 157 | + "CONTEXT": { |
| 158 | + "vector_search": { |
| 159 | + "query": { |
| 160 | + "fields": { |
| 161 | + "markdown": { |
| 162 | + "query": user_query, |
| 163 | + "parameters": { |
| 164 | + "prompt": "Represent this sentence for searching relevant passages: " |
| 165 | + }, |
| 166 | + } |
| 167 | + }, |
| 168 | + }, |
| 169 | + "document": {"keys": ["id"]}, |
| 170 | + "rerank": { |
| 171 | + "model": "mixedbread-ai/mxbai-rerank-base-v1", |
| 172 | + "query": user_query, |
| 173 | + "num_documents_to_rerank": 100, |
| 174 | + }, |
| 175 | + "limit": 5, |
| 176 | + }, |
| 177 | + "aggregate": {"join": "\n\n\n"}, |
| 178 | + }, |
| 179 | + "chat": { |
| 180 | + "model": "meta-llama/Meta-Llama-3.1-405B-Instruct", |
| 181 | + "messages": [ |
| 182 | + { |
| 183 | + "role": "system", |
| 184 | + "content": "You are a question and answering bot. Answer the users question given the context succinctly.", |
| 185 | + }, |
| 186 | + { |
| 187 | + "role": "user", |
| 188 | + "content": f"Given the context\n<context>\n:{{CONTEXT}}\n</context>\nAnswer the question: {user_query}", |
| 189 | + }, |
| 190 | + ], |
| 191 | + "max_tokens": 256, |
| 192 | + }, |
| 193 | + }, |
| 194 | + pipeline, |
| 195 | + ) |
| 196 | + return results |
| 197 | +``` |
| 198 | + |
| 199 | +This function combines vector search, reranking, and text generation to provide context-aware answers to user queries. It uses the Meta-Llama-3.1-405B-Instruct model for text generation. |
| 200 | + |
| 201 | +This query can be broken down into 4 steps: |
| 202 | +1. Perform vector search finding the 100 best matching chunks for the `user_query` |
| 203 | +2. Rerank the results of the vector search using the `mixedbread-ai/mxbai-rerank-base-v1` cross-encoder and limit the results to 5 |
| 204 | +3. Join the reranked results with `\n\n\n` and substitute them in place of the `{{CONTEXT}}` placeholder in the messages |
| 205 | +4. Perform text-generation with `meta-llama/Meta-Llama-3.1-405B-Instruct` |
| 206 | + |
| 207 | +This is a complex query and there are more options and parameters to be tuned. See the [Korvus guide to RAG](https://postgresml.org/docs/open-source/korvus/guides/rag) for more information on the `rag` method. |
| 208 | + |
| 209 | +### All Together Now |
| 210 | + |
| 211 | +To tie everything together, we use an interactive loop in our `main()` function: |
| 212 | + |
| 213 | +```python |
| 214 | +async def main(): |
| 215 | + # ... (previous code for setup and indexing) |
| 216 | + |
| 217 | + # Now we can search |
| 218 | + while True: |
| 219 | + user_query = input("\n\nquery > ") |
| 220 | + if user_query == "q": |
| 221 | + break |
| 222 | + results = await do_rag(user_query) |
| 223 | + print(results) |
| 224 | + |
| 225 | +asyncio.run(main()) |
| 226 | +``` |
| 227 | + |
| 228 | +This loop allows users to input queries and receive RAG-powered responses based on the crawled and indexed content from the PostgresML blog. |
| 229 | + |
| 230 | +## Wrapping up |
| 231 | + |
| 232 | +We've demonstrated how to create a powerful RAG system using [Firecrawl](https://firecrawl.dev) and [Korvus](https://github.com/postgresml/korvus) – but it’s just a small example of the simplicity of doing RAG in-database, with fewer microservices. |
| 233 | + |
| 234 | +It’s faster, cheaper and easier to manage than the common approach to RAG (Vector DB + frameworks + moving your data to the models). But don’t take our word for it. Try out Firecrawl and Korvus on PostgresML, and see the performance benefits yourself. And as always, let us know what you think. |
0 commit comments