Skip to content

Commit f49dc6b

Browse files
author
Montana Low
committed
sync title
1 parent 3df72c8 commit f49dc6b

File tree

1 file changed

+118
-0
lines changed

1 file changed

+118
-0
lines changed
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
<h1>Postgres Full Text Search is <del>Good Enough</del> the Best!</h1>
2+
3+
<p class="author">
4+
<img width="54px" height="54px" src="/images/team/montana.jpg" />
5+
Montana Low<br/>
6+
August 25, 2022
7+
</p>
8+
9+
Normalized data is a powerful tool leveraged by 10x engineering organizations. If you haven't read [Postgres Full Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/) you should, unless you're willing to take that statement at face value, without the code samples to prove it. We'll go beyond that claim in this post, but to reiterate the main points, Postgres supports:
10+
11+
- Stemming
12+
- Ranking / Boost
13+
- Support Multiple languages
14+
- Fuzzy search for misspelling
15+
- Accent support
16+
17+
This is good enough for most of the use cases out there, without introducing any additional concerns to your application. But, if you've ever tried to deliver relevant search results at scale, you'll realize that you need a lot more than these fundamentals. ElasticSearch has all kinds of best in class features, like a modified version of BM25 that is state of the art (developed in the 1970's), which is one of the many features you need beyond the Term Frequency (TF) based ranking that Postgres uses... but, _the ElasticSearch approach is a dead end_ for 2 reasons:
18+
19+
1. Trying to improve search relevance with statistics like TF-IDF and BM25 is like trying to make a flying car. What you want is a helicopter instead.
20+
2. Computing inverse document frequency for BM25 brutalizes your search indexing performance, which leads to a [host of follow on issues via distributed computation](https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing), for the originally dubious reason.
21+
22+
<figure markdown>
23+
<center markdown>
24+
![Flying Car](/blog/images/delorean.jpg)
25+
</center>
26+
<figcaption>What we were promised</figcaption>
27+
</figure>
28+
29+
Academics have spent decades inventing many algorithms that use orders of magnitude more compute eking out marginally better results that often aren't worth it in practice. Not to generally disparage academia, their work has consistently improved our world, but we need to pay attention to tradeoffs. SQL is another acronym similiarly pioneered in the 1970's. One difference between SQL and BM25 is that everyone has heard of the former before reading this blog post, for good reason.
30+
31+
If you actually want to meaningfully improve search results, you generally need to add new data sources. Relevance is much more often revealed by the way other things **_relate_** to the document, rather than the content of the document itself. Google proved the point 23 years ago. Pagerank doesn't rely on the page content itself as much as it uses metadata from _links to the pages_. We live in a connected world and it's the interplay among things that reveal their relevance, whether that is links for websites, sales for products, shares for social posts... It's the greater context around the document that matters.
32+
33+
> _If you want to improve your search results, don't rely on expensive O(n*m) word frequency statistics. Get new sources of data instead. It's the relational nature of relevance that underpins why a relational database forms the ideal search engine._
34+
35+
Postgres made the right call to avoid the costs required to compute Inverse Document Frequency in their search indexing, given its meager benefit. Instead, it offers the most feature-complete relational data platform. [Elasticsearch will tell you](https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html) you can't join data in a **_naively distributed system_** at read time, because it is prohibitively expensive. Instead you'll have to join the data eagerly at indexing time, which is even more prohibitively expensive. That's good for their business since you're the one paying for it, and it will scale until you're bankrupt.
36+
37+
What you really should do, is leave the data normalized inside Postgres, which will allow you to join additional, related data at query time. It will take multiple orders of magnitude less compute to index and search a normalized corpus, meaning you'll have a lot longer (potentially forever) before you need to distribute your workload, and then maybe you can do that intelligently instead of naively. Instead of spending your time building and maintaining pipelines to shuffle updates between systems, you can work on new sources of data to really improve relevance.
38+
39+
With PostgresML, you can now skip straight to full on machine learning when you have the related data. You can load your feature store into the same database as your search corpus. Each data source can live in its own independent table, with its own update cadence, rather than having to reindex and denormalize entire documents back to ElasticSearch, or worse, large portions of the entire corpus, when a single thing changes.
40+
41+
With a single SQL query, you can do multiple passes of re-ranking, pruning and personalization to refine a search relevance score.
42+
43+
- basic term relevance
44+
- embedding similarities
45+
- XGBoost or LightGBM inference
46+
47+
These queries can execute in milliseconds on large production-sized corpora with Postgres's multiple indexing strategies. You can do all of this without adding any new infrastructure to your stack.
48+
49+
The following full blown example is for demonstration purposes only of a 3rd generation search engine. You can test it for real in the PostgresML Gym to build up a complete understanding.
50+
51+
<center markdown>
52+
[Try the PostgresML Gym](https://gym.postgresml.org/){ .md-button .md-button--primary }
53+
</center>
54+
55+
```sql title="search.sql" linenums="1"
56+
WITH query AS (
57+
-- construct a query context with arguments that would typically be
58+
-- passed in from the application layer
59+
SELECT
60+
-- a keyword query for "my" OR "search" OR "terms"
61+
tsquery('my | search | terms') AS keywords,
62+
-- a user_id for personalization later on
63+
123456 AS user_id
64+
),
65+
first_pass AS (
66+
SELECT *,
67+
-- calculate the term frequency of keywords in the document
68+
ts_rank(documents.full_text, keywords) AS term_frequency
69+
-- our basic corpus is stored in the documents table
70+
FROM documents
71+
-- that match the query keywords defined above
72+
WHERE documents.full_text @@ query.keywords
73+
-- ranked by term frequency
74+
ORDER BY term_frequency DESC
75+
-- prune to a reasonably large candidate population
76+
LIMIT 10000
77+
),
78+
second_pass AS (
79+
SELECT *,
80+
-- create a second pass score of cosine_similarity across embeddings
81+
pgml.cosine_similarity(document_embeddings.vector, user_embeddings.vector) AS similarity_score
82+
FROM first_pass
83+
-- grab more data from outside the documents
84+
JOIN document_embeddings ON document_embeddings.document_id = documents.id
85+
JOIN user_embeddings ON user_embeddings.user_id = query.user_id
86+
-- of course we be re-ranking
87+
ORDER BY similarity_score DESC
88+
-- further prune results to top performers for more expensive ranking
89+
LIMIT 1000
90+
),
91+
third_pass AS (
92+
SELECT *,
93+
-- create a final score using xgboost
94+
pgml.predict('search relevance model', ARRAY[session_level_features.*]) AS final_score
95+
FROM second_pass
96+
JOIN session_level_features ON session_level_features.user_id = query.user_id
97+
)
98+
SELECT *
99+
FROM third_pass
100+
ORDER BY final_score DESC
101+
LIMIT 100;
102+
```
103+
104+
If you'd like to play through an interactive notebook to generate models for search relevance in a Postgres database, try it in the Gym. An exercise for the curious reader, would be to combine all three scores above into a single algebraic function for ranking, and then into a fourth learned model...
105+
106+
<center>
107+
<video controls autoplay loop muted width="90%" style="box-shadow: 0 0 8px #000;">
108+
<source src="https://static.postgresml.org/postgresml-org-static/gym_demo.webm" type="video/webm">
109+
<source src="https://static.postgresml.org/postgresml-org-static/gym_demo.mp4" type="video/mp4">
110+
<img src="/images/demos/gym_demo.png" alt="PostgresML in practice" loading="lazy">
111+
</video>
112+
</center>
113+
114+
<center markdown>
115+
[Try the PostgresML Gym](https://gym.postgresml.org/){ .md-button .md-button--primary }
116+
</center>
117+
118+
Many thanks and ❤️ to all those who are supporting this endeavor. We’d love to hear feedback from the broader ML and Engineering community about applications and other real world scenarios to help prioritize our work.

0 commit comments

Comments
 (0)