Skip to content

pgml.transform() docs fixes #1428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions pgml-cms/docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
* [SQL extension](api/sql-extension/README.md)
* [pgml.embed()](api/sql-extension/pgml.embed.md)
* [pgml.transform()](api/sql-extension/pgml.transform/README.md)
* [Fill Mask](api/sql-extension/pgml.transform/fill-mask.md)
* [Question Answering](api/sql-extension/pgml.transform/question-answering.md)
* [Fill-Mask](api/sql-extension/pgml.transform/fill-mask.md)
* [Question answering](api/sql-extension/pgml.transform/question-answering.md)
* [Summarization](api/sql-extension/pgml.transform/summarization.md)
* [Text Classification](api/sql-extension/pgml.transform/text-classification.md)
* [Text classification](api/sql-extension/pgml.transform/text-classification.md)
* [Text Generation](api/sql-extension/pgml.transform/text-generation.md)
* [Text-to-Text Generation](api/sql-extension/pgml.transform/text-to-text-generation.md)
* [Token Classification](api/sql-extension/pgml.transform/token-classification.md)
Expand Down
10 changes: 2 additions & 8 deletions pgml-cms/docs/api/sql-extension/pgml.embed.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ pgml.embed(
|----------|-------------|---------|
| transformer | The name of a Hugging Face embedding model. | `intfloat/e5-large-v2` |
| text | The text to embed. This can be a string or the name of a column from a PostgreSQL table. | `'I am your father, Luke'` |
| kwargs | Additional arguments that are passed to the model. | |
| kwargs | Additional arguments that are passed to the model during inference. | |

### Examples

Expand All @@ -43,7 +43,7 @@ SELECT * FROM pgml.embed(
{% endtab %}
{% endtabs %}

#### Generate embeddings from a table
#### Generate embeddings inside a table

SQL functions can be used as part of a query to insert, update, or even automatically generate column values of any table:

Expand Down Expand Up @@ -96,9 +96,3 @@ LIMIT 1;
{% endtabs %}

This query will return the quote with the most similar meaning to `'Feel the force!'` by generating an embedding of that quote and comparing it to all other embeddings in the table, using vector cosine similarity as the measure of distance.

## Performance

First time `pgml.embed()` is called with a new model, it is downloaded from Hugging Face and saved in the cache directory. Subsequent calls will use the cached model, which is faster, and if the connection to the database is kept open, the model will be reused across multiple queries without being unloaded from memory.

If a GPU is available, the model will be automatically loaded onto the GPU and the embedding generation will be even faster.
81 changes: 60 additions & 21 deletions pgml-cms/docs/api/sql-extension/pgml.transform/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The `pgml.transform()` function comes in two flavors, task-based and model-based

### Task-based API

The task-based API automatically chooses a model to use based on the task:
The task-based API automatically chooses a model based on the task:

```postgresql
pgml.transform(
Expand All @@ -37,22 +37,34 @@ pgml.transform(
)
```

| Argument | Description | Example |
|----------|-------------|---------|
| task | The name of a natural language processing task. | `text-generation` |
| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |
| Argument | Description | Example | Required |
|----------|-------------|---------|----------|
| task | The name of a natural language processing task. | `'text-generation'` | Required |
| args | Additional kwargs to pass to the pipeline. | `'{"max_new_tokens": 50}'::JSONB` | Optional |
| inputs | Array of prompts to pass to the model for inference. Each prompt is evaluated independently and a separate result is returned. | `ARRAY['Once upon a time...']` | Required |

#### Example
#### Examples

{% tabs %}
{% tab title="SQL" %}
{% tabs %}
{% tab title="Text generation" %}

```postgresql
SELECT *
FROM pgml.transform(
task => 'text-generation',
inputs => ARRAY['In a galaxy far far away']
);
```

{% endtab %}
{% tab title="Translation" %}

```postgresql
SELECT *
FROM pgml.transform (
'translation_en_to_fr',
'How do I say hello in French?',
FROM pgml.transform(
task => 'translation_en_to_fr',
inputs => ARRAY['How do I say hello in French?']
);
```

Expand All @@ -61,7 +73,7 @@ FROM pgml.transform (

### Model-based API

The model-based API requires the name of the model and the task, passed as a JSON object, which allows it to be more generic:
The model-based API requires the name of the model and the task, passed as a JSON object. This allows it to be more generic and support more models:

```postgresql
pgml.transform(
Expand All @@ -71,16 +83,41 @@ pgml.transform(
)
```

| Argument | Description | Example |
|----------|-------------|---------|
| task | Model configuration, including name and task. | `{"task": "text-generation", "model": "mistralai/Mixtral-8x7B-v0.1"}` |
| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |
<table class="table-sm table">
<thead>
<th>Argument</th>
<th>Description</th>
<th>Example</th>
</thead>
<tbody>
<tr>
<td>model</td>
<td>Model configuration, including name and task.</td>
<td>
<div class="code-multi-line font-monospace">
'{
<br>&nbsp;&nbsp;"task": "text-generation",
<br>&nbsp;&nbsp;"model": "mistralai/Mixtral-8x7B-v0.1"
<br>}'::JSONB
</div>
</td>
</tr>
<tr>
<td>args</td>
<td>Additional kwargs to pass to the pipeline.</td>
<td><code>'{"max_new_tokens": 50}'::JSONB</code></td>
</tr>
<tr>
<td>inputs</td>
<td>Array of prompts to pass to the model for inference. Each prompt is evaluated independently.</td>
<td><code>ARRAY['Once upon a time...']</code></td>
</tr>
</table>

#### Example

{% tabs %}
{% tab title="SQL" %}
{% tab title="PostgresML SQL" %}

```postgresql
SELECT pgml.transform(
Expand All @@ -89,8 +126,9 @@ SELECT pgml.transform(
"model": "TheBloke/zephyr-7B-beta-GPTQ",
"model_type": "mistral",
"revision": "main",
"device_map": "auto"
}'::JSONB,
inputs => ['AI is going to change the world in the following ways:'],
inputs => ARRAY['AI is going to'],
args => '{
"max_new_tokens": 100
}'::JSONB
Expand Down Expand Up @@ -138,11 +176,12 @@ PostgresML currently supports most NLP tasks available on Hugging Face:
| [Token classification](token-classification) | `token-classification` | Classify tokens in a text. |
| [Translation](translation) | `translation` | Translate text from one language to another. |
| [Zero-shot classification](zero-shot-classification) | `zero-shot-classification` | Classify a text without training data. |
| Conversational | `conversational` | Engage in a conversation with the model, e.g. chatbot. |

### Structured inputs

## Performance
Both versions of the `pgml.transform()` function also support structured inputs, formatted with JSON. Structured inputs are used with the conversational task, e.g. to differentiate between the system and user prompts. Simply replace the text array argument with an array of JSONB objects.

Much like `pgml.embed()`, the models used in `pgml.transform()` are downloaded from Hugging Face and cached locally. If the connection to the database is kept open, the model remains in memory, which allows for faster inference on subsequent calls. If you want to free up memory, you can close the connection.

## Additional resources

Expand Down
58 changes: 49 additions & 9 deletions pgml-cms/docs/api/sql-extension/pgml.transform/fill-mask.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,69 @@
description: Task to fill words in a sentence that are hidden
---

# Fill Mask
# Fill-Mask

Fill-mask refers to a task where certain words in a sentence are hidden or "masked", and the objective is to predict what words should fill in those masked positions. Such models are valuable when we want to gain statistical insights about the language used to train the model.
Fill-Mask is a task where certain words in a sentence are hidden or "masked", and the objective for the model is to predict what words should fill in those masked positions. Such models are valuable when we want to gain statistical insights about the language used to train the model.

## Example

{% tabs %}
{% tab title="SQL" %}

```sql
SELECT pgml.transform(
task => '{
"task" : "fill-mask"
}'::JSONB,
inputs => ARRAY[
'Paris is the <mask> of France.'
'Paris is the &lt;mask&gt; of France.'

]
) AS answer;
```

_Result_
{% endtab %}

{% tab title="Result" %}

```json
[
{"score": 0.679, "token": 812, "sequence": "Paris is the capital of France.", "token_str": " capital"},
{"score": 0.051, "token": 32357, "sequence": "Paris is the birthplace of France.", "token_str": " birthplace"},
{"score": 0.038, "token": 1144, "sequence": "Paris is the heart of France.", "token_str": " heart"},
{"score": 0.024, "token": 29778, "sequence": "Paris is the envy of France.", "token_str": " envy"},
{"score": 0.022, "token": 1867, "sequence": "Paris is the Capital of France.", "token_str": " Capital"}]
{
"score": 0.6811484098434448,
"token": 812,
"sequence": "Paris is the capital of France.",
"token_str": " capital"
},
{
"score": 0.050908513367176056,
"token": 32357,
"sequence": "Paris is the birthplace of France.",
"token_str": " birthplace"
},
{
"score": 0.03812871500849724,
"token": 1144,
"sequence": "Paris is the heart of France.",
"token_str": " heart"
},
{
"score": 0.024047480896115303,
"token": 29778,
"sequence": "Paris is the envy of France.",
"token_str": " envy"
},
{
"score": 0.022767696529626846,
"token": 1867,
"sequence": "Paris is the Capital of France.",
"token_str": " Capital"
}
]
```

{% endtab %}
{% endtabs %}

### Additional resources

- [Hugging Face documentation](https://huggingface.co/tasks/fill-mask)
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
---
description: Retrieve the answer to a question from a given text
description: Retrieve the answer to a question from a given text.
---

# Question Answering
# Question answering

Question Answering models are designed to retrieve the answer to a question from a given text, which can be particularly useful for searching for information within a document. It's worth noting that some question answering models are capable of generating answers even without any contextual information.
Question answering models are designed to retrieve the answer to a question from a given text, which can be particularly useful for searching for information within a document. It's worth noting that some question answering models are capable of generating answers even without any contextual information.

## Example

{% tabs %}
{% tab title="SQL" %}

```sql
SELECT pgml.transform(
Expand All @@ -18,7 +23,9 @@ SELECT pgml.transform(
) AS answer;
```

_Result_
{% endtab %}

{% tab title="Result" %}

```json
{
Expand All @@ -28,3 +35,11 @@ _Result_
"answer": "İstanbul"
}
```

{% endtab %}
{% endtabs %}


### Additional resources

- [Hugging Face documentation](https://huggingface.co/tasks/question-answering)
57 changes: 25 additions & 32 deletions pgml-cms/docs/api/sql-extension/pgml.transform/summarization.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,46 @@
---
description: Task of creating a condensed version of a document
description: Task of creating a condensed version of a document.
---

# Summarization

Summarization involves creating a condensed version of a document that includes the important information while reducing its length. Different models can be used for this task, with some models extracting the most relevant text from the original document, while other models generate completely new text that captures the essence of the original content.

## Example

{% tabs %}
{% tab title="SQL" %}

```sql
SELECT pgml.transform(
task => '{"task": "summarization",
"model": "sshleifer/distilbart-cnn-12-6"
task => '{
"task": "summarization",
"model": "google/pegasus-xsum"
}'::JSONB,
inputs => array[
'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'
]
inputs => array[
'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018,
in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government
of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880,
or about 18 percent of the population of France as of 2017.'
]
);
```

_Result_
{% endtab %}
{% tab title="Result" %}

```json
[
{
"summary_text": "Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . The city is the centre and seat of government of the region and province of Île-de-France, or Paris Region . Paris Region has an estimated 18 percent of the population of France as of 2017 ."
}
{
"summary_text": "The City of Paris is the centre and seat of government of the region and province of le-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017."
}
]
```

You can control the length of summary\_text by passing `min_length` and `max_length` as arguments to the SQL query.
{% endtab %}
{% endtabs %}

```sql
SELECT pgml.transform(
task => '{"task": "summarization",
"model": "sshleifer/distilbart-cnn-12-6"
}'::JSONB,
inputs => array[
'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'
],
args => '{
"min_length" : 20,
"max_length" : 70
}'::JSONB
);
```
### Additional resources

```json
[
{
"summary_text": " Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . City of Paris is centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated 12,174,880, or about 18 percent"
}
]
```
- [Hugging Face documentation](https://huggingface.co/tasks/summarization)
- [google/pegasus-xsum](https://huggingface.co/google/pegasus-xsum)
Loading