Skip to content

chore: add array operators to SQLGlot compiler #1852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 25, 2025
Merged

Conversation

sycai
Copy link
Contributor

@sycai sycai commented Jun 25, 2025

In addition, I polished the OpRegistration class such that full type hints are supported for each compilation function. This is achieved by relaxing the constraints of function signatures.

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Jun 25, 2025
@sycai sycai added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 25, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 25, 2025
@sycai sycai requested a review from chelsea-lin June 25, 2025 17:36
@sycai sycai marked this pull request as ready for review June 25, 2025 17:36
@sycai sycai requested review from a team as code owners June 25, 2025 17:36

@UNARY_OP_REGISTRATION.register(ops.ArrayIndexOp)
def _(op: ops.ArrayIndexOp, expr: TypedExpr) -> sge.Expression:
offset = sge.Anonymous(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile_explode also need SAFE_OFFSET but present as different way. Maybe we should define them in the same way:

sge.Bracket(
this=column,
expressions=[unnested_offset_alias],
safe=True,
offset=False,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use sge.func to call a specific function instead of a generic one currently referred to as sge.Anonymous?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe_offset prefer upper case for any SQL preserved keywords.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Done!


@UNARY_OP_REGISTRATION.register(ops.ArraySliceOp)
def _(op: ops.ArraySliceOp, expr: TypedExpr) -> sge.Expression:
slice_idx = sqlglot.to_identifier("slice_idx")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sql ops use next(self.uid_gen.get_uid_stream("bfcol_")) to generate any new columns but this uid_gen instance is not passed to the scalar op compiler yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels like an overkill. I don't think we need to worry about name collisions in this subquery of unnesting an array with offsets , because none of these fields inherently has a name. Plus, the "slice_idx" and "el" are more meaningful names.

As for injecting uid_gen to the scalar compiler in the general sense, let's only do that when it's necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Both slice_idx and el are both local naming and won't be joined with other column names yet, according to the generated SQL below. Yes, I am okay to keep as current so far.

    ARRAY(
      SELECT
        el
      FROM UNNEST(`bfcol_1`) AS el WITH OFFSET AS slice_idx
      WHERE
        slice_idx >= 1 AND slice_idx < 5
    ) AS `bfcol_4`

@sycai sycai requested a review from chelsea-lin June 25, 2025 18:38
@sycai sycai enabled auto-merge (squash) June 25, 2025 18:51
@sycai sycai merged commit c88a825 into main Jun 25, 2025
18 of 25 checks passed
@sycai sycai deleted the sycai_scalar_compiler branch June 25, 2025 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants