stratified random sampling #1290

ChuckHend · 2024-01-18T18:25:50Z

Implements a stratified random sampling strategy, and sets that as the new default for test_sampling.

montanalow

Thanks for hammering this out!

pgml-extension/src/orm/snapshot.rs

montanalow · 2024-01-18T21:25:34Z

pgml-extension/src/orm/sampling.rs

 #[derive(PostgresEnum, Copy, Clone, Eq, PartialEq, Debug, Deserialize)]
 #[allow(non_camel_case_types)]
 pub enum Sampling {
    random,
    last,
+    stratified_random,


Thinking ahead to other stratification strategies, that use columns other than the target y_column_name, to guarantee you have true out of sample rows. e.g. user_id may be used to train with multiple instances from a particular user, but you want to ensure there is no data leakage where the model is just memorizing user_ids rather than the more abstract, so user_id should be excluded as a feature, but used for stratification.

I think that could work as an additional stratified_column_name parameter to train. In that case though, the sampling wouldn't be stratified_random. So we'd need to add a different stratified type, or we could just call this stratified, and if you don't specify a column, it's random by the y_column_name. If you do specify stratification column(s), then those column(s) get removed from features, and strictly used for stratification.

This can happen in a follow up PR, I'm just commenting so we get a forward looking name on the sampling strategy.

Good ideas/points. I changed it to just stratified. For this PR, there will be no ability to specify the columns to stratify by, stratified only uses y_column_name. I think that should be a non-breaking change to add an optional parameter in the future that would change it from y_column_name to something else.

montanalow · 2024-01-18T21:32:09Z

This will also need a migration in sql/ to update the Postgres sampling enum.

ChuckHend · 2024-01-19T04:35:22Z

This will also need a migration in sql/ to update the Postgres sampling enum.

It looks like 2.8.2 hasn't gone out yet so I put the migration in ./sql/pgml--2.8.1--2.8.2.sql . I can create a 2.8.3 migration if that would be preferred though.

ChuckHend · 2024-01-19T14:16:32Z

pgml-extension/sql/pgml--2.8.1--2.8.2.sql

+
+-- src/orm/sampling.rs:6
+-- pgml::orm::sampling::Sampling
+DROP TYPE IF EXISTS pgml.Sampling;
+CREATE TYPE pgml.Sampling AS ENUM (
+	'random',
+	'last',
+	'stratified'
+);


I still need to do some more testing to make sure this migration works as intended.

@montanalow , are there integration tests that assert migrations work? I didn't see any...

There aren't integration tests. I typically run the test/test.sql to populate a database with the previous version, then alter extension pgml update. You'll want to alter type add value... here rather than dropping the enum, as that would only work on an empty database.

montanalow reviewed Jan 18, 2024

View reviewed changes

ChuckHend commented Jan 19, 2024

View reviewed changes

ChuckHend marked this pull request as ready for review January 29, 2024 14:46

ChuckHend closed this Feb 29, 2024

ChuckHend force-pushed the master branch from 8a9a92b to 347168a Compare February 29, 2024 02:38

ChuckHend mentioned this pull request Feb 29, 2024

Stratified sampling #1336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stratified random sampling #1290

stratified random sampling #1290

Uh oh!

ChuckHend commented Jan 18, 2024

Uh oh!

montanalow left a comment

Uh oh!

Uh oh!

montanalow Jan 18, 2024

Uh oh!

ChuckHend Jan 19, 2024

Uh oh!

montanalow commented Jan 18, 2024

Uh oh!

ChuckHend commented Jan 19, 2024

Uh oh!

ChuckHend Jan 19, 2024

Uh oh!

montanalow Jan 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

stratified random sampling #1290

stratified random sampling #1290

Uh oh!

Conversation

ChuckHend commented Jan 18, 2024

Uh oh!

montanalow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

montanalow Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

ChuckHend Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

montanalow commented Jan 18, 2024

Uh oh!

ChuckHend commented Jan 19, 2024

Uh oh!

ChuckHend Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

montanalow Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

montanalow Jan 19, 2024 •

edited

Loading