Skip to content

stratified random sampling #1290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

Conversation

ChuckHend
Copy link
Contributor

Implements a stratified random sampling strategy, and sets that as the new default for test_sampling.

Copy link
Contributor

@montanalow montanalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for hammering this out!

#[derive(PostgresEnum, Copy, Clone, Eq, PartialEq, Debug, Deserialize)]
#[allow(non_camel_case_types)]
pub enum Sampling {
random,
last,
stratified_random,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking ahead to other stratification strategies, that use columns other than the target y_column_name, to guarantee you have true out of sample rows. e.g. user_id may be used to train with multiple instances from a particular user, but you want to ensure there is no data leakage where the model is just memorizing user_ids rather than the more abstract, so user_id should be excluded as a feature, but used for stratification.

I think that could work as an additional stratified_column_name parameter to train. In that case though, the sampling wouldn't be stratified_random. So we'd need to add a different stratified type, or we could just call this stratified, and if you don't specify a column, it's random by the y_column_name. If you do specify stratification column(s), then those column(s) get removed from features, and strictly used for stratification.

This can happen in a follow up PR, I'm just commenting so we get a forward looking name on the sampling strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas/points. I changed it to just stratified. For this PR, there will be no ability to specify the columns to stratify by, stratified only uses y_column_name. I think that should be a non-breaking change to add an optional parameter in the future that would change it from y_column_name to something else.

@montanalow
Copy link
Contributor

This will also need a migration in sql/ to update the Postgres sampling enum.

@ChuckHend
Copy link
Contributor Author

This will also need a migration in sql/ to update the Postgres sampling enum.

It looks like 2.8.2 hasn't gone out yet so I put the migration in ./sql/pgml--2.8.1--2.8.2.sql . I can create a 2.8.3 migration if that would be preferred though.

Comment on lines 28 to 36

-- src/orm/sampling.rs:6
-- pgml::orm::sampling::Sampling
DROP TYPE IF EXISTS pgml.Sampling;
CREATE TYPE pgml.Sampling AS ENUM (
'random',
'last',
'stratified'
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to do some more testing to make sure this migration works as intended.

@montanalow , are there integration tests that assert migrations work? I didn't see any...

Copy link
Contributor

@montanalow montanalow Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't integration tests. I typically run the test/test.sql to populate a database with the previous version, then alter extension pgml update. You'll want to alter type add value... here rather than dropping the enum, as that would only work on an empty database.

@ChuckHend ChuckHend marked this pull request as ready for review January 29, 2024 14:46
@ChuckHend ChuckHend closed this Feb 29, 2024
@ChuckHend ChuckHend mentioned this pull request Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants