-
Notifications
You must be signed in to change notification settings - Fork 325
stratified random sampling #1290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for hammering this out!
pgml-extension/src/orm/sampling.rs
Outdated
#[derive(PostgresEnum, Copy, Clone, Eq, PartialEq, Debug, Deserialize)] | ||
#[allow(non_camel_case_types)] | ||
pub enum Sampling { | ||
random, | ||
last, | ||
stratified_random, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking ahead to other stratification strategies, that use columns other than the target y_column_name, to guarantee you have true out of sample rows. e.g. user_id
may be used to train with multiple instances from a particular user, but you want to ensure there is no data leakage where the model is just memorizing user_ids rather than the more abstract, so user_id should be excluded as a feature, but used for stratification.
I think that could work as an additional stratified_column_name
parameter to train. In that case though, the sampling wouldn't be stratified_random
. So we'd need to add a different stratified type, or we could just call this stratified
, and if you don't specify a column, it's random by the y_column_name. If you do specify stratification column(s), then those column(s) get removed from features, and strictly used for stratification.
This can happen in a follow up PR, I'm just commenting so we get a forward looking name on the sampling strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ideas/points. I changed it to just stratified
. For this PR, there will be no ability to specify the columns to stratify by, stratified
only uses y_column_name. I think that should be a non-breaking change to add an optional parameter in the future that would change it from y_column_name to something else.
This will also need a migration in |
It looks like 2.8.2 hasn't gone out yet so I put the migration in |
|
||
-- src/orm/sampling.rs:6 | ||
-- pgml::orm::sampling::Sampling | ||
DROP TYPE IF EXISTS pgml.Sampling; | ||
CREATE TYPE pgml.Sampling AS ENUM ( | ||
'random', | ||
'last', | ||
'stratified' | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to do some more testing to make sure this migration works as intended.
@montanalow , are there integration tests that assert migrations work? I didn't see any...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There aren't integration tests. I typically run the test/test.sql to populate a database with the previous version, then alter extension pgml update
. You'll want to alter type add value...
here rather than dropping the enum, as that would only work on an empty database.
Implements a stratified random sampling strategy, and sets that as the new default for
test_sampling
.