-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP | First Contribution] add first pass at TimeSeriesInitialSplit #23931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/model_selection/_split.py
Outdated
TRAIN: [0] TEST: [1] | ||
TRAIN: [0 1] TEST: [2] | ||
TRAIN: [0 1 2] TEST: [3] | ||
TRAIN: [0 1 2 3] TEST: [4] | ||
TRAIN: [0 1 2 3 4] TEST: [5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this API with initial=21
, what is the initial=21
doing here?
Also the default test size is 7 and there is only one sample in the test set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the response. I should explain this better.
Assume the blue dots represents dates to train on and the red dots are for testing.
In this picture the initial size of the blue dots are much larger than the size of the testing dots. (6 blue dots vs 1 red). With each iteration train size increases by 1 blue dot.
My PR allows this kind of flexibility. A user may set a larger initial training period, which is not bound by the size that the training set is increases by on each iteration. It decouples the initial size from the increment size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and 21 represents 7 days a week times 3 (a three week training period by default). I am assuming people could use this function in a forecast setting for daily forecasts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming people could use this function in a forecast setting for daily forecasts.
This assumption is not always valid. I've seen the time series splitter used for hourly data, etc.
In that case with initial=21
, should the first split have training indices [0, 1, 2, 3, 4, ..., 20]
, and the test set be [21, 22, .., 27]
because test_size=7
?
It looks like you want a walk forward as in #23780 but with an additional min_train_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fair enough. I am not married to initial=21
. You let me know what is preferred and I will make the change. The functionality is much more important to me than the initial args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you want a walk forward as in #23780 but with an additional min_train_size.
I am afraid that min_train_size
means that I want a consistent train size. That is what I do not want. I want one train size for the initial training then a second train size for each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you sign off on this then I can keep working. I remember seeing a contributors document. I would have to see what the expectation is for writing tests and documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not signing off on this because the API is not finalized and unclear. Let's move the discussion to the issue.
After the API is clear, we can evaluate if we want this in scikit-learn.
Closing this PR such that we have a clear API decision before to go forward. |
Reference Issues/PRs
WIP #23923
What does this implement/fix? Explain your changes.
Add functionality to do time series cross validation with an initial window.
Any other comments?
I have added the code, but I would like to add tests as well. I am just having difficulty setting up the test environment.