Skip to content

[WIP | First Contribution] add first pass at TimeSeriesInitialSplit #23931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

alexhallam
Copy link

Reference Issues/PRs

WIP #23923

What does this implement/fix? Explain your changes.

Add functionality to do time series cross validation with an initial window.

Any other comments?

I have added the code, but I would like to add tests as well. I am just having difficulty setting up the test environment.

Comment on lines 1145 to 1149
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]
Copy link
Member

@thomasjpfan thomasjpfan Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this API with initial=21, what is the initial=21 doing here?

Also the default test size is 7 and there is only one sample in the test set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response. I should explain this better.

image

Assume the blue dots represents dates to train on and the red dots are for testing.

In this picture the initial size of the blue dots are much larger than the size of the testing dots. (6 blue dots vs 1 red). With each iteration train size increases by 1 blue dot.

My PR allows this kind of flexibility. A user may set a larger initial training period, which is not bound by the size that the training set is increases by on each iteration. It decouples the initial size from the increment size.

Copy link
Author

@alexhallam alexhallam Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and 21 represents 7 days a week times 3 (a three week training period by default). I am assuming people could use this function in a forecast setting for daily forecasts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming people could use this function in a forecast setting for daily forecasts.

This assumption is not always valid. I've seen the time series splitter used for hourly data, etc.

In that case with initial=21, should the first split have training indices [0, 1, 2, 3, 4, ..., 20], and the test set be [21, 22, .., 27] because test_size=7?

It looks like you want a walk forward as in #23780 but with an additional min_train_size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fair enough. I am not married to initial=21. You let me know what is preferred and I will make the change. The functionality is much more important to me than the initial args.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you want a walk forward as in #23780 but with an additional min_train_size.

I am afraid that min_train_size means that I want a consistent train size. That is what I do not want. I want one train size for the initial training then a second train size for each iteration.

Copy link
Author

@alexhallam alexhallam Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you sign off on this then I can keep working. I remember seeing a contributors document. I would have to see what the expectation is for writing tests and documentation.

Copy link
Member

@thomasjpfan thomasjpfan Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not signing off on this because the API is not finalized and unclear. Let's move the discussion to the issue.

After the API is clear, we can evaluate if we want this in scikit-learn.

@glemaitre
Copy link
Member

Closing this PR such that we have a clear API decision before to go forward.

@glemaitre glemaitre closed this Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants