Logistic Regression and Dataset Structure

I am hoping that I can ask this question the correct way. I have access to play-by-play data, so it’s more of an issue with best approach and constructing the data properly.

What I am looking to do is to calculate the probability of winning an NHL game given the score and time remaining in regulation. I figure I could use a logistic regression, but I am not sure what the dataset should look like. Would I have multiple observations per game and for every slice of time I am interested in? Would I have one observation per game and fit seperate models per slice of time? Is logisitic regression even the right way to go?

Any help you can provide will be very much appreciated!

Best regards.


Do a logistic regression with covariates “play time” and “goals(home team) – goals(away team)”. You will need an interaction effect of these terms since a 2 goal lead at half-time will have a much smaller effect than a 2 goal lead with only 1 minute left. Your response is “victory (home team)”.

Don’t just assume linearity for this, fit a smoothly varying coefficient model for the effect of “goals(home team) – goals(away team)”, e.g. in R you could use mgcv‘s gam function with a model formula like win_home ~ s(time_remaining, by=lead_home). Make
lead_home into a factor, so that you get a different effect of time_remaining for every value of lead_home.

I would create multiple observations per game, one for every slice of time you are interested in.

Source : Link , Question Author : Btibert3 , Answer Author : fabians

Leave a Comment