I am trying to fit a linear model using “hour of the day” as parameter. What I’m struggling with, is, that I’ve found two possible solutions on how to handle this:
Dummy encoding for every hour of the day
I don’t quite understand the use cases of both approaches and thus I am not certain which one will lead to a better outcome.
The Data I’m using is from this Kaggle challenge. The goal is to predict nyc taxi fares. Given attributes are pickup and dropoff coordinates, pickup datetime, passenger count and the fare amount.
I extracted the hour of the day to take possible congestions into consideration and am trying to implement it into my model. I should also probably mention that I’m pretty inexperienced.
Dummy encoding would destroy any proximity measure (and ordering) among hours. For example, the distance between 1 PM and 9 PM would be the same as the distance between 1 PM and 1 AM. It’d be harder to say something like around 1 PM.
Even leaving them as is, e.g. numbers in 0-23, would be a better approach than dummy encoding in my opinion. But, this way has a catch as well: 00:01 and 23:59 would be seen very distant but actually they’re not. To remedy this, your second listed approach, i.e. cyclic variables, is used. Cyclic variables map hours onto a circle (like a 24-h mechanical clock) so that the ML algorithm can see the neighbours of individual hours.