I am thinking about a problem which is to predict log(spend) of a customer using linear regression.
I am considering what features to use as input and wondering if it would be OK to use the percentile of a variable as inputs.
For example I could use the companies revenue as a input. What I’m wondering is whether I could use the company revenue percentile instead.
Another example would be a categorical industry classifier (NAICS) – if I were to look at median spend per NAICS code and then assign each NAICS code to a ‘NAICS Percentile’, would that be a valid explanatory variable I could use?
Just wondering if there are any issues to be aware of when using percentiles? Is it in some ways equivalent to a type of feature scaling?
If your model entails some sort of contest in firm revenues, you can use percentile. Log percentile seems more meaningful, quantiles are not going to be linear in value, or so I imagine.
In this story, you include ln(%) of firms with revenues under the observation firm. The story is that with high revenues have reputations that are better than firms with low revenues, and this relation of “having more than the competition” is relevant, not the level of revenue itself. I could see this as an important part of firm recognition and branding.