There is one variable in my data have 80% of missing data. The data is missing because of non-existence (i.e. how much bank loan the company owes). I came across an article saying that dummy variable adjustment method is the solution for this problem. Meaning that I need to transform this continuous variable to categorical?
Is this the only solution? I do not want to drop this variable as I think theoretically, it is important to my research question.
Are the data “missing” in the sense of being unknown or does it just mean there is no loan (so the loan amount is zero)? It sounds like the latter, in which case you need an additional binary dummy to indicate whether there is a loan. No transformation of the loan amount is needed (apart, perhaps, from a continuous re-expression, such as a root or started log, which might be indicated by virtue of other considerations).
This works well in a regression. A simple example is a conceptual model of the form
dependent variable (Y) = loan amount (X) + constant.
With the addition of a loan indicator (I), the regression model is
with ϵ representing random errors with zero expectations. The coefficients are interpreted as:
β0 is the expectation of Y for no-loan situations, because those are characterized by X=0 and I=0.
βX is the marginal change in Y with respect to the amount of the loan (X).
βI+β0 is the intercept for the cases with loans.