Dealing with datasets with a variable number of features

What are some approaches for classifying data with a variable number of features?

As an example, consider a problem where each data point is a vector of x and y points, and we don’t have the same number of points for each instance. Can we treat each pair of x and y points as a feature? Or should we just summarize the points somehow so each data point has a fixed number of features?

Answer

You can treat these points as missing — ie. let’s assume that vector has at most 20 (x, y) pairs and particular point has 5 (x, y) pairs, in this case treat rest of pairs as missing, and then apply standatd procedures for missing parameters:

These standard procedures may be:

  • Use a model that handles missing parameters in natural way, for example decision tree models should be able to cope with that.
  • Replace missing with the mean value for appropriate column.
  • Use some easy model to ‘predict’ missing values.

But as @jonsca points — if presence of absence of given point helps in classyfying the data you should for example build couple of models, each of them models instances with particular number of points.

Attribution
Source : Link , Question Author : jergason , Answer Author : jb.

Leave a Comment