Dealing with datasets with a variable number of features

What are some approaches for classifying data with a variable number of features?

As an example, consider a problem where each data point is a vector of x and y points, and we don’t have the same number of points for each instance. Can we treat each pair of x and y points as a feature? Or should we just summarize the points somehow so each data point has a fixed number of features?


You can treat these points as missing — ie. let’s assume that vector has at most 20 (x, y) pairs and particular point has 5 (x, y) pairs, in this case treat rest of pairs as missing, and then apply standatd procedures for missing parameters:

These standard procedures may be:

  • Use a model that handles missing parameters in natural way, for example decision tree models should be able to cope with that.
  • Replace missing with the mean value for appropriate column.
  • Use some easy model to ‘predict’ missing values.

But as @jonsca points — if presence of absence of given point helps in classyfying the data you should for example build couple of models, each of them models instances with particular number of points.

Source : Link , Question Author : jergason , Answer Author : jb.

Leave a Comment