For example, one wants to predict house prices and have two input features the length and width of the house. Sometimes, one also includes ‘derived’ polynomial input features, such as area, which is length * width.
1) What is the point of including derived features? Shouldn’t a neural network learn the connection between length, width and price during the training? Why isn’t the third feature, area, redundant?
In addition, sometimes I also see that people run genetic selection algorithms on the input features in order to reduce their number.
2) What is the point of reducing the input features if they all contain useful information? Shouldn’t the neural network assign appropriate weights to each input feature according to its importance? What is the point of running genetic selection algorithms?
1): Including derived features is a way to inject expert knowledge into the training process, and so to accelerate it. For example, I work with physicists a lot in my research. When I’m building an optimization model, they’ll give me 3 or 4 parameters, but they usually also know certain forms that are supposed to appear in the equation. For example, I might get variables n and l, but the expert knows that n∗l is important. By including it as a feature, I save the model the extra effort of finding out that n∗l is important. Granted, sometimes domain experts are wrong, but in my experience, they usually know what they’re talking about.
2): There are two reasons I know of for this. First, if you have thousands of features supplied (as often happens in real world data), and are short on CPU time for training (also a common occurrence), you can use a number of different feature selection algorithms to pare down the feature space in advance. The principled approaches to this often use information-theoretic measures to select the features with the highest predictive power. Second, even if you can afford to train on all the data and all the features you have, neural networks are often criticized for being ‘black box’ models. Reducing the feature space in advance can help to mitigate this issue. For example, a user looking at the NN cannot easily tell whether a weight of 0.01 means “0, but the optimization process didn’t quite get there” or “This feature is important, but has to be reduced in value prior to use”. Using feature selection in advance to remove useless features makes this less of an issue.