I have some data that exist on a graph G=(V,E). The vertices belong to one of two classes yi∈{−1,1}, and I’m interested in training an SVM to distinguish between the two classes. One appropriate kernel for this is the diffusion kernel, K=exp(−βL), where L is the Laplacian of G and β is a tuning parameter.

Tuning the SVM requires selection of hyperparameters, so I have to tune θ=(β,C). Conventionally, we use cross-validation for this problem, but that doesn’t seem appropriate here, since omitting a vertex i from G changes the whole graph, possibly even increasing the number of connected components! If the number of connected components changes, some vertices become unreachable from others, and we are faced with a very different set of data than we began with. That is, not only are we missing the removed vertex i, but we’re also missing information about all other vertices j in the graph which were adjacent to that vertex.

The basic notion of cross-validation is that we would like to approximate how the model will perform when it is presented with new data. In standard problems, the omission of some of your data for testing does not change the values of the remaining training data. However, in the case of graph data, it’s not clear what it means for the model to see “new” data in the CV setting. Omitting vertices or edges has the potential to entirely change the data. For example, imagine a graph S=(VS,ES) which is a k-star graph, in which one vertex has k edges to k vertices, and all other vertices have 1 edge. Omitting the central vertex to construct the training data S∗ will entirely disconnect the graph, and the kernel matrix will be diagonal! But of course, it will be possible to train a model on this training data provided in S∗. What is less clear is what it means to then test the out-of-sample performance of the resulting model. Does one recompute the kernel matrix for S, and provide that to make predictions?

Or, alternatively, does one begin by computing the kernel matrix of S in its entirety and omit rows and columns as necessary to produce the kernel matrix used for estimating the SVM? This presents its own conceptual problems, since the inclusion of the central node in S means that every vertex is reachable from every other vertex, and the kernel matrix is dense. Will this inclusion mean that there is information leakage across folds, and bias the cross-validation output? On the one hand, data about the omitted central nodes is still present, as the omitted central node makes the graph connected. On the other hand, we know nothing about the

labelsy of the omitted nodes, so we may be comfortable that we are getting reasonably unbiased out-of-sample estimates from performing CV in this manner.How does one select hyperparameters for problems of this type? Is CV imperfect-but-acceptable, or do we need specialized methods? Is hyperparameter tuning even possible at all in my context?

**Answer**

Disclaimer: I’m not very familiar with graph kernels, so this answer might be based on wrong assumptions. I agree that omitting vertices while computing the kernel matrix is suboptimal. That said, I’m not sure that cross-validation is necessarily problematic. Is your learning context transduction or induction?

Overall, I am not convinced that computing the kernel matrix for a given β based on *all* data (i.e., both train and test) necessarily creates an information leak. If computing the kernel based on all data turns out to be okay, you can then train models in a typical cv-setup, using the relevant blocks of the (precomputed) full kernel matrix for training/testing.

This approach would enable you to jointly optimize β and C, for example via libraries like Optunity, where β is used to compute the kernel based on all data and C is used to train models on the training folds exclusively.

**Attribution***Source : Link , Question Author : Sycorax , Answer Author : Marc Claesen*