I have a binary classification problem where the classes are slightly unbalanced 25%-75% distribution. I have a total of around 35 features after some feature engineering and the features I have are mostly continuous variables. I tried fitting a Logistic Model, an RF model and and XGB Model. They all seem to give me the same performance. My understanding is that XGB Models generally fare a little better than Logistic Models for these kind of problems. But, in my case I have no improvements with the the boosting model over the logistic model even after tuning it a lot. I am wondering about the reasons why that could be the case?
There is no reason for us to expect that a particular type of model A has to be better in terms of performance from another type of model B in every possible use-case. This extends to what is observed here; while indeed XGBoost models tend to be successful and generally provide competitive results, they are not guaranteed to be better than a logistic regression model in every setting.
Gradient boosting machines (the general family of methods XGBoost is a part of) is great but it is not perfect; for example, usually gradient boosting approaches have poor probability calibration in comparison to logistic regression models (see Niculescu-Mizi & Caruana (2005) Obtaining Calibrated Probabilities from Boosting for more details). More generally, certain models are inherently more data-demanding so maybe the dataset available is simply not expressive enough; van der Ploeg et al. (2014) Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints makes a really nice investigation on this.
Finally, we should evaluate the performance of an algorithm rigorously by using resampling approaches (e.g. 100 times 5-fold cross-validation) to get some measurement of the variability in the performance of the algorithm. Maybe on a particular hold-out set, two algorithms have very similar performance but the variability of their estimates is massively different. That has serious implication on when we deploy our model in the future or use it to draw conclusion about future performance.