How to choose between a dummy variable and the amount of a variable that has a lot of zero

I am trying to include several income types into a regression model (specifically a logit model). This variable has the particularity to have numerous zero for some type of incomes (typically some capital incomes). How do I choose between including a dummy variable of whether the individual has this particular type of income or not … Read more

How to simplify/optimize model if dataset size is small?

Let’s say we have a dataset of 250 samples and we have engineered 30 attributes based on the domain knowledge. We perform a 10-fold CV and estimate the performance of our process of creating a model. Now as you know, it’s likely that some of the 30 attributes are redundant and getting rid of some … Read more

How to predict the probability of a discrete choice problem?

I am looking for a discrete choice model (e.g. a logit) to describe the reaction of a pedestrian when a car is arriving. Depending on the severity of the conflict and other geometrical parameters, he has to decide what to do. There are input parameters like: distance, speed of the car, acceleration of the car, … Read more

model selection, mixture of Gaussians

I have data and I want to decide whether it comes from 5-modal-normal distribution or 2-modal-normal distribution. In other words I want to check if it has 2 peaks or 5. I can estimate the μ and σ of each Gaussian, but I don’t know how to calculate the likelihood of each model. When I … Read more

Non-linear fitting with uncertainty in dependent and independent variable

This is related with this question. What is the best strategy/package in R/python/Mathematica to fit a non-linear model to data with uncertainty measures in both variables? Elaborating: For every data point, I have several measures of two variables, X and Z. The different measures can be treated as independent replicates. I compute ˜X and ˜Z … Read more

small data set and number of independent variables is larger than number of observations . What to do ? Should I apply Lasso?

In my data set I have only 6 observations & independent variables are 9. Can I use Lasso regression or multiple regression in this situation? when independent variables are > observations? Data : y x1 x2 x3 x4 x5 x6 x7 x8 x9 6142.8 90.25 164.19 15 0.91 0.88 2.99 0.5 7.255 15 8174.2 126.9 … Read more

Optimism bias – alternative references

In Hastie & al’s book Elements of Statistical Learning, there are two subsections covering insample prediction errors and optimism bias (section 7, p.228-230). Hastie & al explain that defining the insample prediction error as (1), we can take the expected value of this quantity w.r.t. Y and given X and get (2) with ¯err the … Read more

Where is studying the Bispectrum useful?

In cosmology it is well known that studying the bispectrum of the large scale structure of the universe is a powerful way to distinguish different models of cosmic initial conditions. I had assumed that the bispectrum had endless applications in many areas of science and data analysis more generally but when I tried to search … Read more

Does the complexity of a basis function correspond to the number of DOF required to select it?

My general question is whether the number of DOF lost when selecting a basis function from library is dependent on the complexity of the basis function. More specifically: we try to fit a curve, which is perturbed during a certain time. To model the perturbation we use a library of basis functions starting from the … Read more

Pearson χ2\chi^2 residuals vs deviance

In a GLM, how should I interpret the difference between using sum of the model’s Pearson residuals model’s Deviance to assess the fit of my model? Is the former more “flexible” since I can estimate a dispersion parameter? I feel like the Deviance provide a more parametric assessment since it takes the form of the … Read more