## Approach for highly unbalanced panel dataset

I have some highly unbalanced panel data set for which I want to construct a predictive model. As the response variable consists of count data I chose to use a Negative Binomial regression model. Due to possible multiple number of observations per individual over time I would expect the observations per individual to be possibly … Read more

## Log and inverse Data transformation in Linear regression model

I am studying behaviour of particulate matters(pm10) concentration in respose to change in rain and tempretaure. my data was not normaly distributed so i have to transform data I did log transformation and inverse transformation.The Adjusted R-squared for log transformation is :0.07918 and Adjusted R-squared for inverse transform is :0.1002.Now according to rule i must … Read more

## Residuals correlated positively with response variable strongly in linear regression

I did the multiple linear regression on a dataset of 412 observations, with one response variable (Y) and 25 explanatory variables(X1-X25). Y and most of Xs are not normally distributed. Besides, there are some correlation between several Xs. The plot show that the residuals strongly correlated with Y positively and weakly correlated with fitted Y … Read more

## Decorrelating Systems of Random Variables

Suppose we have a dataset that includes dozens of attributes, that are all correlated with each-other. I would like to better understand which variables affect which other variables in a causal way, preferably as a system of regressions. The classical approach would be to setup a structured model, identifying manually which attributes are endogenous and … Read more

## Estimating burstiness in a time series: Dependence on overall frequency?

I have a dataset of time series normalized for length. A working R example below. I am trying to estimate burstiness within the series, i.e. to compare series in the degree of clusteredness in these events. Here I just used a simple index of dispersion (variance of intervals divided by mean of intervals). For this … Read more

## correlation matrix test: is this code correct or is it missing a multiple comparisons correction?

I have $m$ variables $x_1,\dots,x_m$, measured in $N$ independent tests $\{x_{i1},\dots,x_{im}\}_{i=1}^N$, leading to the design matrix $X$. I noted that the demo function corrplot_intro.r from the R package corrplot includes a nice function cor.mtest (reported below), which computes the pairwise correlations for $x_1,\dots,x_m$ in the sample $X$ and reports the corresponding $p$-values: cor.mtest <- function(mat, … Read more

## multiple testing correction for pairwise correlations

I have a dataset comprised of two $n$ x $d$ matrices, $A$ and $B$. Each is comprised of $n$ observations of $d$ angles. The angles are taken from two different meshes of planar triangles that are in correspondence such that columns $A_i$ and $B_i$ both represent $n$ measurements of angle $\alpha_i$. Many columns of $A$ … Read more

## How to compute the correlation between a qualitative and a quantitative variable? [duplicate]

This question already has answers here: Correlations with unordered categorical variables (6 answers) What is Polychoric Correlation Coefficient intuitively? (1 answer) Closed 5 years ago. My study is about the correlation of the person’s gender and his or her score in a questionnaire about gender stereotyping. The higher the score in the questionnaire means he … Read more

## Defend autocorrelation in VAR model

I am creating an unrestricted VAR model with 9 variables and 12 lags (determined by LR, FPE and AIC, and is in line with theory). But the model still has some autocorrelation – the p-values of some of the lags are less than 0.05 and on the others are higher than 0.05. I am afraid … Read more

## Too high statistical significance on random data

I am trying to understand the stats problem that I encountered. Use reproducible R code below. All libraries enlisted are needed. What I do is: – create two tables $101 \times 1000$, where there are $100$ inputs (a vector of delayed timeseries points), and $1$ output (a one step ahead series point). The tables are … Read more