I have a continuous random variable X that can easily be sampled. I don’t have any other assumption on X. Let’s say I have sampled X and I have constructed the set S. We can assume that S is as big as needed.
I want to be able to approximate its probability distribution. By this I mean that I would like to “guess” a probability distribution, such that if it is sampled it will give me a set of values T, which is statistically equivalent to S. I do understand that this is still a vague question, so I am happy with any practical solution.
I guess the obvious solution is to approximate the PDF by the “histogram” of S. I assume that if S is big enough, the approximation will be good enough. But is there anything more clever that can be done?
Is there any known and trusted method to do that? For example, can I use the first few moments to improve my guess?
The histogram approximation might be better than you think. The simplest “histogram” approximation is to use a discrete distribution with a point mass of 1/n at each observation. This is the empirical density, and the corresponding CDF ˆFn is the empirical cumulative distribution function (ECDF). With iid data, the ECDF enjoys a number of properties, one of which is the Dvoretzky-Kiefer-Wolfowitz inequality:
This means that the probability of the largest deviation being greater than some \epsilon decreases exponentially in n. Since you have access to lots of samples you can make this probability tiny even for a very small \epsilon.
Sampling from \hat F_n is equivalent to taking a bootstrap sample from your data, and the quality of \hat F_n as an estimator of F is a big part of why bootstrapping works so well.
There are lots of other options too though, like kernel density estimators. If you have access to lots of samples then many strategies will work since large-sample properties will likely be kicking in and probably any consistent estimator will perform well.