What exactly is building a statistical model?
These days as I am applying for research jobs or consulting jobs, the term “building a model” or “modelling” often comes up. The term sounds cool, but what are they referring to exactly? How do you build your model?
I looked up predictive modelling, which includes k-nn and logistic regression.
I’ll take a crack at this although I’m not a statistician by any means but land up doing a lot of ‘modeling’ – statistical and non-statistical.
First let’s start with the basics:
What IS a model exactly?
A model is a representation of reality albeit highly simplified. Think of a wax/wood ‘model’ for a house. You can touch/feel/smell it. Now a mathematical model is a representation of reality using numbers.
What is this ‘reality’ I hear you ask? Okay. So think of this simple situation: The governor of your state implements a policy saying that the price of a pack of cigarettes would now cost $100 for the next year. The ‘aim’ is to deter the people from purchasing cigarettes thereby decreasing smoking thereby making the smokers healthier (because they’d quit).
After 1 year the governor asks you – was this a success? How can you say that? Well you capture data like number of packets sold/day or per year, survey responses, any measurable data you can get your hands on that is relevant to the problem. You’ve just begun to ‘model’ the problem. Now you want to analyze what this ‘model’ says. That’s where statistical modeling comes in handy. You could run a simple correlation/scatter plot to see what the model ‘looks like’. You could get fancy to determine causality i.e., if increasing price did lead to decrease in smoking or were there other confounding factors at play (i.e., maybe it’s something else altogether and your model missed it perhaps?).
Now, constructing this model is done by a ‘set of rules’ (more like guidelines) i.e., what is/isn’t legal or what does/doesn’t make sense. You should know what you are doing and how to interpret the results of this model. Building/Executing/Interpreting this model requires basic knowledge of statistics. In the example above you need to know about correlation/scatter plots, regression (uni and multivariate) and other stuff. I suggest reading the absolute fun/informative read on understanding statistics intuitively: What is a p-value anyway It’s a humorous intro to statistics and will teach you ‘modeling’ along the way from simple to advanced (i.e., linear regression). Then you can go on and read other stuff.
So, remember a model is a representation of reality and that “All models are wrong but some are more useful than others”. A model is a simplified representation of reality and you can’t possibly consider everything but you must know what to and what not to consider to have a good model that can give you meaningful results.
It doesn’t stop here. You can create models to simulate reality too! That is how a bunch of numbers will change over time (say). These numbers map to some meaningful interpretation in your domain. You can also create these models to mine your data to see how the various measures relate to each other (the application of statistics here maybe questionable, but don’t worry for now). Example: You look at grocery sales for a store per month and realize that whenever beer is bought so is a pack of diapers (you build a model that runs through the data set and shows you this association). It may be weird but it may imply that mostly fathers buy this over the weekend when baby sitting their kids? Put diapers near beers and you may increase your sales! Aaah! Modeling 🙂
These are just examples and by no means a reference for professional work. You basically build models to understand/estimate how reality will/did function and to take better decisions based on the outputs. Statistics or not, you’ve probably doing modeling all your life without realizing it. Best of luck 🙂