Using text mining/natural language processing tools for econometrics

I am not sure whether this question is fully appropriate here, if not, please delete.

I am a grad student in economics. For a project which investigates issues in social insurances, I have access to a large number of administrative case reports (>200k) which deal with eligibility evaluations. These reports can possibly be linked to individual administrative information. I want to extract information from these reports that can be used in quantitative analysis, and ideally more than simple keyword/regex searches using grep/awk etc.

How useful is Natural Language Processing for this? What are other useful text-mining approaches? From what I understand this is a large field, and most likely some of the reports would have to be transformed to be used as a corpus. Is it worth investing some time to become acquainted with the literature and methods? Can it be helpful and has something similar been done before? Is it worth it in terms of the rewards, i.e. can I extract potentially useful information using NLP for an empirical study in economics?

There is possibly funding to hire somebody to read and prep some of the reports. This is a larger project and there is a possibility to apply for more funding. I can provide more details about the topic if strictly necessary. One potential complication is that the language is German, not English.

Regarding qualifications, I am mostly trained in econometrics, and have some knowledge about computational statistics at the level of the Hastie et al. book. I know Python, R, Stata, and could probably get familiar with Matlab quickly. Given the libraries, I assume Python is the tool of choice for this. No training at all in qualitative methods if this is relevant, but I know some people I could reach out to.

I am glad for any input on this, i.e. if this is potentially useful, if so, where to start reading and which tools to focus on in particular.


I think it would benefit you to define what information you want to extract from the data. Simple keyword/regex searches may actually be very fruitful for you. I work in insurance and we use this sort of text mining quite frequently–it’s arguably naive and definitely imperfect, but it is a relatively good start (or close approximation) to what we’re generally interested in.

But to my main point, in order to figure out whether your chosen method is appropriate, I’d recommend defining what exactly you want to extract from the data; that’s the hardest part, in my opinion.

It may be interesting to find the unique words within all of the strings and do a frequency of the top 1000 words or so. This may be computationally expensive (depending on your RAM/processor) but it may be interesting to look at. If I were exploring the data without much knowledge about it, this is where I’d start (others may offer different views).

Hope that helps.

Source : Link , Question Author : ilprincipe , Answer Author : Francisco Arceo

Leave a Comment