I want to build a crime index and political instability index based in news stories

I have this side project where I crawl the local news websites in my country and want to build a crime index and political instability index.
I have already covered the information retrieval part of the project. My plan is to do:

  • Unsupervised topic extraction.
  • Near duplicates detection.
  • Supervised classification and incident level (crime/political – high/medium/low).

I will use python and sklearn and have already research the algorithms that I can use for those tasks. I think 2. could give me a relevancy factor of a story: the more news papers publish about an story or topic the more relevant for that day.

My next step is to build the monthly, weekly and daily index (nation-wide and per cities) based on the features that I have, and I’m a little lost here as the “instability sensitivity” might increase to the time. I mean, the index from the major instability incident of the last year could be less than the index for this year. Also if to use fixed scale 0-100 or not.

Later I would like to be able to predict incidents based on this, e.g. whether the succession of events in the last weeks are leading to a major incident. But for now I will be happy with getting the classification working and building the index model.

I would appreciate any pointer to a paper, relevant readings or thoughts.

PD: Sorry if the question does not belong here.

UPDATE: I haven’t yet “make it”, but recently there was a news about a group of scientists that are working in a system to predict the events using news archives and released a relevant paper Mining the Web to Predict Future Events (PDF).


Consider variations on the GINI score.

It is normalized, and its output ranges from 0 to 1.


Why GINI is “cool” or at least potentially appropriate:

It is a measure of inequality or inequity. It is used as a scale free measure to characterize the heterogeneity of scale-free networks, including infinite and random networks. It is useful in building CART trees because it is the measure of splitting power of a particular data-split.

Because of its range:

  • there is less roundoff errors. Ranges far away from 1.0 tend to suffer numeric issues.
  • it is human readable, and more human accessible. Humans have a more concrete grasp of ones of objects than they do of billions.

Because it is normalized:

  • comparisons of scores are meaningful, a 0.9 in one country means the same level of relative non-uniformity as a 0.9 in any other country.
  • It is normalized against the Lorenz curve for perfect uniformity therefore the values are relevant indicators of the relationship of the distribution of values of interest to the Lorenz curve.


Source : Link , Question Author : R. Max , Answer Author : EngrStudent

Leave a Comment