I have this side project where I crawl the local news websites in my country and want to build a crime index and political instability index.
I have already covered the information retrieval part of the project. My plan is to do:
- Unsupervised topic extraction.
- Near duplicates detection.
- Supervised classification and incident level (crime/political – high/medium/low).
I will use python and sklearn and have already research the algorithms that I can use for those tasks. I think 2. could give me a relevancy factor of a story: the more news papers publish about an story or topic the more relevant for that day.
My next step is to build the monthly, weekly and daily index (nation-wide and per cities) based on the features that I have, and I’m a little lost here as the “instability sensitivity” might increase to the time. I mean, the index from the major instability incident of the last year could be less than the index for this year. Also if to use fixed scale 0-100 or not.
Later I would like to be able to predict incidents based on this, e.g. whether the succession of events in the last weeks are leading to a major incident. But for now I will be happy with getting the classification working and building the index model.
I would appreciate any pointer to a paper, relevant readings or thoughts.
PD: Sorry if the question does not belong here.
UPDATE: I haven’t yet “make it”, but recently there was a news about a group of scientists that are working in a system to predict the events using news archives and released a relevant paper Mining the Web to Predict Future Events (PDF).
Consider variations on the GINI score.
It is normalized, and its output ranges from 0 to 1.
Why GINI is “cool” or at least potentially appropriate:
It is a measure of inequality or inequity. It is used as a scale free measure to characterize the heterogeneity of scale-free networks, including infinite and random networks. It is useful in building CART trees because it is the measure of splitting power of a particular data-split.
Because of its range:
- there is less roundoff errors. Ranges far away from 1.0 tend to suffer numeric issues.
- it is human readable, and more human accessible. Humans have a more concrete grasp of ones of objects than they do of billions.
Because it is normalized:
- comparisons of scores are meaningful, a 0.9 in one country means the same level of relative non-uniformity as a 0.9 in any other country.
- It is normalized against the Lorenz curve for perfect uniformity therefore the values are relevant indicators of the relationship of the distribution of values of interest to the Lorenz curve.
-  http://data.worldbank.org/indicator/SI.POV.GINI
-  http://research3.bus.wisc.edu/file.php/129/Papers/Gini27April2011.pdf
-  http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp
-  http://www2.unine.ch/files/content/sites/imi/files/shared/documents/papers/Gini_index_fulltext.pdf