I am overseeing the input of data from primary literature into a database. The data entry process is error prone, particularly because users must interpret experimental design, extract data from graphics and tables, and transform results to standardized units.
Data are input into a MySQL database through a web interface. Over 10k data points from > 20 variables, > 100 species, and > 500 citations have been included so far. I need to run checks of the quality of not only the variable data, but also the data contained in lookup tables, such as the species associated with each data point, the location of the study, etc.
Data entry is ongoing, so QA/QC will need to be run intermittently. The data have not yet been publicly released, but we are planning to release them in the next few months.
Currently, my QA/QC involves three steps:
- a second user checks each data point.
- visually inspect histogram each variable for outliers.
- users report questionable data after spurious results are obtained.
- Are there guidelines that I can use for developing a robust QA/QC procedure for this database?
- The first step is the most time consuming; is there anything that I can do to make this more efficient?
This response focuses on the second question, but in the process a partial answer to the first question (guidelines for a QA/QC procedure) will emerge.
By far the best thing you can do is check data quality at the time entry is attempted. The user checks and reports are labor-intensive and so should be reserved for later in the process, as late as is practicable.
Here are some principles, guidelines, and suggestions, derived from extensive experience (with the design and creation of many databases comparable to and much larger than yours). They are not rules; you do not have to follow them to be successful and efficient; but they are all here for excellent reasons and you should think hard about deviating from them.
Separate data entry from all intellectually demanding activities. Do not ask data entry operators simultaneously to check anything, count anything, etc. Restrict their work to creating a computer-readable facsimile of the data, nothing more. In particular, this principle implies the data-entry forms should reflect the format in which you originally obtain the data, not the format in which you plan to store the data. It is relatively easy to transform one format to another later, but it’s an error-prone process to attempt the transformation on the fly while entering data.
Create a data audit trail: whenever anything is done to the data, starting at the data entry stage, document this and record the procedure in a way that makes it easy to go back and check what went wrong (because things will go wrong). Consider filling out fields for time stamps, identifiers of data entry operators, identifiers of sources for the original data (such as reports and their page numbers), etc. Storage is cheap, but the time to track down an error is expensive.
Automate everything. Assume any step will have to be redone (at the worst possible time, according to Murphy’s Law), and plan accordingly. Don’t try to save time now by doing a few “simple steps” by hand.
In particular, create support for data entry: make a front end for each table (even a spreadsheet can do nicely) that provides a clear, simple, uniform way to get data in. At the same time the front end should enforce your “business rules:” that is, it should perform as many simple validity checks as it can. (E.g., pH must be between 0 and 14; counts must be positive.) Ideally, use a DBMS to enforce relational integrity checks (e.g., every species associated with a measurement really exists in the database).
Constantly count things and check that counts exactly agree. E.g., if a study is supposed to measure attributes of 10 species, make sure (as soon as data entry is complete) that 10 species really are reported. Although checking counts is simple and uninformative, it’s great at detecting duplicated and omitted data.
If the data are valuable and important, consider independently double-entering the entire dataset. This means that each item will be entered at separate times by two different non-interacting people. This is a great way to catch typos, missing data, and so on. The cross-checking can be completely automated. This is faster, better at catching errors, and more efficient than 100% manual double checking. (The data entry “people” can include devices such as scanners with OCR.)
Use a DBMS to store and manage the data. Spreadsheets are great for supporting data entry, but get your data out of the spreadsheets or text files and into a real database as soon as possible. This prevents all kinds of insidious errors while adding lots of support for automatic data integrity checks. If you must, use your statistical software for data storage and management, but seriously consider using a dedicated DBMS: it will do a better job.
After all data are entered and automatically checked, draw pictures: make sorted tables, histograms, scatterplots, etc., and look at them all. These are easily automated with any full-fledged statistical package.
Do not ask people to do repetitive tasks that the computer can do. The computer is much faster and more reliable at these. Get into the habit of writing (and documenting) little scripts and small programs to do any task that cannot be completed immediately. These will become part of your audit trail and they will enable work to be redone easily. Use whatever platform you’re comfortable with and that is suitable to the task. (Over the years, depending on what was available, I have used a wide range of such platforms and all have been effective in their way, ranging from C and Fortran programs through AWK and SED scripts, VBA scripts for Excel and Word, and custom programs written for relational database systems, GIS, and statistical analysis platforms like R and Stata.)
If you follow most of these guidelines, approximately 50%-80% of the work in getting data into the database will be database design and writing the supporting scripts. It is not unusual to get 90% through such a project and be less than 50% complete, yet still finish on time: once everything is set up and has been tested, the data entry and checking can be amazingly efficient.