Strategy for editing comma separated value (CSV) files

When I work on data analysis projects I often store data in comma or tab-delimited (CSV, TSV) data files. While data often belongs in a dedicated database management system.
For many of my applications, this would be overdoing things.

I can edit CSV and TSV files in Excel (or presumably another Spreadsheet program).
This has benefits:

  • spreadsheets make it easy to enter data

There are also several problems:

  • Working with CSV and TSV files leads to a wide range of warning messages about various features being lost and how only the active sheet will be saved and so forth. Thus, it’s annoying if you just want to open the file and make a little change.
  • It does many “supposedly intelligent” conversions.
    For example, if you enter 12/3, it will think that you want to enter a date.
    UPDATE: I should have mentioned that the date example is just one of many examples;
    most problems seem to be related to inappropriate conversion.
    In particular, text fields that look like numbers or dates cause problems.

Alternatively, I could work directly with the text file
in a standard text editor. This ensures that what I enter is what is recorded.
However it is a very awkward way to enter data
(columns don’t line up;
it’s difficult to enter data simply into multiple cells; etc.).

Question

  • What is a good strategy for working with CSV or TSV data files?
    i.e., what strategy makes it easy to enter and manipulate the data while also
    ensuring that what you enter is actually interpreted correctly?

Answer

  1. If you are comfortable with R, you can create your basic data.frame and then use the fix() function on it to input data. Along the same line as #5, once you set up the data.frame you can use a series of readLines(n=1) (or whatever) to get your data in, validate it, and the provide the opportunity to add the next row. Then leave the fixing to fix(). See an implemented example below using scan().

  2. Another option in excel would be messy, but you could type in 12/9, then have another column evaluate =IFERROR(MONTH(DateEntryCell)/DAY(DataEntryCell),DataEntryCell). But then you’ll have to maintain the excel sheet AND the csv sheet and all of the complaining as you write the csv will persist.

  3. Alternatively, so long as your fields are relatively short and have a consistent length a regular text editor should serve you well with TSV. You can always load it up in excel when you are done and make sure the number of columns for each row is what you expect it to be.
  4. Emacs is available on a number of platforms and probably has something just for this, e.g. http://www.emacswiki.org/emacs/CsvMode.
  5. If you are a hearty soul, programming something quick up in a programming language to do the data entry is trivial, the data editing will be a lot harder.
  6. A quick google search shows software with just this purpose, but no free software seemed to be any good.
  7. It sounds insane, but someone on superuser suggested editing tables in access and then exporting them as CSV… that is just crazy enough to work.
  8. It doesn’t stop excel from complaining as you save as .csv, but you can type a single apostrophe before your data entry field and that makes it leave it alone in terms of auto-formatting. Nicely, this (in Office 2007 at least) doesn’t leave apostrophes in the csv file.

Update:
I’ve been poking around a lot on this problem because it is an issue I also have. So far the best/easiest solution for data-entry I’ve seen so far is KillinkCSV. It isn’t “free” software, it is shareware with a 30 day trial duration and a reasonable price (~$27). I’m not sure how much I trust it for editing existing CSVs though – I handed it an insanely large (and presumably well formatted) CSV and it failed to read all of the rows. However it seemed to work well for one that was reasonably large (20 MB) and the problem with the larger file may be user error on my part.

R Example:

#This function takes a what argument like in scan, 
#a list with the types to be used, see usage example 
#at the end of this code block
#dataEntry will keep reading in values until 
#the values it reads in matches what is in 
#"terminateon".
#limitations: Many
dataEntry <- function(what,terminateon)
{
  CONTINUE <- TRUE #Make sure we start the loop
  data <- NULL #Create empty data so that the data.frame can define itself
  ti <- NULL
  while(CONTINUE)
  {
    ti <- NULL    
    ti <- tryCatch(
      {as.data.frame(scan(what=what, nlines=1, multi.line=FALSE, comment.char="",quiet=TRUE))},
      error=function (e) {print("Error in data entry! Line not stored.")
                          return(NULL)},
      warning=function(w) {print("Error in data entry! Line not stored.")
                           return(NULL)},
      finally={ti <- NULL}
    ) #Try getting the data according to the parameters in 'what' one row at a time.
    if (!is.null(ti))
    {
      if ((ncol(ti)==length(what)) & (nrow(ti)==1)) {
        data <- rbind(data,ti) #If there wasn't an error, add ti to the previous value  
      } else {
        print("Too many or not enough values on previous entry.")
        print("Tail of current data:")
        print(tail(data))
      }
    }
    if (!is.null(ti) & all(ti == terminateon)) 
    {
      CONTINUE <- FALSE
      data <- data[-c(nrow(data)),]
    } #if we've recieved the final value we won't continue and the last row is invalid so we remove it
  }
  return(data)
}

dataEntry(list(x=integer(), y=numeric(), z=character()),terminateon=c(999,999,"Z"))

Attribution
Source : Link , Question Author : Jeromy Anglim , Answer Author : russellpierce

Leave a Comment