On a recent project, I had to implement a CSV parser that would gracefully handle malformed files. I’m talking about files with unescaped quotes, wacky UTF-8 chars, and various other abominations of nature.
I originally assumed FasterCSV would handle this automagically, but it turns out that the library’s most commonly used methods are pretty strict when it comes to handling CSV files.
For example, parsing a malformed file one line at a time will result in an exception being thrown, even before any rows are yielded to the block:
FasterCSV.foreach("malformed.csv") do |row| # use row here... end
Not cool! I managed to get around this by manually looping over each row and rescuing a malformed CSV exception if one gets thrown:
FasterCSV.open("malformed.csv", "rb") do |output| loop do begin break unless row = output.shift # use row here... rescue FasterCSV::MalformedCSVError => e # handle malformed row here... end end end
Anyone have a better way to do this?
Matthew, I just wanted to thank you for your code bit. I ran into the same problem with FasterCSV today and this really helped.
Thanks for the idea. The foreach syntax seems so convenient until you actually need to deal with errors. Wish there was a simpler way!
thanks for the solution. Its a cool way to get around it but imo FasterCSV is just not cool.
But how do you actually capture the offending row so you can at least log it? You would think something like output.gets or something, but I can’t figure it out. Not a big fan of FasterCSV at this point.
That really does look nasty. I’m wondering if it would be better to use #parse inside of a begin/rescue block and then just work with each row as an Array outside of FasterCSV.
Eric, what would that look like?