Friday, May 23, 2008

Large Scale Data Munging

I don't believe I've ever posted a link to one of my favorite blogs, Datawocky. Great article today on super large scale data munging:

Yet the Map Reduce paradigm has its limitations. The biggest problem is that it involves writing code for each analysis. This limits the number of companies and people that can use this paradigm. The second problem is that joins of different data sets is hard. The third problem is that Map Reduce works on files and produces files; after a while the number of files multiplies and it becomes difficult to keep track of things. What's lacking is a metadata layer, such as the catalog in database systems. Don't get me wrong; I love Map Reduce, and there are applications that don't need these things, but increasingly there are applications that do.

-Why the World Needs a New Database System

No comments: