Wednesday, September 1, 2010

Hadoop record boundary

I am working on build a infrastructure for analysis of a vast amount of emails.
I made a chunk of email archives, size of which is 2G each.

The problem was that when I concatenated each email into a chunk, there had to be a sync mark to distinct the boundaries between each email.
So I put sync marks between emails.

However, when I put this files on the DFS, and run a mapred job, I am just worried about the record boundaries of overlapped record between input splits and have some curiosity about how HADOOP deals with it.

I was able to find the correct answer from this link.
http://wiki.apache.org/hadoop/FAQ#A23

0 comments:

Post a Comment