I am working on build a infrastructure for analysis of a vast amount of emails.
I made a chunk of email archives, size of which is 2G each.
The problem was that when I concatenated each email into a chunk, there had to be a sync mark to distinct the boundaries between each email.
So I put sync marks between emails.
However, when I put this files on the DFS, and run a mapred job, I am just worried about the record boundaries of overlapped record between input splits and have some curiosity about how HADOOP deals with it.
I was able to find the correct answer from this link.
http://wiki.apache.org/hadoop/FAQ#A23
0 comments:
Post a Comment