问题描述
现在我有一个 4 阶段的 MapReduce 作业,如下所示:
Now I have a 4-phase MapReduce job as follows:
Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output
我注意到 Hadoop 中有一个 ChainMapper
类,它可以将多个映射器链接成一个大映射器,并节省映射阶段之间的磁盘 I/O 成本.还有一个 ChainReducer
类,但它不是真正的Chain-Reducer".它只能支持以下工作:
I notice that there is ChainMapper
class in Hadoop which can chain several mappers into one big mapper, and save the disk I/O cost between map phases. There is also a ChainReducer
class, however it is not a real "Chain-Reducer". It can only support jobs like:
[Map+/ Reduce Map*]
我知道我可以为我的任务设置四个 MR 作业,并为最后三个作业使用默认映射器.但这将花费大量磁盘 I/O,因为 reducer 应该将结果写入磁盘以让以下映射器访问它.是否有任何其他 Hadoop 内置功能可以链接我的 reducer 以降低 I/O 成本?
I know I can set four MR jobs for my task, and use default mappers for the last three jobs. But that will cost a lot of disk I/O, since reducers should write the result into disk to let the following mapper access it. Is there any other Hadoop built-in feature to chain my reducers to lower the I/O cost?
我使用的是 Hadoop 1.0.4.
I am using Hadoop 1.0.4.
推荐答案
我不认为你可以将一个reducer的o/p直接交给另一个reducer.我会为此而努力的:
I dont think that you can have the o/p of a reducer being given to another reducer directly. I would have gone for this:
Input-> Map1 -> Reduce1 ->
Identity mapper -> Reducer2 ->
Identity mapper -> Reduce3 ->
Identity mapper -> Reduce4 -> Output
在 Hadoop 2.X 系列中,在内部,您可以使用 ChainMapper 在 reducer 之前链接 mapper,在 reducer 之后使用 ChainReducer.
In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer.
这篇关于在 Hadoop MapReduce 作业中链接 Multi-Reducer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!