hadoop编程技巧（6）---处理大量的小型数据文件CombineFileInputFormat申请书

代码测试环境：Hadoop2.4

应用场景：当需要处理非常多的小数据文件，这种技术的目的，可以被应用到实现高效的数据处理。

原理：申请书CombineFileInputFormat，能够进行切片合并的时候把多个小的数据文件。因为每个切片将有一个Mapper，当一个Mapper处理的数据比較小的时候，其效率较低。而一般使用Hadoop处理数据时。即默认方式，会把一个输入数据文件当做一个分片。这样当输入文件较小时就会出现效率低下的情况。

实例：

參考前篇blog：hadoop编程小技巧（5）---自己定义输入文件格式类InputFormat。只是这次输入使用两个输入文件，都是小数据量的数据文件。

自己定义输入文件格式：CustomCombineFileInputFormat：

package fz.combineinputformat;

import java.io.IOException;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;

import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;

/**

 * 定义读取类

 * @author fansy

 *

 */

public class CustomCombineFileInputFormat extends CombineFileInputFormat<Text, Text> {

	@Override

	public RecordReader<Text, Text> createRecordReader(InputSplit split,

			TaskAttemptContext context) throws IOException {

		// TODO Auto-generated method stub

		return new CombineFileRecordReader<Text, Text>((CombineFileSplit)split,context,CustomCombineReader.class);

	}

}

自己定义记录读取类CustomCombineReader：

package fz.combineinputformat;

import java.io.IOException;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

/**

 * 改动初始化函数

 * @author fansy

 *

 */

public class CustomCombineReader extends RecordReader<Text, Text> {

	private int index;

	private CustomReader in;

	public CustomCombineReader(CombineFileSplit split,TaskAttemptContext cxt,Integer index){

			this.index=index;

			this.in= new CustomReader();

	}

	@Override

	public void initialize(InputSplit split, TaskAttemptContext context)

			throws IOException, InterruptedException {

		CombineFileSplit cfsplit= (CombineFileSplit) split;

		FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),cfsplit.getOffset(index),

				cfsplit.getLength(),cfsplit.getLocations());

		in.initialize(fileSplit, context);

	}

	@Override

	public boolean nextKeyValue() throws IOException, InterruptedException {

		return in.nextKeyValue();

	}

	@Override

	public Text getCurrentKey() throws IOException, InterruptedException {

		// TODO Auto-generated method stub

		return in.getCurrentKey();

	}

	@Override

	public Text getCurrentValue() throws IOException, InterruptedException {

		// TODO Auto-generated method stub

		return in.getCurrentValue();

	}

	@Override

	public float getProgress() throws IOException, InterruptedException {

		// TODO Auto-generated method stub

		return in.getProgress();

	}

	@Override

	public void close() throws IOException {

		// TODO Auto-generated method stub

		in.close();

	}

}

能够看到这个类使用了上篇博客的CustomReader类。仅仅是改动了下初始化函数，使得小数据量的文件能够合并到一个分片而已。CustomReader能够參考前篇blog：hadoop编程小技巧（5）---自己定义输入文件格式类InputFormat 。

主类，仅仅需改动（相同參考前篇blog）：

job.setInputFormatClass(CustomCombineFileInputFormat.class);

进行了两次实验。第一次使用CombineFileInputFormat读取，第二次使用TextInputFormat读取。

结果查看：

首先能够从终端看出来：

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvZmFuc3kxOTkw/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="" />

能够看到相同的两个输入文件，任务096仅仅有一个分片。任务097有两个分片；

同一时候在任务监控界面也能够看到Mapper的个数变化：

总结：CombineFileInputFormat具有非常强的应用价值，针对大量小数据具有非常高的处理效率收益。只是。假设是大数据应用，普通情况下可能输入数据都是非常大的，所以。这样的情况也仅仅是针对一些特殊情况的处理。

分享，成长。快乐

转载请注明blog地址：http://blog.csdn.net/fansy1990

hadoop编程技巧（6）---处理大量的小型数据文件CombineFileInputFormat申请书

hadoop编程技巧（6）---处理大量的小型数据文件CombineFileInputFormat申请书的相关教程结束。

相关推荐

【python】python读取文件报错UnicodeDecodeError: 'gbk' codec can't decode byte 0xac in position 2: illegal multibyte sequence

14 python读取文件时出现UnicodeDecodeError: 'gbk' codec can't decode byte 0xb7 in position 26: illegal multibyte sequence解决方法

python 读取文件时报错UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 205: illegal multibyte sequence

Python读取CSV文件，报错：UnicodeDecodeError: 'gbk' codec can't decode byte 0xa7 in position 727: illegal multibyte sequence

python 读取文件时报错： UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 127: illegal multibyte sequence

PyQt5UI文件转换为对应版本的py文件

清理Visual Studio中VC++工程里不需要的文件

2023-02-28：moonfdd/ffmpeg-go是用go语言绑定ffmpeg的库，目前是github上最好用的库。请用go语言将yuv文件编码为h264文件。