解决问题的方案

Hadoop上的中文 分词与词频统计实践

首先来推荐相关材料：http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小虾的这个统计武侠小说人名热度的段子很有意思，照虎画猫来实践一下。

与其不同的地方有：

　　0）其使用Hadoop Streaming，这里使用MapReduce框架。

　　1）不同的中文分词方法，这里使用IKAnalyzer，主页在http://code.google.com/p/ik-analyzer/。

　　2）这里的材料为《射雕英雄传》。哈哈，总要来一些改变。

0）使用WordCount源代码，修改其Map，在Map中使用IKAnalyzer的分词功能。

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import java.io.ByteArrayInputStream;

import org.wltea.analyzer.core.IKSegmenter;

import org.wltea.analyzer.core.Lexeme;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class ChineseWordCount {

      public static class TokenizerMapper

           extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

        public void map(Object key, Text value, Context context

                        ) throws IOException, InterruptedException {

            byte[] bt = value.getBytes();

            InputStream ip = new ByteArrayInputStream(bt);

            Reader read = new InputStreamReader(ip);

            IKSegmenter iks = new IKSegmenter(read,true);

            Lexeme t;

            while ((t = iks.next()) != null)

            {

                word.set(t.getLexemeText());

                context.write(word, one);

            }

        }

      }

  public static class IntSumReducer

       extends Reducer<Text,IntWritable,Text,IntWritable> {

    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,

                       Context context

                       ) throws IOException, InterruptedException {

      int sum = 0;

      for (IntWritable val : values) {

        sum += val.get();

      }

      result.set(sum);

      context.write(key, result);

    }

  }

  public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    if (otherArgs.length != 2) {

      System.err.println("Usage: wordcount <in> <out>");

      System.exit(2);

    }

    Job job = new Job(conf, "word count");

    job.setJarByClass(ChineseWordCount.class);

    job.setMapperClass(TokenizerMapper.class);

    job.setCombinerClass(IntSumReducer.class);

    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

}

1）So，完成了，本地插件模拟环境OK。打包（带上分词包）扔到集群上。

hadoop fs -put chinese_in.txt chinese_in.txt

hadoop jar WordCount.jar chinese_in.txt out0

...mapping reducing...

hadoop fs -ls ./out0

hadoop fs -get part-r-00000 words.txt

2）数据后处理：

2.1）数据排序

head words.txt

tail words.txt

sort -k2 words.txt >0.txt

head 0.txt

tail 0.txt

sort -k2r words.txt>0.txt

head 0.txt

tail 0.txt

sort -k2rn words.txt>0.txt

head -n 50 0.txt

2.2）目标提取

awk '{if(length($1)>=2) print $0}' 0.txt >1.txt

2.3）结果呈现

head 1.txt -n 50 | sed = | sed 'N;s/\n//'

1郭靖   6427

2黄蓉   4621

3欧阳   1660

4甚么   1430

5说道   1287

6洪七公 1225

7笑道   1214

8自己   1193

9一个   1160

10师父  1080

11黄药师        1059

12心中  1046

13两人  1016

14武功  950

15咱们  925

16一声  912

17只见  827

18他们  782

19心想  780

20周伯通        771

21功夫  758

22不知  755

23欧阳克        752

24听得  741

25丘处机        732

26当下  668

27爹爹  664

28只是  657

29知道  654

30这时  639

31之中  621

32梅超风        586

33身子  552

34都是  540

35不是  534

36如此  531

37柯镇恶        528

38到了  523

39不敢  522

40裘千仞        521

41杨康  520

42你们  509

43这一  495

44却是  478

45众人  476

46二人  475

47铁木真        469

48怎么  464

49左手  452

50地下  448

在非人名词中有很多很有意思，如：5说道7笑道12心中17只见22不知30这时49左手。

Hadoop上的中文分词与词频统计实践（有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）

解决问题的方案

Hadoop上的中文 分词与词频统计实践

Hadoop上的中文分词与词频统计实践（有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）的相关教程结束。

相关推荐

SpringBoot的官方英文介绍（中文译本）

JS判断数字、中文、小数位数

用python + openpyxl处理excel(07+)文档 + 一些中文处理的技巧

Unity中实现字段/枚举编辑器中显示中文（中文枚举、中文标签）

ASP.NET Core 中文文档第二章指南（5）在 Nano Server 上运行ASP.NET Core

一个线上全文索引BUG的排查：关于类阿拉件数字的分词与检索

Windows server 2012 添加中文语言包(英文转为中文)（离线）

NO13 Linux的基础优化-关闭SELinux功能-Linux的7种运行级别-防火墙设置-中文显示设置

Hadoop上的中文分词与词频统计实践 （有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）

解决问题的方案

Hadoop上的中文分词与词频统计实践

Hadoop上的中文分词与词频统计实践 （有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）的相关教程结束。

相关推荐

SpringBoot的官方英文介绍（中文译本）

JS判断数字、中文、小数位数

用python + openpyxl处理excel(07+)文档 + 一些中文处理的技巧

Unity中实现字段/枚举编辑器中显示中文（中文枚举、中文标签）

ASP.NET Core 中文文档 第二章 指南（5） 在 Nano Server 上运行ASP.NET Core

一个线上全文索引BUG的排查：关于类阿拉件数字的分词与检索

Windows server 2012 添加中文语言包(英文转为中文)（离线）

NO13 Linux的基础优化-关闭SELinux功能-Linux的7种运行级别-防火墙设置-中文显示设置

Hadoop上的中文分词与词频统计实践（有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）

Hadoop上的中文分词与词频统计实践（有待学习 http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html）的相关教程结束。

ASP.NET Core 中文文档第二章指南（5）在 Nano Server 上运行ASP.NET Core