当前位置: 开发笔记 > 运维 > 正文

基于MapReduce实现决策树算法

作者：xi曦 | 来源：互联网 | 2022-03-02 09:04

这篇文章主要为大家详细介绍了基于MapReduce实现决策树算法，具有一定的参考价值，感兴趣的小伙伴们可以参考一下

本文实例为大家分享了MapReduce实现决策树算法的具体代码，供大家参考，具体内容如下

首先，基于C45决策树算法实现对应的Mapper算子，相关的代码如下：

public class MapClass extends MapReduceBase implements Mapper {
 
  private final static IntWritable One= new IntWritable(1);
  private Text attValue = new Text();
  private int i;
  private String token;
  public static int no_Attr;
  public Split split = null;
  
  public int size_split_1 = 0;
  
  public void configure(JobConf conf){
   try {
  split = (Split) ObjectSerializable.unSerialize(conf.get("currentsplit"));
 } catch (ClassNotFoundException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
 } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
 }
   size_split_1 = Integer.parseInt(conf.get("current_index"));
  }
  
  public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter)
      throws IOException {
    String line = value.toString(); // changing input instance value to
                    // string
    StringTokenizer itr = new StringTokenizer(line);
    int index = 0;
    String attr_value = null;
    no_Attr = itr.countTokens() - 1;
    String attr[] = new String[no_Attr];
    boolean match = true;
    for (i = 0; i

然后，基于C45决策树算法实现对应的Reducer算子，相关的代码如下：

public class Reduce extends MapReduceBase implements Reducer {
 
  static int cnt = 0;
  ArrayList ar = new ArrayList();
  String data = null;
  private static int currentIndex;
 
  public void configure(JobConf conf) {
    currentIndex = Integer.valueOf(conf.get("currentIndex"));
  }
 
  public void reduce(Text key, Iterator values, OutputCollector output,
      Reporter reporter) throws IOException {
    int sum = 0;
    //sum表示按照某个属性进行划分的子数据集上的某个类出现的个数
    while (values.hasNext()) {
      sum += values.next().get();
    }
    //最后将这个属性上的取值写入output中；
    output.collect(key, new IntWritable(sum));
 
    String data = key + " " + sum;
    ar.add(data);
    //将最终结果写入到文件中；
    writeToFile(ar);
    ar.add("\n");
  }
 
  public static void writeToFile(ArrayList text) {
    try {
      cnt++;
      Path input = new Path("C45/intermediate" + currentIndex + ".txt");
      Configuration cOnf= new Configuration();
      FileSystem fs = FileSystem.get(conf);
      BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fs.create(input, true)));
 
      for (String str : text) {
        bw.write(str);
      }
      bw.newLine();
      bw.close();
    } catch (Exception e) {
      System.out.println("File is not creating in reduce");
    }
  }
}

最后，编写Main函数，启动MapReduce作业，需要启动多趟，代码如下：

package com.hackecho.hadoop;
 
import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.PropertyConfigurator;
import org.dmg.pmml.MiningFunctionType;
import org.dmg.pmml.Node;
import org.dmg.pmml.PMML;
import org.dmg.pmml.TreeModel;
 
//在这里MapReduce的作用就是根据各个属性的特征来划分子数据集
public class Main extends Configured implements Tool {
 
 //当前分裂
  public static Split currentsplit = new Split();
  //已经分裂完成的集合
  public static List splitted = new ArrayList();
  //current_index 表示目前进行分裂的位置
  public static int current_index = 0;
  
  public static ArrayList ar = new ArrayList();
  
  public static List leafSplits = new ArrayList();
  
  public static final String PROJECT_HOME = System.getProperty("user.dir");
 
  public static void main(String[] args) throws Exception {
   //在splitted中已经放入了一个currentsplit了,所以此时的splitted的size大小为1
   PropertyConfigurator.configure(PROJECT_HOME + "/conf/log/log4j.properties");
    splitted.add(currentsplit);
   
    Path c45 = new Path("C45");
    Configuration cOnf= new Configuration();
    FileSystem fs = FileSystem.get(conf);
    if (fs.exists(c45)) {
      fs.delete(c45, true);
    }
    fs.mkdirs(c45);
    int res = 0;
    int split_index = 0;
    //增益率
    double gainratio = 0;
    //最佳增益
    double best_gainratio = 0;
    //熵值
    double entropy = 0;
    //分类标签
    String classLabel = null;
    //属性个数
    int total_attributes = MapClass.no_Attr;
    total_attributes = 4;
    //分裂的个数
    int split_size = splitted.size();
    //增益率
    GainRatio gainObj;
    //产生分裂的新节点
    Split newnode;
 
    while (split_size > current_index) {
     currentsplit = splitted.get(current_index);
      gainObj = new GainRatio();
      res = ToolRunner.run(new Configuration(), new Main(), args);
      System.out.println("Current NODE INDEX . ::" + current_index);
      int j = 0;
      int temp_size;
      gainObj.getcount();
      //计算当前节点的信息熵
      entropy = gainObj.currNodeEntophy();
      //获取在当前节点的分类
      classLabel = gainObj.majorityLabel();
      currentsplit.classLabel = classLabel;
 
      if (entropy != 0.0 && currentsplit.attr_index.size() != total_attributes) {
        System.out.println("");
        System.out.println("Entropy NOTT zero  SPLIT INDEX::  " + entropy);
        best_gainratio = 0;
        //计算各个属性的信息增益值
        for (j = 0; j = best_gainratio) {
              split_index = j;
              best_gainratio = gainratio;
            }
          }
        }
 
        //split_index表示在第几个属性上完成了分裂,也就是分裂的索引值;
        //attr_values_split表示分裂的属性所取的值的拼接成的字符串;
        String attr_values_split = gainObj.getvalues(split_index);
        StringTokenizer attrs = new StringTokenizer(attr_values_split);
        int number_splits = attrs.countTokens(); // number of splits
                             // possible with
                             // attribute selected
        String red = "";
        System.out.println(" INDEX :: " + split_index);
        System.out.println(" SPLITTING VALUES " + attr_values_split);
 
        //根据分裂形成的属性值的集合将在某个节点上按照属性值将数据集分成若干类
        for (int splitnumber = 1; splitnumber <= number_splits; splitnumber++) {
          temp_size = currentsplit.attr_index.size();
          newnode = new Split();
          for (int y = 0; y

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

推荐阅读

apache
每天收获一点点Hadoop概述

一、Hadoop来历Hadoop的思想来源于Google在做搜索引擎的时候出现一个很大的问题就是这么多网页我如何才能以最快的速度来搜索到，由于这个问题Google发明 ... [详细]

蜡笔小新 2023-12-14 18:58:01
apache
Maven构建Hadoop,

Maven构建Hadoop工程阅读目录序Maven安装构建示例下载系列索引序　　上一篇，我们编写了第一个MapReduce，并且成功的运行了Job，Hadoop1.x是通过ant ... [详细]

蜡笔小新 2023-10-17 16:11:18
apache
大数据Hadoop生态(20)MapReduce框架原理OutputFormat的开发笔记

本文介绍了大数据Hadoop生态(20)MapReduce框架原理OutputFormat的开发笔记，包括outputFormat接口实现类、自定义outputFormat步骤和案例。案例中将包含nty的日志输出到nty.log文件，其他日志输出到other.log文件。同时提供了一些相关网址供参考。 ... [详细]

蜡笔小新 2023-12-10 11:44:06
服务器
什么是大数据lambda架构

一、什么是Lambda架构Lambda架构由Storm的作者[NathanMarz]提出，根据维基百科的定义，Lambda架构的设计是为了在处理大规模数 ... [详细]

蜡笔小新 2023-10-17 16:06:09
apache
Hadoop源码解析1Hadoop工程包架构解析

1 Hadoop中各工程包依赖简述 Google的核心竞争技术是它的计算平台。Google的大牛们用了下面5篇文章，介绍了它们的计算设施。 GoogleCluster：ht ... [详细]

蜡笔小新 2023-10-17 13:28:20
apache
hadoop学习；block数据块；mapreduce实现样例；UnsupportedClassVersionError异常；关联项目源代码...

对于开源的东东，尤其是刚出来不久，我认为最好的学习方式就是能够看源代码和doc，測试它的样例为了方便查看源代码，关联导入源代 ... [详细]

蜡笔小新 2023-10-17 09:49:38
apache
hadoop基础----hadoop实战(六)-----hadoop管理工具---Cloudera Manager---CDH介绍

我们在之前的文章中已经初步介绍了Cloudera。hadoop基础----hadoop实战(零)-----hadoop的平台版本选择从版本选择这篇文章中我们了解到除了hadoop官方版本外很多 ... [详细]

蜡笔小新 2023-10-16 14:21:13
apache
MapReduce工作流程最详细解释

MapReduce是我们再进行离线大数据处理的时候经常要使用的计算模型，MapReduce的计算过程被封装的很好，我们只用使用Map和Reduce函数，所以对其整体的计算过程不是太 ... [详细]

蜡笔小新 2023-10-16 14:14:27
apache
MapReduce 切片机制源码分析

总体来说大概有以下2个大的步骤1.连接集群(yarnrunner或者是localjobrunner)2.submitter.submitJobInternal()在该方法中会创建 ... [详细]

蜡笔小新 2023-10-16 13:03:18
apache
Zookeeper详解应用程序（七）

Zookeeper为分布式环境提供灵活的协调基础架构。ZooKeeper框架支持许多当今最好的工业应用程序。我们将在本章中讨论ZooKeeper的一些最显着的应用。雅虎ZooKee ... [详细]

蜡笔小新 2023-10-16 08:30:29
ftp
Azkaban（三）Azkaban的使用

界面介绍首页有四个菜单projects：最重要的部分，创建一个工程，所有flows将在工程中运行。scheduling:显示定时任务executing:显示当前运行的任务histo ... [详细]

蜡笔小新 2023-10-15 23:43:11
ftp
Java开发实战讲解！字节跳动三场技术面+HR面

二、回顾整理阿里面试题基本就这样了，还有一些零星的问题想不起来了，答案也整理出来了。自我介绍JVM如何加载一个类的过程，双亲委派模型中有 ... [详细]

蜡笔小新 2023-10-15 19:48:25
apache
java.lang.UnsatisfiedLinkError: …….io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

在利用hadoop运行MapReduce项目时，提示报错（注意最后是Z）：Exceptioninthreadmainj ... [详细]

蜡笔小新 2023-10-15 14:52:06
apache
开发笔记:大数据之Hadoop(MapReduce)：GroupingComparator分组案例实操

篇首语：本文由编程笔记#小编为大家整理，主要介绍了大数据之Hadoop(MapReduce)：GroupingComparator分组案例实操相关的知识，希望对你有一定的参考价值。 ... [详细]

蜡笔小新 2023-10-15 14:29:15
apache
Hbase Region Server和Hbase Master启动报错 Direct buffer memory

2018-02-1420:07:13,610ERROR[main]regionserver.HRegionServerCommandLine:Regionserverexiting ... [详细]

蜡笔小新 2023-10-16 20:08:57

xi曦

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章