win10下Spark java读取Hbase数据

本文采用的配置为spark2.1+hadoop2.7.3+Hbase1.3.0

安装hadoop

1、hadoop在官网下载src之后解压,创建新系统环境变量HADOOP_HOME并把值设置为hadoop解压所在目录。把这个链接https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1下的bin目录下载下来,替换掉我们本地的hadoop bin目录。
2、 找到\hadoop-2.7.3\etc\hadoop\hadoop-env.cmd,然后把里面的JAVA_HOME的值改为你的jdk所在目录
具体可以参考http://blog.csdn.net/kokjuis/article/details/53537029

安装Hbase

1、下载Hbase,然后修改conf/hbase-site.xml为如下

configuration>
    <property>
           
           <name>hbase.mastername>
           <value>localhost:6000value>
   property>
   <property>
           <name>hbase.master.maxclockskewname>
           <value>180000value>
   property>
   <property>
          
           <name>hbase.rootdirname>
           <value>hdfs://localhost:9000/hbasevalue>
   property>
   <property>
           <name>hbase.cluster.distributedname>
           <value>falsevalue>
   property>
   <property>
           
           <name>hbase.zookeeper.quorumname>
           <value>localhostvalue>
   property>
   <property>
            
           <name>hbase.zookeeper.property.dataDirname>
           <value>/hbasevalue>
   property>
   <property>
           <name>dfs.replicationname>
           <value>1value>
   property>
configuration>

2、修改 conf/hbase-env.cmd
设置JAVA_HOME,类似于部署hadoop

set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_111

3、停止hadoop

sbin>stop-all.cmd

4、格式化Hadoop命名节点

bin>hdfs namenode -format

5、启动hadoop

sbin>start-all.cmd

6、启动Hbase

bin>start-hbase.cmd

7、访问Hbase shell

bin>hbase shell
  • 注意:如果不启动hadoop和Hbase的话,Hbase shell是没办法用的

用hbase shell插入数据

1、创建表:其中student为表名,info列族名称。这个列族将包含name,age,gender

hbase> create 'student', 'info'

2、插入数据

//首先录入student表的第一个学生记录
hbase> put 'student','1','info:name','Xueqian'
hbase> put 'student','1','info:gender','F'
hbase> put 'student','1','info:age','23'
//然后录入student表的第二个学生记录
hbase> put 'student','2','info:name','Weiliang'
hbase> put 'student','2','info:gender','M'
hbase> put 'student','2','info:age','24'

3、查看插入的数据

//如果每次只查看一行,就用下面命令
hbase> get 'student','1'
//如果每次查看全部数据,就用下面命令
hbase> scan 'student'

Spark安装

下载Spark然后解压即可

JAVA读取Hbase

1、在%SPARK_HOME%/jars目录下创建hbase,把HBase下lib中的jar包添加到hbase目录中去
2、在intlliJ中把文件夹添加到project structure的Dependency中。
3、用java读取刚刚插入到Hbase的student表

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.protobuf.ProtobufUtil;
import org.apache.hadoop.hbase.protobuf.generated.ClientProtos;
import org.apache.hadoop.hbase.util.Base64;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.io.IOException;
import java.util.List;

import static org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString;

/**
 * Created by shying on 2017/5/9.
 */
public class Test_hbase {

    public static Configuration configuration;
    public static Connection connection;
    public static Admin admin;
    //初始化hbase链接
    public static void init(){
        configuration  = HBaseConfiguration.create();
        configuration.set("hbase.rootdir","hdfs://localhost:9000/hbase");
        try{
            connection = ConnectionFactory.createConnection(configuration);
            admin = connection.getAdmin();
        }catch (IOException e){
            e.printStackTrace();
        }
    }
    //关闭hbase链接
    public static void close(){
        try{
            if(admin != null){
                admin.close();
            }
            if(null != connection){
                connection.close();
            }
        }catch (IOException e){
            e.printStackTrace();
        }
    }

    //用hbase接口查询hbase中存在的表
    public static void listTables() throws IOException {
        init();
        HTableDescriptor hTableDescriptors[] = admin.listTables();
        for(HTableDescriptor hTableDescriptor :hTableDescriptors){
            System.out.println(hTableDescriptor.getNameAsString());
        }
        close();
    }

    public static void main(String[] args)throws IOException{

        //用hbase接口读取hbase数据
        listTables();
        //把hbase数据读取到
        SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("hbase test");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.rootdir","hdfs://localhost:9000/hbase");
        //Scan操作
        Scan scan = new Scan();
        scan.setStartRow(Bytes.toBytes("1"));
        scan.setStopRow(Bytes.toBytes("3"));
        scan.addFamily(Bytes.toBytes("info"));
        scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"));

        try {
            String tableName = "student";
            conf.set(TableInputFormat.INPUT_TABLE, tableName);

            ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
            String ScanToString = Base64.encodeBytes(proto.toByteArray());

            conf.set(TableInputFormat.SCAN, ScanToString);
            JavaPairRDD myRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class,
                    ImmutableBytesWritable.class, Result.class);
             //输出数据条数
             System.out.println("count: " + myRDD.count());
            //把读取到的Result转化成String RDD并保存成test文件夹
            JavaRDD result = myRDD.map(x -> Bytes.toString(x._2().getValue(Bytes.toBytes("info"), Bytes.toBytes("name"))));

            result.saveAsTextFile("./test");

        }
        catch (Exception e) {
            System.out.println(e);

        }
        sc.close();
    }
  • 注意:运行程序会经常抛出java.net.ConnectException,这是正常的。

最后test文件夹下part-00000的内容为

Xueqian
Weiliang

spark读取mysql

1、下载mysql connector for java,链接为https://dev.mysql.com/downloads/connector/j/5.1.html

2、把上述下载的jar包添加到项目dependency中。

3、spark java代码读取mysql

        Dataset jdbcDF = spark.read()
                .format("jdbc")
                .option("url", "jdbc:mysql://localhost/database?user=root&password=secret")
                .option("dbtable", "database.user")
                .option("driver", "com.mysql.jdbc.Driver")
                .load();
        jdbcDF.show();

参考链接:
Win10不需要Cygwin搭建大数据测试环境(1)-Hadoop
Win10不需要Cygwin搭建大数据测试环境(2)-HBase
Spark读取HBase内容_Java
Spark2.1.0入门:读写HBase数据

你可能感兴趣的:(机器学习)