本文采用的配置为spark2.1+hadoop2.7.3+Hbase1.3.0
1、hadoop在官网下载src之后解压,创建新系统环境变量HADOOP_HOME并把值设置为hadoop解压所在目录。把这个链接https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1下的bin目录下载下来,替换掉我们本地的hadoop bin目录。
2、 找到\hadoop-2.7.3\etc\hadoop\hadoop-env.cmd,然后把里面的JAVA_HOME的值改为你的jdk所在目录
具体可以参考http://blog.csdn.net/kokjuis/article/details/53537029
1、下载Hbase,然后修改conf/hbase-site.xml为如下
configuration>
<property>
<name>hbase.mastername>
<value>localhost:6000value>
property>
<property>
<name>hbase.master.maxclockskewname>
<value>180000value>
property>
<property>
<name>hbase.rootdirname>
<value>hdfs://localhost:9000/hbasevalue>
property>
<property>
<name>hbase.cluster.distributedname>
<value>falsevalue>
property>
<property>
<name>hbase.zookeeper.quorumname>
<value>localhostvalue>
property>
<property>
<name>hbase.zookeeper.property.dataDirname>
<value>/hbasevalue>
property>
<property>
<name>dfs.replicationname>
<value>1value>
property>
configuration>
2、修改 conf/hbase-env.cmd
设置JAVA_HOME,类似于部署hadoop
set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_111
3、停止hadoop
sbin>stop-all.cmd
4、格式化Hadoop命名节点
bin>hdfs namenode -format
5、启动hadoop
sbin>start-all.cmd
6、启动Hbase
bin>start-hbase.cmd
7、访问Hbase shell
bin>hbase shell
1、创建表:其中student为表名,info列族名称。这个列族将包含name,age,gender
hbase> create 'student', 'info'
2、插入数据
//首先录入student表的第一个学生记录
hbase> put 'student','1','info:name','Xueqian'
hbase> put 'student','1','info:gender','F'
hbase> put 'student','1','info:age','23'
//然后录入student表的第二个学生记录
hbase> put 'student','2','info:name','Weiliang'
hbase> put 'student','2','info:gender','M'
hbase> put 'student','2','info:age','24'
3、查看插入的数据
//如果每次只查看一行,就用下面命令
hbase> get 'student','1'
//如果每次查看全部数据,就用下面命令
hbase> scan 'student'
下载Spark然后解压即可
1、在%SPARK_HOME%/jars目录下创建hbase,把HBase下lib中的jar包添加到hbase目录中去
2、在intlliJ中把文件夹添加到project structure的Dependency中。
3、用java读取刚刚插入到Hbase的student表
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.protobuf.ProtobufUtil;
import org.apache.hadoop.hbase.protobuf.generated.ClientProtos;
import org.apache.hadoop.hbase.util.Base64;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.io.IOException;
import java.util.List;
import static org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString;
/**
* Created by shying on 2017/5/9.
*/
public class Test_hbase {
public static Configuration configuration;
public static Connection connection;
public static Admin admin;
//初始化hbase链接
public static void init(){
configuration = HBaseConfiguration.create();
configuration.set("hbase.rootdir","hdfs://localhost:9000/hbase");
try{
connection = ConnectionFactory.createConnection(configuration);
admin = connection.getAdmin();
}catch (IOException e){
e.printStackTrace();
}
}
//关闭hbase链接
public static void close(){
try{
if(admin != null){
admin.close();
}
if(null != connection){
connection.close();
}
}catch (IOException e){
e.printStackTrace();
}
}
//用hbase接口查询hbase中存在的表
public static void listTables() throws IOException {
init();
HTableDescriptor hTableDescriptors[] = admin.listTables();
for(HTableDescriptor hTableDescriptor :hTableDescriptors){
System.out.println(hTableDescriptor.getNameAsString());
}
close();
}
public static void main(String[] args)throws IOException{
//用hbase接口读取hbase数据
listTables();
//把hbase数据读取到
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("hbase test");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.rootdir","hdfs://localhost:9000/hbase");
//Scan操作
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("1"));
scan.setStopRow(Bytes.toBytes("3"));
scan.addFamily(Bytes.toBytes("info"));
scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"));
try {
String tableName = "student";
conf.set(TableInputFormat.INPUT_TABLE, tableName);
ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
String ScanToString = Base64.encodeBytes(proto.toByteArray());
conf.set(TableInputFormat.SCAN, ScanToString);
JavaPairRDD myRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
//输出数据条数
System.out.println("count: " + myRDD.count());
//把读取到的Result转化成String RDD并保存成test文件夹
JavaRDD result = myRDD.map(x -> Bytes.toString(x._2().getValue(Bytes.toBytes("info"), Bytes.toBytes("name"))));
result.saveAsTextFile("./test");
}
catch (Exception e) {
System.out.println(e);
}
sc.close();
}
最后test文件夹下part-00000的内容为
Xueqian
Weiliang
1、下载mysql connector for java,链接为https://dev.mysql.com/downloads/connector/j/5.1.html
2、把上述下载的jar包添加到项目dependency中。
3、spark java代码读取mysql
Dataset jdbcDF = spark.read()
.format("jdbc")
.option("url", "jdbc:mysql://localhost/database?user=root&password=secret")
.option("dbtable", "database.user")
.option("driver", "com.mysql.jdbc.Driver")
.load();
jdbcDF.show();
参考链接:
Win10不需要Cygwin搭建大数据测试环境(1)-Hadoop
Win10不需要Cygwin搭建大数据测试环境(2)-HBase
Spark读取HBase内容_Java
Spark2.1.0入门:读写HBase数据