HBase操作

基本操作

创建表

Examples:

  hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
  hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
  hbase> # The above in shorthand would be the following:
  hbase> create 't1', 'f1', 'f2', 'f3'
  hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
  hbase> create 't1', 'f1', {SPLITS => ['10', '20', '30', '40']}
  hbase> create 't1', 'f1', {SPLITS_FILE => 'splits.txt'}
  hbase> # Optionally pre-split the table into NUMREGIONS, using
  hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname)
  hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}

create 'dbname:newsinfo_anticheat_user_tag_data', 'user', 'device'

获得表的描述

hbase(main):006:0> describe "dbname:newsinfo_anticheat_user_tag_data"

插入几条记录

put 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224', 'user:phone', '190eb638-185d-3e58-a009-fcd69f67b8ac'
put 'dbname:newsinfo_anticheat_user_tag_data', '596f51ba4e94d9000170e1ff_20181224', 'user:phone', '17086385281'

查看所有数据scan

hbase(main):014:0> scan 'dbname:newsinfo_anticheat_user_tag_data'
ROW                                              COLUMN+CELL
596f51ba4e94d9000170e1ff_20181223               column=user:phone, timestamp=1545646705395, value=17086385281

596f51ba4e94d9000170e1ff_20181224               column=user:phone, timestamp=1545646716425, value=17086385281

获得数据 get

获得一行的所有数据

get 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224'

获得某行，某列族的所有数据

get 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224','user'

获得某行，某列族，某列的所有数据

get 'dbname:newsinfo_anticheat_user_tag_data', '5a483b8769e9560001f9d1b9_20181224','user:dt'

预分区

默认情况下，在创建HBase表的时候会自动创建一个region分区，当导入数据的时候，所有的HBase客户端都向这一个region写数据，直到这个region足够大了才进行切分。一种可以加快批量写入速度的方法是通过预先创建一些空的regions，这样当数据写入HBase时，会按照region分区情况，在集群内做数据的负载均衡。
命令方式:

create ‘t1’, ‘f1’, {NUMREGIONS => 15, SPLITALGO => ‘HexStringSplit’}
也可以使用api的方式:
bin/hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f info
参数：
test_table是表名
HexStringSplit 是split 方式
-c 是分10个region
-f 是family

这样就可以将表预先分为15个区，减少数据达到storefile 大小的时候自动分区的时间消耗，并且还有以一个优势，就是合理设计rowkey 能让各个region 的并发请求平均分配(趋于均匀) 使IO 效率达到最高，但是预分区需要将filesize 设置一个较大的值，设置哪个参数呢, hbase.hregion.max.filesize 这个值默认是10G 也就是说单个region 默认大小是10G,
这个参数的默认值在0.90 到0.92到0.94.3各版本的变化：256M--1G--10G
但是如果MapReduce Input类型为TableInputFormat 使用hbase作为输入的时候，就要注意了，每个region一个map，如果数据小于10G 那只会启用一个map 造成很大的资源浪费，这时候可以考虑适当调小该参数的值，或者采用预分配region的方式，并将检测如果达到这个值，再手动分配region。

HBase已有表与Phoenix映射

使用phoenix 视图方式映射

初始创建

查看HBASE 已有表dbname:newsinfo_anticheat_tag_data

hbase(main):003:0> scan ' dbname:newsinfo_anticheat_tag_data'
ROW                                              COLUMN+CELL
 596f51ba4e94d9000170e1ff_20181223               column=user:dt, timestamp=1545647095916, value=20181223
 596f51ba4e94d9000170e1ff_20181223               column=user:phone, timestamp=1545646705395, value=17086385281
 596f51ba4e94d9000170e1ff_20181224               column=user:dt,

phoenix 4.10 版本后，对列映射做了优化，采用一套新的机制，不在基于列名方式映射到 hbase。如果只做查询，强烈建议使用phoenix 视图方式映射，删除视图不影响 hbase 源数据，语法如下：

0: jdbc:phoenix:dsrv2.heracles.sohuno.com,dme>
use "dbname";
create view "newsinfo_anticheat_user_tag_data"("ROW" varchar primary key, "user"."dt" varchar , "user"."phone" varchar) ;

把HBASE中的ROW当作主键
表名和列族以及列名需要用双引号括起来，因为HBase是区分大小写的，如果不用双引号括起来的话Phoenix在创建表的时候会自动将小写转换为大写字母

Hbase新增列后重新映射

Hbase shell

新添加列user.did_count
put 'dbname:test2', '5a483b8769e9560001f9d1b9_20181224', 'user:did_count', '100'

Phoneix sql

#删除视图
drop view "newsinfo_anticheat_user_tag_data";

#重新创建视图，加上新增列
use "dbname";
create view "newsinfo_anticheat_user_tag_data"("ROW" varchar primary key, "user"."dt" varchar , "user"."phone" varchar, "user"."did_count" varchar) ;

重新查询有了新数据

使用phoenix 表方式映射

创建映射表

必须要表映射，需要禁用列映射规则（会降低查询性能），如下：

use "dbname";
create table "newsinfo_anticheat_user_tag_data"("ROW" varchar primary key, "user"."dt" varchar , "user"."phone" varchar)  column_encoded_bytes=0;

注意：删除映射表时，hbase对应数据也会被删除，慎用删除表操作！！！

Phoneix二级索引

创建二级索引

create index my_index2 on MY_TABLE(V1) include(v2);

HBase增加列

put 'dbname:test2', '5a483b8769e9560001f9d1b9_20181224', 'user:did_count', '100'

Phoneix导出csv文件

[@dudbname103113.heracles.sohuno.com ~]$ /opt/work/phoenix-4.13.1-HBase-1.3/bin/sqlline.py --help
usage: sqlline.py [-h] [-v VERBOSE] [-c COLOR] [-fc FASTCONNECT]
                  [zookeepers] [sqlfile]

Launches the Apache Phoenix Client.

positional arguments:
  zookeepers            The ZooKeeper quorum string
  sqlfile               A file of SQL commands to execute

optional arguments:
  -h, --help            show this help message and exit
  -v VERBOSE, --verbose VERBOSE
                        Verbosity on sqlline.
  -c COLOR, --color COLOR
                        Color setting for sqlline.
  -fc FASTCONNECT, --fastconnect FASTCONNECT
                        Fetch all schemas on initial connection

编辑一个sqlfile，写入待查询sql语句，如下所示：

[@dudbname103113.heracles.sohuno.com ~]$ cat /home/dbname/data/dev_xdf/phoneix/sqlfile.txt
use "dbname";

select * from "newsinfo_anticheat_blacklist_data" limit 5;

执行命令导出数据到csv文件，如下：

/opt/work/phoenix-4.13.1-HBase-1.3/bin/sqlline.py dsrv2.heracles.sohuno.com,dmeta2.heracles.sohuno.com,drm2.heracles.sohuno.com,dmeta1.heracles.sohuno.com /home/dbname/data/dev_xdf/phoneix/sqlfile.txt >> /home/dbname/data/dev_xdf/phoneix/sqlout.csv

查看导出csv文件内容：

[@dudbname103113.heracles.sohuno.com ~]$ cat /home/dbname/data/dev_xdf/phoneix/sqlout.csv
997/997 (100%) Done
+---------------------------+-----+---------------------------------------+-----------+
|          rowkey           | dt  |                  did                  |  did_dt   |
+---------------------------+-----+---------------------------------------+-----------+
| 51bd61df  |     | 76b46d96-a986-36b6-864c-4558d250e6ad  | 20190114  |
| 58a272ec427ad50001849b2b  |     | dd4768f4-5928-37bc-93d9-d700595434a1  | 20190129  |
| 58c246850c0e580001f4104a  |     | 9a79728e-3d0c-3f39-bad4-40a093ec27ea  | 20190122  |
| 596628720c0e5800018c8d2c  |     | f92979768431f3bc6dcf352ac67e5e5d      | 20190129  |
| 5969a1a5c6e6dd0001a27f35  |     | bb60dc61-69ec-3eaa-9ade-772122c8ac88  | 20190129  |
+---------------------------+-----+---------------------------------------+-----------+

Phoneix查询

0: jdbc:phoenix:dsrv2.heracles.sohuno.com,dme> select * from "device_tag_data" where "dt"='20190210' and TO_NUMBER("user_count")>10 LIMIT 2;
+------------------------------------------------+-----------+-------------+----------+------------+--------------+-----------+--------------+-----+
|                      ROW                       |    dt     | user_count  | fenshen  | gyroscope  | android_ver  | hotcloud  | ad_exposure  | ad_ |
+------------------------------------------------+-----------+-------------+----------+------------+--------------+-----------+--------------+-----+
| 03215fc6-5520-3e49-b1e5-f6d72024fa62_20190210  | 20190210  | 11          | 1        |            |              | 1         |              |     |
| 0a340e16-f0dd-3be0-8632-8ac1895d2e6a_20190210  | 20190210  | 12          | 1        |            | 8.1.0        | 1         |              |     |
+------------------------------------------------+-----------+-------------+----------+------------+--------------+-----------+--------------+-----+

写hbase

Spark Bulkload方式

def main(args: Array[String]): Unit = {

  //日期定义
  val today = args(0)
  val hbaseTabName=args(1)
  val hdfsTmpPath = args(2)

  //创建SparkSession
  val sparkconf = new SparkConf().setAppName("UserMetrics")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  val sc = new SparkContext(sparkconf)
  val spark = SparkSessionSingleton.getInstance(sc.getConf)

  //配置hbase参数
  val conf = new Configuration()
  conf.set("hbase.zookeeper.quorum", "zklist")
  conf.set("hbase.zookeeper.property.clientPort", "2181")
  conf.set("zookeeper.znode.parent", "/hbase ")
  conf.set(TableOutputFormat.OUTPUT_TABLE, hbaseTabName)

  //获取rdd数据
  val device _rdd = get_device_rdd(spark,today)

  //spark bulkload导用户指标数据入hbase
  spark_bulkload_to_hbase(device_rdd,conf,hdfsTmpPath)

  sc.stop()
}

def get_device_rdd(spark: SparkSession,today: String) :RDD[(ImmutableBytesWritable, Put)] = {
  val sql = ""
  val data = spark.sql(sql)

  val ad_exposure_click_rdd = data.rdd.map(record => {
    val did = record.getString(0)
    val rowkey = did+"_"+today
    val put = new Put(Bytes.toBytes(rowkey))

    try{
      put.addColumn(Bytes.toBytes("device"), Bytes.toBytes("dt"), Bytes.toBytes(today))
      val user_count = record.get(1)
      put.addColumn(Bytes.toBytes("device"), Bytes.toBytes("user_count"), Bytes.toBytes(user_count.toString))
    }catch {
      case e: Exception => println(s"${e.printStackTrace()}")
    }
    (new ImmutableBytesWritable, put)
  })
  ad_exposure_click_rdd
}

/**
  * spark bulkload导dataframe数据入hbase
  */
def spark_direct_bulkload_to_hbase(rdd: RDD[(ImmutableBytesWritable, Put)], conf: Configuration,path: String): Unit ={
  val job = Job.getInstance(conf)
  job.setOutputKeyClass(classOf[ImmutableBytesWritable])
  job.setOutputValueClass(classOf[Result])
  job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

  //save to hbase hfile
  job.getConfiguration.set("mapred.output.dir", path)
  rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)
}

/**
  * spark 先生成hfiles在调用 bulkload导数据入hbase
  */
def spark_bulkload_to_hbase(rdd: RDD[(ImmutableBytesWritable, Put)], conf: Configuration,path: String,hbaseTabName: String): Unit ={
  val myTable = new HTable(conf, hbaseTabName)
  // Save Hfiles on HDFS
  rdd.saveAsNewAPIHadoopFile(path, classOf[ImmutableBytesWritable], classOf[Result], classOf[TableOutputFormat[ImmutableBytesWritable]], conf)

  //Bulk load Hfiles to Hbase
  val bulkLoader = new LoadIncrementalHFiles(conf)
  bulkLoader.doBulkLoad(new Path(path), myTable)
}

Hbase API方式

data.rdd.foreachPartition(
  partitionRecords => {
    val conn = getHBaseConn(hbaseTabName) // 获取Hbase连接
    val tName = TableName.valueOf(hbaseTabName)
    val table = conn.getTable(tName)
    partitionRecords.foreach(record => {

      val userid = record.getString(0)
      val rowkey = userid + "_" + today
      val put = new Put(Bytes.toBytes(rowkey))

      try {
        put.addColumn(Bytes.toBytes("user"), Bytes.toBytes("dt"), Bytes.toBytes(today))

        val ad_exposure = record.get(1)
        if (ad_exposure != None && ad_exposure != null) {
          put.addColumn(Bytes.toBytes("user"), Bytes.toBytes("ad_exposure"), Bytes.toBytes(ad_exposure.toString))
        }
        val ad_click_rate = record.get(2)
        if (ad_click_rate != None && ad_click_rate != null) {
          put.addColumn(Bytes.toBytes("user"), Bytes.toBytes("ad_click_rate"), Bytes.toBytes(ad_click_rate.toString))
        }
        Try(table.put(put)).getOrElse(table.close())//将数据写入HBase，若出错关闭table
      } catch {
        case e: Exception => println(s"================================ ${e.printStackTrace()}")
      }
    })
    table.close()//分区数据写入HBase后关闭连接
    conn.close()
  })

Hbase数据导出

Hbase shell

通过查询条件过滤导出

hbase(main):006:0> scan 'dbname:newsinfo_anticheat_blacklist_data",{COLUMNS => 'user:dt',LIMIT=>1}
ROW                                        COLUMN+CELL               59c925dbcb8e580001ddd21c                  column=user:dt, timestamp=1547805110793, value=20190114

导出到文件：

echo " scan 'dbname:newsinfo_anticheat_blacklist_data',{COLUMNS => 'user:dt',LIMIT=>1}" | hbase shell > ./hbase.csv

通过hive导出

有时候我们需要把已存在Hbase中的用户画像数据导到hive里面查询，也就是通过hive就能查到hbase里的数据。但是我又不想使用sqoop或者DataX等工具倒来倒去。这时候可以在hive中创建关联表的方式来查询hbase中的数据。
HBase中建表，然后Hive中建一个外部表，这样当Hive中写入数据后，HBase中也会同时更新。

用hive映射表访问hbase数据

在hbase中创建表后,我们只能在hbase shell中使用scan查询数据,这对于熟悉SQL的使用者不怎么习惯,不过我们可以在hive中创建与hbase表的映射来访问hbase表中的数据,例子如下:

1.这里hbase中的表dbname: device_tag_data已经存在

hbase(main):067:0> scan "dbname:device_tag_data",LIMIT=>1
ROW                                COLUMN+CELL
0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:dt, timestamp=1547637050603, value=20190114
0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:gyroscope, timestamp=1547631713244, value=103
0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:user_count, timestamp=1547631631653, value=1                                     0000039f-2d6e-3140-bb97-0d7294cfa4fe_2019 column=device:android_ver, timestamp=1547638332436, value=6.0

2.创建hive映射表关联hbase

CREATE EXTERNAL TABLE dbname.device_tag_data(
key string,
dt string,
user_count string,
fenshen string,
gyroscope string,
android_ver string,
hotcloud string,
ad_exposure string,
ad_click_rate string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping" =
":key,device:dt,device:user_count,device:fenshen,device:gyroscope,device:android_ver,device:hotcloud,device:ad_exposure,device:ad_click_rate")
TBLPROPERTIES("hbase.table.name" = "dbname:device_tag_data");

主要是配置hbase.table.name和hbase.columns.mapping，一个是hbase表名，一个是hbase字段和hive字段的一一映射，然后就可以从hive中读写hbase数据。
注意hbase.columns.mapping后面的字段直接不能出现空格和换行。

3.通过hive查询数据

0: jdbc:hive2://10.31.103.113:10003/dbname> select * from dbname.device_tag_data where key='0000039f-2d6e-3140-bb97-0d7294cfa4fe_20190114';

注意:
这里我们访问的dbname.device_tag_data表是虚表,数据是存储在hbase中的。Hive 与HBase集成，直接从Hive里面连HBase的数据库进行查询，虽然没有做专门的Benchmark, 但总感觉直接对HBase进行查询操作不怎么靠谱，如果我们要频繁做很多类型的数据分析，那HBase的压力一定会倍增。为此我们可以再建立一个新的hive空表，把查询出来的数据全部导入到新表当中，以后的所有数据分析操作在新表中完成。

4.创建hive表

CREATE TABLE dbname.device_tag_data2(
key string,
dt string,
user_count string,
fenshen string,
gyroscope string,
android_ver string,
hotcloud string,
ad_exposure string,
ad_click_rate string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 'hdfs://ns/user/dbname/hive/online/device_tag_data2';

5.将hbase中的表数据加载到本地表

INSERT OVERWRITE TABLE dbname.device_tag_data2 select * from dbname.device_tag_data;

至此大功告成！
以后所有复杂的数据查询和数据分析都可以在新hive表中完成。

XIAO的博客

HBase和Phoneix使用示例

HBase操作

基本操作

创建表

获得表的描述

插入几条记录

查看所有数据scan

获得数据 get

获得一行的所有数据

获得某行，某列族的所有数据

获得某行，某列族，某列的所有数据

预分区

HBase已有表与Phoenix映射

使用phoenix 视图方式映射

初始创建

Hbase新增列后重新映射

使用phoenix 表方式映射

创建映射表

Phoneix二级索引

HBase增加列

Phoneix导出csv文件

Phoneix查询

写hbase

Spark Bulkload方式

Hbase API方式

Hbase数据导出

Hbase shell

通过hive导出

用hive映射表访问hbase数据