Hive

📋介绍：基于hadoop的数据仓库工具，通俗讲可以把结构化的文本文件映射成一张数据库表，并提供类SQL查询功能
🧠本质：将SQL转换为MapReduce程序
🚗用途：做离线数据分析

Hive SQL

推荐数据库连接工具：DBeaver,可以连接hadoop的hive数据库及各种数据库
推荐书籍：《离线和实时大数据开发实战》 –朱松岭

Hive数据库

创建数据库

--语法
CREATE (DATABASE|SCHEMA) [ IF NOT EXISTS ] database_name
	[ COMMENT database_name ]
	[ LOCATION hdfs_path ]
	[ WITH DBPROPERTIES (Property_name=property value , ...) ];

ex: create database demo;

显示所有数据库

1	show databases;

删除数据库

1	drop database demo;

hive默认不允许删除含有表的数据库，若数据库中有表，则使用以下语句删数据库

1	drop database demo CASCADE

切换数据库

use demo;

查看数据库信息

1	discribe database demo;

Hive表DDL

创建表

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name  
   [(col_name data_type [COMMENT col_comment], ...)] --字段名,类型
   [COMMENT table_comment]  --字段描述
   [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]  --分区，数据查询的时候速度更快
   [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]  --分桶,根据字段哈希取余数决定放在哪个桶中
   [ROW FORMAT row_format] --分割符	
   [STORED AS file_format] --字段排序
   [LOCATION hdfs_path] --HDFS地址

创建指定分隔符的内表(默认’\001’)

单一类型的表

1	create table demo (id int, name string ) row format delimited fields terminated by '\|\|';

表数据类型为：
66||jerry
77||tomas

复杂类型的表1

1
2
3

create table demo (name string, hobby array<string> ) 
	row format delimited fields terminated by '\t' 
	collection items terminated by '||';

表数据类型为:
jerry movie||game||music
tomas movie||code||basketball

复杂类型的表2

create table demo (id int, name string, hobby_score map<string, int>) 
	row format delimited fields terminated by '\t'
	collection items terminated by '||'
	map keys terminated by ':';

表数据类型为：
66 jerry movie:5||game:5||music:4
77 tomas movie:4||game:5||basketball:3

创建外表

1	create external table demo_ext(id int, name string, hobby string) row format delimited fields terminated by '\|\|';

复制表(复制表结构，不复制数据)

1 2	create table demo_copy like demo; `

修改表

修改表名

1	alter table demo_old rename to demo_new;

修改表列名

1	alter table demo change old_col new_col;

添加列名

1	alter table demo add columns (newcol int comment '这是新列的描述')

删除表

删除表的数据和元数据
1
drop table demo;
只删除表数据
1
truncate table demo;

插入表数据

向表中加载数据

数据库：hive
本地日志文件路径：/usr/local/openresty/nginx/logs/toupload/user_defined.log.2018*
HDFS路径：hdfs://user/hive/warehouse/hive.db/
1
2
3
4
5
6
7
> --LOAD语法
> LOAD DATA 
> [LOCAL]  -- 若为本地路径加local,hdfs路径不加
> INPATH 'filepath' --文件路径
> [OVERWRITE] --是否覆盖
> INTO TABLE tablename [PARTITION(col=value,col_1=value_1...)]
>

> --在hive数据库中中创建自定义分区ymd的表
> create table userlog (msec string,remote_addr string,status int,body_bytes_sent string,u_domain string,u_url string,u_title string,u_referrer string,u_sh string,u_sw string,u_cd string,u_lang string,http_user_agent string,u_account string,u_avalue string,u_type0 string,u_type1 string,u_type2 string) 
> partitioned by (ymd string) 
> row format delimited fields terminated by '||';
>

–装载某一天的数据
load data local inpath ‘/usr/local/openresty/nginx/logs/toupload/user_defined.log.20181209*’ into table userlog partition (ymd=’20181209’);

> 若日志文件已经映射到HDFS上
> HDFS日志路径:hdfs://flume/tailout/18-12-09/
```SQL
load data inpath '/flume/tailout/18-12-09/*' into table userlog partition (ymd='20181209');

将查询结果插入到Hive表中(HDFS文件系统)

--insert语法
INSERT 
[INTO:OVERWRITE]  --into:追加插入到表中。overwrite:覆盖表数据
TABLE demo
select_statement...

--创建结果表
create table userlogresult(timesec string,ip string) 
partitioned by (ymd string) 
row format delimited fields terminated by '||';

--开启动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

-- 将userlog表查询的结果插入到userlogresult表中
insert overwrite table userlogresult partition (ymd)
select mesec as timesec,remote_addr as ip,ymd from userlog;

Hive表DML

额…众所周知，表查询是最艰难的，下一章再单独做总结吧