目录
数据结构
Customer表
Transaction表
Store表
Review表
上传数据
创建目录用于存放数据
把本地文件上传到HDFS上
创建外部表
创建数据库
创建表
数据清洗
对transaction_details中的重复数据生成新ID
过滤掉store_review中没有评分的数据
找出PII (personal information identification‘个人信息识别’) 数据进行加密
重新组织transaction数据按照日期YYYY-MM(按月)做分区
Customer分析
1.找出顾客最常用的信用卡
2.找出户客资料中排名前五的职位名称
3.在美国女性最常用的信用卡
4.按性别和国家进行客户统计
Transaction
1.计算每月总收入
2.计算每个季度的总收入
3.按年计算总收入
4.按工作日计算总收入
5.按时间段计算总收入(需要清理数据)
6.按时间段计算平均消费
7.按工作日计算平均消费
8.计算年、月、日的交易总数
9.找出交易量最大的10个客户
10.找出消费最多的前10位顾客
11.计该期间交易数量最少的用户
12.算每个季度的独立客户总数
13.算每周的独立客户总数
14.算整个活动客户平均花费的最大值
15.计每月花费最多的客户
16.计每月访问次数最多的客户
17.总价找出最受欢迎的5种产品
18.据购买频率找出最畅销的5种产品
19.据客户数量找出最受欢迎的5种产品
Store分析
1.按客流量找出最受欢迎的商店
2.根据顾客消费价格找出最受欢迎的商店
3.根据顾客交易情况找出最受欢迎的商店
4.根据商店和唯一的顾客id获取最受欢迎的产品
5.获取每个商店的员工与顾客比
6.按年和月计算每家店的收入
7.按店铺制作总收益饼图
8.找出每个商店最繁忙的时间段
9.找出每家店的忠实顾客
10.根据每位员工的最高收入找出明星商店
Review分析
1.在ext_store_review中找出存在冲突的交易映射关系
2.了解客户评价的覆盖率
3.根据评分了解客户的分布情况
4.根据交易了解客户的分布情况
5.客户给出的最佳评价是否总是同一家门店
使用工具zeppelin
数据结构
Customer表
customer_details | details |
---|---|
customer_id | Int,1-500 |
first_name | string |
last_name | string |
string,suchaswillddy@gmail.com | |
gender | string,Maleorfemale |
address | string |
country | string |
language | string |
job | string,jobtitle/position |
credit_type | string,creditcardtype,suchasvisa |
credit_no | string,creditcardnumber |
Transaction表
transaction_details | details |
---|---|
transaction_id | Int,1-1000 |
customer_id | Int,1-500 |
store_id | Int,1-5 |
price | decimal,suchas5.08 |
product | string,thingsbought |
date | string,whentopurchase |
time | string,whattimetopurchase |
Store表
store_details | details |
---|---|
store_id | Int,1-5 |
store_name | string |
employee_number | Int,在store有多少employee |
Review表
store_review | details |
---|---|
stransaction_id | Int,1-8000 |
store_id | Int,1-5 |
review_store | Int,1-5 |
上传数据
创建目录用于存放数据
%shhdfs dfs -mkdir -p /shopping/data/customerhdfs dfs -mkdir -p /shopping/data/storehdfs dfs -mkdir -p /shopping/data/reviewhdfs dfs -mkdir -p /shopping/data/transaction
把本地文件上传到HDFS上
%shcd /opt/stufile/storetransactionls -alhdfs dfs -put ./customer_details.csv /shopping/data/customerhdfs dfs -put ./store_details.csv /shopping/data/storehdfs dfs -put ./store_review.csv /shopping/data/reviewhdfs dfs -put ./transaction_details.csv /shopping/data/transaction
创建外部表
创建数据库
%hive-- 如果存在此表则删除drop database if exists shopping cascade;--创建数据库create database if not exists shopping;
创建表
%hivecreate external table if not exists ext_customer_detail( customer_id string, first_name string, last_name string, email string, gender string, address string, country string, language string, job string, credit_type string, credit_no string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' location '/shopping/data/customer' tblproperties ('skip.header.line.count' = '1');create external table if not exists ext_transaction_details( transaction_id string, customer_id string, store_id string, price decimal(8, 2), product string, purchase_date string, purchase_time string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' location '/shopping/data/transaction' tblproperties ('skip.header.line.count' = '1');drop table ext_store_details;create external table if not exists ext_store_details( store_id string, store_name string, employee_number string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' location '/shopping/data/store' tblproperties ('skip.header.line.count' = '1');drop table ext_store_review;create external table if not exists ext_store_review( store_id string, transaction_id string, review_score string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' location '/shopping/data/review' tblproperties ('skip.header.line.count' = '1');
TIP:
OpenCSVSerde 默认的分隔符 (separator)、quote 以及逃逸字符(escape characters )分别为 、” 以及 ‘
如果我们查看表结构的时候,我们会发现如果 row format serde 为 org.apache.hadoop.hive.serde2.OpenCSVSerde,不管你建表的时候指定字段是什么类型,其显示的都是 string 类型
tblproperties (‘skip.header.line.count’ = ‘1’);从外部表导入数据跳过(忽略)首行。
数据清洗
对transaction_details中的重复数据生成新ID
%hiveuse shopping;with basetb as ( select row_number() over (partition by transaction_id order by transaction_id) as rn , transaction_id , customer_id , store_id , price , product , purchase_date , purchase_time , substr(purchase_date, 0, 7) as purchase_month from ext_transaction_details), basetb2 as ( select `if`(rn = 1, transaction_id, concat(transaction_id, '_fix', rn)) transaction_id , customer_id , store_id , price , product , purchase_date , purchase_time , purchase_month from basetb)select *from basetb2where transaction_id like '%fix%';
因为查询的数据比较多,显示不全所有要在后面加一个查询
过滤掉store_review中没有评分的数据
查出有评分的数据
%hiveuse shopping;create view if not exists vm_store_review asselect *from ext_store_reviewwhere review_score '';select * from vm_store_review;
找出PII (personal information identification‘个人信息识别’) 数据进行加密
%hiveuse shopping;drop view customer_detail;create view vm_customer_details asselect customer_id,first_name,unbase64(last_name)as last_name,unbase64(email) as email,gender,unbase64(address) as address,unbase64(concat(unbase64(credit_no),'hello')) as credit_no --二次加密from ext_customer_details;
重新组织transaction数据按照日期YYYY-MM(按月)做分区
%hiveuse shopping;set hive.exec.dynamic.partition.mode=nonstrict;with basetb as ( select row_number() over (partition by transaction_id order by transaction_id) as rn , transaction_id , customer_id , store_id , price , product , purchase_date , purchase_time , substr(purchase_date, 0, 7) as purchase_month from ext_transaction_details)insert overwrite table transaction_details partition (purchase_month)select `if`(rn = 1, transaction_id, concat(transaction_id, '_fix', rn)) transaction_id , customer_id , store_id , price , product , purchase_date , purchase_time , purchase_monthfrom basetb;show partitions transaction_details;
Customer分析
1.找出顾客最常用的信用卡
%hiveuse shopping;select credit_type, max(credit_type) countsfrom ext_customer_detailsgroup by credit_typeorder by counts desc;
2.找出户客资料中排名前五的职位名称
select job, count(job) countsfrom ext_customer_detailsgroup by joborder by counts desclimit 5;
3.在美国女性最常用的信用卡
%hiveuse shopping;select credit_type, count(credit_type) countsfrom ext_customer_detailswhere gender = 'Female' and country='United States'group by credit_typeorder by counts desc;
4.按性别和国家进行客户统计
%hiveuse shopping;select country, gender,count(customer_id)from ext_customer_detailsgroup by gender, country;
Transaction
1.计算每月总收入
%hiveuse shopping;select substr(purchase_date,0,7) month,round(sum(price),2) sumfrom ext_transaction_detailsgroup by substr(purchase_date,0,7);
2.计算每个季度的总收入
with basetb as ( select concat_ws('-', cast(year(purchase_date) as string), cast(quarter(purchase_date) as string)) as year_quarter, price from transaction_details)select year_quarter ,sum(price) sumMoney from basetb group by year_quarter;
3.按年计算总收入
%hiveuse shopping;select year(purchase_date) years , round(sum(price),2) sumfrom ext_transaction_detailsgroup by year(purchase_date);
4.按工作日计算总收入
%hiveuse shopping;with basetb as ( select `dayofweek`(purchase_date) weekday, price from transaction_details)select case when (weekday - 1) = 1 then '星期一' when (weekday - 1) = 2 then '星期二' when (weekday - 1) = 3 then '星期三' when (weekday - 1) = 4 then '星期四' when (weekday - 1) = 5 then '星期五' end as weekday, sum(price) sumfrom basetbgroup by weekdayhaving weekday between 2 and 6;
5.按时间段计算总收入(需要清理数据)
%hiveuse shopping;with basetb1 as ( select price, purchase_time, case when purchase_time like '%AM' then split(purchase_time, '\\s+')[0] when purchase_time like '%PM' then concat_ws(':',cast(`if`( (cast(split(purchase_time, ':')[0] as int) + 12) == 24,0, (cast(split(purchase_time, ':')[0] as int) + 12)) as string) , split(split(purchase_time, ':')[1], '\\s+')[0]) else purchase_time end time_format from transaction_details), basetb2 as (select price, purchase_time, (cast(split(time_format, ':')[0] as decimal(4, 2)) + cast(split(time_format, ':')[1] as decimal(4, 2)) / 60) purchase_time_num from basetb1), basetb3 as (select price, purchase_time, `if`(purchase_time_num > 5 and purchase_time_num 8 and purchase_time_num 11 and purchase_time_num 13 and purchase_time_num 18 and purchase_time_num <= 22, 'evening','night'))))) as time_bucket from basetb2)select time_bucket, sum(price) sumfrom basetb3group by time_bucket;
6.按时间段计算平均消费
%hiveuse shopping;with basetb1 as ( select price, purchase_time, case when purchase_time like '%AM' then split(purchase_time, '\\s+')[0] when purchase_time like '%PM' then concat_ws(':',cast(`if`( (cast(split(purchase_time, ':')[0] as int) + 12) == 24,0, (cast(split(purchase_time, ':')[0] as int) + 12)) as string) , split(split(purchase_time, ':')[1], '\\s+')[0]) else purchase_time end time_format from transaction_details), basetb2 as (select price, purchase_time, (cast(split(time_format, ':')[0] as decimal(4, 2)) + cast(split(time_format, ':')[1] as decimal(4, 2)) / 60) purchase_time_num from basetb1), basetb3 as (select price, purchase_time, `if`(purchase_time_num > 5 and purchase_time_num 8 and purchase_time_num 11 and purchase_time_num 13 and purchase_time_num 18 and purchase_time_num <= 22, 'evening','night'))))) as time_bucket from basetb2)select time_bucket, avg(price) avgfrom basetb3group by time_bucket;
7.按工作日计算平均消费
%hiveuse shopping;with basetb as ( select `dayofweek`(purchase_date) weekday, price from transaction_details)select case when (weekday - 1) = 1 then '星期一' when (weekday - 1) = 2 then '星期二' when (weekday - 1) = 3 then '星期三' when (weekday - 1) = 4 then '星期四' when (weekday - 1) = 5 then '星期五' end weekday, avg(price) avgfrom basetbgroup by weekdayhaving weekday between 2 and 6;
8.计算年、月、日的交易总数
%hiveuse shopping;select distinct purchase_date, purchase_month, year(purchase_date), count(1) over (partition by year(purchase_date)) years, count(1) over (partition by year(purchase_date),month(purchase_date)) months, count(1) over (partition by year(purchase_date),month(purchase_date),day(purchase_date)) daysfrom transaction_details;
9.找出交易量最大的10个客户
%hiveuse shopping;selectcustomer_id,count(1) as numfrom transaction_detailsgroup by customer_idorder by num desclimit 10
10.找出消费最多的前10位顾客
%hiveuse shopping;select customer_id, sum(price) as sumfrom transaction_detailsgroup by customer_idorder by sum desclimit 10
11.计该期间交易数量最少的用户
use shopping;select customer_id, count(transaction_id)from transaction_detailsgroup by customer_idorder by count(transaction_id)limit 1;
12.算每个季度的独立客户总数
%hiveuse shopping;with basetb as ( select distinct concat_ws('-', cast(year(purchase_date) as string), cast(quarter(purchase_date) as string)) as year_quarter, customer_id from transaction_details)select year_quarter, count(customer_id) countsfrom basetbgroup by year_quarter;
13.算每周的独立客户总数
%hiveuse shopping;with basetb as ( select distinct concat(cast(year(purchase_date) as string), '-', cast(weekofyear(purchase_date)as string)) weeks, customer_id from transaction_details)select weeks, count(customer_id) countsfrom basetbgroup by weeks;
14.算整个活动客户平均花费的最大值
%hiveuse shopping;select customer_id,avg(price) avgsfrom transaction_detailsgroup by customer_idorder by avgs desclimit 1;
15.计每月花费最多的客户
%hiveuse shopping;with basetb as ( select purchase_month, customer_id, sum(price) sum_price from transaction_details group by purchase_month, customer_id), basetb2 as ( select purchase_month, customer_id, sum_price, row_number() over (partition by purchase_month order by sum_price desc ) rn from basetb)select purchase_month, customer_id, sum_pricefrom basetb2where rn = 1;
16.计每月访问次数最多的客户
%hiveuse shopping;with basetb as ( select purchase_month, customer_id, count(customer_id) counts from transaction_details group by purchase_month, customer_id), basetb2 as ( select purchase_month, customer_id, counts, row_number() over (partition by purchase_month order by counts desc ) rn from basetb)select purchase_month, customer_id, countsfrom basetb2where rn = 1;
17.总价找出最受欢迎的5种产品
select product,sum(price) sumfrom transaction_detailsgroup by productorder by sum desclimit 5;
18.据购买频率找出最畅销的5种产品
select product,count(1) countsfrom transaction_detailsgroup by productorder by counts desclimit 5;
19.据客户数量找出最受欢迎的5种产品
select product,count(distinct customer_id) countsfrom transaction_detailsgroup by productorder by counts desclimit 5
Store分析
1.按客流量找出最受欢迎的商店
%hiveuse shopping;select store_name, count(distinct customer_id) countsfrom transaction_details td join ext_store_details esd on td.store_id = esd.store_idgroup by store_nameorder by counts desc;
2.根据顾客消费价格找出最受欢迎的商店
%hiveuse shopping;select store_name, sum(price) sumsfrom transaction_details td join ext_store_details esd on td.store_id = esd.store_idgroup by store_nameorder by sums desc;
3.根据顾客交易情况找出最受欢迎的商店
%hiveuse shopping;select store_name, count(td.store_id) countsfrom transaction_details td join ext_store_details esd on td.store_id = esd.store_idgroup by store_nameorder by counts desc;
4.根据商店和唯一的顾客id获取最受欢迎的产品
%hiveuse shopping;with basetb as ( select store_id, product, count(distinct customer_id) counts from transaction_details group by store_id, product), basetb2 as ( select store_id, product, counts, rank() over (partition by store_id order by counts desc ) as rn from basetb)select store_name, product, countsfrom basetb2 tb2 join ext_store_details esd on tb2.store_id = esd.store_idwhere rn = 1;
5.获取每个商店的员工与顾客比
%hiveuse shopping;with t1 as (select count(1) c1, store_id from transaction_details td group by td.store_id)select t1.store_id, esd.store_name, concat(substring(cast(esd.employee_number / t1.c1 as decimal(9, 8)) * 100.0, 0, 4), '%') proportionfrom t1 join ext_store_details esd on t1.store_id = esd.store_id;
6.按年和月计算每家店的收入
%hiveuse shopping;select distinct *from ( select store_id, year(purchase_date) year, sum(price) over (partition by year(purchase_date)) sum_year, month(purchase_date) month, sum(price) over (partition by year(purchase_date),month(purchase_date)) sum_month from transaction_details ) tb;
7.按店铺制作总收益饼图
%hiveuse shopping;select store_id,sum(price)from transaction_detailsgroup by store_id
8.找出每个商店最繁忙的时间段
%hiveuse shopping;with basetb1 as ( select store_id, customer_id, purchase_time, case when purchase_time like '%AM' then split(purchase_time, '\\s+')[0] when purchase_time like '%PM' then concat_ws(':', cast(`if`( (cast(split(purchase_time, ':')[0] as int) + 12) == 24, 0, (cast(split(purchase_time, ':')[0] as int) + 12)) as string) , split(split(purchase_time, ':')[1], '\\s+')[0]) else purchase_time end time_format from transaction_details), basetb2 as (select store_id, customer_id, purchase_time, (cast(split(time_format, ':')[0] as decimal(4, 2)) + cast(split(time_format, ':')[1] as decimal(4, 2)) / 60) purchase_time_num from basetb1), basetb3 as (select store_id, customer_id, purchase_time, `if`(purchase_time_num > 5 and purchase_time_num 8 and purchase_time_num 11 and purchase_time_num 13 and purchase_time_num 18 and purchase_time_num <= 22, 'evening', 'night'))))) as time_bucket from basetb2)select esd.store_name, tb3.time_bucket, count(customer_id) countsfrom basetb3 tb3 join ext_store_details esd on tb3.store_id = esd.store_idgroup by esd.store_name, time_bucket;
9.找出每家店的忠实顾客
购买次数大于5,认为他是忠实粉丝
%hiveuse shopping;select *from ( select store_id, customer_id, count(1) counts from transaction_details group by store_id, customer_id) tbwhere tb.counts > 5;
10.根据每位员工的最高收入找出明星商店
%hiveuse shopping;with base as ( select store_id, sum(price) s from transaction_details group by store_id )select base.store_id, base.s / store.employee_number enfrom base join ext_store_details store on base.store_id = store.store_idorder by en desclimit 1;
Review分析
1.在ext_store_review中找出存在冲突的交易映射关系
%hiveuse shopping;with basetb as ( select row_number() over (partition by transaction_id) as row_number1, * from vm_store_review)select row_number1, a.transaction_id, a.store_id, b.store_id, a.review_score, b.review_scorefrom basetb a join vm_store_review b on a.transaction_id = b.transaction_idwhere row_number1 > 1;
2.了解客户评价的覆盖率
%hiveuse shopping;with t1 as ( select count(1) c1 from ext_store_review where review_score ''), t2 as ( select count(1) c2 from ext_store_review where review_score = '' )select concat(cast((c1 - c2) / c1 * 100 as decimal(4, 2)), '%') Coveragefrom t1 join t2;
3.根据评分了解客户的分布情况
%hiveuse shopping;select concat(round(sum(case review_score when '1' then 1 else 0 end) / count(*) * 100, 2), '%') as one_score, concat(round(sum(case review_score when '2' then 1 else 0 end) / count(*) * 100, 2), '%') as two_score, concat(round(sum(case review_score when '3' then 1 else 0 end) / count(*) * 100, 2), '%') as three_score, concat(round(sum(case review_score when '4' then 1 else 0 end) / count(*) * 100, 2), '%') as four_score, concat(round(sum(case review_score when '5' then 1 else 0 end) / count(*) * 100, 2), '%') as five_scorefrom ext_store_review;
4.根据交易了解客户的分布情况
根据总金额
%hiveuse shopping;select country, sum(price) sum_pricefrom transaction_details td join ext_customer_details cd on td.customer_id = cd.customer_idgroup by cd.country;
5.客户给出的最佳评价是否总是同一家门店
%hiveuse shopping;select store_id, customer_id, count(customer_id) countsfrom transaction_details td join ext_store_review esr on esr.transaction_id = td.transaction_idwhere esr.review_score = 5group by td.store_id, td.customer_id;