电商数仓系统(二)
...大约 55 分钟
电商数仓系统(二)
##一、DWS/DWT准备
1.1 业务术语
-- 1. 用户
a、统计流量相关的需求:使用设备id,如统计活跃用户
b、统计业务相关的指标,使用用户id,如统计下单数量
--2. 新增用户
a、如果一个用户首次打开某APP,那这个用户定义为新增用户;
b、卸载再安装的设备,不会被算作一次新增。
c、新增用户包括日新增用户、周新增用户、月新增用户。
-- 3. 活跃用户
a、打开应用的用户即为活跃用户,不考虑用户的使用情况。
b、每天一台设备打开多次会被计为一个活跃用户
-- 4. 周活跃用户
a、某个自然周(月)内启动过应用的用户;
b、该周(月)内的多次启动只记一个活跃用户
-- 5. 月活跃率
a、当月活跃用户 / 总用户数
-- 6. 沉默用户
a、用户仅在安装当天(次日)启动一次,后续时间无再启动行为
b、指标可以反映新增用户质量和用户与APP的匹配程度
-- 7. 版本分布
a、不同版本的周内各天新增用户数,活跃用户数和启动次数
b、判断APP各个版本之间的优劣和用户行为习惯
-- 8. 本周回流用
a、上周未启动应用, 本周启动了应用的用户,且不是本周新增用户
-- 9. 连续n周
a、连续n周, 每周至少启动一次
-- 10. 忠诚用户
a、连续活跃5周以上的用户
-- 10. 连续活跃用户
a、连续两周以上的活跃用户
-- 11. 近期流失用户
a、连续n(2-4周)周没有启动应用的用户
-- 12. 留存用户
a、某段时间内的新增用户,经过一段时间后,仍然使用应用的被认作是留存用户;这部分用户占当时新增用户的比例即是留存率。
b、例如,5月份新增用户200,这200人启动情况:
6月份:启动人数为100人,5月份新增用户一个月后的留存率是50%
7月份:启动人数为80人,5月份新增用户二个月后的留存率是40%
8月份:启动人数为100人,5月份新增用户三个月后的留存率是50%
c、留存用户一般是统计留存率,同时必须有两个参数:哪个月份的几月的留存率
-- 13. 用户新鲜度
a、每天启动应用的新老用户比例
b、用户新鲜度 = 新用户数量 / 当日活跃用户数
-- 14. 单次使用时长
a、每次启动使用的时间长度
-- 15. 日使用时长
a、累计一天内的使用时间长度。
-- 16. 启动次数计算标准
'背景':IOS平台应用退到后台,是真的退出应用,后台会释放资源
Android平台应用退到后台,并不会立即退出应用,因为应用和应用之间存在互相唤醒的问题。
'统计方式':
IOS平台应用退到后台就算一次独立的启动
Android平台我们规定,两次启动之间的间隔小于30秒,被计算一次启动
1.2 系统函数
1.2.1 celloect_set
-- 1. 作用:将一列数据收集变成一行数据,返回一个数组
-- 2. 区别:
collect_set : 会去重
collect_list: 不会去重
-- 3. 案例:
1. 数据准备:
drop table if exists stud;
create table stud (name string, area string, course string, score int);
2. 插入数据:
insert into table stud values('zhang3','bj','math',88);
insert into table stud values('li4','bj','math',99);
insert into table stud values('wang5','sh','chinese',92);
insert into table stud values('zhao6','sh','chinese',54);
insert into table stud values('tian7','bj','chinese',91);
3. 查询数据:
1. select course , collect_set(area),avg(score) from stud group by course
'打印结果':
course _c1 _c2
chinese ["sh","bj"] 79.0
math ["bj"] 93.5
2. select course ,collset_list(area),avg(score) from stud group by course
'打印结果'
course _c1 _c2
math ["bj","bj"] 93.5
chinese ["sh","sh","bj"] 79.0
1.2.2 nvl
-- 1. 函数:
nvl(表达式1,表达式2)
-- 2. 返回值:
只能传递两个参数,从左往右找,找到一个非null的值,就返回,如果两个表示式都为null,则返回null
-- 3. 参数说明:
1.表达式类型:数字型、字符型和日期型
2.两个表达式的数据类型必须相同
-- 4. 案例:
1. select nvl(10,"lianzp"); =>结果:10
2. select nvl(date_add('2020-06-25',-1),1);
报错:The first and seconds arguments of function NLV should have the same type
3. select 'lianzp'+10; => 结果为null
1.2.3 coalesce
-- 1. coalesce(a1, a2, ...)
-- 2. 作用:- Returns the first non-null argument,返回第一不为null的值
-- 3. 和nvl主要区别是:
a、可以有多个参数
-- 4. 要求所有参数的类型保持一致:
The expressions after COALESCE should all have the same type
1.2.4日期函数(重要)
-- 1. date_format函数(根据格式整理日期)
hive (gmall)> select date_format('2020-06-14','yyyy-MM');
=>2020-06
-- 2. date_add函数(加减日期)
hive (gmall)> select date_add('2020-06-14',-1);
=> 2020-06-13
hive (gmall)> select date_add('2020-06-14',1);
=> 2020-06-15
-- 3. next_day(start_date, day_of_week)
1. 返回当期日期的下一个周几。
2. day_of_week:可选参数:-- 可以写前面两个字母,实际开发中,周一和周末使用频率最高
星期一到星期日的英文(Monday,Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday)
Returns the first date which is later than start_date and named as indicated.
3. 需求:
a、需求1:返回下一个周一
select next_day('2020-07-07','mo');
=>2020-07-13
b、需求2:返回本周末:求出下一个周一再减1天
select date_add(next_day('2020-07-07','mo'),-1);
=> 2020-07-12
c、需求3:返回上个周日,求出下一个周一再减8天
select date_add(next_day('2020-07-07','mo'),-8);
=> 2020-07-05
-- 4. last_day(date)
返回当前日期的最后一天
Returns the last day of the month which the date belongs to
select last_day('2020-07-07'); => 2020-07-31
-- 5. 当前日期:current_date
select current_day;
=>2020-07-07
二、DWS/DWT层
2.1 DWS/DWT思想
-- 1. dwd层和dws/dwt的区别?
DWD层建表时不要考虑用户的需求,而dws和dwd层,以用户需求为驱动,统计各个维度的相关指标;
-- 2. dws和dwt都是建宽表,建宽表的目的是优化查询,减少重复查询的步骤。
-- 3. 用户后续的需求是什么样的呢?
主要是报表,体现的指标有个数、次数、金额等指标。
'体现的方式':维度 + 时间 + 指标
'案例':
a、地区维度:北京地区今天的下单金额
b、商品维度:1号商品今天下单数量
-- 4. 建宽表的思路:
1. 一个维度作为一个宽表;
2. 字段:以维度作为关键字 + 和该维度相关的所有事实表的度量值
-- 5. DWS和DWT的主要区别:
1. dws的数据来源于DWD层,站在维度的角度,看事实表的度量值,统计每个主题当天的行为
2. dwt的主要数据来源dws,站在维度的角度,看事实表的开始时间、结束时间、累积到现在的度量值,累积一段时间的度量值,
统计每个主题累计行为,如统计最近7天的下单金额。
-- 6. 宽表是应对一些常用的需求,并不是所有的需求都会包含进来,如果一些特殊需求,可以去到dwd层获取数据
2.2 用户行为数据
2.2.1 每日设备DWS层
- 建表语句
drop table if exists dws_uv_detail_daycount;
create external table dws_uv_detail_daycount
(
`mid_id` string,
`brand` string,
`model` string,
`login_count` bigint COMMENT '活跃次数',
`page_stats` array<struct<page_id:string,page_count:bigint>> COMMENT '页面访问统计'
)
partitioned by(dt string)
stored as parquet
location '/warehouse/gmall/dws/dws_uv_detail_daycount';
- 加载数据
with
tmp_start_log as (
select
mid_id ,
brand ,
model,
count(*) login_count
from dwd_start_log
where dt = '2020-06-25'
group by mid_id,brand,model
),
tmp_page_log as (
select
mid_id ,
brand ,
model,
collect_set(named_struct("page_id",page_id,"page_count",page_count)) page_stats
from (
select
mid_id ,
brand ,
model,
page_id,
count(*) page_count
from dwd_page_log
where dt ='2020-06-25'
group by mid_id ,brand,model,page_id
)tmp
group by mid_id ,brand,model
)
insert overwrite table dws_uv_detail_daycount partition(dt='2020-06-25')
select
tmp_start_log.mid_id ,
tmp_start_log.brand ,
tmp_start_log.model,
login_count,
page_stats
from tmp_start_log
join tmp_page_log
on tmp_start_log.mid_id = tmp_page_log.mid_id
and tmp_start_log.brand = tmp_page_log.brand
and tmp_start_log.model = tmp_page_log.model
2.2.2 设备主题DWT层
- 建表语句
drop table if exists dwt_uv_topic;
create external table dwt_uv_topic
(
`mid_id` string,
`brand` string,
`model` string,
`login_date_first` string comment '首次活跃时间',
`login_date_last` string comment '末次活跃时间',
`login_day_count` bigint comment '当日活跃次数',
`login_count` bigint comment '累积活跃天数'
)
stored as parquet
location '/warehouse/gmall/dwt/dwt_uv_topic';
- 插入数据
insert overwrite table dwt_uv_topic
select
nvl(old.mid_id,new.mid_id),
nvl(old.brand,new.brand),
nvl(old.model,new.model),
if(old.login_date_first is null and new.login_count > 0 ,'2020-06-25',old.login_date_first),
if(new.login_count > 0 ,'2020-06-25',old.login_date_last),
nvl(new.login_count,0),
nvl(old.login_count,0) + if(new.login_count> 0 , 1 ,0)
from dwt_uv_topic old
full join (
select
mid_id ,
brand ,
model,
login_count
from dws_uv_detail_daycount
where dt ='2020-06-25'
)new
on old.mid_id = new.mid_id
2.3 业务数据
2.3.1 会员行为
2.3.1.1 会员DWS层
-- 1. 建表过程:
'准备': 事实表:订单、订单详情、优惠券领用、支付、退款、收藏、加购物车、评价
'第一步':找到和用户维度相关的所有事实表:订单、订单详情、优惠券领用、支付、退款、收藏、加购物车、评价
'第二步': 找到这些事实表的所有度量值字段:
'第三步':维度主键 + 第二步获取的字段作为dws宽表的字段。
-- 备注:理论上上述所有事实表都需要进行统计,但是在本案例中,只统计了订单、订单详情表、支付,加购物车四个事实表的数据。
-- 2. 数据的来源:来自于DWD层
-- 3. 表中的数据说明:
1. 是分区表,每个分区为当天的数据
2. 每一行数据代表一个用户当天的行为
-- 4. 根据建表字段,确定每个字段来自于哪张表中
-- 5. 数据存储格式:列式存储 + lzo压缩
- 建表语句
drop table if exists dws_user_action_daycount;
create external table dws_user_action_daycount
(
user_id string comment '用户 id',
login_count bigint comment '登录次数',
cart_count bigint comment '加入购物车次数',
order_count bigint comment '下单次数',
order_amount decimal(16,2) comment '下单金额',
payment_count bigint comment '支付次数',
payment_amount decimal(16,2) comment '支付金额',
order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> comment '下单明细统计'
) COMMENT '每日用户行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_action_daycount/'
tblproperties ("parquet.compression"="lzo");
- 第一步:确定各个字段来自哪个表
user_id string comment '用户 id',--从dwd_start_log获取
login_count bigint comment '登录次数', --从dwd_start_log获取
cart_count bigint comment '加入购物车次数',--dwd_action_log
order_count bigint comment '下单次数',--dwd_fact_order_info
order_amount decimal(16,2) comment '下单金额',--dwd_fact_order_info
payment_count bigint comment '支付次数',--dwd_fact_order_info
payment_amount decimal(16,2) comment '支付金额',--dwd_fact_order_info
order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> comm
-- dwd_fact_order_detail
说明:1. 加入购车的数据,需要去启动日志中获取,因为加购事实表是每天一个快照,不保留中间操作的过程,所以去到日志action
获取,点击一次加购物车操作,记一次加入购物车的次数
2. 用户登录次数,从启动日志中获取,启动一次算作今天登录一次,但是要注意有些登录不是会员,所以需要过滤user_id为
null的数据;
- 第二步:确定dws一行数据代表什么意思?
用户行为表的一行数据代表:一个用户今天登陆次数、加购物车数量、下单数量、下单金额等数据
- 第三步:确定dwd层所有相关表的同步策略
'dwd_start_log':每日新增数据
'dwd_action_log':每日新增数据
'dwd_fact_order_info':事务型
'dwd_fact_order_detail':事务型
'dwd_fact_payment_info':事务型
- 第四步:从各个表中获取对应的字段
-- 1. 获取用户id和登录次数
select
user_id,
count(*) login_count
from dwd_start_log
where dt='2020-06-25'
and user_id is not null
group by user_id
---------------------------------------------
-- 2. 获取加购物车的次数
select
user_id,
count(*) cart_count
from dwd_action_log
where dt='2020-06-25'
and user_id is not null
and action_id='cart_add'
group by user_id
------------------------------------------
-- 3. 获取下单的次数和下单金额
select
user_id,
count(*) order_count
sum(final_total_amount) order_amount
from dwd_fact_order_info
where dt = '2020-06-25'
group by user_id
------------------------------------------------
-- 4. 获取支付的金额和支付次数
select
user_id,
count(*) payment_count,
sum(payment_amount) payment_amount
from dwd_fact_payment_info
where dt = '2020-06-25'
group by user_id
----------------------------------------------
-- 5. 获取下单明细
select
user_id,
collect_set(named_struct("sku_id",sku_id,"sku_num",sku_num,"order_count",order_count,"order_amount",order_amount))order_detail_stats
from (
select
user_id,
sku_id,
sum(sku_num) sku_num,
count(*) order_count,
cast(sum(final_amount_d) as demical(20,2)) order_amount
from dwd_fact_order_detail
where dt = '2020-06-25'
group by user_id,sku_id
)tmp
group by user_id
- 第五步:组装以后插入数据
with
tmp_start_log as (
select
user_id,
count(*) login_count
from dwd_start_log
where dt='2020-06-25'
and user_id is not null
group by user_id
),
tmp_action_log as (
select
user_id,
count(*) cart_count
from dwd_action_log
where dt='2020-06-25'
and user_id is not null
and action_id='cart_add'
group by user_id
),
tmp_order_info as (
select
user_id,
count(*) order_count,
sum(final_total_amount) order_amount
from dwd_fact_order_info
where dt = '2020-06-25'
group by user_id
),
tmp_payment_info as (
select
user_id,
count(*) payment_count,
sum(payment_amount) payment_amount
from dwd_fact_payment_info
where dt = '2020-06-25'
group by user_id
),
tmp_order_detail as (
select
user_id,
collect_set(named_struct("sku_id",sku_id,"sku_num",sku_num,"order_count",order_count,"order_amount",order_amount)) order_detail_stats
from (
select
user_id,
sku_id,
sum(sku_num) sku_num,
count(*) order_count,
cast(sum(final_amount_d) as decimal(20,2)) order_amount
from dwd_fact_order_detail
where dt = '2020-06-25'
group by user_id,sku_id
)tmp
group by user_id
)
insert overwrite table dws_user_action_daycount partition (dt='2020-06-25')
select
tmp_start_log.user_id ,
login_count ,
nvl(cart_count,0) ,
nvl(order_count,0) ,
nvl(order_amount,0.0) ,
nvl(payment_count,0) ,
nvl(payment_amount,0.0) ,
order_detail_stats
from tmp_start_log
left join tmp_action_log on tmp_start_log.user_id = tmp_action_log.user_id
left join tmp_order_info on tmp_start_log.user_id = tmp_order_info.user_id
left join tmp_payment_info on tmp_start_log.user_id = tmp_payment_info.user_id
left join tmp_order_detail on tmp_start_log.user_id = tmp_order_detail.user_id
2.3.1.2 会员DWT层
-- 1. 说明:
1. dwt和dws层的字段基本是一一对应的。
2. dws是当天的数据,dwt是累积数据,累积涉及到时间,比如累积7天,累积3天
-- 2. dwt和dws的维度字段,我们也可以直接放到表中,后续统计需求可以用到。
-- 3. dwt数据说明:
1. 是维度全量表
2. 数据源来自于dws层
3. 不是分区表
-- 4. 如何维护dwt表,即如何向这个全量表中插入数据?
1. dwt的数据每天都需要进行更新;
2. 更新涉及到新数据和老数据
3. 更新方式:
a、取累积时间周期的数据;
b、如果是累积字段,使用聚合函数求值
c、如果非累积字段,使用判断语句求当天的数据
d、然后新旧数据今天合并,由于旧数据不是分区数据
那么累积数据直接使用新表累积值,而非累积字段,采用更新的方式
- 建表语句
drop table if exists dwt_user_topic;
create external table dwt_user_topic
(
user_id string comment '用户id',
login_date_first string comment '首次登录时间',
login_date_last string comment '末次登录时间',
login_count bigint comment '累积登录天数',
login_last_30d_count bigint comment '最近30日登录天数',
order_date_first string comment '首次下单时间',
order_date_last string comment '末次下单时间',
order_count bigint comment '累积下单次数',
order_amount decimal(16,2) comment '累积下单金额',
order_last_30d_count bigint comment '最近30日下单次数',
order_last_30d_amount bigint comment '最近30日下单金额',
payment_date_first string comment '首次支付时间',
payment_date_last string comment '末次支付时间',
payment_count decimal(16,2) comment '累积支付次数',
payment_amount decimal(16,2) comment '累积支付金额',
payment_last_30d_count decimal(16,2) comment '最近30日支付次数',
payment_last_30d_amount decimal(16,2) comment '最近30日支付金额'
)COMMENT '用户主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_user_topic/'
tblproperties ("parquet.compression"="lzo");
- 分析
旧表:dwt前一天的数据
新表:从dws层获取的累积30天的数据
user_id string comment '用户id', --
login_date_first string comment '首次登录时间',--如果旧表中有登录时间就使用旧表中的时间,否则使用当天时间
login_date_last string comment '末次登录时间',-- 如果新表中今天的登入次数大于0,那么末次登录时间使用今天时间,否则使用旧时间
login_count bigint comment '累积登录天数', -- 如果新表中的登入次数大于0,那么旧表中数据 + 1
login_last_30d_count bigint comment '最近30日登录天数', -- 使用新表中累积计算数据,如果为null,则选择0
order_date_first string comment '首次下单时间',-- 如果旧表中首次下单时间为null且新表中下单次数大于0,那么使用今天时间,否则使用旧表数据
order_date_last string comment '末次下单时间',-- 如果新表的下单数据大于0,则使用当天时间,否则使用旧表数据
order_count bigint comment '累积下单次数',-- 将旧表下单数据和新表下单次数直接相加
order_amount decimal(16,2) comment '累积下单金额',--将旧表下单金额和新表下单金额相加
order_last_30d_count bigint comment '最近30日下单次数',--直接使用新表数据,如果新表数据为null,则使用0
order_last_30d_amount bigint comment '最近30日下单金额',--直接使用新表数据,如果新表数据为null,则使用0
payment_date_first string comment '首次支付时间',
payment_date_last string comment '末次支付时间',
payment_count decimal(16,2) comment '累积支付次数',
payment_amount decimal(16,2) comment '累积支付金额',
payment_last_30d_count decimal(16,2) comment '最近30日支付次数',
payment_last_30d_amount decimal(16,2) comment '最近30日支付金额'
- 装载数据
insert overwrite table dwt_user_topic
select
nvl(old.user_id,new.user_id) ,
nvl(old.login_date_first,'2020-06-25'),
if(new.user_id is not null , '2020-06-25',old.login_date_last),
nvl(old.login_count,0) + if(new.user_id is not null,1,0),
nvl(new.login_last_30d_count,0),
if(old.order_date_first is null and new.order_count > 0 , '2020-06-25',old.order_date_first),
if(new.order_count > 0,'2020-06-25',old.order_date_first),
nvl(old.order_count,0) + nvl(new.order_count,0),
nvl(old.order_amount,0) + nvl(new.order_amount,0),
nvl(new.order_last_30d_count,0),
nvl(new.order_last_30d_amount,0),
if(old.payment_date_first is null and new.payment_count > 0 ,'2020-06-25',old.payment_date_first),
if(new.payment_count > 0 ,'2020-06-25',old.payment_date_last),
nvl(old.payment_count,0) + nvl(new.payment_count,0),
nvl(old.payment_amount,0) + nvl(new.payment_amount,0),
nvl(new.payment_last_30d_count,0),
nvl(new.payment_last_30d_amount,0)
from dwt_user_topic old
full join (
select
user_id,
sum(if(dt='2020-06-25',login_count,0)) login_count,--当天登录次数
sum(if(dt='2020-06-25',cart_count,0)) cart_count,--当天加入购物车次数
sum(if(dt='2020-06-25',order_count,0)) order_count,--当天下单次数
sum(if(dt='2020-06-25',order_amount,0.0)) order_amount,--当天下单金额
sum(if(dt='2020-06-25',payment_count,0)) payment_count,--当天支付次数
sum(if(dt='2020-06-25',payment_amount,0.0)) payment_amount,--当天支付金额
sum(if(login_count > 0,1,0)) login_last_30d_count,--累积30天登录次数
sum(order_count) order_last_30d_count,--最近30日下单次数
sum(order_amount) order_last_30d_amount,--最近30日下单金额
sum(payment_count) payment_last_30d_count,--最近30日下单次数
sum(payment_amount) payment_last_30d_amount--最近30日下单金额
from dws_user_action_daycount
where dt > date_add(dt , -30)
group by user_id
)new
on old.user_id = new.user_id
2.3.2 商品行为
2.3.2.1 商品DWS层
-- 解析:
1. 表的字段如何创建
'准备': 事实表:订单、订单详情、优惠券领用、支付、退款、收藏、加购物车、评价
'第一步':找到和商品这个维度相关的所有事实表:订单详情、支付、退款、收藏、加购物车、评价
'第二步':获取上述事实表的并取其度量值
'第三步':将商品维度id + 第二步度量值的字段作为dws层的字段,创建商品行为的dws层表
2. 如何向表中插入数据?
'第一步':根据dws层表的字段,确定每个字段来自于哪个表
'第二步':确定第一步中所有表的同步策略,确定表中存储的数据是什么及每行数据存储的是什么
'第三步':一个一个字段来获取最后进行合并
3. 表保存什么数据
4. 表中每行数据是什么
5. 存储格式
1. 分区表,每个分区保留当前最新的数据
2. 列式存储 + lzo压缩
- 建表语句
drop table if exists dws_sku_action_daycount;
create external table dws_sku_action_daycount
(
sku_id string comment 'sku_id',
order_count bigint comment '被下单次数',
order_num bigint comment '被下单件数',
order_amount decimal(16,2) comment '被下单金额',
payment_count bigint comment '被支付次数',
payment_num bigint comment '被支付件数',
payment_amount decimal(16,2) comment '被支付金额',
refund_count bigint comment '被退款次数',
refund_num bigint comment '被退款件数',
refund_amount decimal(16,2) comment '被退款金额',
cart_count bigint comment '被加入购物车次数',
favor_count bigint comment '被收藏次数',
appraise_good_count bigint comment '好评数',
appraise_mid_count bigint comment '中评数',
appraise_bad_count bigint comment '差评数',
appraise_default_count bigint comment '默认评价数'
) COMMENT '每日商品行为'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_sku_action_daycount/'
tblproperties ("parquet.compression"="lzo");
- 分析过程
==================================================
-- 1. 相关表:订单详情事实表
2. 同步策略:事务型事实表,以订单的创建时间为分区数据
3. 保存的数据:一个分区保存当天所有的下单的信息
3. 相关字段如下:
order_count bigint comment '被下单次数',
order_num bigint comment '被下单件数',
order_amount decimal(16,2) comment '被下单金额',
select
sku_id,
count(*) order_count,--被下单次数
sum(sku_num) order_num, --被下单件数
sum(original_amount_d) order_amount --被下单金额
from dwd_fact_order_detail
where dt= '2020-06-25'
group by sku_id
-------------------------------------------------
-- 1. 相关表:支付事实表
2. 同步策略:事务型事实表,以订单支付时间为分区
3. 保存的数据:一个分区保存当天所有的支付信息
4. 相关字段如下:
5. 说明: 有一种情况,昨天晚上下单,但是今天才支付,那么下单的sku_id不在支付中,这样可能会导致数据丢失
payment_count bigint comment '被支付次数',
payment_num bigint comment '被支付件数',
payment_amount decimal(16,2) comment '被支付金额',
select
sku_id,
count(*) payment_count,
sum(final_amount_d) payment_amount
from dwd_fact_order_detail
where dt='2020-06-25'
and order_id in (
select
order_id
from dwd_fact_payment_info
where (dt='2020-06-25'
or dt=date_add('2020-06-25',-1))
and date_format(payment_time,'yyyy-MM-dd')='2020-06-25'
)
group by sku_id
-------------------------------------------
-- 1. 相关表:退款事实表
2. 同步策略:事务型事实表,退款时间为分区
3. 保存的数据:一个分区保存当天所有退款的数据
4. 相关字段如下:
refund_count bigint comment '被退款次数',
refund_num bigint comment '被退款件数',
refund_amount decimal(16,2) comment '被退款金额',
select
sku_id,
sum(refund_count) refund_num,
sum(refund_amount) refund_amount
from dwd_fact_order_refund_info
where dt = '2020-06-25'
group by sku_id
-----------------------------------------------
-- 1. 相关表:加购车
2. 同步策略:周期型快照事实表
3. 保存的数据:每天一个快照
4. 相关字段如下:
cart_count bigint comment '被加入购物车次数',
select
sku_id,
count(*) cart_count
from dwd_fact_cart_info
where create_time = '2020-06-25'
group by sku_id
-------------------------------------------
-- 1. 相关表:收藏表
2. 同步策略:周期型快照事实表
3. 保存的数据:每天一个快照
4. 相关字段如下:
favor_count bigint comment '被收藏次数',
select
sku_id,
count(*) favor_count
from dwd_fact_favor_info
where create_time = '2020-06-25'
group by sku_id
-----------------------------------------------
-- 1. 相关表:评价事实表
2. 同步策略:事务型事实表,以订单支付时间为分区
3. 保存的数据:每个分区保留当天的评价数据
4. 相关字段如下:
appraise_good_count bigint comment '好评数',
appraise_mid_count bigint comment '中评数',
appraise_bad_count bigint comment '差评数',
appraise_default_count bigint comment '默认评价数'
select
sku_id,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count,
from dwd_fact_comment_info
where dt = '2020-06-25'
group by sku_id
- 插入数据
with
tmp_order as
(
select
sku_id,
count(*) order_count,
sum(sku_num) order_num,
sum(final_amount_d) order_amount
from dwd_fact_order_detail
where dt='2020-06-25'
group by sku_id
),
tmp_payment as
(
select
sku_id,
count(*) payment_count,
sum(sku_num) payment_num,
sum(final_amount_d) payment_amount
from dwd_fact_order_detail
where dt='2020-06-25'
and order_id in
(
select
id
from dwd_fact_order_info
where (dt='2020-06-25'
or dt=date_add('2020-06-25',-1))
and date_format(payment_time,'yyyy-MM-dd')='2020-06-25'
)
group by sku_id
),
tmp_refund as
(
select
sku_id,
count(*) refund_count,
sum(refund_num) refund_num,
sum(refund_amount) refund_amount
from dwd_fact_order_refund_info
where dt='2020-06-25'
group by sku_id
),
tmp_cart as
(
select
item sku_id,
count(*) cart_count
from dwd_action_log
where dt='2020-06-25'
and user_id is not null
and action_id='cart_add'
group by item
),tmp_favor as
(
select
item sku_id,
count(*) favor_count
from dwd_action_log
where dt='2020-06-25'
and user_id is not null
and action_id='favor_add'
group by item
),
tmp_appraise as
(
select
sku_id,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count
from dwd_fact_comment_info
where dt='2020-06-25'
group by sku_id
)
insert overwrite table dws_sku_action_daycount partition(dt='2020-06-25')
select
sku_id,
sum(order_count),
sum(order_num),
sum(order_amount),
sum(payment_count),
sum(payment_num),
sum(payment_amount),
sum(refund_count),
sum(refund_num),
sum(refund_amount),
sum(cart_count),
sum(favor_count),
sum(appraise_good_count),
sum(appraise_mid_count),
sum(appraise_bad_count),
sum(appraise_default_count)
from
(
select
sku_id,
order_count,
order_num,
order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_order
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
payment_count,
payment_num,
payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_payment
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
refund_count,
refund_num,
refund_amount,
0 cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_refund
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_cart
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_favor
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
0 favor_count,
appraise_good_count,
appraise_mid_count,
appraise_bad_count,
appraise_default_count
from tmp_appraise
)tmp
group by sku_id;
2.3.2.2 商品DWT层
- 建表语句
drop table if exists dwt_sku_topic;
create external table dwt_sku_topic
(
sku_id string comment 'sku_id',
spu_id string comment 'spu_id',
order_last_30d_count bigint comment '最近30日被下单次数',
order_last_30d_num bigint comment '最近30日被下单件数',
order_last_30d_amount decimal(16,2) comment '最近30日被下单金额',
order_count bigint comment '累积被下单次数',
order_num bigint comment '累积被下单件数',
order_amount decimal(16,2) comment '累积被下单金额',
payment_last_30d_count bigint comment '最近30日被支付次数',
payment_last_30d_num bigint comment '最近30日被支付件数',
payment_last_30d_amount decimal(16,2) comment '最近30日被支付金额',
payment_count bigint comment '累积被支付次数',
payment_num bigint comment '累积被支付件数',
payment_amount decimal(16,2) comment '累积被支付金额',
refund_last_30d_count bigint comment '最近三十日退款次数',
refund_last_30d_num bigint comment '最近三十日退款件数',
refund_last_30d_amount decimal(16,2) comment '最近三十日退款金额',
refund_count bigint comment '累积退款次数',
refund_num bigint comment '累积退款件数',
refund_amount decimal(16,2) comment '累积退款金额',
cart_last_30d_count bigint comment '最近30日被加入购物车次数',
cart_count bigint comment '累积被加入购物车次数',
favor_last_30d_count bigint comment '最近30日被收藏次数',
favor_count bigint comment '累积被收藏次数',
appraise_last_30d_good_count bigint comment '最近30日好评数',
appraise_last_30d_mid_count bigint comment '最近30日中评数',
appraise_last_30d_bad_count bigint comment '最近30日差评数',
appraise_last_30d_default_count bigint comment '最近30日默认评价数',
appraise_good_count bigint comment '累积好评数',
appraise_mid_count bigint comment '累积中评数',
appraise_bad_count bigint comment '累积差评数',
appraise_default_count bigint comment '累积默认评价数'
)COMMENT '商品主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_sku_topic/'
tblproperties ("parquet.compression"="lzo");
- 插入数据
insert overwrite table dwt_sku_topic
select
nvl(new.sku_id,old.sku_id),
sku_info.spu_id,
nvl(new.order_count30,0),
nvl(new.order_num30,0),
nvl(new.order_amount30,0),
nvl(old.order_count,0) + nvl(new.order_count,0),
nvl(old.order_num,0) + nvl(new.order_num,0),
nvl(old.order_amount,0) + nvl(new.order_amount,0),
nvl(new.payment_count30,0),
nvl(new.payment_num30,0),
nvl(new.payment_amount30,0),
nvl(old.payment_count,0) + nvl(new.payment_count,0),
nvl(old.payment_num,0) + nvl(new.payment_count,0),
nvl(old.payment_amount,0) + nvl(new.payment_count,0),
nvl(new.refund_count30,0),
nvl(new.refund_num30,0),
nvl(new.refund_amount30,0),
nvl(old.refund_count,0) + nvl(new.refund_count,0),
nvl(old.refund_num,0) + nvl(new.refund_num,0),
nvl(old.refund_amount,0) + nvl(new.refund_amount,0),
nvl(new.cart_count30,0),
nvl(old.cart_count,0) + nvl(new.cart_count,0),
nvl(new.favor_count30,0),
nvl(old.favor_count,0) + nvl(new.favor_count,0),
nvl(new.appraise_good_count30,0),
nvl(new.appraise_mid_count30,0),
nvl(new.appraise_bad_count30,0),
nvl(new.appraise_default_count30,0) ,
nvl(old.appraise_good_count,0) + nvl(new.appraise_good_count,0),
nvl(old.appraise_mid_count,0) + nvl(new.appraise_mid_count,0),
nvl(old.appraise_bad_count,0) + nvl(new.appraise_bad_count,0),
nvl(old.appraise_default_count,0) + nvl(new.appraise_default_count,0)
from
dwt_sku_topic old
full outer join
(
select
sku_id,
sum(if(dt='2020-06-25', order_count,0 )) order_count,
sum(if(dt='2020-06-25',order_num ,0 )) order_num,
sum(if(dt='2020-06-25',order_amount,0 )) order_amount ,
sum(if(dt='2020-06-25',payment_count,0 )) payment_count,
sum(if(dt='2020-06-25',payment_num,0 )) payment_num,
sum(if(dt='2020-06-25',payment_amount,0 )) payment_amount,
sum(if(dt='2020-06-25',refund_count,0 )) refund_count,
sum(if(dt='2020-06-25',refund_num,0 )) refund_num,
sum(if(dt='2020-06-25',refund_amount,0 )) refund_amount,
sum(if(dt='2020-06-25',cart_count,0 )) cart_count,
sum(if(dt='2020-06-25',favor_count,0 )) favor_count,
sum(if(dt='2020-06-25',appraise_good_count,0 )) appraise_good_count,
sum(if(dt='2020-06-25',appraise_mid_count,0 ) ) appraise_mid_count ,
sum(if(dt='2020-06-25',appraise_bad_count,0 )) appraise_bad_count,
sum(if(dt='2020-06-25',appraise_default_count,0 )) appraise_default_count,
sum(order_count) order_count30 ,
sum(order_num) order_num30,
sum(order_amount) order_amount30,
sum(payment_count) payment_count30,
sum(payment_num) payment_num30,
sum(payment_amount) payment_amount30,
sum(refund_count) refund_count30,
sum(refund_num) refund_num30,
sum(refund_amount) refund_amount30,
sum(cart_count) cart_count30,
sum(favor_count) favor_count30,
sum(appraise_good_count) appraise_good_count30,
sum(appraise_mid_count) appraise_mid_count30,
sum(appraise_bad_count) appraise_bad_count30,
sum(appraise_default_count) appraise_default_count30
from dws_sku_action_daycount
where dt >= date_add ('2020-06-25', -30)
group by sku_id
)new
on new.sku_id = old.sku_id
left join
(select * from dwd_dim_sku_info where dt='2020-06-25') sku_info
on nvl(new.sku_id,old.sku_id)= sku_info.id;
2.3.4 活动统计
2.3.4.1 活动DWS层
- 建表
drop table if exists dwt_activity_topic;
create external table dwt_activity_topic(
`id` string COMMENT '编号',
`activity_name` string COMMENT '活动名称',
`activity_type` string COMMENT '活动类型',
`start_time` string COMMENT '开始时间',
`end_time` string COMMENT '结束时间',
`create_time` string COMMENT '创建时间',
`display_day_count` bigint COMMENT '当日曝光次数',
`order_day_count` bigint COMMENT '当日下单次数',
`order_day_amount` decimal(20,2) COMMENT '当日下单金额',
`payment_day_count` bigint COMMENT '当日支付次数',
`payment_day_amount` decimal(20,2) COMMENT '当日支付金额',
`display_count` bigint COMMENT '累积曝光次数',
`order_count` bigint COMMENT '累积下单次数',
`order_amount` decimal(20,2) COMMENT '累积下单金额',
`payment_count` bigint COMMENT '累积支付次数',
`payment_amount` decimal(20,2) COMMENT '累积支付金额'
) COMMENT '活动主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwt/dwt_activity_topic/'
tblproperties ("parquet.compression"="lzo");
- 插入数据
with
tmp_op as
(
select
activity_id,
sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',1,0)) order_count,
sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) order_amount,
sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',1,0)) payment_count,
sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) payment_amount
from dwd_fact_order_info
where (dt='2020-06-25' or dt=date_add('2020-06-25',-1))
and activity_id is not null
group by activity_id
),
tmp_display as
(
select
item activity_id,
count(*) display_count
from dwd_display_log
where dt='2020-06-25'
and item_type='activity_id'
group by item
),
tmp_activity as
(
select
*
from dwd_dim_activity_info
where dt='2020-06-25'
)
insert overwrite table dws_activity_info_daycount partition(dt='2020-06-25')
select
nvl(tmp_op.activity_id,tmp_display.activity_id),
tmp_activity.activity_name,
tmp_activity.activity_type,
tmp_activity.start_time,
tmp_activity.end_time,
tmp_activity.create_time,
tmp_display.display_count,
tmp_op.order_count,
tmp_op.order_amount,
tmp_op.payment_count,
tmp_op.payment_amount
from tmp_op
full outer join tmp_display on tmp_op.activity_id=tmp_display.activity_id
left join tmp_activity on nvl(tmp_op.activity_id,tmp_display.activity_id)=tmp_activity.id;
2.3.4.2 活动DWT层
- 建表
drop table if exists dwt_activity_topic;
create external table dwt_activity_topic(
`id` string COMMENT '编号',
`activity_name` string COMMENT '活动名称',
`activity_type` string COMMENT '活动类型',
`start_time` string COMMENT '开始时间',
`end_time` string COMMENT '结束时间',
`create_time` string COMMENT '创建时间',
`display_day_count` bigint COMMENT '当日曝光次数',
`order_day_count` bigint COMMENT '当日下单次数',
`order_day_amount` decimal(20,2) COMMENT '当日下单金额',
`payment_day_count` bigint COMMENT '当日支付次数',
`payment_day_amount` decimal(20,2) COMMENT '当日支付金额',
`display_count` bigint COMMENT '累积曝光次数',
`order_count` bigint COMMENT '累积下单次数',
`order_amount` decimal(20,2) COMMENT '累积下单金额',
`payment_count` bigint COMMENT '累积支付次数',
`payment_amount` decimal(20,2) COMMENT '累积支付金额'
) COMMENT '活动主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwt/dwt_activity_topic/'
tblproperties ("parquet.compression"="lzo");
- 插入数据
insert overwrite table dwt_activity_topic
select
nvl(new.id,old.id),
nvl(new.activity_name,old.activity_name),
nvl(new.activity_type,old.activity_type),
nvl(new.start_time,old.start_time),
nvl(new.end_time,old.end_time),
nvl(new.create_time,old.create_time),
nvl(new.display_count,0),
nvl(new.order_count,0),
nvl(new.order_amount,0.0),
nvl(new.payment_count,0),
nvl(new.payment_amount,0.0),
nvl(new.display_count,0)+nvl(old.display_count,0),
nvl(new.order_count,0)+nvl(old.order_count,0),
nvl(new.order_amount,0.0)+nvl(old.order_amount,0.0),
nvl(new.payment_count,0)+nvl(old.payment_count,0),
nvl(new.payment_amount,0.0)+nvl(old.payment_amount,0.0)
from
(
select
*
from dwt_activity_topic
)old
full outer join
(
select
*
from dws_activity_info_daycount
where dt='2020-06-25'
)new
on old.id=new.id;
2.3.5 地区统计
2.3.5.1 地区DWS层
- 建表语句
drop table if exists dws_area_stats_daycount;
create external table dws_area_stats_daycount(
`id` bigint COMMENT '编号',
`province_name` string COMMENT '省份名称',
`area_code` string COMMENT '地区编码',
`iso_code` string COMMENT 'iso编码',
`region_id` string COMMENT '地区ID',
`region_name` string COMMENT '地区名称',
`login_count` string COMMENT '活跃设备数',
`order_count` bigint COMMENT '下单次数',
`order_amount` decimal(20,2) COMMENT '下单金额',
`payment_count` bigint COMMENT '支付次数',
`payment_amount` decimal(20,2) COMMENT '支付金额'
) COMMENT '每日地区信息表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_area_stats_daycount/'
tblproperties ("parquet.compression"="lzo");
- 插入数据
with
tmp_login as
(
select
area_code,
count(*) login_count
from dwd_start_log
where dt='2020-06-25'
group by area_code
),
tmp_op as
(
select
province_id,
sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',1,0)) order_count,
sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) order_amount,
sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',1,0)) payment_count,
sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-25',final_total_amount,0)) payment_amount
from dwd_fact_order_info
where (dt='2020-06-25' or dt=date_add('2020-06-25',-1))
group by province_id
)
insert overwrite table dws_area_stats_daycount partition(dt='2020-06-25')
select
pro.id,
pro.province_name,
pro.area_code,
pro.iso_code,
pro.region_id,
pro.region_name,
nvl(tmp_login.login_count,0),
nvl(tmp_op.order_count,0),
nvl(tmp_op.order_amount,0.0),
nvl(tmp_op.payment_count,0),
nvl(tmp_op.payment_amount,0.0)
from dwd_dim_base_province pro
left join tmp_login on pro.area_code=tmp_login.area_code
left join tmp_op on pro.id=tmp_op.province_id;
2.3.5.2 地区DWT层
- 建表语句
drop table if exists dwt_area_topic;
create external table dwt_area_topic(
`id` bigint COMMENT '编号',
`province_name` string COMMENT '省份名称',
`area_code` string COMMENT '地区编码',
`iso_code` string COMMENT 'iso编码',
`region_id` string COMMENT '地区ID',
`region_name` string COMMENT '地区名称',
`login_day_count` string COMMENT '当天活跃设备数',
`login_last_30d_count` string COMMENT '最近30天活跃设备数',
`order_day_count` bigint COMMENT '当天下单次数',
`order_day_amount` decimal(16,2) COMMENT '当天下单金额',
`order_last_30d_count` bigint COMMENT '最近30天下单次数',
`order_last_30d_amount` decimal(16,2) COMMENT '最近30天下单金额',
`payment_day_count` bigint COMMENT '当天支付次数',
`payment_day_amount` decimal(16,2) COMMENT '当天支付金额',
`payment_last_30d_count` bigint COMMENT '最近30天支付次数',
`payment_last_30d_amount` decimal(16,2) COMMENT '最近30天支付金额'
) COMMENT '地区主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwt/dwt_area_topic/'
tblproperties ("parquet.compression"="lzo");
- 插入数据
insert overwrite table dwt_area_topic
select
nvl(old.id,new.id),
nvl(old.province_name,new.province_name),
nvl(old.area_code,new.area_code),
nvl(old.iso_code,new.iso_code),
nvl(old.region_id,new.region_id),
nvl(old.region_name,new.region_name),
nvl(new.login_day_count,0),
nvl(new.login_last_30d_count,0),
nvl(new.order_day_count,0),
nvl(new.order_day_amount,0.0),
nvl(new.order_last_30d_count,0),
nvl(new.order_last_30d_amount,0.0),
nvl(new.payment_day_count,0),
nvl(new.payment_day_amount,0.0),
nvl(new.payment_last_30d_count,0),
nvl(new.payment_last_30d_amount,0.0)
from
(
select
*
from dwt_area_topic
)old
full outer join
(
select
id,
province_name,
area_code,
iso_code,
region_id,
region_name,
sum(if(dt='2020-06-25',login_count,0)) login_day_count,
sum(if(dt='2020-06-25',order_count,0)) order_day_count,
sum(if(dt='2020-06-25',order_amount,0.0)) order_day_amount,
sum(if(dt='2020-06-25',payment_count,0)) payment_day_count,
sum(if(dt='2020-06-25',payment_amount,0.0)) payment_day_amount,
sum(login_count) login_last_30d_count,
sum(order_count) order_last_30d_count,
sum(order_amount) order_last_30d_amount,
sum(payment_count) payment_last_30d_count,
sum(payment_amount) payment_last_30d_amount
from dws_area_stats_daycount
where dt>=date_add('2020-06-25',-30)
group by id,province_name,area_code,iso_code,region_id,region_name
)new
on old.id=new.id;
2.4 脚本
2.4.1 DWS脚本
- 在/home/atguigu/bin目录下创建脚本dwd_to_dws.sh
[atguigu@hadoop102 bin]$ vim dwd_to_dws.sh
- 脚本内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set mapreduce.job.queuename=hive;
with
tmp_start as
(
select
mid_id,
brand,
model,
count(*) login_count
from ${APP}.dwd_start_log
where dt='$do_date'
group by mid_id,brand,model
),
tmp_page as
(
select
mid_id,
brand,
model,
collect_set(named_struct('page_id',page_id,'page_count',page_count)) page_stats
from
(
select
mid_id,
brand,
model,
page_id,
count(*) page_count
from ${APP}.dwd_page_log
where dt='$do_date'
group by mid_id,brand,model,page_id
)tmp
group by mid_id,brand,model
)
insert overwrite table ${APP}.dws_uv_detail_daycount partition(dt='$do_date')
select
nvl(tmp_start.mid_id,tmp_page.mid_id),
nvl(tmp_start.brand,tmp_page.brand),
nvl(tmp_start.model,tmp_page.model),
tmp_start.login_count,
tmp_page.page_stats
from tmp_start
full outer join tmp_page
on tmp_start.mid_id=tmp_page.mid_id
and tmp_start.brand=tmp_page.brand
and tmp_start.model=tmp_page.model;
with
tmp_login as
(
select
user_id,
count(*) login_count
from ${APP}.dwd_start_log
where dt='$do_date'
and user_id is not null
group by user_id
),
tmp_cart as
(
select
user_id,
count(*) cart_count
from ${APP}.dwd_action_log
where dt='$do_date'
and user_id is not null
and action_id='cart_add'
group by user_id
),tmp_order as
(
select
user_id,
count(*) order_count,
sum(final_total_amount) order_amount
from ${APP}.dwd_fact_order_info
where dt='$do_date'
group by user_id
) ,
tmp_payment as
(
select
user_id,
count(*) payment_count,
sum(payment_amount) payment_amount
from ${APP}.dwd_fact_payment_info
where dt='$do_date'
group by user_id
),
tmp_order_detail as
(
select
user_id,
collect_set(named_struct('sku_id',sku_id,'sku_num',sku_num,'order_count',order_count,'order_amount',order_amount)) order_stats
from
(
select
user_id,
sku_id,
sum(sku_num) sku_num,
count(*) order_count,
cast(sum(final_amount_d) as decimal(20,2)) order_amount
from ${APP}.dwd_fact_order_detail
where dt='$do_date'
group by user_id,sku_id
)tmp
group by user_id
)
insert overwrite table ${APP}.dws_user_action_daycount partition(dt='$do_date')
select
tmp_login.user_id,
login_count,
nvl(cart_count,0),
nvl(order_count,0),
nvl(order_amount,0.0),
nvl(payment_count,0),
nvl(payment_amount,0.0),
order_stats
from tmp_login
left outer join tmp_cart on tmp_login.user_id=tmp_cart.user_id
left outer join tmp_order on tmp_login.user_id=tmp_order.user_id
left outer join tmp_payment on tmp_login.user_id=tmp_payment.user_id
left outer join tmp_order_detail on tmp_login.user_id=tmp_order_detail.user_id;
with
tmp_order as
(
select
sku_id,
count(*) order_count,
sum(sku_num) order_num,
sum(final_amount_d) order_amount
from ${APP}.dwd_fact_order_detail
where dt='$do_date'
group by sku_id
),
tmp_payment as
(
select
sku_id,
count(*) payment_count,
sum(sku_num) payment_num,
sum(final_amount_d) payment_amount
from ${APP}.dwd_fact_order_detail
where dt='$do_date'
and order_id in
(
select
id
from ${APP}.dwd_fact_order_info
where (dt='$do_date'
or dt=date_add('$do_date',-1))
and date_format(payment_time,'yyyy-MM-dd')='$do_date'
)
group by sku_id
),
tmp_refund as
(
select
sku_id,
count(*) refund_count,
sum(refund_num) refund_num,
sum(refund_amount) refund_amount
from ${APP}.dwd_fact_order_refund_info
where dt='$do_date'
group by sku_id
),
tmp_cart as
(
select
item sku_id,
count(*) cart_count
from ${APP}.dwd_action_log
where dt='$do_date'
and user_id is not null
and action_id='cart_add'
group by item
),tmp_favor as
(
select
item sku_id,
count(*) favor_count
from ${APP}.dwd_action_log
where dt='$do_date'
and user_id is not null
and action_id='favor_add'
group by item
),
tmp_appraise as
(
select
sku_id,
sum(if(appraise='1201',1,0)) appraise_good_count,
sum(if(appraise='1202',1,0)) appraise_mid_count,
sum(if(appraise='1203',1,0)) appraise_bad_count,
sum(if(appraise='1204',1,0)) appraise_default_count
from ${APP}.dwd_fact_comment_info
where dt='$do_date'
group by sku_id
)
insert overwrite table ${APP}.dws_sku_action_daycount partition(dt='$do_date')
select
sku_id,
sum(order_count),
sum(order_num),
sum(order_amount),
sum(payment_count),
sum(payment_num),
sum(payment_amount),
sum(refund_count),
sum(refund_num),
sum(refund_amount),
sum(cart_count),
sum(favor_count),
sum(appraise_good_count),
sum(appraise_mid_count),
sum(appraise_bad_count),
sum(appraise_default_count)
from
(
select
sku_id,
order_count,
order_num,
order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_order
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
payment_count,
payment_num,
payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_payment
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
refund_count,
refund_num,
refund_amount,
0 cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_refund
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
cart_count,
0 favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_cart
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
favor_count,
0 appraise_good_count,
0 appraise_mid_count,
0 appraise_bad_count,
0 appraise_default_count
from tmp_favor
union all
select
sku_id,
0 order_count,
0 order_num,
0 order_amount,
0 payment_count,
0 payment_num,
0 payment_amount,
0 refund_count,
0 refund_num,
0 refund_amount,
0 cart_count,
0 favor_count,
appraise_good_count,
appraise_mid_count,
appraise_bad_count,
appraise_default_count
from tmp_appraise
)tmp
group by sku_id;
with
tmp_login as
(
select
area_code,
count(*) login_count
from ${APP}.dwd_start_log
where dt='$do_date'
group by area_code
),
tmp_op as
(
select
province_id,
sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',1,0)) order_count,
sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) order_amount,
sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',1,0)) payment_count,
sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) payment_amount
from ${APP}.dwd_fact_order_info
where (dt='$do_date' or dt=date_add('$do_date',-1))
group by province_id
)
insert overwrite table ${APP}.dws_area_stats_daycount partition(dt='$do_date')
select
pro.id,
pro.province_name,
pro.area_code,
pro.iso_code,
pro.region_id,
pro.region_name,
nvl(tmp_login.login_count,0),
nvl(tmp_op.order_count,0),
nvl(tmp_op.order_amount,0.0),
nvl(tmp_op.payment_count,0),
nvl(tmp_op.payment_amount,0.0)
from ${APP}.dwd_dim_base_province pro
left join tmp_login on pro.area_code=tmp_login.area_code
left join tmp_op on pro.id=tmp_op.province_id;
with
tmp_op as
(
select
activity_id,
sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',1,0)) order_count,
sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) order_amount,
sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',1,0)) payment_count,
sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) payment_amount
from ${APP}.dwd_fact_order_info
where (dt='$do_date' or dt=date_add('$do_date',-1))
and activity_id is not null
group by activity_id
),
tmp_display as
(
select
item activity_id,
count(*) display_count
from ${APP}.dwd_display_log
where dt='$do_date'
and item_type='activity_id'
group by item
),
tmp_activity as
(
select
*
from ${APP}.dwd_dim_activity_info
where dt='$do_date'
)
insert overwrite table ${APP}.dws_activity_info_daycount partition(dt='$do_date')
select
nvl(tmp_op.activity_id,tmp_display.activity_id),
tmp_activity.activity_name,
tmp_activity.activity_type,
tmp_activity.start_time,
tmp_activity.end_time,
tmp_activity.create_time,
tmp_display.display_count,
tmp_op.order_count,
tmp_op.order_amount,
tmp_op.payment_count,
tmp_op.payment_amount
from tmp_op
full outer join tmp_display on tmp_op.activity_id=tmp_display.activity_id
left join tmp_activity on nvl(tmp_op.activity_id,tmp_display.activity_id)=tmp_activity.id;
"
$hive -e "$sql"
2.4.2 DWT脚本
- 在/home/atguigu/bin目录下创建脚本dws_to_dwt.sh
[atguigu@hadoop102 bin]$ vim dws_to_dwt.sh
- 脚本内容
#!/bin/bash
APP=gmall
hive=/opt/module/hive/bin/hive
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set mapreduce.job.queuename=hive;
insert overwrite table ${APP}.dwt_uv_topic
select
nvl(new.mid_id,old.mid_id),
nvl(new.model,old.model),
nvl(new.brand,old.brand),
if(old.mid_id is null,'$do_date',old.login_date_first),
if(new.mid_id is not null,'$do_date',old.login_date_last),
if(new.mid_id is not null, new.login_count,0),
nvl(old.login_count,0)+if(new.login_count>0,1,0)
from
(
select
*
from ${APP}.dwt_uv_topic
)old
full outer join
(
select
*
from ${APP}.dws_uv_detail_daycount
where dt='$do_date'
)new
on old.mid_id=new.mid_id;
insert overwrite table ${APP}.dwt_user_topic
select
nvl(new.user_id,old.user_id),
if(old.login_date_first is null and new.login_count>0,'$do_date',old.login_date_first),
if(new.login_count>0,'$do_date',old.login_date_last),
nvl(old.login_count,0)+if(new.login_count>0,1,0),
nvl(new.login_last_30d_count,0),
if(old.order_date_first is null and new.order_count>0,'$do_date',old.order_date_first),
if(new.order_count>0,'$do_date',old.order_date_last),
nvl(old.order_count,0)+nvl(new.order_count,0),
nvl(old.order_amount,0)+nvl(new.order_amount,0),
nvl(new.order_last_30d_count,0),
nvl(new.order_last_30d_amount,0),
if(old.payment_date_first is null and new.payment_count>0,'$do_date',old.payment_date_first),
if(new.payment_count>0,'$do_date',old.payment_date_last),
nvl(old.payment_count,0)+nvl(new.payment_count,0),
nvl(old.payment_amount,0)+nvl(new.payment_amount,0),
nvl(new.payment_last_30d_count,0),
nvl(new.payment_last_30d_amount,0)
from
${APP}.dwt_user_topic old
full outer join
(
select
user_id,
sum(if(dt='$do_date',login_count,0)) login_count,
sum(if(dt='$do_date',order_count,0)) order_count,
sum(if(dt='$do_date',order_amount,0)) order_amount,
sum(if(dt='$do_date',payment_count,0)) payment_count,
sum(if(dt='$do_date',payment_amount,0)) payment_amount,
sum(if(login_count>0,1,0)) login_last_30d_count,
sum(order_count) order_last_30d_count,
sum(order_amount) order_last_30d_amount,
sum(payment_count) payment_last_30d_count,
sum(payment_amount) payment_last_30d_amount
from ${APP}.dws_user_action_daycount
where dt>=date_add( '$do_date',-30)
group by user_id
)new
on old.user_id=new.user_id;
insert overwrite table ${APP}.dwt_sku_topic
select
nvl(new.sku_id,old.sku_id),
sku_info.spu_id,
nvl(new.order_count30,0),
nvl(new.order_num30,0),
nvl(new.order_amount30,0),
nvl(old.order_count,0) + nvl(new.order_count,0),
nvl(old.order_num,0) + nvl(new.order_num,0),
nvl(old.order_amount,0) + nvl(new.order_amount,0),
nvl(new.payment_count30,0),
nvl(new.payment_num30,0),
nvl(new.payment_amount30,0),
nvl(old.payment_count,0) + nvl(new.payment_count,0),
nvl(old.payment_num,0) + nvl(new.payment_count,0),
nvl(old.payment_amount,0) + nvl(new.payment_count,0),
nvl(new.refund_count30,0),
nvl(new.refund_num30,0),
nvl(new.refund_amount30,0),
nvl(old.refund_count,0) + nvl(new.refund_count,0),
nvl(old.refund_num,0) + nvl(new.refund_num,0),
nvl(old.refund_amount,0) + nvl(new.refund_amount,0),
nvl(new.cart_count30,0),
nvl(old.cart_count,0) + nvl(new.cart_count,0),
nvl(new.favor_count30,0),
nvl(old.favor_count,0) + nvl(new.favor_count,0),
nvl(new.appraise_good_count30,0),
nvl(new.appraise_mid_count30,0),
nvl(new.appraise_bad_count30,0),
nvl(new.appraise_default_count30,0) ,
nvl(old.appraise_good_count,0) + nvl(new.appraise_good_count,0),
nvl(old.appraise_mid_count,0) + nvl(new.appraise_mid_count,0),
nvl(old.appraise_bad_count,0) + nvl(new.appraise_bad_count,0),
nvl(old.appraise_default_count,0) + nvl(new.appraise_default_count,0)
from
(
select
sku_id,
spu_id,
order_last_30d_count,
order_last_30d_num,
order_last_30d_amount,
order_count,
order_num,
order_amount ,
payment_last_30d_count,
payment_last_30d_num,
payment_last_30d_amount,
payment_count,
payment_num,
payment_amount,
refund_last_30d_count,
refund_last_30d_num,
refund_last_30d_amount,
refund_count,
refund_num,
refund_amount,
cart_last_30d_count,
cart_count,
favor_last_30d_count,
favor_count,
appraise_last_30d_good_count,
appraise_last_30d_mid_count,
appraise_last_30d_bad_count,
appraise_last_30d_default_count,
appraise_good_count,
appraise_mid_count,
appraise_bad_count,
appraise_default_count
from ${APP}.dwt_sku_topic
)old
full outer join
(
select
sku_id,
sum(if(dt='$do_date', order_count,0 )) order_count,
sum(if(dt='$do_date',order_num ,0 )) order_num,
sum(if(dt='$do_date',order_amount,0 )) order_amount ,
sum(if(dt='$do_date',payment_count,0 )) payment_count,
sum(if(dt='$do_date',payment_num,0 )) payment_num,
sum(if(dt='$do_date',payment_amount,0 )) payment_amount,
sum(if(dt='$do_date',refund_count,0 )) refund_count,
sum(if(dt='$do_date',refund_num,0 )) refund_num,
sum(if(dt='$do_date',refund_amount,0 )) refund_amount,
sum(if(dt='$do_date',cart_count,0 )) cart_count,
sum(if(dt='$do_date',favor_count,0 )) favor_count,
sum(if(dt='$do_date',appraise_good_count,0 )) appraise_good_count,
sum(if(dt='$do_date',appraise_mid_count,0 ) ) appraise_mid_count ,
sum(if(dt='$do_date',appraise_bad_count,0 )) appraise_bad_count,
sum(if(dt='$do_date',appraise_default_count,0 )) appraise_default_count,
sum(order_count) order_count30 ,
sum(order_num) order_num30,
sum(order_amount) order_amount30,
sum(payment_count) payment_count30,
sum(payment_num) payment_num30,
sum(payment_amount) payment_amount30,
sum(refund_count) refund_count30,
sum(refund_num) refund_num30,
sum(refund_amount) refund_amount30,
sum(cart_count) cart_count30,
sum(favor_count) favor_count30,
sum(appraise_good_count) appraise_good_count30,
sum(appraise_mid_count) appraise_mid_count30,
sum(appraise_bad_count) appraise_bad_count30,
sum(appraise_default_count) appraise_default_count30
from ${APP}.dws_sku_action_daycount
where dt >= date_add ('$do_date', -30)
group by sku_id
)new
on new.sku_id = old.sku_id
left join
(select * from ${APP}.dwd_dim_sku_info where dt='$do_date') sku_info
on nvl(new.sku_id,old.sku_id)= sku_info.id;
insert overwrite table ${APP}.dwt_activity_topic
select
nvl(new.id,old.id),
nvl(new.activity_name,old.activity_name),
nvl(new.activity_type,old.activity_type),
nvl(new.start_time,old.start_time),
nvl(new.end_time,old.end_time),
nvl(new.create_time,old.create_time),
nvl(new.display_count,0),
nvl(new.order_count,0),
nvl(new.order_amount,0.0),
nvl(new.payment_count,0),
nvl(new.payment_amount,0.0),
nvl(new.display_count,0)+nvl(old.display_count,0),
nvl(new.order_count,0)+nvl(old.order_count,0),
nvl(new.order_amount,0.0)+nvl(old.order_amount,0.0),
nvl(new.payment_count,0)+nvl(old.payment_count,0),
nvl(new.payment_amount,0.0)+nvl(old.payment_amount,0.0)
from
(
select
*
from ${APP}.dwt_activity_topic
)old
full outer join
(
select
*
from ${APP}.dws_activity_info_daycount
where dt='$do_date'
)new
on old.id=new.id;
insert overwrite table ${APP}.dwt_area_topic
select
nvl(old.id,new.id),
nvl(old.province_name,new.province_name),
nvl(old.area_code,new.area_code),
nvl(old.iso_code,new.iso_code),
nvl(old.region_id,new.region_id),
nvl(old.region_name,new.region_name),
nvl(new.login_day_count,0),
nvl(new.login_last_30d_count,0),
nvl(new.order_day_count,0),
nvl(new.order_day_amount,0.0),
nvl(new.order_last_30d_count,0),
nvl(new.order_last_30d_amount,0.0),
nvl(new.payment_day_count,0),
nvl(new.payment_day_amount,0.0),
nvl(new.payment_last_30d_count,0),
nvl(new.payment_last_30d_amount,0.0)
from
(
select
*
from ${APP}.dwt_area_topic
)old
full outer join
(
select
id,
province_name,
area_code,
iso_code,
region_id,
region_name,
sum(if(dt='$do_date',login_count,0)) login_day_count,
sum(if(dt='$do_date',order_count,0)) order_day_count,
sum(if(dt='$do_date',order_amount,0.0)) order_day_amount,
sum(if(dt='$do_date',payment_count,0)) payment_day_count,
sum(if(dt='$do_date',payment_amount,0.0)) payment_day_amount,
sum(login_count) login_last_30d_count,
sum(order_count) order_last_30d_count,
sum(order_amount) order_last_30d_amount,
sum(payment_count) payment_last_30d_count,
sum(payment_amount) payment_last_30d_amount
from ${APP}.dws_area_stats_daycount
where dt>=date_add('$do_date',-30)
group by id,province_name,area_code,iso_code,region_id,region_name
)new
on old.id=new.id;
"
$hive -e "$sql"
三、数仓搭建ADS层
3.0 ADS层思想
-- 1. 解题思路:
第一步:分析指标,明确指标具体的含义,不可有歧义
第二步:确定数据来源,寻找数据的方式,先从dwt层 -> dws -> dwd 层依次往回找,找到能满足需求的表
第三表: 找步骤2表的方式:找相关主题的表,因为在ads层的统计指标,也是按照维度 + 时间 + 指标的方式今天统计。
-- 2. 统计流量,使用用户行为数据,统计和业务相关的指标,使用业务数据。
-- 3. 老生常谈:知道每一张表中存储了什么数据。每行数据代表什么意思。关键、关键、关键。
3.1 设备主题
3.1.1 活跃设备数
-- 1. 什么是活跃设备
a、打开应用的用户即为活跃用户,不考虑用户的使用情况。
b、每天一台设备打开多次会被计为一个活跃用户
-- 2. 需求:
'日活':当日活跃的设备数
'周活':当周活跃的设备数,在这一周内,多次活跃也计算为1次
'月活':当月活跃的设备数,在这一月内,多次活跃也计算为1次
- 建表
drop table if exists ads_uv_count;
create external table ads_uv_count(
`dt` string COMMENT '统计日期',
`day_count` bigint COMMENT '当日用户数量',
`wk_count` bigint COMMENT '当周用户数量',
`mn_count` bigint COMMENT '当月用户数量',
`is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
`is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果'
) COMMENT '活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_uv_count/';
- 插入数据
insert into ads_uv_count
select
'2020-06-25' dt,
day_count,
wk_count,
mn_count ,
if('2020-06-25'=date_add(next_day('2020-06-25','mo'),-1),'Y','N') is_weekend ,
if('2020-06-25'=last_day('2020-06-25'),'Y','N') is_monthend
from
(
select
'2020-06-25' dt,
sum(if(login_day_count > 0 , 1 , 0 )) day_count,
sum(if(login_date_last >= date_add(next_day('2020-06-25','mo'),-7) , 1 , 0)) wk_count,
sum(if(date_format(login_date_last,'yyyy-MM') = date_format('2020-06-25','yyyy-MM'),1,0)) mn_count
from dwt_uv_topic
)tmp
3.1.2 每日新增设备
- 建表
drop table if exists ads_new_mid_count;
create external table ads_new_mid_count
(
`create_date` string comment '创建时间' ,
`new_mid_count` BIGINT comment '新增设备数量'
) COMMENT '每日新增设备信息数量'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_new_mid_count/';
- 插入数据
insert into table ads_new_mid_count
select
'2020-06-25',
count(*)
from dwt_uv_topic
where login_date_first='2020-06-25';
3.1.3 留存率
-- 1. 什么是留存率?
a、某段时间内的新增用户,经过一段时间后,仍然使用应用的被认作是留存用户;这部分用户占当时新增用户的比例即是留存率。
b、例如,5月份新增用户200,这200人启动情况:
6月份:启动人数为100人,5月份新增用户一个月后的留存率是50%
7月份:启动人数为80人,5月份新增用户二个月后的留存率是40%
8月份:启动人数为100人,5月份新增用户三个月后的留存率是50%
c、留存用户一般是统计留存率,同时必须有两个参数:哪个月份的几月的留存率
-- 2. 本案例需要统计的指标是:
计算每天的1、2、3日留存率
-- 3. 实现方式
第一步:统计当天所有的活跃用户
第二步:统计昨天的1日留存率,求出昨天的新用户但是今天上线的用户/昨天的新用户
第三步:统计前天的2日留存率,求出前天的新用户但是今天上线的用户/前天的新用户
所以需要统计的数量有:
1. 昨天的新用户但是今天上线的用户
2. 昨天的新用户
3. 前天的新用户但是今天上线的用户
4. 前天的新用户
![image-20200708203545115](https://lian-zp.oss-cn-shenzhen.aliyuncs.com/pic GO/20200709012005.png)
- 建表
drop table if exists ads_user_retention_day_rate;
create external table ads_user_retention_day_rate
(
`stat_date` string comment '统计日期',
`create_date` string comment '设备新增日期',
`retention_day` int comment '截止当前日期留存天数',
`retention_count` bigint comment '留存数量',
`new_mid_count` bigint comment '设备新增数量',
`retention_ratio` decimal(16,2) comment '留存率'
) COMMENT '每日用户留存情况'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_retention_day_rate/';
- 插入数据
with tmp_uv_topic
as (
select
'2020-06-26' stat_date,
sum(if(login_date_first = date_add('2020-06-26',-1),1,0)) 1day_count , -- 昨天的新用户
sum(if(login_date_first = date_add('2020-06-26',-1) and login_date_last = '2020-06-26',1,0)) new_1day ,--昨天的新用户但是今天上线的用户
sum(if(login_date_first = date_add('2020-06-26',-2),1,0)) 2day_count , -- 前天的新用户
sum(if(login_date_first = date_add('2020-06-26',-2) and login_date_last = '2020-06-26',1,0)) new_2day ,--前天的新用户但是今天上线的用户
sum(if(login_date_first = date_add('2020-06-26',-3),1,0)) 3day_count , -- 大前天的新用户
sum(if(login_date_first = date_add('2020-06-26',-3) and login_date_last = '2020-06-26',1,0)) new_3day --大前天的新用户但是今天上线的用户
from dwt_uv_topic
)
insert into ads_user_retention_day_rate
select
'2020-06-26' stat_date, --统计日期
date_add('2020-06-26',-1) create_date,--设备新增日期
1 retention_day,--截止当前日期留存天数
new_1day retention_count,--留存数量
1day_count new_mid_count,--设备新增数量
new_1day * 100 /1day_count retention_ratio--留存率
from tmp_uv_topic
union all
select
'2020-06-26' stat_date,--统计日期
date_add('2020-06-26',-2) create_date,--设备新增日期
2 retention_day,--截止当前日期留存天数
new_2day retention_count,--留存数量
2day_count new_mid_count,--设备新增数量
new_2day * 100 / 2day_count retention_ratio --留存率
from tmp_uv_topic
union all
select
'2020-06-26' stat_date,--统计日期
date_add('2020-06-26',-3) create_date,--设备新增日期
3 retention_day,--截止当前日期留存天数
new_3day retention_count,--留存数量
3day_count new_mid_count,--设备新增数量
new_3day * 100 / 3day_count retention_ratio --留存率
from tmp_uv_topic
3.1.4 沉默用户数
-- 1. 什么是沉默用户?
只在安装当天启动过,且启动时间是在7天前
-- 2. 实现过程
1. 统计首次活跃时间 = 最后末次活跃时间,且最后活跃时间在7天前的用户
- 建表
drop table if exists ads_silent_count;
create external table ads_silent_count(
`dt` string COMMENT '统计日期',
`silent_count` bigint COMMENT '沉默设备数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_silent_count';
- 插入数据
insert into table ads_silent_count
select
'2020-06-25',
count(*)
from dwt_uv_topic
where login_date_first=login_date_last
and login_date_last < date_add('2020-06-25',-7);
3.1.5 本周回流用户数
-- 1. 什么是本周回流用户?
上周未活跃,本周活跃的设备,且不是本周新增设备
-- 2. 实现步骤:
第一步:获取本周活跃的用户且不是本周新增的用户
第二步:获取上周的活跃的用户
第三步:第一步获取的用户减少第二步获取的用户就是本周回流的用户
- 建表
drop table if exists ads_back_count;
create external table ads_back_count(
`dt` string COMMENT '统计日期',
`wk_dt` string COMMENT '统计日期所在周',
`wastage_count` bigint COMMENT '回流设备数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_back_count';
- 插入数据
insert into ads_back_count
select
'2020-06-25' dt ,
concat(date_add(next_day('2020-06-25','MO'),-7),'_', date_add(next_day('2020-06-25','MO'),-1)) wk_dt,
count(*)
from dwt_uv_topic
where login_date_last >= date_add(next_day('2020-06-25','mo'),-7)
and login_date_first < date_add(next_day('2020-06-25','mo'),-7)
and mid_id not in (
select
mid_id
from dws_uv_detail_daycount
where dt < date_add(next_day('2020-06-25','mo'),-7)
and dt >= date_add(next_day('2020-06-25','mo'),-14)
)
3.1.6 流失用户
-- 1. 什么是流失用户
最近7天未活跃的设备
-- 2. 实现步骤
第一步:获取最近活跃时间小于7天
- 建表
drop table if exists ads_wastage_count;
create external table ads_wastage_count(
`dt` string COMMENT '统计日期',
`wastage_count` bigint COMMENT '流失设备数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_wastage_count';
- 插入数据
insert into table ads_wastage_count
select
'2020-06-25' dt ,
count(*)
from dwt_uv_topic
where login_date_last < date_add('2020-06-25',-7)
3.1.7 最近连续三周活跃用户数
-- 1. 实现步骤
第一步: 从dws层获取前一周、前两周以及当前周的所有活跃的用户
第二步: 然后进行内连接,能连接上的,则说明这连续的3周都活跃了,最后按照用户进行分组去重后求count。
- 建表
drop table if exists ads_continuity_wk_count;
create external table ads_continuity_wk_count(
`dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期',
`wk_dt` string COMMENT '持续时间',
`continuity_count` bigint COMMENT '活跃次数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_wk_count';
- 插入数据
with 1wk_mid
as (
select
mid_id
from dws_uv_detail_daycount
where dt >= date_add(next_day('2020-06-25','mo'),-7)
group by mid_id
),
2wk_mid as (
select
mid_id
from dws_uv_detail_daycount
where dt >= date_add(next_day('2020-06-25','mo'),-14)
and dt < date_add(next_day('2020-06-25','mo'),-7)
group by mid_id
),
3wk_mid as (
select
mid_id
from dws_uv_detail_daycount
where dt >= date_add(next_day('2020-06-25','mo'),-21)
and dt < date_add(next_day('2020-06-25','mo'),-14)
group by mid_id
)
insert into ads_continuity_wk_count
select
'2020-06-25',
concat(date_add(next_day('2020-06-25','MO'),-7*3),'_',date_add(next_day('2020-06-25','MO'),-1)),
count(*)
from 1wk_mid
join 2wk_mid on 1wk_mid.mid_id = 2wk_mid.mid_id
join 3wk_mid on 2wk_mid.mid_id = 3wk_mid.mid_id
3.1.8 最近七天内连续三天活跃
-- 1. 思路:
第一步:从dws层获取最近7天的数据,对数据按照用户分组进行开窗,按照活跃时间进行降序排序
第二步:使用活跃时间减去排名,获取一列
第三步:按照用户和第三步的列进行分组,求count(*) >= 3的用户
第四步:分组去重
第五步:求7天内连续3天的活跃用户
- 建表
drop table if exists ads_continuity_uv_count;
create external table ads_continuity_uv_count(
`dt` string COMMENT '统计日期',
`wk_dt` string COMMENT '最近7天日期',
`continuity_count` bigint
) COMMENT '连续活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_uv_count';
- 分析
-- 第一步:从dws层获取最近7天的数据,对数据按照用户分组进行开窗,按照活跃时间进行降序排序
select
mid_id,
dt ,
row_number() over(partition by mid_id order by dt ) rk
from dws_uv_detail_daycount
where dt > date_add('2020-06-25',-7) --t1
-- 第二步:使用活跃时间减去排名,获取一列
select
mid_id,
dt ,
date_add(dt,-rk) dt_rk
from t1 --t2
-- 第三步:按照用户和第三步的列进行分组,求count(*) > =3的用户
select
mid_id
from t2
group by mid_id,dt_rk
having count(*) >= 3--t3
-- 第四步:分组去重
select
mid_id
from t3
group by mid_id --t4
--第五步:求7天内连续3天的活跃用户
select
'2020-06-25' dt,
concat(date_add('2020-06-25',-6),'_','2020-06-25') wk_dt ,
count(*) continuity_count
from t4
- 插入数据
insert into ads_continuity_uv_count
select
'2020-06-25' dt,
concat(date_add('2020-06-25',-7),'_','2020-06-25') wk_dt ,
count(*) continuity_count
from (
select
mid_id
from (
select
mid_id
from (
select
mid_id,
dt ,
date_add(dt,rk) dt_rk
from (
select
mid_id,
dt ,
row_number() over(partition by mid_id order by dt ) rk
from dws_uv_detail_daycount
where dt > date_add('2020-06-25',-7)
and dt < '2020-06-25'
)t1
)t2
group by mid_id,dt_rk
having count(*) >= 3
)t3
group by mid_id
)t4
3.2 会员主题
3.2.1 会员主题信息
-- 几个指标说明
1. 总付费会员数:指付费的会员数量
2. 会员活跃率 = 今天会员活跃数量 / 总的会员数量
3. 会员付费率 = 今天会员付费人数 / 总的会员数量
4. 会员新鲜度 = 今天新增会员数量 / 今天活跃的会员数量
- 建表
drop table if exists ads_user_topic;
create external table ads_user_topic(
`dt` string COMMENT '统计日期',
`day_users` string COMMENT '活跃会员数',
`day_new_users` string COMMENT '新增会员数',
`day_new_payment_users` string COMMENT '新增消费会员数',
`payment_users` string COMMENT '总付费会员数',
`users` string COMMENT '总会员数',
`day_users2users` decimal(16,2) COMMENT '会员活跃率',
`payment_users2users` decimal(16,2) COMMENT '会员付费率',
`day_new_users2users` decimal(16,2) COMMENT '会员新鲜度'
) COMMENT '会员主题信息表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_topic';
- 插入数据
insert into ads_user_topic
select
'2020-06-25' dt ,--统计日期
sum(if(login_date_last = '2020-06-25',1,0)) day_users,--今天活跃会员数
sum(if(login_date_first = '2020-06-25',1,0)) day_new_users,--今天新增会员数
sum(if(payment_date_first = '2020-06-25',1,0)) day_new_payment_users,--今天新增消费会员数
sum(if(payment_count > 0,1,0)) payment_users , --总付费会员数
count(*) users ,-- 总会员数
sum(if(login_date_last = '2020-06-25',1,0))*100/count(*) day_users2users ,--会员活跃率
sum(if(payment_count > 0,1,0))*100/count(*) payment_users2users ,--会员付费率
sum(if(login_date_first = '2020-06-25',1,0))*100/sum(if(login_date_last = '2020-06-25',1,0)) day_new_users2users --会员新鲜度
from dwt_user_topic
3.2.2 漏斗分析
-- 1. 什么是漏斗分析?
其实是指转化率,同一个会员同一次操作业务过程为:
浏览页面 -> 进入详情的页面 -> 加入购物车 -> 下单 -> 支付
理论上从左往右的人数是越来越少,所以称之为漏斗
-- 2. 说明:
理论上:进入详情页面的方式有很多,不止是通过浏览页面,所以我们在计算从浏览页面以后进入详情页面应该是连续的,才能算作从浏览
页面到页面详情的转换率,但是在本案例中,比较简单粗暴,处理的方式是:
-- 3. 统计方式:
1. '统计浏览首页的人数':统计今天浏览过详情页面的人数,
同一个人多次浏览首页算作1次,数据来源每日设备行为:dwd_page_log
2. '浏览商品详情页面的人数':统计今天浏览过详情页面的人数,
同一个用户多次浏览商品详情算作1次,数据来源每日设备行为:dwd_page_log
3. '加购物车的人数':统计今天加购物车的人数,数据来源:每日会员行为dws_user_action_daycount
4. '下单人数':统计今天下单的人数,数据来源:每日会员行为dws_user_action_daycount
5. '支付人数':统计今天支付的人数,数据来源:每日会员行为dws_user_action_daycount
- 建表
drop table if exists ads_user_action_convert_day;
create external table ads_user_action_convert_day(
`dt` string COMMENT '统计日期',
`home_count` bigint COMMENT '浏览首页人数',
`good_detail_count` bigint COMMENT '浏览商品详情页人数',
`home2good_detail_convert_ratio` decimal(16,2) COMMENT '首页到商品详情转化率',
`cart_count` bigint COMMENT '加入购物车的人数',
`good_detail2cart_convert_ratio` decimal(16,2) COMMENT '商品详情页到加入购物车转化率',
`order_count` bigint COMMENT '下单人数',
`cart2order_convert_ratio` decimal(16,2) COMMENT '加入购物车到下单转化率',
`payment_amount` bigint COMMENT '支付人数',
`order2payment_convert_ratio` decimal(16,2) COMMENT '下单到支付的转化率'
) COMMENT '用户行为漏斗分析'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_action_convert_day/';
- 插入数据
with
tmp_action as (
select
'2020-06-25' dt,
sum(if(cart_count > 0,1,0)) cart_count,--加入购物车的人数
sum(if(order_count > 0,1,0)) order_count , --下单人数
sum(if(payment_count > 0,1,0)) payment_count --支付人数
from dws_user_action_daycount
where dt ='2020-06-25'
),
tmp_page as (
select
'2020-06-25' dt , --统计日期
sum(if(array_contains(pages,'home'),1,0)) home_count, --浏览首页人数
sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count, --浏览商品详情页人数
sum(if(array_contains(pages,'good_detail'),1,0)) * 100 /sum(if(array_contains(pages,'home'),1,0)) home2good_detail_convert_ratio --首页到商品详情转化率
from (
select
mid_id,
--对用户进行分组,过滤出今天进入首页和详情页的用户,获取当天用户的页面行为,去重后放到一个集合中
-- 那么一行数据有如下2种情况
-- 用户 page_id
-- 243 ["good_detail","home"]
-- 63 ["home"]
collect_set(page_id) pages
from dwd_page_log
where dt = '2020-06-25'
and page_id in ('home','good_detail')
group by mid_id
)tmp
)
insert into ads_user_action_convert_day
select
'2020-06-25' dt , --统计日期
home_count, --浏览首页人数
good_detail_count, --浏览商品详情页人数
home2good_detail_convert_ratio ,--首页到商品详情转化率
cart_count,--加入购物车的人数
cart_count *100/good_detail_count good_detail2cart_convert_ratio,--商品详情页到加入购物车转化率
order_count , --下单人数
order_count *100/cart_count cart2order_convert_ratio ,--加入购物车到下单转化率
payment_count, --支付人数
payment_count * 100 / order_count order2payment_convert_ratio --下单到支付的转化率
from tmp_action
join tmp_page on tmp_action.dt = tmp_page.dt
3.3 商品主题
3.3.1 商品个数信息
-- 1. 需求分析:
`sku_num` string COMMENT 'sku个数',
统计到目前为止的sku数量
`spu_num` string COMMENT 'spu个数'
统计到目前为止的spu数量
- 建表
drop table if exists ads_product_info;
create external table ads_product_info(
`dt` string COMMENT '统计日期',
`sku_num` string COMMENT 'sku个数',
`spu_num` string COMMENT 'spu个数'
) COMMENT '商品个数信息'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_info';
- 插入数据
with
tmp_sku as(
select
'2020-06-25' dt,
count(*) sku_num --sku个数
from dwt_sku_topic
),
tmp_spu as(
select
'2020-06-25' dt,
count(*) spu_num --spu个数
from (
select
'2020-06-25' dt,
spu_id
from dwt_sku_topic
group by spu_id
)tmp
)
insert into ads_product_info
select
'2020-06-25' dt,
sku_num,
spu_num
from tmp_sku
join tmp_spu on tmp_sku.dt = tmp_spu.dt
3.3.2 商品销量排名
-- 1. 商品销量排名:
1. 是按照什么规则进行排名?本需求是按照当天的产品的支付金额的大小进行排名
- 建表
drop table if exists ads_product_sale_topN;
create external table ads_product_sale_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`payment_amount` bigint COMMENT '销量'
) COMMENT '商品个数信息'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_sale_topN';
- 插入数据
insert into ads_product_sale_topN
select
'2020-06-25' dt,
sku_id,
payment_amount
from dws_sku_action_daycount
where dt = '2020-06-25'
order by payment_amount desc
limit 10
3.3.3 商品收藏排名
- 建表
drop table if exists ads_product_favor_topN;
create external table ads_product_favor_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`favor_count` bigint COMMENT '收藏量'
) COMMENT '商品收藏TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_favor_topN';
- 插入数据
insert into table ads_product_favor_topN
select
'2020-06-25' dt,
sku_id,
favor_count
from
dws_sku_action_daycount
where
dt='2020-06-25'
order by favor_count desc
limit 10;
3.3.4 商品加入购物车排名
- 建表
drop table if exists ads_product_cart_topN;
create external table ads_product_cart_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`cart_count` bigint COMMENT '加入购物车次数'
) COMMENT '商品加入购物车TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_cart_topN';
- 插入数据
insert into table ads_product_cart_topN
select
'2020-06-25' dt,
sku_id,
cart_count
from
dws_sku_action_daycount
where
dt='2020-06-25'
order by cart_count desc
limit 10;
3.3.5 商品退款率排名(近30天)
- 建表
drop table if exists ads_product_refund_topN;
create external table ads_product_refund_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`refund_ratio` decimal(16,2) COMMENT '退款率'
) COMMENT '商品退款率TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_product_refund_topN';
- 插入数据
insert overwrite table ads_product_refund_topN
select
'2020-06-25',
sku_id,
refund_last_30d_count/payment_last_30d_count*100 refund_ratio
from dwt_sku_topic
order by refund_ratio desc
limit 10;
3.3.6 商品差评率
- 建表
drop table if exists ads_appraise_bad_topN;
create external table ads_appraise_bad_topN(
`dt` string COMMENT '统计日期',
`sku_id` string COMMENT '商品ID',
`appraise_bad_ratio` decimal(16,2) COMMENT '差评率'
) COMMENT '商品差评率TopN'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_appraise_bad_topN';
- 插入数据
insert into table ads_appraise_bad_topN
select
'2020-06-25' dt,
sku_id,
appraise_bad_count/(appraise_good_count+appraise_mid_count+appraise_bad_count+appraise_default_count) appraise_bad_ratio
from
dws_sku_action_daycount
where
dt='2020-06-25'
order by appraise_bad_ratio desc
limit 10;
3.4 营销主题
3.4.1 下单数目统计
- 建表
drop table if exists ads_order_daycount;
create external table ads_order_daycount(
dt string comment '统计日期',
order_count bigint comment '单日下单笔数',
order_amount bigint comment '单日下单金额',
order_users bigint comment '单日下单用户数'
) comment '每日订单总计表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_order_daycount';
插入数据
insert into ads_order_daycount
select
'2020-06-25' dt ,
sum(order_count) order_count, --单日下单笔数
sum(order_amount ) order_amount ,--单日下单金额
sum(if(order_count > 0 , 1 , 0)) --单日下单用户数
from dws_user_action_daycount
where dt = '2020-06-25'
分析
insert into table ads_order_daycount
select
'2020-06-14',
sum(order_count),
sum(order_amount),
sum(if(order_count>0,1,0))
from dws_user_action_daycount
where dt='2020-06-14';
3.4.2 支付信息统计
- 建表
drop table if exists ads_payment_daycount;
create external table ads_payment_daycount(
dt string comment '统计日期',
order_count bigint comment '单日支付笔数',
order_amount bigint comment '单日支付金额',
payment_user_count bigint comment '单日支付人数',
payment_sku_count bigint comment '单日支付商品数',
payment_avg_time decimal(16,2) comment '下单到支付的平均时长,取分钟数'
) comment '每日订单总计表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_payment_daycount';
- 插入数据
with
tmp_user as (
select
'2020-06-25' dt , --统计日期
sum(payment_count) order_count, --单日支付笔数
sum(payment_amount) order_amount,--单日支付金额
sum(if(order_count > 0 , 1 , 0)) payment_user_count --单日支付人数
from dws_user_action_daycount
where dt='2020-06-25'
),
tmp_action as (
select
'2020-06-25' dt,
sum(if(payment_amount > 0,1,0)) payment_sku_count --单日支付商品数
from dws_sku_action_daycount
where dt= '2020-06-25'
),
tmp_order as (
select
'2020-06-25' dt,
sum(unix_timestamp(payment_time)-unix_timestamp(create_time))/count(*)/60 payment_avg_time --下单到支付的平均时长,取分钟数
from dwd_fact_order_info
where dt='2020-06-25'
and payment_time is not null
)
insert into ads_payment_daycount
select
'2020-06-25' dt , --统计日期
order_count, --单日支付笔数
order_amount,--单日支付金额
payment_user_count ,--单日支付人数
payment_sku_count ,--单日支付商品数
payment_avg_time --下单到支付的平均时长,取分钟数
from tmp_user
join tmp_action on tmp_user.dt = tmp_action.dt
join tmp_order on tmp_action.dt = tmp_order.dt
3.4.3 品牌复购率(月复购)
- 建表
drop table ads_sale_tm_category1_stat_mn;
create external table ads_sale_tm_category1_stat_mn
(
tm_id string comment '品牌id',
category1_id string comment '1级品类id ',
category1_name string comment '1级品类名称 ',
buycount bigint comment '购买人数',
buy_twice_last bigint comment '两次以上购买人数',
buy_twice_last_ratio decimal(16,2) comment '单次复购率',
buy_3times_last bigint comment '三次以上购买人数',
buy_3times_last_ratio decimal(16,2) comment '多次复购率',
stat_mn string comment '统计月份',
stat_date string comment '统计日期'
) COMMENT '复购率统计'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_sale_tm_category1_stat_mn/';
- 插入数据
with
tmp_order as
(
select
user_id,--用户di
order_stats_struct.sku_id sku_id,--商品id
order_stats_struct.order_count order_count --商品被购买的次数
-- 使用侧写的方式,将一个用户当天购买的每个商品的明细(数组)进行侧写,形成多行
from dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct
where date_format(dt,'yyyy-MM')=date_format('2020-06-25','yyyy-MM') --取当天的数据
),
tmp_sku as
(
select
id,--商品id
tm_id, -- 品牌id
category1_id, -- 一次品类的id
category1_name --一次品类的名字
from dwd_dim_sku_info
where dt='2020-06-25'
)
insert into table ads_sale_tm_category1_stat_mn
select
tm_id,
category1_id,
category1_name,
sum(if(order_count>=1,1,0)) buycount,
sum(if(order_count>=2,1,0)) buyTwiceLast,
sum(if(order_count>=2,1,0))/sum( if(order_count>=1,1,0)) buyTwiceLastRatio,
sum(if(order_count>=3,1,0)) buy3timeLast ,
sum(if(order_count>=3,1,0))/sum( if(order_count>=1,1,0)) buy3timeLastRatio ,
date_format('2020-06-25' ,'yyyy-MM') stat_mn,
'2020-06-25' stat_date
from
(
select
tmp_order.user_id, --用户
tmp_sku.category1_id, -- 一级品类
tmp_sku.category1_name, -- 一级品类的名字
tmp_sku.tm_id, -- 品牌
sum(order_count) order_count -- 购买同一品牌数量
from tmp_order
join tmp_sku
on tmp_order.sku_id=tmp_sku.id
-- 按照用户 + 1级品类 + 品牌id分组,得到一个用户购买某一个1级品类的某一品牌的个数
group by tmp_order.user_id,tmp_sku.category1_id,tmp_sku.category1_name,tmp_sku.tm_id
)tmp
-- 按照品牌进行 +
group by tm_id, category1_id, category1_name;
3.5 地区主题
3.5.1 地区主题信息
- 建表
drop table if exists ads_area_topic;
create external table ads_area_topic(
`dt` string COMMENT '统计日期',
`id` bigint COMMENT '编号',
`province_name` string COMMENT '省份名称',
`area_code` string COMMENT '地区编码',
`iso_code` string COMMENT 'iso编码',
`region_id` string COMMENT '地区ID',
`region_name` string COMMENT '地区名称',
`login_day_count` bigint COMMENT '当天活跃设备数',
`order_day_count` bigint COMMENT '当天下单次数',
`order_day_amount` decimal(16,2) COMMENT '当天下单金额',
`payment_day_count` bigint COMMENT '当天支付次数',
`payment_day_amount` decimal(16,2) COMMENT '当天支付金额'
) COMMENT '地区主题宽表'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_area_topic/';
- 插入数据
insert into table ads_area_topic
select
'2020-06-25',
id,
province_name,
area_code,
iso_code,
region_id,
region_name,
login_day_count,
order_day_count,
order_day_amount,
payment_day_count,
payment_day_amount
from dwt_area_topic;
分析
insert into table ads_area_topic
select
'2020-06-14',
id,
province_name,
area_code,
iso_code,
region_id,
region_name,
login_day_count,
order_day_count,
order_day_amount,
payment_day_count,
payment_day_amount
-- 直接导入dwt的地区表
from dwt_area_topic;
3.6 导入脚本
- 在/home/atguigu/bin目录下创建脚本dwt_to_ads.sh
[atguigu@hadoop102 bin]$ vim dwt_to_ads.sh
- 脚本内容
#!/bin/bash
hive=/opt/module/hive/bin/hive
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set mapreduce.job.queuename=hive;
insert into table ${APP}.ads_uv_count
select
'$do_date' dt,
daycount.ct,
wkcount.ct,
mncount.ct,
if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,
if(last_day('$do_date')='$do_date','Y','N')
from
(
select
'$do_date' dt,
count(*) ct
from ${APP}.dwt_uv_topic
where login_date_last='$do_date'
)daycount join
(
select
'$do_date' dt,
count (*) ct
from ${APP}.dwt_uv_topic
where login_date_last>=date_add(next_day('$do_date','MO'),-7)
and login_date_last<= date_add(next_day('$do_date','MO'),-1)
) wkcount on daycount.dt=wkcount.dt
join
(
select
'$do_date' dt,
count (*) ct
from ${APP}.dwt_uv_topic
where date_format(login_date_last,'yyyy-MM')=date_format('$do_date','yyyy-MM')
)mncount on daycount.dt=mncount.dt;
insert into table ${APP}.ads_new_mid_count
select
login_date_first,
count(*)
from ${APP}.dwt_uv_topic
where login_date_first='$do_date'
group by login_date_first;
insert into table ${APP}.ads_silent_count
select
'$do_date',
count(*)
from ${APP}.dwt_uv_topic
where login_date_first=login_date_last
and login_date_last<=date_add('$do_date',-7);
insert into table ${APP}.ads_back_count
select
'$do_date',
concat(date_add(next_day('$do_date','MO'),-7),'_', date_add(next_day('$do_date','MO'),-1)),
count(*)
from
(
select
mid_id
from ${APP}.dwt_uv_topic
where login_date_last>=date_add(next_day('$do_date','MO'),-7)
and login_date_last<= date_add(next_day('$do_date','MO'),-1)
and login_date_first<date_add(next_day('$do_date','MO'),-7)
)current_wk
left join
(
select
mid_id
from ${APP}.dws_uv_detail_daycount
where dt>=date_add(next_day('$do_date','MO'),-7*2)
and dt<= date_add(next_day('$do_date','MO'),-7-1)
group by mid_id
)last_wk
on current_wk.mid_id=last_wk.mid_id
where last_wk.mid_id is null;
insert into table ${APP}.ads_wastage_count
select
'$do_date',
count(*)
from
(
select
mid_id
from ${APP}.dwt_uv_topic
where login_date_last<=date_add('$do_date',-7)
group by mid_id
)t1;
insert into table ${APP}.ads_user_retention_day_rate
select
'$do_date',--统计日期
date_add('$do_date',-1),--新增日期
1,--留存天数
sum(if(login_date_first=date_add('$do_date',-1) and login_date_last='$do_date',1,0)),--$do_date的1日留存数
sum(if(login_date_first=date_add('$do_date',-1),1,0)),--$do_date新增
sum(if(login_date_first=date_add('$do_date',-1) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-1),1,0))*100
from ${APP}.dwt_uv_topic
union all
select
'$do_date',--统计日期
date_add('$do_date',-2),--新增日期
2,--留存天数
sum(if(login_date_first=date_add('$do_date',-2) and login_date_last='$do_date',1,0)),--$do_date的2日留存数
sum(if(login_date_first=date_add('$do_date',-2),1,0)),--$do_date新增
sum(if(login_date_first=date_add('$do_date',-2) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-2),1,0))*100
from ${APP}.dwt_uv_topic
union all
select
'$do_date',--统计日期
date_add('$do_date',-3),--新增日期
3,--留存天数
sum(if(login_date_first=date_add('$do_date',-3) and login_date_last='$do_date',1,0)),--$do_date的3日留存数
sum(if(login_date_first=date_add('$do_date',-3),1,0)),--$do_date新增
sum(if(login_date_first=date_add('$do_date',-3) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-3),1,0))*100
from ${APP}.dwt_uv_topic;
insert into table ${APP}.ads_continuity_wk_count
select
'$do_date',
concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-1)),
count(*)
from
(
select
mid_id
from
(
select
mid_id
from ${APP}.dws_uv_detail_daycount
where dt>=date_add(next_day('$do_date','monday'),-7)
and dt<=date_add(next_day('$do_date','monday'),-1)
group by mid_id
union all
select
mid_id
from ${APP}.dws_uv_detail_daycount
where dt>=date_add(next_day('$do_date','monday'),-7*2)
and dt<=date_add(next_day('$do_date','monday'),-7-1)
group by mid_id
union all
select
mid_id
from ${APP}.dws_uv_detail_daycount
where dt>=date_add(next_day('$do_date','monday'),-7*3)
and dt<=date_add(next_day('$do_date','monday'),-7*2-1)
group by mid_id
)t1
group by mid_id
having count(*)=3
)t2;
insert into table ${APP}.ads_continuity_uv_count
select
'$do_date',
concat(date_add('$do_date',-6),'_','$do_date'),
count(*)
from
(
select mid_id
from
(
select mid_id
from
(
select
mid_id,
date_sub(dt,rank) date_dif
from
(
select
mid_id,
dt,
rank() over(partition by mid_id order by dt) rank
from ${APP}.dws_uv_detail_daycount
where dt>=date_add('$do_date',-6) and dt<='$do_date'
)t1
)t2
group by mid_id,date_dif
having count(*)>=3
)t3
group by mid_id
)t4;
insert into table ${APP}.ads_user_topic
select
'$do_date',
sum(if(login_date_last='$do_date',1,0)),
sum(if(login_date_first='$do_date',1,0)),
sum(if(payment_date_first='$do_date',1,0)),
sum(if(payment_count>0,1,0)),
count(*),
sum(if(login_date_last='$do_date',1,0))/count(*),
sum(if(payment_count>0,1,0))/count(*),
sum(if(login_date_first='$do_date',1,0))/sum(if(login_date_last='$do_date',1,0))
from ${APP}.dwt_user_topic;
with
tmp_uv as
(
select
'$do_date' dt,
sum(if(array_contains(pages,'home'),1,0)) home_count,
sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count
from
(
select
mid_id,
collect_set(page_id) pages
from ${APP}.dwd_page_log
where dt='$do_date'
and page_id in ('home','good_detail')
group by mid_id
)tmp
),
tmp_cop as
(
select
'$do_date' dt,
sum(if(cart_count>0,1,0)) cart_count,
sum(if(order_count>0,1,0)) order_count,
sum(if(payment_count>0,1,0)) payment_count
from ${APP}.dws_user_action_daycount
where dt='$do_date'
)
insert into table ${APP}.ads_user_action_convert_day
select
tmp_uv.dt,
tmp_uv.home_count,
tmp_uv.good_detail_count,
tmp_uv.good_detail_count/tmp_uv.home_count*100,
tmp_cop.cart_count,
tmp_cop.cart_count/tmp_uv.good_detail_count*100,
tmp_cop.order_count,
tmp_cop.order_count/tmp_cop.cart_count*100,
tmp_cop.payment_count,
tmp_cop.payment_count/tmp_cop.order_count*100
from tmp_uv
join tmp_cop
on tmp_uv.dt=tmp_cop.dt;
insert into table ${APP}.ads_product_info
select
'$do_date' dt,
sku_num,
spu_num
from
(
select
'$do_date' dt,
count(*) sku_num
from
${APP}.dwt_sku_topic
) tmp_sku_num
join
(
select
'$do_date' dt,
count(*) spu_num
from
(
select
spu_id
from
${APP}.dwt_sku_topic
group by
spu_id
) tmp_spu_id
) tmp_spu_num
on
tmp_sku_num.dt=tmp_spu_num.dt;
insert into table ${APP}.ads_product_sale_topN
select
'$do_date' dt,
sku_id,
payment_amount
from
${APP}.dws_sku_action_daycount
where
dt='$do_date'
order by payment_amount desc
limit 10;
insert into table ${APP}.ads_product_favor_topN
select
'$do_date' dt,
sku_id,
favor_count
from
${APP}.dws_sku_action_daycount
where
dt='$do_date'
order by favor_count desc
limit 10;
insert into table ${APP}.ads_product_cart_topN
select
'$do_date' dt,
sku_id,
cart_count
from
${APP}.dws_sku_action_daycount
where
dt='$do_date'
order by cart_count desc
limit 10;
insert into table ${APP}.ads_product_refund_topN
select
'$do_date',
sku_id,
refund_last_30d_count/payment_last_30d_count*100 refund_ratio
from ${APP}.dwt_sku_topic
order by refund_ratio desc
limit 10;
insert into table ${APP}.ads_appraise_bad_topN
select
'$do_date' dt,
sku_id,
appraise_bad_count/(appraise_good_count+appraise_mid_count+appraise_bad_count+appraise_default_count) appraise_bad_ratio
from
${APP}.dws_sku_action_daycount
where
dt='$do_date'
order by appraise_bad_ratio desc
limit 10;
insert into table ${APP}.ads_order_daycount
select
'$do_date',
sum(order_count),
sum(order_amount),
sum(if(order_count>0,1,0))
from ${APP}.dws_user_action_daycount
where dt='$do_date';
insert into table ${APP}.ads_payment_daycount
select
tmp_payment.dt,
tmp_payment.payment_count,
tmp_payment.payment_amount,
tmp_payment.payment_user_count,
tmp_skucount.payment_sku_count,
tmp_time.payment_avg_time
from
(
select
'$do_date' dt,
sum(payment_count) payment_count,
sum(payment_amount) payment_amount,
sum(if(payment_count>0,1,0)) payment_user_count
from ${APP}.dws_user_action_daycount
where dt='$do_date'
)tmp_payment
join
(
select
'$do_date' dt,
sum(if(payment_count>0,1,0)) payment_sku_count
from ${APP}.dws_sku_action_daycount
where dt='$do_date'
)tmp_skucount on tmp_payment.dt=tmp_skucount.dt
join
(
select
'$do_date' dt,
sum(unix_timestamp(payment_time)-unix_timestamp(create_time))/count(*)/60 payment_avg_time
from ${APP}.dwd_fact_order_info
where dt='$do_date'
and payment_time is not null
)tmp_time on tmp_payment.dt=tmp_time.dt;
with
tmp_order as
(
select
user_id,
order_stats_struct.sku_id sku_id,
order_stats_struct.order_count order_count
from ${APP}.dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct
where date_format(dt,'yyyy-MM')=date_format('$do_date','yyyy-MM')
),
tmp_sku as
(
select
id,
tm_id,
category1_id,
category1_name
from ${APP}.dwd_dim_sku_info
where dt='$do_date'
)
insert into table ${APP}.ads_sale_tm_category1_stat_mn
select
tm_id,
category1_id,
category1_name,
sum(if(order_count>=1,1,0)) buycount,
sum(if(order_count>=2,1,0)) buyTwiceLast,
sum(if(order_count>=2,1,0))/sum( if(order_count>=1,1,0)) buyTwiceLastRatio,
sum(if(order_count>=3,1,0)) buy3timeLast ,
sum(if(order_count>=3,1,0))/sum( if(order_count>=1,1,0)) buy3timeLastRatio ,
date_format('$do_date' ,'yyyy-MM') stat_mn,
'$do_date' stat_date
from
(
select
tmp_order.user_id,
tmp_sku.category1_id,
tmp_sku.category1_name,
tmp_sku.tm_id,
sum(order_count) order_count
from tmp_order
join tmp_sku
on tmp_order.sku_id=tmp_sku.id
group by tmp_order.user_id,tmp_sku.category1_id,tmp_sku.category1_name,tmp_sku.tm_id
)tmp
group by tm_id, category1_id, category1_name;
insert into table ${APP}.ads_area_topic
select
'$do_date',
id,
province_name,
area_code,
iso_code,
region_id,
region_name,
login_day_count,
order_day_count,
order_day_amount,
payment_day_count,
payment_day_amount
from ${APP}.dwt_area_topic;
"
$hive -e "$sql"
ods层:24张
log:1张
业务数据:23张。
dwd层:19张
log:5张
业务数据:6张维度表,8张事实表
dws和dwt层:12张
维度:设备、会员、商品、地区、时间、活动:12张
ads层:20张
设备:8张
会员:2张
商品:6张
营销:3张
地区:1张
共计:20 + 12 + 19 + 24 = 75张
另外:
临时表:拉链表:1张
时间表:1张
赞助