12、SQL窗口函数
SQL窗口函数
一. 什么是窗口函数
基本含义
窗口限定一个范围,它可以理解为满足某些条件的记录集合,窗口函数也就是在窗口范围内执行的函数。
基本语法
窗口函数有over关键字,指定函数执行的范围,可分为三部分:分组子句(partition by),排序子句(order by),窗口子句(rows)
<函数名> over (partition by <分组的列> order by <排序的列> rows between <起始行> and <终止行>)
注意Mysql8才支持窗口函数
演示表格
cid(班级id) | sname(学生姓名) | score(分数) |
---|---|---|
001 | 张三 | 78 |
001 | 李四 | 82 |
002 | 小明 | 90 |
001 | 王五 | 67 |
002 | 小红 | 85 |
002 | 小刚 | 62 |
演示脚本
CREATE TABLE SQL_5 (
cid varchar(4),
sname varchar(4),
score int
);
insert into SQL_5 (cid, sname, score) values ('001', '张三', 78);
insert into SQL_5 (cid, sname, score) values ('001', '李四', 82);
insert into SQL_5 (cid, sname, score) values ('002', '小明', 90);
insert into SQL_5 (cid, sname, score) values ('001', '王五', 67);
insert into SQL_5 (cid, sname, score) values ('002', '小红', 85);
insert into SQL_5 (cid, sname, score) values ('002', '小刚', 62);
二. 窗口的确定
例子:
select *, sum(score) over (partition by cid order by score rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;
分组子句(partition by)
不分组可以写成partition by null或者直接不写
后面可以跟多个列, 如 partition by cid, sname
注意 partition by与group by的区别
1)前者不会压缩行数但是后者会
2)后者只能选取分组的列和聚合的列
也就是说group by 后生成的结果集与原表的行数和列数都不同
排序子句(order by)
不排序可以写成order by null 或者直接不写
asc或不写表示升序,desc表示降序
后面可以跟多个列, 如 order by cid, sname
窗口子句(rows)
窗口子句的描述
起始行: N preceding/unbounded preceding
当前行: current row
终止行: N following/unbounded following
举例:
rows between unbounded preceding and current row 从之前所有的行到当前行
rows between 2 preceding and current row 从前面两行到当前行
rows between current row and unbounded following 从当前行到之后所有的行
rows between current row and 1following 从当前行到后面一行
注意:
排序子句后面缺少窗口子句,窗口规范默认是 rows between unbounded preceding and current row
排序子句和窗口子句都缺失,窗口规范默认是 rows between unbounded preceding and unbounded following
总体流程
通过partition by 和 order by 子句确定大窗口( 定义出上界unbounded preceding和下界unbounded following)
通过row 子句针对每一行数据确定小窗口(滑动窗口)
对每行的小窗口内的数据执行函数并生成新的列
三. 函数分类
排序类
rank, dense_rank, row_number
-- 【排序类】
-- 按班级分组后打上序号 不考虑并列
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' from SQL_5;
-- 按班级分组后作跳跃排名 考虑并列
select *, rank() over (partition by cid order by score desc) as '跳跃可并列排名' from SQL_5;
-- 按班级分组后作连续排名 考虑并列
select *, dense_rank() over (partition by cid order by score desc) as '连续可并列排名' from SQL_5;
-- 合并起来对比
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' ,
rank() over (partition by cid order by score desc) as '跳跃可并列排名',
dense_rank() over (partition by cid order by score desc) as '连续可并列排名'
from SQL_5;
聚合类
sum. avg, count, max, min
-- 【聚合类】
-- 让同一班级每个学生都知道班级总分是多少
select *, sum(score) over (partition by cid) as '班级总分' from SQL_5;
-- 或者可以写成
select *, sum(score) over (partition by cid rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;
-- 计算同一班级,每个同学和比他分数低的同学的累计总分是多少
select *, sum(score) over (partition by cid order by score) '累加分数' from SQL_5;
-- 或者可以写成 其中rows between ... and 是规定窗口大小
select *, sum(score) over (partition by cid order by score rows between unbounded preceding and current row) as '累加分数' from SQL_5;
跨行类
lag, lead
-- 【跨行类】
-- lag/lead 函数 参数1:比较的列 参数2: 偏移量 参数3:找不到的默认值
-- 同一班级内,成绩比自己低一名的分数是多少
select *, lag(score, 1) over (partition by cid order by score) as '低一名的分数' from SQL_5;
-- 或者写成
select *, lag(score, 1, 0) over (partition by cid order by score) as '低一名的分数' from SQL_5;
-- 同一班级内,成绩比自己高2名的分数是多少
select *, lead(score, 2) over (partition by cid order by score) as '高两名的分数' from SQL_5;
四. 相关题目
表格
cid | sname | course | score |
---|---|---|---|
001 | 张三 | 语文 | 78 |
002 | 小刚 | 语文 | 71 |
001 | 李四 | 数学 | 56 |
002 | 小明 | 数学 | 54 |
... | ... | ... | ... |
脚本
CREATE TABLE SQL_6 (
cid varchar(4),
sname varchar(4),
course varchar(10),
score int
);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '语文', 78);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '语文', 71);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '数学', 56);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '数学', 97);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '数学', 54);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '数学', 67);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '数学', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '语文', 80);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '数学', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '语文', 58);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '语文', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '语文', 60);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '英语', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '英语', 50);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '地理', 59);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '地理', 88);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '地理', 45);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '地理', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '地理', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '英语', 81);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '地理', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '英语', 55);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '英语', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '英语', 61);
分组内topN
问题1:求出每个学生成绩最高的三条记录
select * from
(
select *, row_number() over (partition by sname order by score desc) as rn from SQL_6
) temp
where rn <= 3
公式:
select * from
(
select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;
汇总分析
问题2:找出每门课程都高于班级课程平均分的学生
可以拆解成以下几个问题:
1)求出每个班级,每门课程的平均分
with
-- 1) 求出每个班级,每门课程的平均分
t1 as
(
select *,
avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),
2)将学生每门课程的成绩与所在班级的对应课程平均分相减,结果大于0就说明该学生的这门成绩高于课程平均分
t2 as (
select *,
score - avg as 'del'
from t1
)
3)“找出每门课程都高于班级课程平均分的学生”说明对于学生来说,最小的“相减结果”都是大于0的
select sname from t2
group by sname
having min(del) > 0;
合并后的SQL语句
with
t1 as
(
select *,
avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),
t2 as (
select *,
score - avg as 'del'
from t1
)
select sname from t2
group by sname
having min(del) > 0;
-- 或者
select sname from (
select *,
score - avg as 'del'
from (
select *,
avg(score) over (partition by cid, course) as 'avg'
from SQL_6
) t1
) t2
group by sname
having min(del) > 0;
表格
empno | ename | hire_date | salary | dept_no |
---|---|---|---|---|
001 | Adam | 2018-03-01 | 1000 | A |
002 | Bill | 2021-03-01 | 1200 | A |
003 | Cindy | 2016-03-01 | 1500 | A |
004 | Danny | 2020-03-01 | 5000 | A |
005 | Eason | 2020-03-01 | 4000 | B |
006 | Fred | 2018-03-01 | 3500 | B |
007 | Gary | 2017-03-01 | 1800 | B |
008 | Hugo | 2020-03-01 | 2500 | B |
脚本
CREATE TABLE SQL_7 (
empno varchar(4),
ename varchar(10),
hire_date varchar(10),
salary int,
dept_no varchar(2)
);
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('001', 'Adam', '2018-03-01', 1000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('002', 'Bill', '2021-03-01', 1200, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('003', 'Cindy', '2016-03-01', 1500, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('004', 'Danny', '2020-03-01', 5000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('005', 'Eason', '2020-03-01', 4000, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('006', 'Fred', '2018-03-01', 3500, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('007', 'Gary', '2017-03-01', 1800, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('008', 'Hugo', '2020-03-01', 4500, 'B');
select * from SQL_7;
分组内topN
问题一:求出每个部门工资最高的前三名员工
select * from
(
select *, row_number() over (partition by dept_no order by salary desc) as rn from SQL_7
) as tmp
where rn <= 3;
汇总分析
问题二:计算这些员工的工资占所属部门总工资的百分比
with
t1 as (
select * , sum(salary) over (partition by dept_no) as 'sum_sal' from SQL_7
),
t2 as (
select *, round(salary*100/sum_sal,2) as 'percentage' from t1
)
select * from t2;
问题三:对各部门员工的工资进行从小到大排序,排名前30%为低层,30%-80%为中层,高于80%为高层,并打上标签
with
t1 as (
select * , row_number() over (partition by dept_no order by salary) as cnt,
count(empno) over (partition by dept_no) as 'sum' from SQL_7
),
t2 as (
select *, round(cnt/sum,2) as 'percentage' from t1
),
t3 as (
select *, case when percentage <= 0.3 then '低层'
when percentage <= 0.8 then '中层'
when percentage <= 1 then '高层' end as 'label'
from t2
)
select empno, ename, hire_date, salary, dept_no, label from t3;
问题四:统计每年入职总数以及截至本年累计入职总人数(本年总入职人数 + 本年之前所有年的总入职人数之和)
select year(hire_date) as hire_year, count(empno) as cnt
from SQL_7
group by year(hire_date) order by hire_year;
with t1 as (
select year(hire_date) as hire_year, count(empno) as cnt from SQL_7 group by year(hire_date) order by hire_year
)
select *, sum(cnt) over(partition by null rows between unbounded preceding and current row) as sum from t1;
五. 技巧
1)分组内topN公式
select * from
(
select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;
窗口函数 -> 生成辅助列(相当于高级语言的临时变量)
with 语句 -> 生成临时表(相当于高级语言的局部方法)
把复杂的问题拆分成多个子问题并用临时表去表达