1.2版DB版本1.1.DB2 11.51.2.Oracle 19c1.3.PostgreSQL 121.4.MySQL 8.01.5.SQL Server 20172.子查询2.1.如果想创建虚拟表,以便对其执行包含窗口函数或聚合函数的查询,那么最简单的做法无疑是使用子查询2.2.只需编写一个查询并将其放在括号内,然后再编写另一个使用它的查询2.3.有些 RDBMS 要求给子查询表指定别名,有些则不要求这样做2.3.1.Oracle 不要求2.4.示例2.4.1.sql
select max(HeadCount) as HighestJobHeadCount from(select job,count(empno) as HeadCountfrom empgroup by job) head_count_tab
3.通用表表达式3.1.为了克服子查询的一些局限性,引入了 CTE3.1.1.让 SQL 支持递归3.2.CTE 的可读性更高3.2.1.sql
with head_count_tab (job,HeadCount) as(select job,count(empno)from empgroup by job)select max(HeadCount) as HighestJobHeadCountfrom head_count_tab
3.3.递归 CTE 来计算前 20 个斐波那契数3.3.1.sql
with recursive workingTable (fibNum, NextNumber, index1)as(select 0,1,1union allselect fibNum+nextNumber,fibNUm,index1+1from anchorwhere index1<20)select fibNum from workingTable as fib
3.3.2.关键字 RECURSIVE,它在 MySQL、Oracle 和 PostgreSQL 中必不可少,但在 SQL Server 和 DB2 中是可选的3.3.3.index1 列旨在简化在 WHERE 子句中指定返回行数的工作3.3.4.WHERE 子句至关重要,如果没有它,查询将不会终止4.根据发音比较字符串4.1.匹配单词4.1.1.匹配拼写正确和拼写错误的单词4.1.2.匹配拼写方式不同(比如英式拼写和美式拼写)的单词4.1.3.匹配由不同字符串表示的单词4.1.3.1.查找拼写不同但发音相同的字符串4.2.函数 SOUNDEX 将字符串转换为英语发音4.2.1.SOUNDEX 会保留第一个字母,并将其他字母替换为数字4.2.2.发音相似的字母将被替换为相同的数字4.3.SQL Server 函数 DIFFERENCE 会使用 SOUNDEX 对两个字符串进行比较,并返回表示相似程度的数字 0~44.4.sql
a_name----1 Johnson2 Jonson3 Jonsen4 Jensen5 Johnsen6 Shakespeare7 Shakspear8 Shaekspir9 Shakespar
4.4.1.sql
select an1.a_name as name1, an2.a_name as name2, SOUNDEX(an1.a_name) as Soundex_Name from author_names an1 join author_names an2 on (SOUNDEX(an1.a_name)=SOUNDEX(an2.a_name) and an1.a_name not like an2.a_name)
5.查找与模式不匹配的文本5.1.sql
select emp_id, text from employee_commentEMP_ID TEXT---------- ------------------------------------------------------------7369 126 Varnum, Edmore MI 48829, 989 313-53517499 1105 McConnell Court Cedar Lake MI 48812 Home: 989-387-4321 Cell: (237) 438-3333
5.2.列出其中电话号码格式不正确的行5.3.sql
select emp_id, textfrom employee_commentwhere regexp_like(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}') and regexp_like( regexp_replace(text, '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}',''), '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}') EMP_ID TEXT---------- ------------------------------------------------------------ 7369 126 Varnum, Edmore MI 48829, 989 313-5351 7844 989-387.5359 9999 906-387-1698, 313-535.8886
6.使用绝对中位差找出异常值6.1.值存在疑问的原因6.1.1.数据收集方式有问题6.1.1.1.记录值的仪表存在误差6.1.2.数据输入错误导致的6.1.3.数据生成时环境出现异常6.1.3.1.意味着数据点是正确的,但应谨慎根据数据得出任何结论6.2.检测异常数据的常用方法6.2.1.计算数据的标准偏差,并将超过 3 倍标准偏差(或其他类似距离)的数据点视为异常数据6.2.2.如果数据不符合正态分布,则这种方法可能错误地识别异常数据,而当数据分布不对称,或者如果你远离平均值,数据就不像正态分布那样变得稀疏时更是如此6.3.偏差是中值与各个值的绝对差6.4.绝对中位差是偏差的中值6.5.Oracle6.5.1.sql
withDeviation (Deviation) as(select abs(sal-median(sal))from emp),MAD (MAD) as(select median(Deviation)from Deviation )select abs(sal-median)/MAD, sal, ename, jobFROM MAD join emp
6.6.SQL Server6.6.1.sql
with median (median)as(select distinct percentile_cont(0.5) within group(order by sal) over()from emp),Deviation (Deviation) as(Select abs(sal-median)from emp join median on 1=1),MAD (MAD) as(select DISTINCT PERCENTILE_CONT(0.5) within group(order by deviation) over()from Deviation )select abs(sal-median)/MAD, sal, ename, jobfrom MAD join emp on 1=1
6.7.DB26.8.PostgreSQL6.9.PERCENTILE_CONT 被视为聚合函数,而不是窗口函数6.9.1.sql
with median (median)as(select percentile_cont(0.5) within group(order by sal)from emp),devtab (deviation) as(select abs(sal-median)from emp join median),MedAbsDeviation (MAD) as(select percentile_cont (0.5) within group(order by deviation)from devtab)select abs(sal-median)/MAD, sal, ename, jobFROM MedAbsDeviation join emp
6.10.MySQL6.10.1.sql
with rank_tab (sal, rank_sal) as (select sal, cume_dist() over (order by sal)from emp),inter as(select sal, rank_sal from rank_tabwhere rank_sal>=0.5unionselect sal, rank_sal from rank_tabwhere rank_sal<=0.5),medianSal (medianSal) as(select (max(sal)+min(sal))/2from inter),deviationSal (Sal,deviationSal) as(select Sal,abs(sal-medianSal)from emp join medianSalon 1=1),distDevSal (sal,deviationSal,distDeviationSal) as(select sal,deviationSal,cume_dist() over (order by deviationSal)from deviationSal),DevInter (DevInter, sal) as(select min(deviationSal), salfrom distDevSalwhere distDeviationSal >= 0.5unionselect max(DeviationSal), salfrom distDevSalwhere distDeviationSal <= 0.5),MAD (MedianAbsoluteDeviance) as(select abs(emp.sal-(min(devInter)+max(devInter))/2)from emp join DevInter on 1=1)select emp.sal,MedianAbsoluteDeviance,(emp.sal-deviationSal)/MedianAbsoluteDeviancefrom (emp join MAD on 1=1) join deviationSal on emp.sal=deviationSal.sal
6.10.2.没有提供函数 MEDIAN 或 PERCENTILE_CONT6.10.3.需要在 CTE 中使用两个子查询6.11.计算中值,然后计算这个中值与各个值的绝对偏差的中值,即绝对中位差6.12.使用查询来找出每个值相对于中值的偏差与绝对中位差的比值6.13.可以像使用标准偏差那样使用这些比值了6.13.1.如果一个值相对于中值的偏差是绝对中位差的 3 倍以上,就可以认为它是异常值6.14.优点6.14.1.即便数据不呈正态分布,它依然有效6.14.2.即便数据分布不平衡,绝对中位差给出的答案依然合理7.使用本福特法则查找反常数据7.1.检测不像异常值那样显而易见的反常数据的一种方式是查看数字位的出现频率,这种频率通常符合本福特法则7.2.本福特法则最常用于检测数据造假7.2.1.在数据集中人为地添加伪造的数字7.2.2.用于检测不符合预期规律的数据7.3.本福特法则7.3.1.计算数字位的期望分布7.3.2.将其与实际分布进行比较7.4.sql
withFirstDigits (FirstDigit)as(select left(cast(SAL as CHAR),1) as FirstDigit from emp),TotalCount (Total)as (select count(*) from emp),ExpectedBenford (Digit,Expected)as (select ID,(log10(ID + 1) - log10(ID)) as expected from t10 where ID < 10)select count(FirstDigit),Digit,coalesce(count(*)/Total,0) as ActualProportion,ExpectedFrom FirstDigits Join TotalCount Right Join ExpectedBenford on FirstDigits.FirstDigit=ExpectedBenford.Digitgroup by Digitorder by Digit
7.5.最终的结果集包含 4 列数据,分别是第一位的预测频率、第一位的实际频率、本福特法则预测的前几位的频率,以及前几位的实际频率8.SQL 不像专用包 SAS、统计编程语言 R 和 Python 统计库那样提供了完备的统计工具9.SQL Server9.1.PIVOT操作符9.1.1.sql
DEPT_10 DEPT_20 DEPT_30 DEPT_40------- ---------- ---------- ---------- 3 5 6 0
9.1.2.sql
select [10] as dept_10, [20] as dept_20, [30] as dept_30, [40] as dept_40 from (select deptno, empno from emp) driver pivot ( count(driver.empno) for driver.deptno in ( [10],[20],[30],[40] ) ) as empPivot
9.1.2.1.不使用CASE表达式或额外的连接操作9.2.UNPIVOT操作符9.2.1.sql
ACCOUNTING RESEARCH SALES OPERATIONS---------- ---------- ---------- ---------- 3 5 6 0
9.2.2.sql
DNAME CNT-------------- ----------ACCOUNTING 3RESEARCH 5SALES 6OPERATIONS 0
9.2.3.sql
select DNAME, CNT from ( select [ACCOUNTING] as ACCOUNTING, [SALES] as SALES, [RESEARCH] as RESEARCH, [OPERATIONS] as OPERATIONS from ( select d.dname, e.empno from emp e,dept d where e.deptno=d.deptno ) driver pivot ( count(driver.empno) for driver.dname in ([ACCOUNTING],[SALES],[RESEARCH],[OPERATIONS]) ) as empPivot ) new_driver unpivot (cnt for dname in (ACCOUNTING,SALES,RESEARCH,OPERATIONS) ) as un_pivot