HW1

Submission requirements:

Please submit your solutions to our class website.

Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.

(a) Draw a star schema diagram for the data warehouse.

(b) Starting with the base cuboid [date, spectator, location, game]，what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles” />step 1. Roll–up on date from date_key to allstep 2. Roll–up on spectator from spectator_key to statusstep 3. Roll–up on location from location_key to location_namestep 4. Roll–up on game from game_key to allstep 5. Dice with “status=student” and “location_name=Los Angeles”

优点

位图索引是一种高效的索引结构，在查询、过滤等方面上，由于进行的是位运算，所以比常规的查询方式快很多。例如在本仓库中，假设对于spectator表的子列status，我们有：

spectator_key	status	gender
0	学生	男
1	成人	女
2	学生	男
3	学生	女
4	老人	女

status就可以建立以下位图索引：

status="学生" : 10110status="成人" : 01000status="老人" : 00001

gender可以建立以下位图索引：

gender="男": 10100gender="女": 01011

例如，我们想要查询学生，只需要用10110去过滤原始数据就行。

我们想混合查询，比如同时查询status="学生"和gender="男"的数据，只需要进行并操作就行了：

10110 & 10100 = 10100

可以大大提高计算速度。

此外，位图索引可以在一定程度上绕开原始数据，进一步提高处理速度。例如，我们想统计满足上面条件的人数，只需要:

ans=0x=(10110&10100)while x:x&=(x-1)ans+=1

缺点

位图索引比较适合枚举类型，也就是离散型变量，对于连续变量，位图索引并不适用，往往需要先做离散化。比如本仓库中，phone number字段可能就不太适合(也许这个字段没有存在的必要？)

而当属性列非常多时，我们做位图索引的开销也比较大。

Q2．某电子邮件数据库中存储了大量的电子邮件。请设计数据仓库的结构，以便用户从多个维度进行查询和挖掘。

Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:

age	23	23	27	27	39	41	47	49	50	52	54	54	56	57	58	58	60	61
%fat	9.5	26.5	7.8	17.8	31.4	25.9	27.4	27.2	31.2	34.6	42.5	28.8	33.4	30.2	34.1	32.9	41.2	35.7

(a) Calculate the mean, median, and standard deviation of age and %fat.

 age %fatmean 46.44444428.783333std13.218624 9.254395median51.0 30.7

(b) Draw the boxplots for age and %fat.

(d) Normalize age based on min-max normalization.

x=data["age"]y=data['%fat']X=(x-x.min())/(x.max()-x.min())Y=(y-y.min())/(y.max()-y.min())print(X,Y)

Result is:

0 0.0000001 0.0000002 0.1052633 0.1052634 0.4210535 0.4736846 0.6315797 0.6842118 0.7105269 0.763158100.815789110.815789120.868421130.894737140.921053150.921053160.973684171.000000

(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated” />print(np.corrcoef(x,y))print(“相关系数” ,stats.pearsonr(x,y)[0])

Result is

[[1.0.8176188] [0.8176188 1. ]]相关系数 0.8176187964565874

I think they are positively correlated.

(f) Smooth the fat data by bin means, using a bin depth of 6.

def mean(x):return round(sum(x)/len(x),2)N_y=sorted(y)bins=[[]]for j in N_y:bins[-1].append(j)if len((v:=bins[-1]))==6:v[:]=[mean(v)]*len(v)bins.append([])for i,jin enumerate(bins[:-1]):print("bin %d is :"%(i+1),j)

bin 1 is : [19.12, 19.12, 19.12, 19.12, 19.12, 19.12]bin 2 is : [30.32, 30.32, 30.32, 30.32, 30.32, 30.32]bin 3 is : [36.92, 36.92, 36.92, 36.92, 36.92, 36.92]

(g) Smooth the fat data by bin boundaries, using a bin depth of 6.

这里因为我们是对排好序的数据做处理，所以可以通过二分法进行优化，获取中间分界。

def close(x,a,b):# 是否靠近下界return (x-a)<=(b-x)def boundary(x):Min=x[0]Max=x[-1]l,r=0,len(x)-1while l<=r:mid=(r-l)//2+lif close(x[mid],Min,Max):if not close(x[mid+1],Min,Max):l=midbreakl=mid+1else:if close(x[mid-1],Min,Max):l=midbreakr=mid-1return [[Min]*l+[Max]*(len(x)-l)]N_y=sorted(y)bins=[[]]for j in N_y:bins[-1].append(j)if len((v:=bins[-1]))==6:v[:]=boundary(v)bins.append([])for i,jin enumerate(bins[:-1]):print("bin %d is :"%(i+1),j)

bin 1 is : [[7.8, 7.8, 27.2, 27.2, 27.2, 27.2]]bin 2 is : [[27.4, 27.4, 32.9, 32.9, 32.9, 32.9]]bin 3 is : [[33.4, 33.4, 33.4, 33.4, 42.5, 42.5]]

文章版权归作者所有，未经允许请勿转载。

THE END

文章

国科大数据挖掘课程HW1

HW1

Submission requirements:

Please submit your solutions to our class website.

云原生之深入解析Thanos在EKS多集群架构上存储多个集群Prometheus

strcmp函数详解看这一篇就够了-C语言（函数实现、使用用法举例、作用、与strncmp的区别）

Java中csv文件读写分析

c语言文件操作函数的讲解；fopen()；fputs()；fgets()；fflush()；fsync()；fileno()；open()；fclose()；等函数的含义及其作用；

pytorch进阶学习（二）：使用DataLoader读取自己的数据集

2023年度数学建模竞赛汇总