原因是:databricks 集群的数据底层是HDFS虽然是spark做为引擎读写如果没有及时合并也一样会因为小文件问题造成大量的资源消耗,也就会越来越慢。目前采用的主要方式,定时合并,与版本删除
1、python 脚本如下有用到的同学可以参考下
# 合并for database_name in database_list:sqlQueryShowTables = "SHOW TABLES FROM {0}".format(database_name)tablesDF = spark.sql(sqlQueryShowTables).collect()for table in tablesDF:sqlOptimizeTable = "OPTIMIZE {0}.{1}".format(database_name, table['tableName'])try:spark.sql(sqlOptimizeTable)print("INFO: Optimize table {0}.{1} completed.".format(database_name, table['tableName']))except Exception as e:print("ERROR: Optimize table {0}.{1} failed.".format(database_name, table['tableName']))
# 删除多的版本for database_name in database_list:sqlQueryShowTables = "SHOW TABLES FROM {0}".format(database_name)tablesDF = spark.sql(sqlQueryShowTables).collect()for table in tablesDF:sqlVACUUMTable = "VACUUM {0}.{1} RETAIN 168 HOURS".format(database_name, table['tableName'])try:spark.sql(sqlVACUUMTable)print("INFO: VACUUM table {0}.{1} completed.".format(database_name, table['tableName']))except Exception as e:print("ERROR: VACUUM table {0}.{1} failed.".format(database_name, table['tableName']))
2、在workflows 设置好定时器就行了,