xxl-job是流行的分布式调度中心,也是中心化的调度组件,由xxl-job-admin统一集中调度。调度核心逻辑在JobScheduleHelper。

调度原理性与问题分析

调度线程scheduleThread负责轮询数据库,查询错过调度或即将要被调度的任务,进行调度处理,这个处理的理想结果是在任务即将被调度之前放置到时间轮对应刻度任务队列里。这个处理要足够快,否则容易导致任务错过调度时间,或调度不均匀。假设调度密集型任务比较多,每次轮询处理总时间过长,就容易导致任务没有及时放置到时间轮刻度任务队列而错过调度,一种可能是会在下次轮询直接被触发任务(调度过期补偿),或是堆积在时间轮里待时间轮下次读刻度任务队列时连续触发该任务,读刻度的任务队列是简单的List,并未对任务去重。单任务的调度处理也应足够快,否则会导致调度线程一次轮询总处理时间过长。

时间轮线程ringThread则比较简单,按秒刻度读取任务队列进行触发任务异步执行,这个处理还是比较快的,即使是单线程,效率也不会太差。

调度线程为了保证一致性:“调度中心”通过DB锁保证集群分布式调度的一致性, 一次任务调度只会触发一次执行。采用的锁是用的select for update的排它锁,锁很大,在任务量大时容易有性能问题,而由网络或宕机导致的事务未提交并最终导致锁未释放则问题更严重,所有任务调度会被阻塞,即使重启了xxl-job-admin也无法恢复,需要人工介入操作数据库将锁释放才行。任务量大时即使采用集群部署xxl-job-admin,由于这把锁的存在,基本还是相当于单线程串行执行。锁未释放的问题,我们业务中就有这样的场景,数据库每天凌晨4点进行全量备份,数据库备份时也会加锁,会偶发导致xxl-job这边锁未释放问题。同时,这个锁的使用频次太高也会对数据库备份带来影响。

总而言之,xxl-job调度设计是有一定缺陷的,在任务量大或者低效数据库运维等场景下,容易有调度效率问题。优化的思路其实也不复杂,一是对调度处理进行分片,使用多线程调度方式,二是替换掉数据库排它锁,用其它效率高、安全性高、副作用小的分布式锁,比如Redis分布式锁。

调度效率优化思路与方案

调度分片

调度处理分片很简单,原始的查询代码按触发状态+下次调度时间分页查询所有任务,分片可以使用xxl_job_info的id进行分片,例如加上 id % ${分片总数} = ${分片序号}这样的查询条件。

SELECT FROM xxl_job_info AS tWHERE t.trigger_status = 1and t.trigger_next_time <![CDATA[  #{maxNextTime}ORDER BY id ASCLIMIT #{pagesize}

改造后如下:

SELECTFROM xxl_job_info AS tt.trigger_status = 1AND t.trigger_next_time <![CDATA[  #{maxNextTime}AND t.id % #{shardingTotal} = #{shardingNow}ORDER BY id ASCLIMIT #{pagesize}

调度线程由原来的单线程改为多线程方式,每个线程处理一个分片:

final XxlJobAdminConfig adminConfig = XxlJobAdminConfig.getAdminConfig();// 分片总数,作为外部化配置final int scheduleShardingCount = adminConfig.getScheduleShardingCount();// 调度线程池,线程数为分片总数scheduleExecutor = new ThreadPoolExecutor(scheduleShardingCount,scheduleShardingCount, 300, TimeUnit.SECONDS,new LinkedBlockingQueue(), new NamedThreadFactory("xxl-job-scheduler", true),new ThreadPoolExecutor.AbortPolicy());for (int i = 0; i  {// ... 前置处理// 分片的调度逻辑,使用分片查询List scheduleList = adminConfig.getXxlJobInfoDao().scheduleJobQueryWithSharding(nowTime + PRE_READ_MS, preReadCount,scheduleShardingCount, lockFlag);// ... 调度处理}); }

由于采用了多线程的方式,需要考虑同步问题,时间轮刻度任务队列用替换BlockingQueue掉List,简化同步操作。

private volatile static Map<Integer, BlockingQueue> ringData = new ConcurrentHashMap();

调度线程放置任务到时间轮刻度队列相应修改为BlockingQueue的add操作:

private void pushTimeRing(int ringSecond, int jobId) {// push async ringfinal BlockingQueue queue = ringData.computeIfAbsent(ringSecond, k -> new LinkedBlockingQueue());queue.add(jobId);if (logger.isDebugEnabled()) {logger.debug(">>>>>>>>>>> xxl-job, schedule push time-ring : " + ringSecond+ " = " + new ArrayList(queue));}}

时间轮线程获取刻度任务队列处理也相应的修改:

// second dataint nowSecond = Calendar.getInstance().get(Calendar.SECOND); // 避免处理耗时太长,跨过刻度,向前校验一个刻度;for (int i = 0; i < 2; i++) {final BlockingQueue queue = ringData.get((nowSecond + 60 - i) % 60);if (queue != null && !queue.isEmpty()) {queue.drainTo(ringItemData);}}// ring triggerif (logger.isDebugEnabled()) {logger.debug(">>>>>>>>>>> xxl-job, time-ring beat : " + nowSecond+ " = " + Collections.singletonList(ringItemData));}if (ringItemData.size() > 0) {// do triggerfor (int jobId : ringItemData) {// do triggerJobTriggerPoolHelper.trigger(jobId, TriggerTypeEnum.CRON, -1, null, null, null);}// clearringItemData.clear();}

分布式锁替换

分布式锁要解决的问题是锁的粒度以及锁的自动释放。分布式锁的方案比较成熟,例如Redis分布式锁,可自己采用setnx ex命令或lua脚本实现,或直接使用Redisson的锁的API;又例如Zookeeper分布式锁,可采用Curator的InterProcessMutex。至于使用数据库的乐观锁,读者可自行探讨其可行性。使用Redis分布式锁简单、性能高,也是我推荐的方式。由于项目原因,我们在该组件中已经集成了dubbo,且使用Zookeeper作为注册中心,不想再引入Redis组件,因此我们直接使用Zookeeper分布式锁。

使用Zookeeper分布式锁时需要考虑加锁解锁的效率。因为这个加锁解锁是在调度线程一直轮询的,如果每个线程固化一个锁标识,那么竞争无非是多实例(多进程)下的竞争,试想如何减小竞争呢?仔细分析这个场景,其实它并不是一个需要高频加锁解锁的场景,简单办法就是锁的时间延长。因为一个调度线程处理的任务已经固化好分片策略的,一次获取到锁,多次轮询使用没毛病,除非这个实例出问题了。如果这个实例出问题,锁自动释放了也没问题,所以使用Zookeeper分布式锁+一次获取到锁使用多次也还是很合适的,低频使用也能缓解Zookeeper的压力。除了加锁与释放锁两次轮询有额外的消耗,其它轮询无此消耗,也会提升处理时间。这里采用外部化配置控制锁持有最大时间,到期后释放锁再次让调度线程竞争。这个理论同样适合于DB锁或Redis锁。

使用Zookeeper分布式锁定义CuratorFramework:

@Configurationpublic class CuratorConfig {@Value("${zookeeper.address}")private String zookeeperAddress;@Value("${zookeeper.timeout:20000}")private int zookeeperTimeout;@Beanpublic CuratorFramework curatorFramework() {CuratorFramework curatorFramework = CuratorFrameworkFactory.builder().connectString(zookeeperAddress).sessionTimeoutMs(zookeeperTimeout).connectionTimeoutMs(zookeeperTimeout).retryPolicy(new ExponentialBackoffRetry(2000, 10)).build();curatorFramework.start();return curatorFramework;}}

完整代码

外部化配置:

adminConfig.getScheduleShardingCount(); // 分片总数

adminConfig.getLockMaxSeconds(); // 锁持有最长时间

/** * @author xuxueli 2019-05-21 */public class JobScheduleHelper {private static final Logger logger = LoggerFactory.getLogger(JobScheduleHelper.class);private static final JobScheduleHelper instance = new JobScheduleHelper();public static JobScheduleHelper getInstance() {return instance;}public static final long PRE_READ_MS = 5000;// pre readprivate Thread ringThread;private volatile boolean scheduleThreadToStop = false;private volatile boolean ringThreadToStop = false;private volatile static Map<Integer, BlockingQueue> ringData = new ConcurrentHashMap();private volatile ExecutorService scheduleExecutor;private final Map lockMap = new ConcurrentHashMap();private final Map lockAcquireTimeMap = new ConcurrentHashMap();public void start() {final XxlJobAdminConfig adminConfig = XxlJobAdminConfig.getAdminConfig();final int scheduleShardingCount = adminConfig.getScheduleShardingCount();scheduleExecutor = new ThreadPoolExecutor(scheduleShardingCount,scheduleShardingCount, 300, TimeUnit.SECONDS,new LinkedBlockingQueue(), new NamedThreadFactory("xxl-job-scheduler", true),new ThreadPoolExecutor.AbortPolicy());for (int i = 0; i  {try {TimeUnit.MILLISECONDS.sleep(5000 - System.currentTimeMillis() % 1000);} catch (InterruptedException e) {if (!scheduleThreadToStop) {logger.error(e.getMessage(), e);}}logger.info(">>>>>>>>> init xxl-job admin scheduler {} success.", lockFlag);// pre-read count: treadpool-size * trigger-qps (each trigger cost 50ms, qps = 1000/50 = 20)int preReadCount = (adminConfig.getTriggerPoolFastMax() + adminConfig.getTriggerPoolSlowMax()) * 20;// zookeeper lockInterProcessMutex lock = null;while (!scheduleThreadToStop) {// define locktry {lock = lockMap.computeIfAbsent(lockFlag,k -> new InterProcessMutex(adminConfig.getCuratorFramework(),"/xxl-job/schedule_lock_" + k));} catch (Throwable e) {logger.warn(">>>>>>>>>>> xxl-job scheduler {}, 定义分布式锁失败", lockFlag, e);try {TimeUnit.SECONDS.sleep(10);} catch (InterruptedException interruptedException) {logger.error(interruptedException.getMessage(), interruptedException);}continue;}// check locktry {// 当期线程未持有锁,则尝试获取锁,获取不到则继续轮询if (lockAcquireTimeMap.get(lockFlag) == null) {if (lock.acquire(10, TimeUnit.SECONDS)) {logger.info("获取到分布式锁:/xxl-job/schedule_lock_" + lockFlag);lockAcquireTimeMap.put(lockFlag, System.currentTimeMillis());} else {// TimeUnit.SECONDS.sleep(10);continue;}}// 检测锁状态if (!lock.isOwnedByCurrentThread()) {lockAcquireTimeMap.remove(lockFlag); // 清除获取到锁的时间logger.warn(">>>>>>>>>>> xxl-job scheduler {}, 当前线程未占有分布式锁", lockFlag);try {TimeUnit.SECONDS.sleep(10);} catch (InterruptedException e) {logger.error(e.getMessage(), e);}continue;}} catch (Exception e) {logger.warn(">>>>>>>>>>> xxl-job scheduler {}, 轮询分布式锁失败", lockFlag, e);continue;}// Scan Jobfinal long start = System.currentTimeMillis();boolean preReadSuc = true;try {// 1、pre readlong nowTime = System.currentTimeMillis();// sharding queryList scheduleList = adminConfig.getXxlJobInfoDao().scheduleJobQueryWithSharding(nowTime + PRE_READ_MS, preReadCount,scheduleShardingCount, lockFlag);if (scheduleList == null || scheduleList.size() <= 0) {preReadSuc = false;continue;}// 2、push time-ringfinal List jobIds = new ArrayList();for (XxlJobInfo jobInfo : scheduleList) {try {// time-ring jumpif (nowTime > jobInfo.getTriggerNextTime() + PRE_READ_MS) {// 2.1、trigger-expire > 5s:pass && make next-trigger-timelogger.warn(">>>>>>>>>>> xxl-job, schedule misfire, jobId = "+ jobInfo.getId());// 1、misfire matchMisfireStrategyEnum misfireStrategyEnum = MisfireStrategyEnum.match(jobInfo.getMisfireStrategy(),MisfireStrategyEnum.DO_NOTHING);if (MisfireStrategyEnum.FIRE_ONCE_NOW == misfireStrategyEnum) {// FIRE_ONCE_NOW 》 triggerJobTriggerPoolHelper.trigger(jobInfo.getId(),TriggerTypeEnum.MISFIRE, -1, null, null, null);if (logger.isDebugEnabled()) {logger.debug(">>>>>>>>>>> xxl-job, schedule push trigger : jobId = "+ jobInfo.getId());}}// 2、fresh nextrefreshNextValidTime(jobInfo, new Date());} else if (nowTime > jobInfo.getTriggerNextTime()) {// 2.2、trigger-expire >>>>>>>>>> xxl-job, schedule push trigger : jobId = "+ jobInfo.getId());}// 2、fresh nextrefreshNextValidTime(jobInfo, new Date());// next-trigger-time in 5s, pre-read againif (jobInfo.getTriggerStatus() == 1&& nowTime + PRE_READ_MS > jobInfo.getTriggerNextTime()) {// 1、make ring secondint ringSecond = (int) ((jobInfo.getTriggerNextTime() / 1000) % 60);// 2、push time ringpushTimeRing(ringSecond, jobInfo.getId());// 3、fresh nextrefreshNextValidTime(jobInfo, new Date(jobInfo.getTriggerNextTime()));}} else {// 2.3、trigger-pre-read:time-ring trigger && make next-trigger-time// 1、make ring secondint ringSecond = (int) ((jobInfo.getTriggerNextTime() / 1000) % 60);// 2、push time ringpushTimeRing(ringSecond, jobInfo.getId());// 3、fresh nextrefreshNextValidTime(jobInfo, new Date(jobInfo.getTriggerNextTime()));}// 记录处理成功的jobIdjobIds.add(jobInfo.getId());} catch (Throwable e) {logger.error("P3|XXLJobFail|任务调度失败|{},{}|任务id:{},msg={}",jobInfo.getJobTag(), jobInfo.getJobDesc(),jobInfo.getId(), e.getMessage(), e);}}// 3、update trigger jobInfofor (XxlJobInfo jobInfo : scheduleList) {if (jobIds.contains(jobInfo.getId())) {adminConfig.getXxlJobInfoDao().scheduleUpdate(jobInfo);}}} catch (Throwable e) {logger.error(">>>>>>>>>>> xxl-job, JobScheduleHelper#scheduleThread error, "+ "scheduleThreadToStop={}", scheduleThreadToStop, e);} finally {try {if (lock.isOwnedByCurrentThread()) {final Long acquireTime = lockAcquireTimeMap.get(lockFlag);if (acquireTime != null&& System.currentTimeMillis() - acquireTime> adminConfig.getLockMaxSeconds() * 1000) {// 持有锁达到最大时间,释放lock.release();lockAcquireTimeMap.remove(lockFlag);logger.info("释放分布式锁:/xxl-job/schedule_lock_" + lockFlag);}}} catch (Exception e) {logger.warn("检测释放锁失败", e);}long cost = System.currentTimeMillis() - start;// Wait seconds, align secondif (cost  scan each second; fail > skip this period;TimeUnit.MILLISECONDS.sleep((preReadSuc ? 1000 : PRE_READ_MS) - System.currentTimeMillis() % 1000);} catch (InterruptedException e) {if (!scheduleThreadToStop) {logger.error(e.getMessage(), e);}}}}}try {if (lock != null && lock.isOwnedByCurrentThread()) {lock.release();lockAcquireTimeMap.remove(lockFlag);logger.info("释放分布式锁:/xxl-job/schedule_lock_" + lockFlag);}} catch (Exception e) {logger.warn("检测释放锁失败", e);}logger.info(">>>>>>>>>>> xxl-job, JobScheduleHelper#scheduleThread {} stop", lockFlag);}));}// ring threadringThread = new Thread(() -> {final List ringItemData = new ArrayList(1024);while (!ringThreadToStop) {// align secondtry {TimeUnit.MILLISECONDS.sleep(1000 - System.currentTimeMillis() % 1000);} catch (InterruptedException e) {if (!ringThreadToStop) {logger.error(e.getMessage(), e);}}try {// second dataint nowSecond = Calendar.getInstance().get(Calendar.SECOND); // 避免处理耗时太长,跨过刻度,向前校验一个刻度;for (int i = 0; i < 2; i++) {final BlockingQueue queue = ringData.get((nowSecond + 60 - i) % 60);if (queue != null && !queue.isEmpty()) {queue.drainTo(ringItemData);}}// ring triggerif (logger.isDebugEnabled()) {logger.debug(">>>>>>>>>>> xxl-job, time-ring beat : " + nowSecond+ " = " + Collections.singletonList(ringItemData));}if (ringItemData.size() > 0) {// do triggerfor (int jobId : ringItemData) {// do triggerJobTriggerPoolHelper.trigger(jobId, TriggerTypeEnum.CRON, -1, null, null, null);}// clearringItemData.clear();}} catch (Throwable e) {logger.error(">>>>>>>>>>> xxl-job, JobScheduleHelper#ringThread error, "+ "ringThreadToStop={}", ringThreadToStop, e);}}logger.info(">>>>>>>>>>> xxl-job, JobScheduleHelper#ringThread stop");});ringThread.setDaemon(true);ringThread.setName("xxl-job-admin-JobScheduleHelper#ringThread");ringThread.start();}private static void refreshNextValidTime(XxlJobInfo jobInfo, Date fromTime) {Date nextValidTime = null;try {nextValidTime = generateNextValidTime(jobInfo, fromTime);} catch (Exception e) {logger.warn("P3|XXLJobFail|任务执行失败|{},{}|任务id:{},计算下次调度时间失败,任务自动下线,调度类型={},调度表达式={},errMsg={}",jobInfo.getJobTag(), jobInfo.getJobDesc(), jobInfo.getId(), jobInfo.getScheduleType(),jobInfo.getScheduleConf(), e.getMessage(), e);jobInfo.setTriggerStatus(0);jobInfo.setTriggerLastTime(0);jobInfo.setTriggerNextTime(0);return;}if (nextValidTime != null) {jobInfo.setTriggerLastTime(jobInfo.getTriggerNextTime());jobInfo.setTriggerNextTime(nextValidTime.getTime());} else {jobInfo.setTriggerStatus(0);jobInfo.setTriggerLastTime(0);jobInfo.setTriggerNextTime(0);logger.warn(">>>>>>>>>>> xxl-job, refreshNextValidTime fail for job: jobId={}, "+ "scheduleType={}, scheduleConf={}",jobInfo.getId(), jobInfo.getScheduleType(), jobInfo.getScheduleConf());}}private void pushTimeRing(int ringSecond, int jobId) {// push async ringfinal BlockingQueue queue = ringData.computeIfAbsent(ringSecond, k -> new LinkedBlockingQueue());queue.add(jobId);if (logger.isDebugEnabled()) {logger.debug(">>>>>>>>>>> xxl-job, schedule push time-ring : " + ringSecond+ " = " + new ArrayList(queue));}}public void toStop() {// 1、stop schedulescheduleThreadToStop = true;try {TimeUnit.SECONDS.sleep(1);// wait} catch (InterruptedException e) {logger.error(e.getMessage(), e);}if (scheduleExecutor != null) {scheduleExecutor.shutdown();}// if has ring databoolean hasRingData = false;if (!ringData.isEmpty()) {for (int second : ringData.keySet()) {BlockingQueue tmpData = ringData.get(second);if (tmpData != null && tmpData.size() > 0) {hasRingData = true;break;}}}if (hasRingData) {try {TimeUnit.SECONDS.sleep(8);} catch (InterruptedException e) {logger.error(e.getMessage(), e);}}// stop ring (wait job-in-memory stop)ringThreadToStop = true;try {TimeUnit.SECONDS.sleep(1);} catch (InterruptedException e) {logger.error(e.getMessage(), e);}if (ringThread.getState() != Thread.State.TERMINATED) {// interrupt and waitringThread.interrupt();try {ringThread.join();} catch (InterruptedException e) {logger.error(e.getMessage(), e);}}logger.info(">>>>>>>>>>> xxl-job, JobScheduleHelper stop");}// ---------------------- tools ----------------------public static Date generateNextValidTime(XxlJobInfo jobInfo, Date fromTime) throws Exception {ScheduleTypeEnum scheduleTypeEnum = ScheduleTypeEnum.match(jobInfo.getScheduleType(), null);if (ScheduleTypeEnum.CRON == scheduleTypeEnum) {return new CronExpression(jobInfo.getScheduleConf()).getNextValidTimeAfter(fromTime);} else if (ScheduleTypeEnum.FIX_RATE== scheduleTypeEnum /*|| ScheduleTypeEnum.FIX_DELAY == scheduleTypeEnum*/) {return new Date(fromTime.getTime() + Integer.parseInt(jobInfo.getScheduleConf()) * 1000);}return null;}}