频繁设置CGroup触发linux内核bug导致CGroup running task不调度

1. 说明1> 本篇是实际工作中linux上碰到的一个问题，一个使用了CGroup的进程处于R状态但不执行，也不退出，还不能kill，经过深入挖掘才发现是Cgroup的内核bug2>发现该bug后，去年给RedHat提交过漏洞，但可惜并未通过，不知道为什么，这里就发我博客公开了3> 前面的2个帖子《极简cfs公平调度算法》《极简组调度-CGroup如何限制cpu》是为了了解本篇这个内核bug而写的，需要linux内核进程调度和CGroup控制的基本原理才能够比较清晰的了解这个内核bug的来龙去脉4> 本文所用的内核调试工具是crash，大家可以到官网上去查看crash命令的使用，这里就不多介绍了https://crash-utility.github.io/help.html2. 问题2.1 触发bug code(code较长，请展开代码)2.1.1 code

#include #include #include #include #include #include #include #include #include #include #include <string>using namespace std;std::string sub_cgroup_dir("/sys/fs/cgroup/cpu/test");// common libbool is_dir(const std::string& path){    struct stat statbuf;    if (stat(path.c_str(), &statbuf) == 0 )    {        if (0 != S_ISDIR(statbuf.st_mode))        {            return true;        }    }    return false;}bool write_file(const std::string& file_path, int num){    FILE* fp = fopen(file_path.c_str(), "w");    if (fp = NULL)    {        return false;    }    std::string write_data = to_string(num);    fputs(write_data.c_str(), fp);    fclose(fp);    return true;}// mslong get_ms_timestamp(){    timeval tv;    gettimeofday(&tv, NULL);    return (tv.tv_sec * 1000 + tv.tv_usec / 1000);}// cgroupbool create_cgroup(){    if (is_dir(sub_cgroup_dir) == false)    {        if (mkdir(sub_cgroup_dir.c_str(), S_IRWXU | S_IRGRP) != 0)        {            cout << "mkdir cgroup dir fail" << endl;            return false;        }    }    int pid = getpid();    cout << "pid is " << pid << endl;    std::string procs_path = sub_cgroup_dir + "/cgroup.procs";    return write_file(procs_path, pid);}bool set_period(int period){    std::string period_path = sub_cgroup_dir + "/cpu.cfs_period_us";    return write_file(period_path, period);}bool set_quota(int quota){    std::string quota_path = sub_cgroup_dir + "/cpu.cfs_quota_us";    return write_file(quota_path, quota);}// thread// param: ms intervalvoid* thread_func(void* param){    int i = 0;    int interval = (long)param;    long last = get_ms_timestamp();    while (true)    {        i++;        if (i % 1000 != 0)        {            continue;        }        long current = get_ms_timestamp();        if ((current - last) >= interval)        {            usleep(1000);            last = current;        }    }    pthread_exit(NULL);} void test_thread() {    const int k_thread_num = 10;    pthread_t pthreads[k_thread_num];    for (int i = 0; i < k_thread_num; i++)    {        if (pthread_create(&pthreads[i], NULL, thread_func, (void*)(i + 1)) != 0)        {            cout << "create thread fail" << endl;        }        else        {            cout << "create thread success,tid is " << pthreads[i] << endl;        }    }}//argv[0] : period//argv[1] : quotaint main(int argc,char* argv[]){    if (argc <3)    {        cout << "usage : ./inactive timer $period $quota" << endl;        return -1;    }    int period = stoi(argv[1]);    int quota = stoi(argv[2]);    cout << "period is " << period << endl;    cout << "quota is " << quota << endl;    test_thread();    if (create_cgroup() == false)    {        cout << "create cgroup fail" << endl;        return -1;    }    int i =0;    while (true)    {        if (i > 20)        {            i = 0;        }        i++;        long current = get_ms_timestamp();        long last = current;        while ((current - last) < i)        {            usleep(1000);            current = get_ms_timestamp();        }                set_period(period);        set_quota(quota);    }    return 0;}

View Code

2.1.2 编译

g++ -std=c++11 -lpthread trigger_cgroup_timer_inactive.cpp -o inactive_timer

2.1.3 在CentOS7.0~7.5的系统上执行程序

./inactive_timer 100000 10000

2.1.4 上述代码主要干了2件事1> 将自己进程设置为CGroup控制cpu2> 反复设置CGroup的cpu.cfs_period_us和cpu.cfs_quota_us3> 起10个线程消耗cpu2.1.5《极简组调度-CGroup如何限制cpu》已经讲过CGroup限制cpu的原理：CGroup控制cpu是通过cfs_period_us指定的一个时间周期内，CGroup下的进程，能使用cfs_quota_us时间长度的cpu，如果在该周期内使用的cpu超过了cfs_quota_us设定的值，则将其throttled，即将其从公平调度运行队列中移出，然后等待定时器触发下个周期unthrottle后再移入，从而达到控制cpu的效果。2.2 现象1> 程序跑几分钟后，所有的线程一直处于running状态，但实际线程都已经不再执行了，cpu使用率也一直是02> 查看线程的stack，task都在系统调用返回中3> 用crash查看进程的主线程32764状态确实为”running”，但对应的0号cpu上的rq cfs运行队列中并没有任何运行task4> 查看task对应的se没有在rq上，cfs_rq显示被throttled《极简组调度-CGroup如何限制cpu》中说过，throttle后经过一个period（程序设的是100ms），CGroup的定时器会再次分配quota，并unthrottle，将group se重新加入到rq中，这里一直throttle不恢复，只能怀疑是不是定时器出问题了5> 再查看task group对应的cfs_bandwidth的period timer，发现state为0，即HRTIMER_STATE_INACTIVE，表示未激活，问题就在这里，正常情况下该timer是激活的，该定时器未激活会导致对应cpu上的group cfs_rq分配不到quota，quota用完后就会导致其对应的se被移出rq，此时task虽然处于Ready状态，但由于不在rq上，仍然不会被调度的3. 原因3.1 linux的定时器是一次性，到期后需要再次激活才能继续使用，搜索代码可知period_timer是在__start_cfs_bandwidth()中实现调用start_bandwidth_timer()进行激活的这里有一个关键点，当cfs_b->timer_active不为0时，__start_cfs_bandwidth()就会不激活period_timer，和问题现象相符，那么什么时候cfs_b->timer_active会不为0呢？3.2 当设置CGroup的quota或者period时，会最终进入到__start_cfs_bandwidth()，这里就会将cfs_b->timer_active设为0，并进入__start_cfs_bandwidth()

tg_set_cfs_quota()    tg_set_cfs_bandwidth()            /* restart the period timer (if active) to handle new period expiry */            if (runtime_enabled && cfs_b->timer_active) {                /* force a reprogram */                cfs_b->timer_active = 0;                __start_cfs_bandwidth(cfs_b);            }

仔细观察上述代码，设想如下场景：1> 在线程A设置CGroup的quota或者period时，将cfs_b->timer_active设为0，调用_start_cfs_bandwidth()后，在未执行到__start_cfs_bandwidth()代码580行hrtimer_cancel()之前，cpu切换到B线程2> 线程B也调用__start_cfs_bandwidth()，执行完后将cfs_b->timer_active设为1，并调用start_bandwidth_timer()激活timer，此时cpu切换到线程A3> 线程A恢复并继续执行，调用hrtimer_cancel()让period_timer失效，然后执行到__start_cfs_bandwidth()代码585行后，发现cfs_b->timer_active为1，直接return，而不再将period_timer激活3.3 搜索__start_cfs_bandwidth()的调用，发现时钟中断中会调用update_curr()函数，其最终会调用assign_cfs_rq_runtime()检查cgroup cpu配额使用情况，决定是否需要throttle，这里在cfs_b->timer_active = 0时，也会调用__start_cfs_bandwidth()，即执行上面B线程的代码，从而和设置CGroup的线程A发生线程竞争，导致timer失效。1>完整代码执行流程图2> 当定时器失效后，由于3.2中线程B将cfs_b->timer_active = 1，所以即使下次时钟中断执行到assign_cfs_rq_runtime()中时，由于误判timer是active的，也不会调用__start_cfs_bandwidth()再次激活timer，这样被throttle的group se永远不会被unthrottle投入rq调度了3.4 总结频繁设置CGroup配置，会和时钟中断中检查group quota的线程在__start_cfs_bandwidth()上发生线程竞争，导致period_timer被cancel后不再激活，然后CGroup控制的task不能分配cpu quota，导致不再被调度3.5 恢复方法知道了漏洞成因，我们也看到tg_set_cfs_quota()会调用__start_cfs_bandwidth() cancel掉timer，然后重新激活timer，这样就能在timer回调中unthrottle了，所以只要手动设置下这个CGroup的cpu.cfs_period_us或cpu.cfs_quota_us，就能恢复运行。4. 修复3.10.0-693以上的版本并不会出现这个问题，通过和2.6.32版本（下图右边）的代码对比，可知3.10.0-693版的代码（下图左边）将hrtimer_cancel()该为hrtimer_try_to_cancel()，并将其和cfs_b->timer_active的判定都放在自旋锁中保护，这样就不会cfs_b->timer_active被置1后，仍然还会去cancel period_timer的问题了，但看这个bug fix的邮件组讨论，是为了修另一个问题顺便把这个问题也修了，痛失给linux提patch的机会- –ref : https://gfiber.googlesource.com/kernel/bruno/+/09dc4ab03936df5c5aa711d27c81283c6d09f4955. 漏洞利用1> 在国内，仍有大量的公司在使用CentOS6和CentOS7.0~7.5，这些系统都存在这个漏洞，使用了CGroup限制cpu就有可能触发这个bug导致业务中断，且还不一定能重启恢复2> 一旦触发这个bug，由于task本身已经是running状态了，即使去kill，由于task得不到调度，是无法kill掉的，因此可以通过这种方法攻击任意软件程序（如杀毒软件），让其不能执行又不能重启（很多程序为了保证不双开，都会只保证只有一个进程存在），即使他们不用CGroup，也可以给他建一个对其进行攻击3> 该bug由于是linux内核bug，一旦触发还不易排查和感知，因为看进程状态都是running，直觉上认为进程仍然在正常执行的本文为博主原创文章，如需转载请说明转至http://www.cnblogs.com/organic/

频繁设置CGroup触发linux内核bug导致CGroup running task不调度

最新关注

热文推荐

玩转这三款抓包工具，网友说：你是高手

计算机视觉岗暑期实习应聘总结

在Linux中通过C语言获取当前系统时间，精确到毫秒或微秒

要让一个批处理文件（.bat）在每次开关机时自动运行

【C++杂货铺】模板（文末有彩蛋哟）

从源码分析 MGR 的新主选举算法

频繁设置CGroup触发linux内核bug导致CGroup running task不调度

相关文章

最新关注

热文推荐