Linux上采用rdtsc指令对C/C++程序进行性能测试

RDTSC是什么

RDTSC是 “Read Time Stamp Counter”的缩写,它是目前Intel和AMD的CPU都普遍支持的一条CPU指令,该指令可以把当前处理器的时间戳加载到EDX:EAX寄存器供外部使用。

RDTSC的优点

RDTSC是内置的CPU指令,而一般CPU单条指令运行仅需要几个或者几十个CPU cycles, 所以采用RDTSC指令可以在开销比较小的情况下获取程序的性能数据,可以说RDTSC是指令级别的性能测试利器。

RDTSC的历史

最初的RDTSC指令读取的是CPU自启动后实际的tic-toc数值[1],这样的缺点是CPU的频率不总是恒定的,特别是设置了超频的CPU,会导致同样的程序采用tic-toc值进行测试时,得到的tic-toc差值相差很大。同时对于多核CPU,“这些核心之间的 time stamp counter 不一定是同步的,所以当进程在核心之间迁移后,rdtsc 的结果就未必有意义。[2]”
为了克服这个问题,从Intel 奔腾处理器(Nehalem 架构)[3]开始,处理器内部都实现了恒定时间戳计数器,这样,不管是变频或者多核,time stamp counter会以恒定速率增加,RDTSC就可以重新用来作为性能测试的工具了。

判断CPU是否支持RDTSC指令

在Linux上运行下面的命令
bash grep -m 1 ^flags /proc/cpuinfo | sed 's/ /\n/g' | egrep "constant_tsc|nonstop_tsc"
如果constant_tsc和nonstop_tsc输出存在,则说明该CPU支持RDTSC指令,constant_tsc表示TSC的速率是恒定的,nonstop tsc表示TSC是不会sleep的。一般来说,CPU每秒tic-toc的值和CPU的频率是相关的,多数情况下相等。

使用RDTSC的注意事项

乱序执行

由于RDTSC是CPU指令,因而可能会产生乱序执行的问题(OOO:out of order),通常需要在指令前后插入memory barrier

#include 
#include 
#include 

__inline__ uint64_t perf_counter(void)
{
    __asm__ __volatile__("" : : : "memory");
    uint64_t r =  __rdtsc();
    __asm__ __volatile__("" : : : "memory");

    return r;
}

void someFunction() {

}

int main()
{
    uint64_t t1 = perf_counter();
    someFunction();
    uint64_t t2 = perf_counter();
    std::cout << t2 - t1 << " counts" << std::endl;
    
    return 0;
}

与时间的换算

由于tic-toc度量的是时间戳计数器的值,如果需要换算成时间,需要如下[4]这样进行转换:

# seconds = # cycles / frequency
Note: frequency is given in Hz, where: 1,000,000 Hz = 1 MHz

但是这里TSC的频率和CPU的频率通常不是一致的,所以如果程序采用rdtsc指令度量,最好不要将其换算为时间,最好统一采用TSC值进行比较。获取实际的TSC频率是比较复杂的,这里[5]有相关的讨论可以参考。

我们可以像类似[6]这样获得类似1纳秒有几个tic-toc的大致概念,但是,不要指望将TSC值换算成时间后,可以得到严格的时间。

// from https://stackoverflow.com/questions/39151049/on-a-cpu-with-constant-tsc-and-nonstop-tsc-why-does-my-time-drift
#include 
#include 
#include 
#include 
#include 

static inline unsigned long rdtscp_start(void) {
  unsigned long var;
  unsigned int hi, lo;

  __asm volatile ("cpuid\n\t"
          "rdtsc\n\t" : "=a" (lo), "=d" (hi)
          :: "%rbx", "%rcx");

  var = ((unsigned long)hi << 32) | lo;
  return (var);
}

static inline unsigned long rdtscp_end(void) {
  unsigned long var;
  unsigned int hi, lo;

  __asm volatile ("rdtscp\n\t"
          "mov %%edx, %1\n\t"
          "mov %%eax, %0\n\t"
          "cpuid\n\t"  : "=r" (lo), "=r" (hi)
          :: "%rax", "%rbx", "%rcx", "%rdx");

  var = ((unsigned long)hi << 32) | lo;
  return (var);
  }

/*see https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
 */

using std::cout;
using std::cerr;
using std::endl;

#define CLOCK CLOCK_REALTIME

uint64_t to_ns(const timespec &ts);   // Converts a struct timespec to ns (since epoch).
void view_ticks_per_ns(int runs =10, int sleep =10);

int main(int argc, char **argv) {
    int runs = 10, sleep = 10;
    if (argc != 1 && argc != 3) {
        cerr << "Usage: " << argv[0] << " [ RUNS SLEEP ] \n";
        exit(1);
    } else if (argc == 3) {
        runs = std::atoi(argv[1]);
        sleep = std::atoi(argv[2]);
    }

    view_ticks_per_ns(runs, sleep);
}

void view_ticks_per_ns(int RUNS, int SLEEP) {
// Prints out stream of RUNS tsc ticks per ns, each calculated over a SLEEP secs interval.
    timespec clock_start, clock_end;
    unsigned long tsc1, tsc2, tsc_start, tsc_end;
    unsigned long elapsed_ns, elapsed_ticks;
    double rate; // ticks per ns from each run.

    clock_getres(CLOCK, &clock_start);
    cout <<  "Clock resolution: " << to_ns(clock_start) << "ns\n\n";

    cout << " tsc ticks      " << "ns      " << " tsc ticks per ns\n";
    for (int i = 0; i < RUNS; ++i) {
        tsc1 = rdtscp_start();
        clock_gettime(CLOCK, &clock_start);
        tsc2 = rdtscp_end();
        tsc_start = (tsc1 + tsc2) / 2;

        sleep(SLEEP);

        tsc1 = rdtscp_start();
        clock_gettime(CLOCK, &clock_end);
        tsc2 = rdtscp_end();
        tsc_end = (tsc1 + tsc2) / 2;

        elapsed_ticks = tsc_end - tsc_start;
        elapsed_ns = to_ns(clock_end) - to_ns(clock_start);
        rate = static_cast<double>(elapsed_ticks) / elapsed_ns;

        cout << elapsed_ticks << " " << elapsed_ns << " " << std::setprecision(12) << rate << endl;
    }
}

constexpr uint64_t BILLION {1000000000};

uint64_t to_ns(const timespec &ts) {
  return ts.tv_sec * BILLION + ts.tv_nsec;
}

程序输出为:

$ ./viewRates 
Clock resolution: 1ns

 tsc ticks      ns       tsc ticks per ns
22000198660 10000061710 2.20000628976
22000234368 10000078247 2.20000622241
22000189429 10000057811 2.20000622444
22000253367 10000086877 2.20000622371
22000117926 10000025294 2.2000062279
22000283315 10000100490 2.20000622364
22000192366 10000059184 2.20000621608
22000177888 10000052623 2.20000621171
22000191482 10000058825 2.20000620666
22000189772 10000058126 2.20000618944

我们发现tsc ticks per ns差不多是2.2, 这和CPU 2.2GHz频率正好相等,但是,如果换算成时间ns,将是不精确的:每10秒的tick数其实相差比较大。


Reference:
[1]: https://en.wikipedia.org/wiki/Time_Stamp_Counter
[2]: https://www.cnblogs.com/ralphjzhang/archive/2012/01/09/2317463.html
[3]: https://en.wikipedia.org/wiki/List_of_Intel_processors
[4]: https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf
[5]: https://stackoverflow.com/questions/35123379/getting-tsc-rate-from-x86-kernel
[6]: https://stackoverflow.com/questions/39151049/on-a-cpu-with-constant-tsc-and-nonstop-tsc-why-does-my-time-drift

你可能感兴趣的:(C++,linux,c++)