虚函数对性能损耗的真正原因是什么

本文动机

很多人都知道虚函数对程序会有性能损耗。

然而很多人可能未必知道造成性能损耗的根本原因是什么。

我认为有必要深度挖掘一下这背后的原理

一来是更好的理解虚函数的原理

二者是当我们理解了这根本原因，可能对在其他方面进行性能优化，有更好的指导方向

虚函数的原理

virtual-function

虚函数通过类的虚表来维护
当子类发生函数调用时，子类对象会现在内存中判断是否为虚函数，如果是虚函数，则根据虚表指针找到虚表，然后调用对应的虚函数

例子分析

虚函数与非虚函数的性能比较

通过分别单独调用虚函数和非虚函数10w次，并对他们的结果进行分析

class A {
  public:
  virtual void foo() {}
};

class B : public A {
  public:
  int a;
  int c;
  virtual void foo() {
    a = 10;
  }
  void boo() {
    c = 10;
  }
};

调用虚函数10w次

void VirtualFuntionCall(benchmark::State& state) {
  for (auto _ : state) {
    B b;
    for(int i=0; i

 
 调用非虚函数10w次 
 void NonVirtualFuntionCall(benchmark::State& state) {
  for (auto _ : state) {
    B b;
    for(int i=0; i
 
 他们跑出来的结果如下： 
  
   
     
    
   
  
    virtual-cmp-vir-2-non 
   
  
  
  虚函数和非虚函数的差别非常小，甚至可以忽略不记 
  这是否可以说明虚函数对性能的损耗并没有那么大？ 
  
 对象内存乱序带来性能损耗 
 使用一个子类，通过vector构造10w个内存连续的子类对象，进行调用虚函数10w次；打乱这10w个对象，再次进行10w次的虚函数调用。对他们的性能结果进行分析 
 class A {
  public:
  virtual void foo() {}
};

class B : public A {
  public:
  int a;
  int c;
  virtual void foo() {
    a = 10;
  }
};
 
 按内存顺序调用虚函数10w次 
 void VirtualFuntionSeqCall(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    arr.resize(size);
    std::vector target;
    for(auto& item: arr){
      target.push_back(&item);
    }

    for(auto item: target){
      item->foo();
      benchmark::DoNotOptimize(item);
    }

  }
}
// Register the function as a benchmark
BENCHMARK(VirtualFuntionSeqCall);
 
 乱序调用虚函数10w次 
 void VirtualFuntionRandomCall(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    arr.resize(size);
    std::vector target;
    for(auto& item: arr){
      target.push_back(&item);
    }
    std::random_shuffle(target.begin(), target.end());
    for(auto item: target){
      item->foo();
      benchmark::DoNotOptimize(item);
    }

  }
}
// Register the function as a benchmark
BENCHMARK(VirtualFuntionRandomCall);
 
 他们跑出来的结果如下2.5x faster： 
  
   
     
    
   
  
    virtual-random-virtual 
   
  
  
  虚函数访问慢，是由于对象与对象之间内存不连续，发生大量cache miss，导致性能低 
  
 编译优化带来的性能差异 
 构造短小的虚函数和功能一样的非虚函数，通过一个10w的循环调用，分析编译优化在这两种场景的差异是什么 
 #include 
class A {
public:
  int offset;
  void set(int o) {
    offset = o;
  }
  int get() {
    return offset;
  }
  virtual int v_get() {
    return offset;
  }

};


const int size = 100000;
 
 调用虚函数10w次 
 void virtualFunction(benchmark::State& state) {
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    
    std::vector arr;
    arr.resize(size);
    for(int i=0; i ret;
    ret.resize(size);
    for(int i=0; i
 
 调用非虚函数10w次 
 void normalFunction(benchmark::State& state) {
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    
    std::vector arr;
    arr.resize(size);
    for(int i=0; i ret;
    ret.resize(size);
    for(int i=0; i
 
 他们跑出来的结果如下： 
  
   
     
    
   
  
    virtual-inline 
   
  
  
  虚函数是不能编译优化成inline函数的 
  非虚函数可以被优化成inline函数，并做进一步的循环优化 
  因此短小的函数，非虚函数的性能要比虚函数好很多 
  
 Jump destination guess与虚函数 
 虚函数随机子类带来的性能差异 
 构造多个子类，制造三个场景，顺序的调用子类的虚函数，轮询的调用子类的虚函数和乱序的调用子类的虚函数 
 对这三个场景的性能结果进行分析 
 class A {
 public:
 virtual void foo() {}
};

class B : public A {
 public:
 int a;
 virtual void foo() {
   a = 10;
 }
};

class C : public A {
 public:
 int a;
 virtual void foo() {
   a = 20;
 }
};
const int size = 100000;
 
 调用随机子类10w次 
 void randomVirtualFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    B b;
    C c;
    for(int i=0; ifoo();
       benchmark::DoNotOptimize(item);
    }
  }
}
// Register the function as a benchmark
BENCHMARK(randomVirtualFuntion);
 
 轮询调用子类10w次 
 void roundRobinVirtualFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    B b;
    C c;
    for(int i=0; ifoo();
       benchmark::DoNotOptimize(item);
    }
  }
}
// Register the function as a benchmark
BENCHMARK(roundRobinVirtualFuntion);
 
 顺序调用子类10w次 
 void seqVirtualFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    B b;
    C c;
    auto half_size = size/2;
    for(int i=0; ifoo();
       benchmark::DoNotOptimize(item);
    }
  }
}
// Register the function as a benchmark
BENCHMARK(seqVirtualFuntion);
 
 他们跑出来的结果如下3.5x faster： 
  
   
     
    
   
  
    virtual-random-children 
   
  
  
  cpu遇到call指令的时候，如果没有明确的地址，而是需要计算的地址，CPU不会等待地址计算完成，就会先去猜测地址是什么，并进行调用，如果发现预测地址错误，则会重新调用正确的地址函数 
  顺序执行子类和有规律的执行子类的虚函数，CPU都可以正确的预测出来 
  随机的执行子类虚函数，CPU无法正确预测，所以性能下降 
  
 这个特性不是虚函数特有的，我们来看看函数指针的例子 
 随机指针函数带来的性能差异 
 构造两个功能相同的指针函数 
 制造两个场景，乱序执行指针函数，顺序执行指针函数和轮询执行指针函数 
 并分析他们的性能结果 
 const int size = 100000;
int b = 0;
void foo1() {
  b = 20;
}
void foo2() {
  b = 30;
}
 
 随机调用函数指针10w次 
 void randomFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    for(int i=0; i
 
 顺序调用函数指针10w次 
 void seqFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    auto half_size = size/2;
    for(int i=0; i
 
 轮询调用函数指针10w次 
 void roundRobinFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    for(int i=0; i
 
 他们跑出来的结果如下3.5x faster： 
  
   
     
    
   
  
    virtual-function-order 
   
  
  
  指针函数跑出来的性能结果与虚函数类似 
  这也说明了，凡是需要调用call指令，并且函数地址需要计算的，CPU都会通过目标地址猜测，来进行性能的优化。 
  而乱序带来的性能下降，并不是虚函数所特有的 
  
 CPU指令cache带来的性能差异 
 构造体积大的虚函数 
 制造两个场景：轮询调用子类虚函数和顺序调用子类虚函数 
 并分析他们的性能结果 
 #include 
#define VARIABLES       \
    V(1, a* a* a)       \
    V(2, a / 3)         \
    V(3, a / 5)         \
    V(4, a / (a - 1))  




class A {
  public:
  virtual int long_virtual_function(std::vector& v) {
    return 0;
  }
};

class B : public A {
  public:
  virtual int long_virtual_function(std::vector& v) override {
    int sum = 20;
    for (int i = 0; i < v.size(); i++) {
        int a = v[i];
        if (a == 0) {
            sum += a;
        }
#define V(num, expr)                \
        else if (a == num) {            \
            sum += (expr) - (expr) / 2; \
        }
        VARIABLES
#undef V
    }
    return sum;
  }


};

class C : public A {
  public:
    virtual int long_virtual_function(std::vector& v) override {
      int sum = 10;
      for (int i = 0; i < v.size(); i++) {
          int a = v[i];
          if (a == 0) {
              sum += a;
          }
  #define V(num, expr)                \
          else if (a == num) {            \
              sum += (expr) - (expr) / 2; \
          }
          VARIABLES
  #undef V
      }
      return sum;
    }
};
const int size = 10000;
 
 轮询调用虚函数10w次 
 void roundRobinVirtualFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    B b;
    C c;
    for(int i=0; i number;
    number.resize(1000);
    for(int i=0; i<1000; i++){
      number.push_back(i);
    }
    for(auto item: arr) {
      auto ret = item->long_virtual_function(number);
       benchmark::DoNotOptimize(ret);
    }
  }
}
// Register the function as a benchmark
BENCHMARK(roundRobinVirtualFuntion);
 
 顺序调用虚函数10w次 
 void seqVirtualFuntion(benchmark::State& state) {
  for (auto _ : state) {
    srand (1);
    std::vector arr;
    B b;
    C c;
    auto half_size = size/2;
    for(int i=0; i number;
    number.resize(1000);
    for(int i=0; i<1000; i++){
      number.push_back(i);
    }
    for(auto item: arr) {
      auto ret = item->long_virtual_function(number);
       benchmark::DoNotOptimize(ret);
    }
  }
}
// Register the function as a benchmark
BENCHMARK(seqVirtualFuntion);
 
 判断分支为4时（小函数） 
  
  
   
     
    
   
  
    virtual-cache-small 
   
  
 
 当分支数为900时（大函数） 
  
  
   
     
    
   
  
    virtual-cache-big 
   
  
  
  小函数，在用轮询方式和顺序方式，性能差别并不大 
  大函数，顺序的方式明显比轮询的要好很多 
  这是因为CPU cache有一块专门给指令做的cache。例如在顺序调用虚函数的时候，子类的虚函数一直在cache里面，直到遍历完为止，才会刷出cache。而通过轮询的方式，因为函数体积大，一下子把cache吃满，cache就有肯能被来回的切换，导致性能下降 
  
 结论 
  
  虚函数带来的性能下降，跟函数指针带来的性能下降，原理是相同 
  在批量调用虚函数时，内存最好是cache亲和的，尽量不要乱序的调用 
  虚函数是不能被编译优化，短小的函数最好为非虚函数 
  体积大的虚函数，要避免用round robin的轮询方式调用，最好顺序调用

虚函数对性能损耗的真正原因是什么

本文动机

虚函数的原理

例子分析

虚函数与非虚函数的性能比较

对象内存乱序带来性能损耗

编译优化带来的性能差异

Jump destination guess与虚函数

虚函数随机子类带来的性能差异

随机指针函数带来的性能差异

CPU指令cache带来的性能差异

结论

你可能感兴趣的:(虚函数对性能损耗的真正原因是什么)