MXNet是一种开源的深度学习框架,核心代码是由C++实现,在编译源码的过程中,它需要依赖其它几种开源库,这里对MXNet依赖的开源库进行简单的说明:
1. OpenBLAS:全称为Open Basic Linear Algebra Subprograms,是开源的基本线性代数子程序库,是一个优化的高性能多核BLAS库,主要包括矩阵与矩阵、矩阵与向量、向量与向量等操作。它的License是BSD-3-Clause,可以商用,目前最新的发布版本是0.3.3。它的源码放在GitHub上,由张先轶老师等持续维护。
OpenBLAS是由中科院软件所并行软件与计算科学实验室发起的基于GotoBLAS2 1.13 BSD版的开源BLAS库高性能实现。
BLAS是一个应用程序接口(API)标准,用以规范发布基础线性代数操作的数值库(如矢量或矩阵乘法)。该程序集最初发布于1979年,并用于建立更大的数值程序包(如LAPACK)。在高性能计算领域,BLAS被广泛使用。
测试代码如下(openblas_test.cpp):
#include "openblas_test.hpp"
#include
#include
int test_openblas_1()
{
int th_model = openblas_get_parallel();
switch (th_model) {
case OPENBLAS_SEQUENTIAL:
printf("OpenBLAS is compiled sequentially.\n");
break;
case OPENBLAS_THREAD:
printf("OpenBLAS is compiled using the normal threading model\n");
break;
case OPENBLAS_OPENMP:
printf("OpenBLAS is compiled using OpenMP\n");
break;
}
int n = 2;
double* x = (double*)malloc(n*sizeof(double));
double* upperTriangleResult = (double*)malloc(n*(n + 1)*sizeof(double) / 2);
for (int j = 0; j
执行结果如下:
2. DLPack:仅有一个头文件dlpack.h。DLPack是一种开放的内存张量(tensor)结构,用于在不同框架之间共享张量,如Tensorflow, PyTorch, NXNet,不发生任何数据复制或拷贝。
dlpack.h文件中包括两个枚举类型,四个结构体:
枚举类型DLDeviceType:支持的设备类型包括CPU、CUDA GPU、OpenCL、Apple GPU、AMD GPU等。
枚举类型DLDataTypeCode:支持的数据类型包括有符号int、无符号int、float。
结构体DLContext:A Device context for Tensor and operator,数据成员包括设备类型和设备id。
结构体DLDataType:tensor支持的数据类型,数据成员包括code基本类型,值必须为DLDataTypeCode支持的;位数(bits)可以是8,16,32;类型的lanes数。
结构体DLTensor:tensor对象,不管理内存。数据成员包括数据指针(void*)、DLContext、维数、DLDataType、tensor的shape、tensor的stride、数据开始指针的字节偏移量。
结构体DLManagedTensor:管理DLTensor内存。
3. MShadow:全称Matrix Shadow,用C++/CUDA实现的轻量级的CPU/GPU矩阵和tensor模板库。它的文件全部为.h或.cuh,使用时直接include即可。注意:如果在工程属性预处理器定义中没有加入MSHADOW_STAND_ALONE,则需要包括额外的CBLAS或MKL或CUDA的支持。如果不依赖其它库,定义MSHADOW_STAND_ALONE,则会导致有些函数没有实现,如dot_engine-inl.h中,函数体中会包括语句:LOG(FATAL) << “Not implemented!”;
这里为了测试不开启MSHADOW_STAND_ALONE宏,仅开启MSHADOW_USE_CBLAS宏。
测试代码如下(mshadow_test.cpp):
#include "mshadow_test.hpp"
#include
#include
#include "mshadow/tensor.h"
// reference: mshadow source code: mshadow/guide
int test_mshadow_1()
{
// intialize tensor engine before using tensor operation
mshadow::InitTensorEngine();
// assume we have a float space
float data[20];
// create a 2 x 5 x 2 tensor, from existing space
mshadow::Tensor ts(data, mshadow::Shape3(2, 5, 2));
// take first subscript of the tensor
mshadow::Tensor mat = ts[0];
// Tensor object is only a handle, assignment means they have same data content
// we can specify content type of a Tensor, if not specified, it is float bydefault
mshadow::Tensor mat2 = mat;
mat = mshadow::Tensor(data, mshadow::Shape1(10)).FlatTo2D();
// shaape of matrix, note size order is same as numpy
fprintf(stdout, "%u X %u matrix\n", mat.size(0), mat.size(1));
// initialize all element to zero
mat = 0.0f;
// assign some values
mat[0][1] = 1.0f; mat[1][0] = 2.0f;
// elementwise operations
mat += (mat + 10.0f) / 10.0f + 2.0f;
// print out matrix, note: mat2 and mat1 are handles(pointers)
for (mshadow::index_t i = 0; i < mat.size(0); ++i) {
for (mshadow::index_t j = 0; j < mat.size(1); ++j) {
fprintf(stdout, "%.2f ", mat2[i][j]);
}
fprintf(stdout, "\n");
}
mshadow::TensorContainer lhs(mshadow::Shape2(2, 3)), rhs(mshadow::Shape2(2, 3)), ret(mshadow::Shape2(2, 2));
lhs = 1.0;
rhs = 1.0;
ret = mshadow::expr::implicit_dot(lhs, rhs.T());
mshadow::VectorDot(ret[0].Slice(0, 1), lhs[0], rhs[0]);
fprintf(stdout, "vdot=%f\n", ret[0][0]);
int cnt = 0;
for (mshadow::index_t i = 0; i < ret.size(0); ++i) {
for (mshadow::index_t j = 0; j < ret.size(1); ++j) {
fprintf(stdout, "%.2f ", ret[i][j]);
}
fprintf(stdout, "\n");
}
fprintf(stdout, "\n");
for (mshadow::index_t i = 0; i < lhs.size(0); ++i) {
for (mshadow::index_t j = 0; j < lhs.size(1); ++j) {
lhs[i][j] = cnt++;
fprintf(stdout, "%.2f ", lhs[i][j]);
}
fprintf(stdout, "\n");
}
fprintf(stdout, "\n");
mshadow::TensorContainer index(mshadow::Shape1(2)), choosed(mshadow::Shape1(2));
index[0] = 1; index[1] = 2;
choosed = mshadow::expr::mat_choose_row_element(lhs, index);
for (mshadow::index_t i = 0; i < choosed.size(0); ++i) {
fprintf(stdout, "%.2f ", choosed[i]);
}
fprintf(stdout, "\n");
mshadow::TensorContainer recover_lhs(mshadow::Shape2(2, 3)), small_mat(mshadow::Shape2(2, 3));
small_mat = -100.0f;
recover_lhs = mshadow::expr::mat_fill_row_element(small_mat, choosed, index);
for (mshadow::index_t i = 0; i < recover_lhs.size(0); ++i) {
for (mshadow::index_t j = 0; j < recover_lhs.size(1); ++j) {
fprintf(stdout, "%.2f ", recover_lhs[i][j] - lhs[i][j]);
}
}
fprintf(stdout, "\n");
rhs = mshadow::expr::one_hot_encode(index, 3);
for (mshadow::index_t i = 0; i < lhs.size(0); ++i) {
for (mshadow::index_t j = 0; j < lhs.size(1); ++j) {
fprintf(stdout, "%.2f ", rhs[i][j]);
}
fprintf(stdout, "\n");
}
fprintf(stdout, "\n");
mshadow::TensorContainer idx(mshadow::Shape1(3));
idx[0] = 8;
idx[1] = 0;
idx[2] = 1;
mshadow::TensorContainer weight(mshadow::Shape2(10, 5));
mshadow::TensorContainer embed(mshadow::Shape2(3, 5));
for (mshadow::index_t i = 0; i < weight.size(0); ++i) {
for (mshadow::index_t j = 0; j < weight.size(1); ++j) {
weight[i][j] = i;
}
}
embed = mshadow::expr::take(idx, weight);
for (mshadow::index_t i = 0; i < embed.size(0); ++i) {
for (mshadow::index_t j = 0; j < embed.size(1); ++j) {
fprintf(stdout, "%.2f ", embed[i][j]);
}
fprintf(stdout, "\n");
}
fprintf(stdout, "\n\n");
weight = mshadow::expr::take_grad(idx, embed, 10);
for (mshadow::index_t i = 0; i < weight.size(0); ++i) {
for (mshadow::index_t j = 0; j < weight.size(1); ++j) {
fprintf(stdout, "%.2f ", weight[i][j]);
}
fprintf(stdout, "\n");
}
fprintf(stdout, "upsampling\n");
#ifdef small
#undef small
#endif
mshadow::TensorContainer small(mshadow::Shape2(2, 2));
small[0][0] = 1.0f;
small[0][1] = 2.0f;
small[1][0] = 3.0f;
small[1][1] = 4.0f;
mshadow::TensorContainer large(mshadow::Shape2(6, 6));
large = mshadow::expr::upsampling_nearest(small, 3);
for (mshadow::index_t i = 0; i < large.size(0); ++i) {
for (mshadow::index_t j = 0; j < large.size(1); ++j) {
fprintf(stdout, "%.2f ", large[i][j]);
}
fprintf(stdout, "\n");
}
small = mshadow::expr::pool(large, small.shape_, 3, 3, 3, 3);
// shutdown tensor enigne after usage
for (mshadow::index_t i = 0; i < small.size(0); ++i) {
for (mshadow::index_t j = 0; j < small.size(1); ++j) {
fprintf(stdout, "%.2f ", small[i][j]);
}
fprintf(stdout, "\n");
}
fprintf(stdout, "mask\n");
mshadow::TensorContainer mask_data(mshadow::Shape2(6, 8));
mshadow::TensorContainer mask_out(mshadow::Shape2(6, 8));
mshadow::TensorContainer mask_src(mshadow::Shape1(6));
mask_data = 1.0f;
for (int i = 0; i < 6; ++i) {
mask_src[i] = static_cast(i);
}
mask_out = mshadow::expr::mask(mask_src, mask_data);
for (mshadow::index_t i = 0; i < mask_out.size(0); ++i) {
for (mshadow::index_t j = 0; j < mask_out.size(1); ++j) {
fprintf(stdout, "%.2f ", mask_out[i][j]);
}
fprintf(stdout, "\n");
}
mshadow::ShutdownTensorEngine();
return 0;
}
// user defined unary operator addone
struct addone {
// map can be template function
template
MSHADOW_XINLINE static DType Map(DType a) {
return a + static_cast(1);
}
};
// user defined binary operator max of two
struct maxoftwo {
// map can also be normal functions,
// however, this can only be applied to float tensor
MSHADOW_XINLINE static float Map(float a, float b) {
if (a > b) return a;
else return b;
}
};
int test_mshadow_2()
{
// intialize tensor engine before using tensor operation, needed for CuBLAS
mshadow::InitTensorEngine();
// take first subscript of the tensor
mshadow::Stream *stream_ = mshadow::NewStream(0);
mshadow::Tensor mat = mshadow::NewTensor(mshadow::Shape2(2, 3), 0.0f, stream_);
mshadow::Tensor mat2 = mshadow::NewTensor(mshadow::Shape2(2, 3), 0.0f, stream_);
mat[0][0] = -2.0f;
mat = mshadow::expr::F(mshadow::expr::F(mat) + 0.5f, mat2);
for (mshadow::index_t i = 0; i < mat.size(0); ++i) {
for (mshadow::index_t j = 0; j < mat.size(1); ++j) {
fprintf(stdout, "%.2f ", mat[i][j]);
}
fprintf(stdout, "\n");
}
mshadow::FreeSpace(&mat); mshadow::FreeSpace(&mat2);
mshadow::DeleteStream(stream_);
// shutdown tensor enigne after usage
mshadow::ShutdownTensorEngine();
return 0;
}
其中test_mshadow_2的执行结果如下:
4. DMLC-Core:全称Distributed Machine Learning Common Codebase,它是支持所有DMLC项目的基础模块,用于构建高效且可扩展的分布式机器学习通用库。
测试代码如下(dmlc_test.cpp):
#include "dmlc_test.hpp"
#include
#include
#include
#include
#include
// reference: dmlc-core/example and dmlc-core/test
struct MyParam : public dmlc::Parameter {
float learning_rate;
int num_hidden;
int activation;
std::string name;
// declare parameters in header file
DMLC_DECLARE_PARAMETER(MyParam) {
DMLC_DECLARE_FIELD(num_hidden).set_range(0, 1000)
.describe("Number of hidden unit in the fully connected layer.");
DMLC_DECLARE_FIELD(learning_rate).set_default(0.01f)
.describe("Learning rate of SGD optimization.");
DMLC_DECLARE_FIELD(activation).add_enum("relu", 1).add_enum("sigmoid", 2)
.describe("Activation function type.");
DMLC_DECLARE_FIELD(name).set_default("mnet")
.describe("Name of the net.");
// user can also set nhidden besides num_hidden
DMLC_DECLARE_ALIAS(num_hidden, nhidden);
DMLC_DECLARE_ALIAS(activation, act);
}
};
// register it in cc file
DMLC_REGISTER_PARAMETER(MyParam);
int test_dmlc_parameter()
{
int argc = 4;
char* argv[4] = {
#ifdef _DEBUG
"E:/GitCode/MXNet_Test/lib/dbg/x64/ThirdPartyLibrary_Test.exe",
#else
"E:/GitCode/MXNet_Test/lib/rel/x64/ThirdPartyLibrary_Test.exe",
#endif
"num_hidden=100",
"name=aaa",
"activation=relu"
};
MyParam param;
std::map kwargs;
for (int i = 0; i < argc; ++i) {
char name[256], val[256];
if (sscanf(argv[i], "%[^=]=%[^\n]", name, val) == 2) {
kwargs[name] = val;
}
}
fprintf(stdout, "Docstring\n---------\n%s", MyParam::__DOC__().c_str());
fprintf(stdout, "start to set parameters ...\n");
param.Init(kwargs);
fprintf(stdout, "-----\n");
fprintf(stdout, "param.num_hidden=%d\n", param.num_hidden);
fprintf(stdout, "param.learning_rate=%f\n", param.learning_rate);
fprintf(stdout, "param.name=%s\n", param.name.c_str());
fprintf(stdout, "param.activation=%d\n", param.activation);
return 0;
}
namespace tree {
struct Tree {
virtual void Print() = 0;
virtual ~Tree() {}
};
struct BinaryTree : public Tree {
virtual void Print() {
printf("I am binary tree\n");
}
};
struct AVLTree : public Tree {
virtual void Print() {
printf("I am AVL tree\n");
}
};
// registry to get the trees
struct TreeFactory
: public dmlc::FunctionRegEntryBase > {
};
#define REGISTER_TREE(Name) \
DMLC_REGISTRY_REGISTER(::tree::TreeFactory, TreeFactory, Name) \
.set_body([]() { return new Name(); } )
DMLC_REGISTRY_FILE_TAG(my_tree);
} // namespace tree
// usually this sits on a seperate file
namespace dmlc {
DMLC_REGISTRY_ENABLE(tree::TreeFactory);
}
namespace tree {
// Register the trees, can be in seperate files
REGISTER_TREE(BinaryTree)
.describe("This is a binary tree.");
REGISTER_TREE(AVLTree);
DMLC_REGISTRY_LINK_TAG(my_tree);
}
int test_dmlc_registry()
{
// construct a binary tree
tree::Tree *binary = dmlc::Registry::Find("BinaryTree")->body();
binary->Print();
// construct a binary tree
tree::Tree *avl = dmlc::Registry::Find("AVLTree")->body();
avl->Print();
delete binary;
delete avl;
return 0;
}
其中test_dmlc_parameter的执行结果如下:
5. TVM:深度学习系统的编译器堆栈(compiler stack)。它旨在缩小深度学习框架与以性能、效率为重点的硬件后端之间的差距。它与深度学习框架协同工作,为不同的后端提供端到端的编译。TVM除了依赖dlpack、dmlc-core外,还依赖HalideIR。而且编译TVM时,一大堆C2440、C2664错误,即无法从一种类型转换为另一种类型的错误。因为在编译MXNet源码时,目前MXNet仅需要tvm源码nnvm/src下的c_api, core, pass三个目录的文件参与编译,因此后面再调试TVM库。
6. OpenCV:可选的,编译过程可参考: https://blog.csdn.net/fengbingchun/article/details/84030309
7. CUDA和cudnn:可选的,编译过程可参考:https://blog.csdn.net/fengbingchun/article/details/53892997
GitHub: https://github.com/fengbingchun/MXNet_Test