Setup Intel Optane DCPMM in KVM/QEMU Guests

We are using Intel Optane DataCenter Persistent Memory Module as the NVDIMM device here. Intel Optane DC Persistent Memory (DCPMM or PMEM) is a new generation of nonvolatile memory (NVM) technology that is fast enough for processors to access stored data directly, without high latency and without a tremendous reduction in performance

We can use DCPMM in a Guest VM by QEMU vNVDIMM device. QEMU supports vNVDIMM from 2.6, but we suggest only use it in 12SP5, 15SP2 and later. 

1. Intel Optane DCPMM

Intel Optane DCPMM uses DDR-T protocol, it has two different work mode: memory mode and app direct mode. 

1.1 Memory Mode

Setup Intel Optane DCPMM in KVM/QEMU Guests_第1张图片

In this mode, DCPMM is same as a DRAM, it will lose data after losing power. And normal DRAM will act as the cache of DCPMM.

So for example, if there is 1T DCPMM and 128G DRAM, the system will only recognize 1T memory from DCPMM, 128G DRAM becomes a cache. 

1.2 APP Direct Mode

Setup Intel Optane DCPMM in KVM/QEMU Guests_第2张图片

In this mode, DCPMM works as an NVDIMM device, but only for some specific APP. In other words, it needs APP could support it.  The app will store data in this device, and it will not lose data even power off the system.

QEMU actually uses this mode for  GuestVM.

2. Virtual NVDIMM

The vNVDIMM device in QEMU is provided by the memory backend (i.e. memory-backend-file and memory-backend-ram). The current vNVDIMM device could only support memory mode. 

A simple way to create a vNVDIMM device at startup time is done via the following command-line options:

 -machine pc,nvdimm
 -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
 -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE
 -device nvdimm,id=nvdimm1,memdev=mem1

Where:

  • the nvdimm machine option enables vNVDIMM feature.

  • slots=$N should be equal to or larger than the total amount of normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. If the HotPlug feature is required, assign enough slots for the eventual total number of RAM and vNVDIMM devices.

  • maxmem=$MAX_SIZE should be equal to or larger than the total size of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be >= $RAM_SIZE + $NVDIMM_SIZE here.

  • object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go to the file $PATHshare=on/off controls the visibility of guest writes. If share=on, then guest writes will be applied to the backend file. If another guest uses the same backend file with option share=on, then above writes will be visible to it as well. If share=off, then guest writes won't be applied to the backend file and thus will be invisible to other guests.

  • device nvdimm,id=nvdimm1,memdev=mem1 creates a virtual NVDIMM device whose storage is provided by above memory backend device

 

Multiple vNVDIMM devices can be created if multiple pairs of "-object" and "-device" are provided.

For the above command-line options, if the guest OS has the proper NVDIMM driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
detect an NVDIMM device which is in the memory mode and whose size is $NVDIMM_SIZE.

 

QEMU v2.7.0 and later implement the label support for vNVDIMM devices. To enable label on vNVDIMM devices, users can simply add
"label-size=$SZ" option to "-device nvdimm", e.g.

 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K

 

QEMU v2.8.0 and later implement the hotplug support for vNVDIMM devices. Similar to the RAM hotplug, the vNVDIMM hotplug is implemented by two monitor commands "object_add" and "device_add".

3. Configure Intel Optane DCPMM for KVM/QEMU Guest VMs

3.1 Install ipmctl and ndctl in the host

ipmctl:   This is a tool to configure and manage DCPMM directly.

ndctl:      (Non-Volatile Device Control) This is a command-line tool that used to manage libnvdimm in Linux Kernel

sudo zypper in ipmctl
sudo zypper in ndctl

3.2 Setup DCPMM to AppDirect Mode, A reboot is needed here.

$ sudo ipmctl create -goal PersistentMemoryType=AppDirect

The following configuration will be applied:
 SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size 
==================================================================
 0x0000   | 0x0001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0000   | 0x0011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0000   | 0x0101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0000   | 0x0111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
Do you want to continue? [y/n] y
Created following region configuration goal
 SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size 
==================================================================
 0x0000   | 0x0001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0000   | 0x0011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0000   | 0x0101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0000   | 0x0111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0001   | 0x1111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0002   | 0x2111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3001 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3011 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3101 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
 0x0003   | 0x3111 | 0.000 GiB  | 416.000 GiB    | 0.000 GiB
A reboot is required to process new memory allocation goals.

3.3. After reboot, we can check the region information as below:

$ sudo ipmctl show -region

 SocketID | ISetID             | PersistentMemoryType | Capacity    | FreeCapacity | HealthState 
=================================================================================================
 0x0000   | 0x8b63c3d056fb8888 | AppDirect            | 384.000 GiB | 0.000 GiB    | Healthy
 0x0001   | 0x3dddc3d0e20a8888 | AppDirect            | 384.000 GiB | 0.000 GiB    | Healthy
 0x0002   | 0x7345c3d0e3d28888 | AppDirect            | 384.000 GiB | 0.000 GiB    | Healthy
 0x0003   | 0xd797c3d0aeed8888 | AppDirect            | 384.000 GiB | 0.000 GiB    | Healthy

3.4  Create a namespace with ndctl, we can also set up a customize size here. 

$ sudo ndctl create-namespace --region=region0
{
    "dev":"namespace0.0",
    "mode":"fsdax",
    "map":"dev",
    "size":405872312320,
    "uuid":"69579483-0f2f-4f45-bc22-da7048569d22",
    "sector_size":512,
    "align":2097152,
    "blockdev":"pmem0"
}

$ sudo ndctl create-namespace --region=region0 --size=36g
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"35.44 GiB (38.05 GB)",
  "uuid":"69579483-0f2f-4f45-bc22-da7048569d22",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0"
}

3.5 Initialize the device we just created, format and mount the filesystems, the procedure is the same as operating a normal block device.

$ sudo mkfs.xfs /dev/pmem0
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=48770944 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=195083776, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=95255, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ sudo mkdir /pmemfs0

$ sudo mount -o dax /dev/pmem0 /pmemfs0

3.6 Create vNVDIMM backend file on the host

3.6.1 Normal file

$ truncate -s 10G /pmemfs0/nvdimm 

We need to add "unarmed=on"  to QEMU command-line.

3.6.2 DAX device

$ ndctl create-namespace -f -e namespace0.0 -m devdax 

The /dev/dax0.0 could be used directly in "mem-path" option.

3.6.3 DAX file

$ ndctl create-namespace -f -e namespace0.0 -m fsdax

Create partition pmem0p1 /dev/pmem0

$ mount -o dax /dev/pmem0p1 /mnt

Create or copy a  disk image file with qemu-img, or dd in /mnt

3.7 Create a QEMU Guest VM, add the device we just created in the form of NVDIMM. 

$ sudo qemu-system-x86_64 \
  -m 4G,slots=4,maxmem=32G \
  -smp 4 \
  -machine pc,accel=kvm,nvdimm=on \
  -enable-kvm \
  -boot order=cd \
  -vnc :0 \
  -vga virtio \
  -net nic \
  -net user,hostfwd=tcp::2222-:22 \
  -object memory-backend-file,id=mem1,share,mem-path=/pmemfs0/nvdimm,size=10G,align=2M \
  -device nvdimm,memdev=mem1,unarmed=on,id=nv1,label-size=2M \
  -cdrom /home/SLE-15-SP2-Full-x86_64-GM-Media1.iso \
  -hda /home/VMachines/sles15-sp2.qcow2 \
  -serial stdio

4. Setup DCPMM inside Guest VM

4.1 Detect the DCPMM Device inside Guest VM. Make sure the driver libnvdimm.ko is loaded and ndctl is installed.

$ ssh 'user'@localhost -p2222 #you need to change to the user you created or "root"

$ lsmod | grep libnvdimm
libnvdimm             192512  5 dax_pmem,dax_pmem_core,nd_btt,nd_pmem,nfit

$ sudo zypper in ndctl

$ sudo ndctl list
{
  "dev":"namespace0.0",
  "mode":"raw",
  "size":10735321088,
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

4.2. Configure the device as we did on host:

$ sudo ndctl create-namespace -f -e namespace0.0 --fsdax
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map": "dev",
  "size":"9.84 GiB (10.57 GB)",
  "uuid":"3ac049f8-bcf6-4e30-b74f-7ed33916e405"
  "raw_uuid:"b9942cdd-502c-4e31-af51-46ac5d5a1f93"
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

$ sudo mkfs.xfs -f /dev/pmem0
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=644864 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=2579456, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ sudo mount -o dax /dev/pmem0 /pmemfs0

4.3 Create a file on /dev/pmem0 device, restart guest OS to verify the file is persistent

$ sudo echo “12345” >/pmemfs0/test
$ sudo reboot
$ sudo mount -o dax /dev/pmem0 /pmemfs0
$ sudo cat /pmemfs0/test
12345

5. Advance features of vNvidimm in QEMU

5.1 NVDIMM Labels

Labels contain metadata to describe the NVDIMM features and namespace configuration. They are stored on each NVDIMM in a reserved area called the label storage area. The exact location of the label storage area is NVDIMM-specific. QEMU v2.7.0 and later store labels at the end of backend storage.  If a memory backend file, which was previously used as the backend of a vNVDIMM device without labels, is now used for a vNVDIMM device with the label, the data in the label area at the end of the file will be inaccessible to the guest. If any useful data (e.g. the meta-data of the file system) was stored there, the latter usage may result in guest data corruption (e.g. breakage of the guest file system):

-device nvdimm,id=nvdimm1,memdev=mem1,label-size=256K

Label information can be accessed using the ndctl command utility which needs to be installed within the guest.

5.2 NVDIMM HotPlug

QEMU v2.8.0 and later implement the hotplug support for dynamically adding vNVDIMM devices to a running guest. Similarly to the RAM hotplug, the vNVDIMM hotplug is accomplished by two monitor console commands "object_add" and "device_add".

The following commands add another 4GB vNVDIMM device to the guest using the qemu monitor interface:

 (qemu) object_add memory-backend-file,id=mem3,share=on,mem-path=/virtual-machines/qemu/f27nvdimm2,size=4G
 (qemu) device_add nvdimm,id=nvdimm2,memdev=mem3

Each hotplugged vNVDIMM device consumes one memory slot. Users should always ensure the memory option "-m ...,slots=N" specifies enough number of slots.

5.3 NVDIMM  IO Alignment

QEMU uses mmap to maps vNVDIMM backends and aligns the mapping address to the page size (getpagesize) by default. QEMU v2.12.0 and later provide 'align' option to memory-backend-file to allow users to specify the proper alignment.

The following commands use /dev/dax0.0 as the backend of vNVDIMM with 2MB alignment:

 (qemu) object_add memory-backend-file,id=mem3,share=on,mem-path=/virtual-machines/qemu/nvdimm2,size=4G
 (qemu) device_add nvdimm,id=nvdimm2,memdev=mem3

Each hotplugged vNVDIMM device consumes one memory slot. Users should always ensure the memory option "-m ...,slots=N" specifies enough number of slots.

5.4 NVDIMM  Persistence

ACPI 6.2 Errata A added support for a new Platform Capabilities Structure which allows the platform to communicate what features it supports related to NVDIMM data persistence. Users can provide a persistence value to a guest via the optional "nvdimm-persistence" machine command line option:

 -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu

There are currently two valid values for this option:

  • "mem-ctrl" - The platform supports flushing dirty data from the memory controller to the NVDIMMs in the event of power loss.

  • "cpu" - The platform supports flushing dirty data from the CPU cache to the NVDIMMs in the event of power loss. This implies that the platform also supports flushing dirty data through the memory controller on power loss.

6. Conclusion

We set up the Intel Optane DC persistent memory in app-direct mode using the ipmctl and ndctl tools, and then applied similar steps under the guest VM.

Applications will be able to utilize the persistent memory using standard file APIs or the Persistent Memory Development Kit (PMDK).  

 

Reference:

  1. https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt
  2. https://software.intel.com/content/www/us/en/develop/articles/provision-intel-optane-dc-persistent-memory-for-kvm-qemu-guests.html
  3. https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/virtualization/qemu#create-a-guest-vm

你可能感兴趣的:(QEMU,KVM)