We are using Intel Optane DataCenter Persistent Memory Module as the NVDIMM device here. Intel Optane DC Persistent Memory (DCPMM or PMEM) is a new generation of nonvolatile memory (NVM) technology that is fast enough for processors to access stored data directly, without high latency and without a tremendous reduction in performance
We can use DCPMM in a Guest VM by QEMU vNVDIMM device. QEMU supports vNVDIMM from 2.6, but we suggest only use it in 12SP5, 15SP2 and later.
Intel Optane DCPMM uses DDR-T protocol, it has two different work mode: memory mode and app direct mode.
In this mode, DCPMM is same as a DRAM, it will lose data after losing power. And normal DRAM will act as the cache of DCPMM.
So for example, if there is 1T DCPMM and 128G DRAM, the system will only recognize 1T memory from DCPMM, 128G DRAM becomes a cache.
In this mode, DCPMM works as an NVDIMM device, but only for some specific APP. In other words, it needs APP could support it. The app will store data in this device, and it will not lose data even power off the system.
QEMU actually uses this mode for GuestVM.
The vNVDIMM device in QEMU is provided by the memory backend (i.e. memory-backend-file and memory-backend-ram). The current vNVDIMM device could only support memory mode.
A simple way to create a vNVDIMM device at startup time is done via the following command-line options:
-machine pc,nvdimm -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE -device nvdimm,id=nvdimm1,memdev=mem1 |
Where:
the nvdimm
machine option enables vNVDIMM feature.
slots=$N
should be equal to or larger than the total amount of normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. If the HotPlug feature is required, assign enough slots for the eventual total number of RAM and vNVDIMM devices.
maxmem=$MAX_SIZE
should be equal to or larger than the total size of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be >= $RAM_SIZE + $NVDIMM_SIZE here.
object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE
creates a backend storage of size $NVDIMM_SIZE
on a file $PATH
. All accesses to the virtual NVDIMM device go to the file $PATH
. share=on/off
controls the visibility of guest writes. If share=on
, then guest writes will be applied to the backend file. If another guest uses the same backend file with option share=on
, then above writes will be visible to it as well. If share=off
, then guest writes won't be applied to the backend file and thus will be invisible to other guests.
device nvdimm,id=nvdimm1,memdev=mem1
creates a virtual NVDIMM device whose storage is provided by above memory backend device
Multiple vNVDIMM devices can be created if multiple pairs of "-object" and "-device" are provided.
For the above command-line options, if the guest OS has the proper NVDIMM driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
detect an NVDIMM device which is in the memory mode and whose size is $NVDIMM_SIZE.
QEMU v2.7.0 and later implement the label support for vNVDIMM devices. To enable label on vNVDIMM devices, users can simply add
"label-size=$SZ" option to "-device nvdimm", e.g.
-device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K |
QEMU v2.8.0 and later implement the hotplug support for vNVDIMM devices. Similar to the RAM hotplug, the vNVDIMM hotplug is implemented by two monitor commands "object_add" and "device_add".
ipmctl: This is a tool to configure and manage DCPMM directly.
ndctl: (Non-Volatile Device Control) This is a command-line tool that used to manage libnvdimm in Linux Kernel
sudo zypper in ipmctl sudo zypper in ndctl |
$ sudo ipmctl create -goal PersistentMemoryType=AppDirect The following configuration will be applied: SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size ================================================================== 0x0000 | 0x0001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0000 | 0x0011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0000 | 0x0101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0000 | 0x0111 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1111 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2111 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3111 | 0.000 GiB | 416.000 GiB | 0.000 GiB Do you want to continue? [y/n] y Created following region configuration goal SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size ================================================================== 0x0000 | 0x0001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0000 | 0x0011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0000 | 0x0101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0000 | 0x0111 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0001 | 0x1111 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0002 | 0x2111 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3001 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3011 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3101 | 0.000 GiB | 416.000 GiB | 0.000 GiB 0x0003 | 0x3111 | 0.000 GiB | 416.000 GiB | 0.000 GiB A reboot is required to process new memory allocation goals. |
$ sudo ipmctl show -region SocketID | ISetID | PersistentMemoryType | Capacity | FreeCapacity | HealthState ================================================================================================= 0x0000 | 0x8b63c3d056fb8888 | AppDirect | 384.000 GiB | 0.000 GiB | Healthy 0x0001 | 0x3dddc3d0e20a8888 | AppDirect | 384.000 GiB | 0.000 GiB | Healthy 0x0002 | 0x7345c3d0e3d28888 | AppDirect | 384.000 GiB | 0.000 GiB | Healthy 0x0003 | 0xd797c3d0aeed8888 | AppDirect | 384.000 GiB | 0.000 GiB | Healthy |
$ sudo ndctl create-namespace --region=region0 { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":405872312320, "uuid":"69579483-0f2f-4f45-bc22-da7048569d22", "sector_size":512, "align":2097152, "blockdev":"pmem0" } $ sudo ndctl create-namespace --region=region0 --size=36g { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":"35.44 GiB (38.05 GB)", "uuid":"69579483-0f2f-4f45-bc22-da7048569d22", "sector_size":512, "align":2097152, "blockdev":"pmem0" } |
$ sudo mkfs.xfs /dev/pmem0 meta-data=/dev/pmem0 isize=512 agcount=4, agsize=48770944 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0 data = bsize=4096 blocks=195083776, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=95255, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mkdir /pmemfs0 $ sudo mount -o dax /dev/pmem0 /pmemfs0 |
$ truncate -s 10G /pmemfs0/nvdimm |
We need to add "unarmed=on" to QEMU command-line.
$ ndctl create-namespace -f -e namespace0.0 -m devdax |
The /dev/dax0.0 could be used directly in "mem-path" option.
$ ndctl create-namespace -f -e namespace0.0 -m fsdax |
Create partition pmem0p1 /dev/pmem0
$ mount -o dax /dev/pmem0p1 /mnt |
Create or copy a disk image file with qemu-img, or dd in /mnt
$ sudo qemu-system-x86_64 \ -m 4G,slots=4,maxmem=32G \ -smp 4 \ -machine pc,accel=kvm,nvdimm=on \ -enable-kvm \ -boot order=cd \ -vnc :0 \ -vga virtio \ -net nic \ -net user,hostfwd=tcp::2222-:22 \ -object memory-backend-file,id=mem1,share,mem-path=/pmemfs0/nvdimm,size=10G,align=2M \ -device nvdimm,memdev=mem1,unarmed=on,id=nv1,label-size=2M \ -cdrom /home/SLE-15-SP2-Full-x86_64-GM-Media1.iso \ -hda /home/VMachines/sles15-sp2.qcow2 \ -serial stdio |
$ ssh 'user'@localhost -p2222 #you need to change to the user you created or "root" $ lsmod | grep libnvdimm libnvdimm 192512 5 dax_pmem,dax_pmem_core,nd_btt,nd_pmem,nfit $ sudo zypper in ndctl $ sudo ndctl list { "dev":"namespace0.0", "mode":"raw", "size":10735321088, "sector_size":512, "blockdev":"pmem0", "numa_node":0 } |
$ sudo ndctl create-namespace -f -e namespace0.0 --fsdax { "dev":"namespace0.0", "mode":"fsdax", "map": "dev", "size":"9.84 GiB (10.57 GB)", "uuid":"3ac049f8-bcf6-4e30-b74f-7ed33916e405" "raw_uuid:"b9942cdd-502c-4e31-af51-46ac5d5a1f93" "sector_size":512, "blockdev":"pmem0", "numa_node":0 } $ sudo mkfs.xfs -f /dev/pmem0 meta-data=/dev/pmem0 isize=512 agcount=4, agsize=644864 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0 data = bsize=4096 blocks=2579456, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o dax /dev/pmem0 /pmemfs0 |
$ sudo echo “12345” >/pmemfs0/test $ sudo reboot $ sudo mount -o dax /dev/pmem0 /pmemfs0 $ sudo cat /pmemfs0/test 12345 |
Labels contain metadata to describe the NVDIMM features and namespace configuration. They are stored on each NVDIMM in a reserved area called the label storage area. The exact location of the label storage area is NVDIMM-specific. QEMU v2.7.0 and later store labels at the end of backend storage. If a memory backend file, which was previously used as the backend of a vNVDIMM device without labels, is now used for a vNVDIMM device with the label, the data in the label area at the end of the file will be inaccessible to the guest. If any useful data (e.g. the meta-data of the file system) was stored there, the latter usage may result in guest data corruption (e.g. breakage of the guest file system):
-device nvdimm,id=nvdimm1,memdev=mem1,label-size=256K |
Label information can be accessed using the ndctl
command utility which needs to be installed within the guest.
QEMU v2.8.0 and later implement the hotplug support for dynamically adding vNVDIMM devices to a running guest. Similarly to the RAM hotplug, the vNVDIMM hotplug is accomplished by two monitor console commands "object_add" and "device_add".
The following commands add another 4GB vNVDIMM device to the guest using the qemu monitor interface:
(qemu) object_add memory-backend-file,id=mem3,share=on,mem-path=/virtual-machines/qemu/f27nvdimm2,size=4G (qemu) device_add nvdimm,id=nvdimm2,memdev=mem3 |
Each hotplugged vNVDIMM device consumes one memory slot. Users should always ensure the memory option "-m ...,slots=N" specifies enough number of slots.
QEMU uses mmap to maps vNVDIMM backends and aligns the mapping address to the page size (getpagesize) by default. QEMU v2.12.0 and later provide 'align' option to memory-backend-file to allow users to specify the proper alignment.
The following commands use /dev/dax0.0
as the backend of vNVDIMM with 2MB alignment:
(qemu) object_add memory-backend-file,id=mem3,share=on,mem-path=/virtual-machines/qemu/nvdimm2,size=4G (qemu) device_add nvdimm,id=nvdimm2,memdev=mem3 |
Each hotplugged vNVDIMM device consumes one memory slot. Users should always ensure the memory option "-m ...,slots=N" specifies enough number of slots.
ACPI 6.2 Errata A added support for a new Platform Capabilities Structure which allows the platform to communicate what features it supports related to NVDIMM data persistence. Users can provide a persistence value to a guest via the optional "nvdimm-persistence" machine command line option:
-machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu |
There are currently two valid values for this option:
"mem-ctrl" - The platform supports flushing dirty data from the memory controller to the NVDIMMs in the event of power loss.
"cpu" - The platform supports flushing dirty data from the CPU cache to the NVDIMMs in the event of power loss. This implies that the platform also supports flushing dirty data through the memory controller on power loss.
We set up the Intel Optane DC persistent memory in app-direct mode using the ipmctl and ndctl tools, and then applied similar steps under the guest VM.
Applications will be able to utilize the persistent memory using standard file APIs or the Persistent Memory Development Kit (PMDK).
Reference: