先在目标机上NFS MOUNT了宿主机的一个文件夹,当将此文件夹中的内容CP到目标机内时,出现几个现象:
一个是CP HELLO文件夹时,正常,HELLO文件夹内文件最大为13K+
但传输较大的文件时,如20k+的文件时,出错:
NFS: server 192.168.0.94 not responding,still trying...
在网上搜资料,发现下面的这个比较靠谱,怀疑可能是NFS有问题,与RING或buffer的大小有关......
修改方法:
来自https://bugzilla.redhat.com/show_bug.cgi?id=222662
Description From YangKun 2007-01-15 11:05:32 EDT
Description of problem:
In RedHat Hardware Test Suite(HTS), we are doing an UDP test in both hostOS and
FV guests(by using NFS network test), both the FV guest and the Host os are
clients, there is another machine in our LAN plays as NFS server. The mount
options are "rw,intr,rsize=32768,wsize=32768,udp". If we mount in the HostOS,
everything works well, but if we mount in the FV guest, it will prompt out a
"nfs: server xxxx not reponding, still trying" error message. And if I remove
the "udp" option from the option lists, then everything works well in both
HostOS and FV guest. I'm not sure whether this is a Xen's bug. SElinux is
enabled(enforcing) and firewall is turned on(default settings) on hostOS and FV
guest, and both SElinux and firewall are turned off/disabled on the server machine.
Version-Release number of selected component (if applicable):
RHEL5-Server-20070105.0-i386
How reproducible:
always
Steps to Reproduce:
1. create a FV
2. boot the FV
3. mount use the above mount options
------- Comment #1 From YangKun 2007-01-15 11:07:57 EDT -------
*** Bug 220584 has been marked as a duplicate of this bug. ***
Comment #2 From Stephen Tweedie 2007-01-15 11:19:41 EDT -------
Can you please try to narrow things down with tcpdump to see if you can identify
where packets are being lost?
Are you mounting with identical NFS options in both the dom0 and the HVM domU?
Does tcpdump in the dom0 give you any clue as to packets being malformed or dropped?
Thanks.
Comment #3 From YangKun 2007-01-16 06:53:24 EDT -------
Yes, I mount with identical NFS options in both the dom0 and the HVM domU.
Actully, we wrote a script to do the test, what we do in the NFS test is :
1) mount
2) make some dirs under the mounted dir
3) cp some files into those dirs
3) umount
4) mount again
5) cp those files back and cmp the original files with the copied back
ones. to check whether there are some errors happened in file-copying .
I saw the "nfs: server xxxx not reponding, still trying" error message between
step 4) and 5).
Interestingly, I found the following :
o I saw this error message only when I mount use "udp" option and only in
FV guest;
o some times the test just passed and won't prompt out any error, this
direct-pass ratio is about 30%;
o some times the test prompts out this error, but after a while, the test
will pass anyway(the waiting time varies, maybe short , maybe long), 69%
chances;
o some times the test seems never pass(about 1% chances), it continue to
prompts out this error combines with "nfs: server x.x.x.x OK" message, like
following:
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 not responding, still trying
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 OK
nfs: server 10.66.0.87 not responding, still trying
....
I checked with tcpdump, seems nothing special ? I'm not sure:
-----------------------------------
09:32:01.242319 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.242322 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.639221 arp who-has dhcp-0-117.pek.redhat.com tell dhcp-0-
087.pek.redhat.com
09:32:01.639237 arp reply dhcp-0-117.pek.redhat.com is-at 00:16:3e:00:3e:10
(oui Unknown)
09:32:01.639336 arp who-has dhcp-0-117.pek.redhat.com tell dhcp-0-
087.pek.redhat.com
09:32:01.639339 arp reply dhcp-0-117.pek.redhat.com is-at 00:16:3e:00:3e:10
(oui Unknown)
09:32:01.639439 IP dhcp-0-087.pek.redhat.com.nfs > dhcp-0-
117.pek.redhat.com.2827293577: reply ok 1472 read
09:32:01.639447 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.639449 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.redhat.com: udp
09:32:01.639455 IP dhcp-0-087.pek.redhat.com > dhcp-0-117.pek.
-----------------------------------
Is it possible FS-Cache is involed ? Becase I saw the "FS-Cache: Loaded" and
"FS-Cache: netfs 'nfs' registered for caching " messages when I first mount(in
step 1).)
Thanks.
------- Comment #4 From YangKun 2007-01-16 06:57:13 EDT -------
"dhcp-0-087.pek.redhat.com" is the NFS server machine(10.66.0.87).
"dhcp-0-117.pek.redhat.com" is the FV guest(10.66.0.117).
Comment #5 From Stephen Tweedie 2007-01-16 07:26:35 EDT -------
Can you please try with a smaller nfs blocksize in the guest?
The default NIC emulated in FV mode is an RTL-8139, which has only 64k of ring
buffer. I suspect that with 32k blocksize it's simply too much for that
hardware to keep up with the large NFS packets involved. With a tcp mount that
doesn't matter, because TCP will work out the best window size automatically;
but with udp, lose one fragment of a 32k packet and there's no recovery other
than a complete NFS retry. Get more than one packet at a time doing that same
recovery and you'll repeatedly overflow the hardware buffer.
Comment #6 From YangKun 2007-01-17 00:35:01 EDT -------
I tried 3 smaller blocksizes(8k,12k,16k), each blocksize runs for 10 times,
following are the results:
Blocksize 8k ( rsize=wsize=8192 )
---------------------------
Run# Result
---------------------------
1 direct PASS
2 direct PASS
3 direct PASS
4 direct PASS
5 direct PASS
6 direct PASS
7 direct PASS
8 direct PASS
9 direct PASS
10 direct PASS
Blocksize 12k ( rsize=wsize=12288 )
---------------------------
Run# Result
---------------------------
1 direct PASS
2 direct PASS
3 direct PASS
4 direct PASS
5 direct PASS
6 direct PASS
7 direct PASS
8 direct PASS
9 direct PASS
10 direct PASS
Blocksize 16k ( rsize=wsize=16384 )
---------------------------
Run# Result
---------------------------
1 prompt out Error message, can PASS after a short time wait
2 prompt out Error message, can PASS after a long time wait
3 direct PASS
4 prompt out Error message, can PASS after a short time wait
5 direct PASS
6 prompt out Error message, can PASS after a short time wait
7 prompt out Error message, can PASS after a short time wait
8 direct PASS
9 prompt out Error message, can PASS after a short time wait
10 prompt out Error message, can PASS after a short time wait
Well, I think it is the "64k ring buffer" cause this issue. Is there a way to
increase this ring buffer in FV guest ?
Is there a maximum "UDP-mount-safe" block size for RTL-8139 ?
Thanks very much .
------- Comment #7 From Stephen Tweedie 2007-01-17 09:18:12 EDT -------
What's causing the issue is that the test is non-portable and is requesting
something that is not going to work on all hardware. I expect it will fail on
real rtl-8139 hardware just as badly as on the emulated virtual NIC. Modern
NICs are unlikely to have the same limits.
The 64k ring buffer is hard-coded into the guest IO model.
As for maximum safe blocksize, I'm not sure: that's really an NFS question. I
think by default we enable 5 nfs rpc threads in parallel; if that translates to
5 blocks max outstanding at once, then that would be consistent with 12k passing
and 16k failing. But you'd have to check with an NFS expert.
------- Comment #8 From YangKun 2007-01-17 18:53:29 EDT -------
Ok, thanks. we have decided to change our test to mount with 12k. I'll close
this bug then.
Thanks very much for your help :-)