NBUversion:7.5
MediaServer:WindowsServer 2008R2
备份内容:SQLServer 数据
带库: IBM3584
在activity monitor中显示如下
Info nbjm(pid=7004) started backup (backupid=xxxx_1379096131) job for client xxxx, policy centralDWH, schedule full on storage unit xxxx-hcart2-robot-tld-0 9/14/2013 2:15:33 AM - started process bpbrm (14008) 9/14/2013 2:15:34 AM - connecting 9/14/2013 2:15:34 AM - connected; connect time: 00:00:00 9/14/2013 2:20:15 AM - Error bpbrm(pid=14008) from client xxxx: ERR - command failed: none of the requested files were backed up (2) 9/14/2013 2:20:15 AM - Error bpbrm(pid=14008) from client xxxx: ERR - bphdb exit status = 2: none of the requested files were backed up 9/14/2013 2:20:41 AM - Info dbclient(pid=18520) ERR - Error in GetConfiguration: 0x80770003. 9/14/2013 2:20:41 AM - Info dbclient(pid=18520) CONTINUATION: - The api was waiting and the timeout interval had elapsed. 9/14/2013 2:20:46 AM - Info dbclient(pid=18520) ERR - Error in VDS->Close: 0x80770004. 9/14/2013 2:20:47 AM - Info dbclient(pid=18520) CONTINUATION: - An abort request is preventing anything except termination actions. 9/14/2013 2:20:47 AM - Info dbclient(pid=18520) INF - OPERATION #1 of batch C:\Program Files\Veritas\NetBackup\DbExt\MsSql\centralDWH.bch FAILED with STATUS 1 (0 is normal). Elapsed time = 310(310) seconds. 9/14/2013 2:20:49 AM - Info dbclient(pid=18520) INF - Results of executing <C:\Program Files\Veritas\NetBackup\DbExt\MsSql\centralDWH.bch>: 9/14/2013 2:20:49 AM - Info dbclient(pid=18520) <0> operations succeeded. <1> operations failed. 9/14/2013 2:20:49 AM - Info dbclient(pid=18520) INF - The following object(s) were not backed up successfully. 9/14/2013 2:20:49 AM - Info dbclient(pid=18520) INF - CentralDWH |
同时间SQLserver log
Date |
Source |
Severity |
Message |
09/14/2013 02:20:15 |
Backup |
Unknown |
BACKUP failed to complete the command BACKUP DATABASE CentralDWH. Check the backup application log for detailed messages. |
09/14/2013 02:20:15 |
Backup |
Unknown |
Error: 3041 |
09/14/2013 02:04:57 |
Backup |
Unknown |
BACKUP failed to complete the command BACKUP DATABASE CentralDWH. Check the backup application log for detailed messages. |
09/14/2013 02:04:57 |
Backup |
Unknown |
Error: 3041 |
问题分析:
首先日志内容中
Error bpbrm(pid=14008) from client xxxx: ERR - command failed: none of the requested files were backed up (2) Error bpbrm(pid=14008) from client xxxx: ERR - bphdb exit status = 2: none of the requested files were backed up |
说明bch脚本运行失败,并没有找到数据库中需要备份的文件
然后这部分
9/14/2013 2:20:41 AM - Info dbclient(pid=18520) ERR - Error in GetConfiguration: 0x80770003. 9/14/2013 2:20:41 AM - Info dbclient(pid=18520) CONTINUATION: - The api was waiting and the timeout interval had elapsed. 9/14/2013 2:20:46 AM - Info dbclient(pid=18520) ERR - Error in VDS->Close: 0x80770004. 9/14/2013 2:20:47 AM - Info dbclient(pid=18520) CONTINUATION: - An abort request is preventing anything except termination actions. 9/14/2013 2:20:47 AM - Info dbclient(pid=18520) INF - OPERATION #1 of batch C:\Program Files\Veritas\NetBackup\DbExt\MsSql\centralDWH.bch FAILED with STATUS 1 (0 is normal). Elapsed time = 310(310) seconds. |
说明nbu连接vdi超时,一般vdi默认是300秒,因为没有请求到数据库的文件,所以脚本300秒后超时,vdi报错,与此同时在windows server日志中有一条error也记录这个信息:
SQLVDI: Loc=SignalAbort. Desc=Client initiates abort |
既然脚本没执行就检查了一下bch脚本,并没有发现什么问题,然后手动重新运行了一下这个policy,NBU又报错了,不过这次不是脚本问题
INF - Created VDI object for SQL Server instance <xxxx>. Connection timeout is <300> seconds. ERR - Error in GetConfiguration: 0x80770003. |
在创建vdi后,等了300秒,又出现了Error in GetConfiguration 0x80770003,看来是创建vdi object出了问题,应该是nbu client调用SQLVDI.DLL来创建。
接下来看看dbclient log,这个日志必须在nerbackup\log下新建一个dbclient文件夹才会有:
<2> logconnections: BPRD CONNECT FROM media-ip.62961 TO master-ip.1556 fd = 1268 <4> DBConnect: INF - Logging into SQL Server with DSN <NBMSSQL_34284_37776_1>, SQL userid <sa> handle <0x0080d1b0>. <4> CDBbackrec::InitDeviceSet(): INF - Created VDI object for SQL Server instance <instance>. Connection timeout is <300> seconds.------可以看到这里创建vdi了 <2> vnet_pbxConnect: pbxConnectEx Succeeded <2> logconnections: BPRD CONNECT FROM media-ip.62962 TO master-ip.1556 fd = 1396 <2> vnet_pbxConnect: pbxConnectEx Succeeded <2> logconnections: BPRD CONNECT FROM media-ip.62963 TO master-ip.1556 fd = 952 <4> CGlobalInformation::VCSVirtualNameList: INF - Veritas Cluster Server is not installed.---这里显示没有安装veritas集群 <1> CGlobalInformation::VCSVirtualNameList: CONTINUATION: - The system cannot find the path specified. ------找不到路径 <4> getServerName: Read server name from nb_master_config: xxxxx <4> CDBIniParms::CDBIniParms: INF - NT User is Administrator <4> DBConnect: INF - Logging into SQL Server with DSN <NBMSSQL_temp_23736_9600_1>, SQL userid <sa> handle <0x0065acf0>.----sa0x0065acf0 登录 SQLserver <4> DBConnect: INF - Logging into SQL Server with DSN <NBMSSQL_temp_23736_9600_1>, SQL userid <sa> handle <0x0065c260>.----sa0x0065c260 登录 SQLserver <4> CGlobalInformation::CreateDSN: INF - A successful connection to SQL Server <xxxx\instance> has been made using Trusted security with DSN <NBMSSQL_temp_23736_9600_1> using standard userid <sa>. <4> DBDisconnect: INF - Logging out of SQL Server with handle <0x0065c260>---sa0x0065c260 退出 <4> DBConnect: INF - Logging into SQL Server with DSN <NBMSSQL_temp_23736_9600_1>, SQL userid <sa> handle <0x0065c690>. 又一个sa登录 <4> DBDisconnect: INF - Logging out of SQL Server with handle <0x0065c690> 紧接着退出 <4> SQLEnumerator: INF - Enumerated SQL hosts: SERVER:Server={BJDSQLCLUSTER\instance};UID:Login ID=?;PWD:Password=?;Trusted_Connection:Use Integrated Security=?;*APP:AppName=?;*WSID:WorkStation ID=? 01:17:34.156 [23736.9600] <4> SQLEnumerator: INF - Could not enumerate Local SQL host/instance using SQLBrowseConnectW ---无法使用SQLBrowseConnect枚举出sql本地主机和实例,这个SQLBrowseConnect用来发现和枚举连接数据库所需要值(主机名实例名等) <4> CGlobalInformation::SQLEnumerator: INF - Hosts and instances retrieved from host list string <4> CGlobalInformation::SQLEnumerator: INF - host: mediaserver <4> CGlobalInformation::SQLEnumerator: INF - instance: xxxx <4> CGlobalInformation::SQLEnumerator: INF - host: BJDSQLCLUSTER <4> CGlobalInformation::SQLEnumerator: INF - instance: xxxxx <4> CGlobalInformation::CreateDSN: INF - A successful connection to SQL Server <xxxx\instance> has been made using Trusted security with DSN <NBMSSQL_23736_9600_2> using standard userid <sa>.----从host list中发现了主机名和实例,并成功连接,至此说明nbu client 连接到了数据库实例,接下来看看为什么没有备份成功 -------------------------------------------------------分割线-------------------------------------------- <4> StartupProcess: INF - Starting: <C:\Program Files\Veritas\NetBackup\bin\admincmd\bppllist.exe -byclient mediaserver> 中间又是一堆登录信息,并成功连接到数据库,这里省略 <4> getServerName: Read server name from nb_master_config: masterserver <2> vnet_pbxConnect: pbxConnectEx Succeeded <2> logconnections: BPRD CONNECT FROM media-ip.62996 TO master-ip.1556 fd = 960 --media的bprd连接master <16> writeToServer: ERR - send() to server on socket failed: 发送socket失败 <16> dbc_RemoteWriteFile: ERR - could not write progress status message to the NAME socket <16> CDBbackrec::InitDeviceSet_Part2(): ERR - Error in GetConfiguration: 0x80770003.这里报错和activity monitor里一样了 01:22:09.551 <1> CDBbackrec::InitDeviceSet_Part2(): CONTINUATION: - The api was waiting and the timeout interval had elapsed. <2> vnet_pbxConnect: pbxConnectEx Succeeded <2> logconnections: BPRD CONNECT FROM media-ip.63001 TO master-ip.1556 fd = 1400 01:22:09.703 <4> KillAllThreads: INF - Killing group #0 01:22:09.704 [34284.33648] <4> KillAllThreads: INF - Killing group #0 01:22:09.704 <4> KillAllThreads: INF - Issuing SignalAbort to MS SQL Server VDI --windows中看到的消息 01:22:09.704 [34284.33416] <4> KillAllThreads: INF - Killing group #0 01:22:09.704 [34284.32560] <4> KillAllThreads: INF - Killing group #0 01:22:12.709 <2> vnet_pbxConnect: pbxConnectEx Succeeded 01:22:12.710 <2> logconnections: BPRD CONNECT FROM media-ip.63002 TO master-ip.1556 fd = 1276 01:22:14.546 <16> writeToServer: ERR - send() to server on socket failed: <16> dbc_RemoteWriteFile: ERR - could not write progress status message to the NAME socket <16> CDBbackrec::FreeDeviceSet(): ERR - Error in VDS->Close: 0x80770004. |
看来故障原因是bprd 无法将进程状态写入name socket,导致 mediaserver和masterserver通信失败,从而导致vdi超时。
http://www.symantec.com/business/support/index?page=content&id=TECH182435
这里说 7.1版本中如果dbc_RemoteWriteFile- RemoteWriteFile status = 0状态为0可以忽略,下个版本中会解决,但是我是7.5,似乎不是这个问题。
http://www.symantec.com/docs/TECH146444 这篇文章提到sqlserver 某个补丁更新了SQLVDI.DLL,导致备份失败。也不是我的问题
http://www.symantec.com/connect/forums/having-problem-mssql-agent-backup这篇里提到2个方法
1删除进程dbbackex.exe,2增加Client Connect 时间即 Client Read Timeout,可以在bch脚本增加VDITIMEOUTSECONDS XXXX(关于这个参数查阅NetBackup for Microsoft SQL Server Administrator’s Guide)来设置nbu与VDI连接超时的时间。
注意:
Before running another backup, ensure the following log folders exist on media server: bptm and bpbrm. If backup still fails after increasing media server timeouts, please check a new set of logs: dbclient on SQL client, bptm and bpbrm on media server. |
解决方案
在脚本中加入了VDITIMEOUTSECONDS 1800后,手动备份成功
备注:
关于错误代码0x80770003和0x80770004在http://www.sqlbackuprestore.com/vdierrors.htm里有关于vdi的错误信息的详细解释
0x80770003 (-2139684861) |
The api was waiting and the timeout interval had elapsed. Similar to the above example, this can happen when the backup application has waited a set amount of time waiting for SQL Server to respond to its backup request, but did not receive any response. |
0x80770004 (-2139684860) |
An abort request is preventing anything except termination actions. An example of this error is when the backup software has encountered a critical error, and has issued an abort request to the VDI. |
http://www.symantec.com/business/support/index?page=content&id=TECH38369
后记
备份流程 nbu策略--nbu备份脚本--mediaserverVDI---mediaserverDBProcess
mediaserver调用本地脚本,通过vdi和sqlserver里的一组备份进程通信,每个备份的数据库对应3个进程,备份完成后进程应该销毁,并通过vdi通知mediaserver,然后mediserver完成备份。
当sqlserver备份进程在N秒(N是脚本里的超时时间)内不能完成备份,不能通过vdi通知mediaserver,nbu认为备份失败。那么第二次备份时,进程依然存在的话,备份仍会失败。
造成备份很慢的情况可能是sqlserver服务器性能过低,导致进程运行缓慢。
思考
应该增加sqlserver的性能