近日,连续收到ASM磁盘dismount,并且是错误“Waited 15 secs for write IO to PST”的问题,这是ASM特有的心跳超时检测,ASM instance会定期检查每个asm disk是不是能正常反

近日,连续收到ASM磁盘dismount,并且是错误“Waited 15 secs for write IO to PST”的问题,这是ASM特有的心跳超时检测,ASM instance会定期检查每个asm disk是不是能正常反馈。所以决定针对这个问题,做个小总结。

在文档ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (Doc ID 1581684.1) 中有下面一段描述:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Generally this kind messages comes in ASM alertlog file on below situations,

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,

thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.

By the way the heart beat delays are sort of ignored for external redundancy diskgroup.

ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,

but the heart beat delays do not dismount external redundancy diskgroup directly.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

上面描述,可以理解为下面几点:

1. ASM实例会定期检查每一个磁盘组的磁盘状态,是否通信正常;

2. 这个检查,只是针对normal和high冗余模式,对于external冗余,不会遇到这个错误;

3. 默认情况是15s超时,也就是说15s磁盘组还是没有对ASM实例响应的话,就会dismount磁盘组。


而遇到这个问题的客户,都是使用光纤网络存储,在存储网络出现问题的情况下,会引发这个错误的出现。也就是说,在ASM定期发出检查信息的时候,如果磁盘没有在15s内反馈的话,我就认为磁盘已经无法访问。

针对这个错误,我尝试在测试环境测试,由于测试环境是VMware的虚拟机,在物理层面删除磁盘,并不会引发这个问题。原因是在同一个主机上的磁盘被异常删除后,ASM的读取操作会立即返回系统层面的IO错误,而不需要去等待错误“Waited 15 secs for write IO to PST”的超时。

所以,我总结这个错误,只会出现在共享的ASM磁盘,不在物理主机的本地,而是在存储网络中,ASM发出去的检测信息,不能及时被反馈,才会出现这个错误。这时,可能是存储主机,存储网络,甚至存储磁盘的问题,anyway,我ASM没有收到我需要的确认信息,我认为你有问题,如果有问题的磁盘数够多,达到影响数据完整性了,那我ASM就要dismount这个磁盘组了。


这里对于“Waited 15 secs for write IO to PST”错误信息,根据文档1581684.1介绍,是在11.2.0.3.0之后出现的。同时在文档中有描述,如何手动修改这个检测超时的时间,可以通过参数_asm_hbeatiowait来控制:

alter system set "_asm_hbeatiowait"= scope=spfile sid='*';


为了确认,这个参数是在11.2.0.3之后出现的,我将全部数据库版本都查询一遍,具体可以参考下面信息:

======================10.2=====================
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod
PL/SQL Release 10.2.0.5.0 - Production
CORE 10.2.0.5.0 Production
TNS for Linux: Version 10.2.0.5.0 - Production
NLSRTL Version 10.2.0.5.0 - Production
 
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;
hidden parameter value
-------------------------------------------------------------------------------- ----------
_asm_acd_chunks 1
_asm_allow_only_raw_disks TRUE
_asm_allow_resilver_corruption FALSE
_asm_ausize 1048576
_asm_blksize 4096
_asm_direct_con_expire_time 120
_asm_disk_repair_time 14400
_asm_droptimeout 60
_asm_emulmax 10000
_asm_emultimeout 0
_asm_fob_tac_frequency 3
hidden parameter value
-------------------------------------------------------------------------------- ----------
_asm_instlock_quota 0
_asm_kfdpevent 0
_asm_libraries ufs
_asm_maxio 1048576
_asm_skip_resize_check FALSE
_asm_stripesize 131072
_asm_stripewidth 8
_asm_wait_time 18
_asmlib_test 0
_asmsid asm
21 rows selected.

======================11.2.0.1=====================
sqlplus / as sysdba
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatwaitquantum 2

======================11.2.0.2=====================
 $ sqlplus / as sysdba
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining
and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatwaitquantum 2

在11.2.0.3.0之后才有这个参数出现,也就是说ASM实例对磁盘超时的检测是在11.2.0.3之后才出现的
======================11.2.0.3=====================
sys@R11203> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;
hidden parameter value
hidden parameter value
-------------------------------------------------- --------------------
_asm_hbeatiowait 15
_asm_hbeatwaitquantum 2

======================11.2.0.4=====================
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - Production
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;
hidden parameter value
-------------------------------------------------------------------------------- ---------
_asm_hbeatiowait 15 <<<<<<<<<<<<<<<<<<<<
_asm_hbeatwaitquantum 2
 
 ======================12.1.0.1=====================
 $ sqlplus / as sysdba
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatiowait 15
_asm_hbeatwaitquantum 2

在12.1.0.2之后,这个参数默认值被调整为120s
 ======================12.1.0.2=====================
 $ sqlplus / as sysdba
 
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;
hidden parameter value
--------------------------------------------------------------------------------
_asm_hbeatiowait 120
_asm_hbeatwaitquantum 2
登录后复制



希望总结的这个知识点,对你有帮助。日常中,经常感叹,这个问题很简单,但是不sure,测试过后,记录下来,以备查询。

09-10 07:46