LATEST TOPICS

Oracle Multitenant: Impact of a Pluggable Database (PDB) Failure

Introduction

Today I would be discussing the impact of a pluggable database failure (particularly media failure) on the other pluggable databases and the parent container database.

In this article series, I am trying to make an attempt in answering the concerns or questions raised by my dear friend Nassyam Basha in his post PDB is Painful to CDB any cost – 12c ?

In the first section, I would be just repeating the demonstration that my friend (Nassyam Basha) had shown in the mentioned article.

During the demonstration, it was observed that, when a particular datafile belonging to a particular pluggable database (PDB) goes missing or corrupt, CKPT (process responsible for updating controlfile and datafile headers as well as for calling DBWR to flush dirty buffers to disk) background process is causing the container database (CDB$ROOT) to terminate, in turn causing all the pluggable databases (PDB) attached to the container to terminate.

This is a serious concern with respect to the multi-tenant database architecture, which is raising the obvious question on the capability and functionality of the pluggable database architecture. It is like, your business is in a cloud infrastructure and a single component failure is causing the entire infrastructure to fail, which is not at all desirable and probably that cloud would not be the recommended choice to deploy.

Demonstration (based on the referenced article)

Lets, take a look at the simple simulation of the problem.

I have a container database PRODCDB, which is hosting 4 (four) pluggable databases.

sys@PRODCDB> select name,open_mode,cdb from v$database;

NAME      OPEN_MODE            CDB
--------- -------------------- ---
PRODCDB   READ WRITE           YES

sys@PRODCDB> select name,dbid,open_mode from v$pdbs;

NAME                                 DBID OPEN_MODE
------------------------------ ---------- ----------
PDB$SEED                       4103948816 READ ONLY
PRODPDB1                       4276769587 READ WRITE
PRODPDB2                       4149756065 READ WRITE
PRODPDB3                       4199535790 READ WRITE
PRODPDB4                       4072086604 READ WRITE

Lets simulate an artificial failure by deleting one of the datafile belonging to a particular PDB. I am randomly choosing the pluggable database PRODPDB4.

sys@PRODCDB> show con_name

CON_NAME
------------------------------
PRODPDB4

sys@PRODCDB>  select tablespace_name,file_id,file_name from dba_data_files order by 1,2;

TABLESPACE_NAME    FILE_ID FILE_NAME
--------------- ---------- --------------------------------------------------------------------------------
SYSAUX                  33 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/sysaux.300.864131435
SYSTEM                  32 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/system.299.861526861
USERS                   43 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/users.375.864212453
USERS                   45 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf

Lets introduce an artificial datafile loss.

11:15:47 sys@PRODCDB> !ls -lrt /app/oracle/data/prodcdb/prodpdb4_users_2.dbf
-rw-r----- 1 oracle oinstall 10493952 Nov 21 11:12 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf

11:16:06 sys@PRODCDB> !rm /app/oracle/data/prodcdb/prodpdb4_users_2.dbf

11:16:23 sys@PRODCDB> !ls -lrt /app/oracle/data/prodcdb/prodpdb4_users_2.dbf
ls: /app/oracle/data/prodcdb/prodpdb4_users_2.dbf: No such file or directory

Now, lets instruct the CKPT process to perform a checkpoint.

11:16:40 sys@PRODCDB> alter system checkpoint;
ERROR:
ORA-03114: not connected to ORACLE


alter system checkpoint
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 7177
Session ID: 20 Serial number: 9

My container database PRODCDB is terminated. Here are the errors that were logged in alert log file.

Fri Nov 21 11:16:49 2014
Errors in file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_ckpt_5848.trc:
ORA-63999: data file suffered media failure
ORA-01116: error in opening database file 45
ORA-01110: data file 45: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf'
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Fri Nov 21 11:16:49 2014
Errors in file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_ckpt_5848.trc:
ORA-63999: data file suffered media failure
ORA-01116: error in opening database file 45
ORA-01110: data file 45: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf'
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
USER (ospid: 5848): terminating the instance due to error 63999
Fri Nov 21 11:16:50 2014
System state dump requested by (instance=1, osid=5848 (CKPT)), summary=[abnormal instance termination].
System State dumped to trace file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_diag_5828.trc
Dumping diagnostic data in directory=[cdmp_20141121111650], requested by (instance=1, osid=5848 (CKPT)), summary=[abnormal instance termination].

Now, trying to start the container database (CDB) PRODCDB results in following errors.

idle> startup
ORACLE instance started.

Total System Global Area  521936896 bytes
Fixed Size                  2290264 bytes8
Variable Size             264244648 bytes
Database Buffers          251658240 bytes
Redo Buffers                3743744 bytes
Database mounted.
ORA-01157: cannot identify/lock data file 45 - see DBWR trace file
ORA-01110: data file 45: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf'

The container database (CDB) was able to MOUNT. However, while opening the database, it (DBWR in particular) could not find the datafile (FILE# 45 in our case) belonging to the pluggable database PRODPDB4.

The quickest solution to bring back the container along with the attached pluggable databases would be to immediately take the lost datafile (FILE# 45 in our case) OFFLINE and start the container and pluggable databases. We can later recover the lost file (using the available backup).

First, identify the pluggable database to which the lost datafile belongs to.

idle>  select name,dbid,open_mode from v$pdbs  where con_id=(select CON_ID from v$datafile where file#=45);

NAME                                 DBID OPEN_MODE
------------------------------ ---------- ----------
PRODPDB4                       4072086604 MOUNTED

Now, take the lost datafile OFFLINE by logging in to the respective pluggable database.

idle> alter session set container=PRODPDB4;

Session altered.

idle> alter database datafile 45 offline;

Database altered.

Now, login to the container database (CDB$ROOT) and open it (already in MOUNT state due to the last STARTUP attempt)

idle> show con_name

CON_NAME
------------------------------
CDB$ROOT

idle> alter database open;

Database altered.

Now, we can open the attached pluggable databases as follows.

idle> alter pluggable database all open;

Pluggable database altered.


sys@PRODCDB> select name,dbid,open_mode from v$pdbs;

NAME                                 DBID OPEN_MODE
------------------------------ ---------- ----------
PDB$SEED                       4103948816 READ ONLY
PRODPDB1                       4276769587 READ WRITE
PRODPDB2                       4149756065 READ WRITE
PRODPDB3                       4199535790 READ WRITE
PRODPDB4                       4072086604 READ WRITE

So, our container as well as the pluggable databases are back ONLINE. However, we are yet to restore and recover the missing file belonging to the pluggable database. We can use RMAN to recover the datafile. I am skipping that part here.

In the main article, which I have initially referred, following were the questions asked.


1) if in case of system datafile lost of Pluggable database, what happens to CDB?
2) If i shutdown the PDB, will it impact to any other PDB part of CDB?
3) You can drop PDB anytime, then why CDB can’t stop and startup in case of system or user datafile lost of PDB?
4) If you have 10 PDB’s of one CDB, if there is lost of any single datafile of PDB(pdb1) and i have scenario to startup and shutdown my CDB and other PDBs except damaged(pdb1) why i can’t startup CDB or other PDB’s?

I would try to give quick answers to these questions.

1) Answer: Depends (CDB may terminate). Note, I did not say “it will terminate”. Stay tuned for explanation.
2) Answer: Not at all, SHUTDOWN a PDB is similar to closing a PDB with the ‘ALTER PLUGGABLE DATABASE CLOSE’ command, which has no impact on other PDBs.

sys@PRODCDB> show con_name

CON_NAME
------------------------------
PRODPDB4

sys@PRODCDB> shutdown
Pluggable Database closed.

sys@PRODCDB>  alter session set container=CDB$ROOT;

Session altered.

sys@PRODCDB> alter pluggable database PRODPDB3 close;

Pluggable database altered.

sys@PRODCDB> select name,dbid,open_mode from v$pdbs;

NAME                                 DBID OPEN_MODE
------------------------------ ---------- ----------
PDB$SEED                       4103948816 READ ONLY
PRODPDB1                       4276769587 READ WRITE
PRODPDB2                       4149756065 READ WRITE
PRODPDB3                       4199535790 MOUNTED
PRODPDB4                       4072086604 MOUNTED

3) Answer: This is because there is only one CONTROLFILE which is for both CDB and its associated PDBS. For the CDB to STARTUP completely, DBWR must identify all the ONLINE datafiles listed in the controlfile.
4) Answer: As per the demonstration that I have shown , we can still open the CDB and all the PDB (s), even if we have a lost datafile from a PDB.

Explanation on the Instance Termination

I was bit curious here. I was not ready to believe that Oracle would offer a cloud infrastructure (multi-tenant pluggable database architecture) which is not functioning like a cloud. There are savvy programmers working for Oracle, writing the core of this muti-tenant architecture and undoubtedly they would have tested these scenarios before releasing the new multi-tenant architecture. I am pretty sure, we are missing something in our simulation, which is leading the cloud (container database) to fail.

With this thought in my mind, I had started researching on this and finally ended up finding the cause noted in MOS Note: Doc ID 1605755.1.

Here is the logic behind the instance termination by CKPT. Prior to Oracle version 11.2.0.2, media failure for any datafile (except from SYSTEM tablespace) would result in to the particular datafile to be OFFLINE provided the database is in ARCHIVELOG mode. However, Oracle has introduced a fix Bug 7691270 Crash the DB in case of write errors (rather than just offline files) in 11.2.0.2, where media failure for a datafile leads to the instance termination.

This fix for media failure is controlled by a new hidden parameter _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE with the following set of values

_DATAFILE_WRITE_ERRORS_CRASH_INSTANCE=TRUE (default): When set to TRUE, any datafile media failure would cause the instance termination when a database process tries to write to that datafile.
_DATAFILE_WRITE_ERRORS_CRASH_INSTANCE=FALSE : When set to FALSE, would restore the previous functionality (pre 11.2.0.2) and would make the datafile OFFLINE (provided database is in ARCHIVELOG mode and the datafile is not from SYSTEM tablespace).

Avoiding the Instance Termination

Now, lets test the same scenario by restoring the media failure behaviour.

Check the value of the hidden parameter _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE ( would be TRUE by default)

sys@PRODCDB> SELECT a.ksppinm Param , b.ksppstvl SessionVal ,
2 c.ksppstvl InstanceVal, a.ksppdesc Descr
3 FROM
4 x$ksppi a , x$ksppcv b , x$ksppsv c
5 WHERE
6 a.indx = b.indx AND
7 a.indx = c.indx AND
8 a.ksppinm LIKE '/_datafile_write_errors_crash_instance%' escape '/'
9 ;

PARAM                                    SESSIONVAL INSTANCEVAL     DESCR
---------------------------------------- ---------- --------------- --------------------------------------------------
_datafile_write_errors_crash_instance    TRUE       TRUE            datafile write errors crash instance

Now, lets set the hidden parameter _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE to FALSE in order to avoid instance crash when database process tries to write to a datafile with media failure.

sys@PRODCDB> alter system set "_datafile_write_errors_crash_instance"=FALSE;

System altered.


sys@PRODCDB> SELECT a.ksppinm Param , b.ksppstvl SessionVal ,
  2  c.ksppstvl InstanceVal, a.ksppdesc Descr
  3  FROM
  4  x$ksppi a , x$ksppcv b , x$ksppsv c
  5  WHERE
  6  a.indx = b.indx AND
  7  a.indx = c.indx AND
  8  a.ksppinm LIKE '/_datafile_write_errors_crash_instance%' escape '/'
  9  ;

PARAM                                    SESSIONVAL INSTANCEVAL     DESCR
---------------------------------------- ---------- --------------- --------------------------------------------------
_datafile_write_errors_crash_instance    FALSE      FALSE           datafile write errors crash instance

Lets simulate an artificial media failure again. This time, I am expecting the database to run even after the media failure.

sys@PRODCDB> select tablespace_name,file_id,file_name,online_status from dba_data_files order by 1,2;

TABLESPACE_NAME    FILE_ID FILE_NAME                                                                     ONLINE_
--------------- ---------- ----------------------------------------------------------------------------- -------
SYSAUX                  33 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/sysaux.300.864131435  ONLINE
SYSTEM                  32 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/system.299.861526861  SYSTEM
USERS                   46 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/users.375.864302607   ONLINE
USERS                   47 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf                                 ONLINE


sys@PRODCDB> !rm /app/oracle/data/prodcdb/prodpdb4_users_2.dbf

sys@PRODCDB> !ls -lrt /app/oracle/data/prodcdb/prodpdb4_users_2.dbf
ls: /app/oracle/data/prodcdb/prodpdb4_users_2.dbf: No such file or directory

Lets perform a checkpoint as it tries to write to the datafiles and controlfile to update the header information. If checkpoint is working, we can assure that database would be running without impacting the remaining pluggable databases and their container database.

 
sys@PRODCDB> select sysdate from dual;

SYSDATE
--------------------
22-NOV-2014 13:51:08

sys@PRODCDB> alter system checkpoint;

System altered.

sys@PRODCDB> /

System altered.

As expected, checkpoint worked fine this time with _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE being set to FALSE. Lets check the status of the datafile which has the media failure.

sys@PRODCDB>  select tablespace_name,file_id,file_name,online_status from dba_data_files order by 1,2;

TABLESPACE_NAME    FILE_ID FILE_NAME                                                                     ONLINE_
--------------- ---------- ----------------------------------------------------------------------------- -------
SYSAUX                  33 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/sysaux.300.864131435  ONLINE
SYSTEM                  32 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/system.299.861526861  SYSTEM
USERS                   46 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/users.375.864302607   ONLINE
USERS                   47 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf                                 RECOVER

sys@PRODCDB>

As, we can see the datafile (FILE# 47 in our case) is now in RECOVER (OFFLINE) state. We can now recover it without impacting other pluggable database and container database.

This time, the alert log just reported the media failure with notification about the datafile (FILE# 47 in our case) being put in OFFLINE.

Sat Nov 22 13:51:14 2014
Beginning global checkpoint up to RBA [0x46.8150.10], SCN: 2422023
Sat Nov 22 13:51:14 2014
Errors in file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_ckpt_10263.trc:
ORA-01171: datafile 47 going offline due to error advancing checkpoint
ORA-01116: error in opening database file 47
ORA-01110: data file 47: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf'
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
Completed checkpoint up to RBA [0x46.8150.10], SCN: 2422023
Beginning global checkpoint up to RBA [0x46.8154.10], SCN: 2422029
Completed checkpoint up to RBA [0x46.8154.10], SCN: 2422029

Oracle has mentioned in the note that, this new undocumented parameter is being introduced as a FIX. However, as per my opinion it is more of a kind of bug rather than a fix (especially for the muti-tenant architecture) considering the fact that the FIX is not documented or published.

Reference

PDB is Painful to CDB any cost – 12c ?
Media Failure Of Any PDB Datafile Crashes The Complete CDB (Doc ID 1605755.1)
Bug 7691270 Crash the DB in case of write errors (rather than just offline files)

2 Comments
  1. Raj
    • Abu Fazal Abbas
%d bloggers like this:
Visit Us On LinkedinVisit Us On TwitterVisit Us On Google PlusCheck Our Feed