Friday, May 5, 2017

Fractured Blocks- ORA-01578 & tail checks

Recently dealed with an ORA-1578 error, which could be an indication of a Fractured Block..
Because of this error, the database could not be opened.
The complete Oracle error stack recorded in the trace file, was like the following;

Corrupt block relative dba: 0x0101342a (file 4, block 78890)
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 4, block # 78890)
ORA-01110: data file 4: '/ortest/testdata/system04.dbf'

So, without a need to do a further diagnostic, it was obvious that a block in the system04.dbf was corrupted. 
However; as the error was ORA-01578, this made me think that the problem was due to a physical corruption, which was a result of an OS or Hardware IO error.. 
In addition, this physical corruption seemed to be the cause of a fractured block.

I like to explain the Fractured block everytime it is asked, because i find it interesting.. It is somehow related with the other stacks below Oracle .. ( OS, Storage and so on)

As oracle IO is done through the OS. The size of a particular IO depends on the OS Block size, which is 512 bytes for Linux.

So, if we have a 8K Block Sized Oracle Database and if we modify one block of it; (oracle does its modifications on block level), then we actually do multiple IOs. (when we look to this incident from the OS perspective)

Almost all the backup consistency related solutions are built by considering this. For example, hot backup mode (Backup mode) is designed to guarantee the consistency, to guarantee the recovery eventhough there may be fractured blocks in the backed up files.

So, a fractured block means that the database block is incomplete/physically corrupted.

Suppose we are doing a DB block IO. It is 8K right? So, considering our OS block size is 512bytes, a single DB block IO means multiple OS IO (16 IOs..) . 

Suppose OS does 15 IOs and during the 16th IO it encounteres a problem and can not do the 16th IO.. 

Now suppose; Oracle can not even catch it, or we as Oracle couldn't get the return, or let's say; could not handle it.. (or OS says it is ok, altough it is not ok..)

At this moment, our Block is physically corrupt and it is inconsistent, since it has one part coming from the future , and one part coming from the past..

On the other hand; Oracle knows about this danger and it has mechanisms to check this.

In order to check this, Oracle updates a field in the end of each block. The size of this field is 4 bytes. 

Oracle writes down a number into these 4 bytes for checking the consistency of the tail of the block in the subsequent updates.

That is, Oracle updates this 4 byte field, which is located in the tail of each block, before writing down the block to disk completely.  (before giving block to OS -- for IO)

This value stored in a 4 byte field is called the tailchk..I t is a number, which is built by using the information stored in the block itself (which oracle reads from the head and remembers).

If the tailchk is aligned with the related data that is stored in block header, Oracle ensures that at least the tail of the block is consistent with the head.

Tailchk value is computed using the the last 4 digits of the SCN base, the Block Type and the SCN sequence Number.
Additional info can be found in Fractured Block explanation (Doc ID 1392417.1).

Note: The tailcheck only verifies if the header and the tail of the block have been written correctly. The complete(whole) block verification can be achieved using the DB_BLOCK_CHECKSUM..

So in our case, as seen below, the block type= 0 , last 4 digits of scn is 0001 and seq is 0x1..
So, our tailchk value must be -> 0x00010001, however; it is 0x19b90601, as you see below.
(--block type is 0 .. This is wrong in the first place, but still wanted to show you how we calculate the tailchk--)

Corrupt block relative dba: 0x0101342a (file 4, block 78890)
Bad header found during buffer read
Data in bad block:
 type: 0 format: 3 rdba: 0x0000000c
 last change scn: 0x056e.ea020001 seq: 0x1 flg: 0x06
 spare1: 0x0 spare2: 0x0 spare3: 0x0
 consistency value in tail: 0x19b90601
 check value in block header: 0x747
 computed block checksum: 0x8577
Reread of rdba: 0x0101342a (file 4, block 78890) found same corrupted data
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 4, block # 78890)
ORA-01110: data file 4: '/ortest/testdata/system04.dbf'

Also, computed block checksum approves that this block is corrupt, as it is different than 0x0.
Actually the block type is 0 here. Type 0 is an unknown type. So this blog is really a corrupt one. It is not only the computed block checksum that proves this.  (computed block checksum is meaningless actually.. because the type is 0) 
In addition, the content of block is actually for a different block. Rdba value is different than the requested dba value.
So this blog is both corrupted and misplaced.

The methods that Oracle suggests for fixing these kinds of problems, can be seen below;

Note 144911.1 RMAN : Block-Level Media Recovery - Concept & Example
Note 342972.1 How to perform Block Media Recovery (BMR) when backups are not taken by RMAN ORA-1578 Methods to Skip Block Corruption (Doc ID 2199133.1)

Above 2 methods rely on backups.. So we need to have RMAN or non-RMAN backups in order to fix these kinds of problems using the methods above.

However, what if we don't have backups?

Well, there are 4 things that I can say on this ->

1)We can recreate the index, if the fractured block is an index block.

2)We can use methods for skipping block corruption (documented in 2199133.1)

3)We can use "bbed" utility to update the consistency value in tail and computed block checksum properly in order to at least start our database. (using bbed at on our own risk) -- this method is a very advanced one.

4)There are cases for specific scenarios.. For example this one -> Error ORA-1578 reported on the database after changing compatible.rdbms >= 11 and resizing a datafile (Doc ID 965751.1)Taking a look to Oracle Support is always a good idea.

No comments :

Post a Comment