Jan22 disk failures

Disk failures

  • Typically as DBMS imp, we assume second storage just works .
  • classic recovery doesn't address media (disk) failures.

But sometimes disk does fail

  • catastrophic plant failure. --- handle via geographic distribution.
  • simple hardware failure ---- "RAID" redundant array of inexpensive disks

How do we detect if a disk is giving bad values?

  • checksums common
  • use a parity bit for each byte --- e.g. n = 8 bits. 1 (parity bit) 00110010
  • for a given block, count # of 0's, count # of 1's
    --- add one more bit at end so that # of 1's is even.
    if I perform a checksum, what is chance that b bit error will go uncaught?
    -- one bit error : destined to catch
    --- two bit error, i.e. switch 10 to 01. -- cannot be caught.
    ---three bit error ...
    50% chance that I catch an error of b bits in a block of size n.

So, the idea is to make blocks small, so good chance one of the blocks catch an error. In the extreme (1 bit block) -- you mirror! costly
In practice, use larger blocks and accept some chance of missed error.

你可能感兴趣的:(Jan22 disk failures)