Jan 29 page data

page size: typically ~100 KB, extreme 256 MB

First question: row store or column store?

row store--whole records are stored contiguously --- classical method

page:
| header | rec1 | rec2 | rec3 .....
most useful for transaction processing.
Downside: when there are a large number of columns, much of the information in a row is useless if you only need to access a subset of data.

column store: ---increasingly common organization for databases.
Page:
header|||||||||| <---at I for Rec1, at I for Rec2, .....at I for RecN
Down side: records are at different pages. have to access multiple pages to get a row.
Up side: analytical process: tables are really really wide.

Useful for different circumstances.

What should be in a header?

Info about table : 1. name, 2. schema (attributes)
Timestamp (for Concurrency Control and Recover CCR)
Info about Storage format
Pointers to start of each record. (bad: complicated. Good: may reduce CPU) -- useful as there might be variable length and you cannot simply get record by offset * unit_length
Number of recs
Pointer to first unused slot ...

What should be included in a record header

Time stamp for CCR
Pointer to next record (do this to trade space for CPU, otherwise you will need to parse the next record and parsing is expensive)
Pointers to attributes
Valid or not? A bit or byte to tell if the record is valid or not. --- To delete, simply put a bit there. --valid or not byte

In next assignment, use provided code (page storage) to store bunch of records

"Pointers" in DBMS

C pointer is a memory address.
smart pointer is the same with a counter associated with it.
Types of pointers:

Classic pointers: memory locations. --DBMS support code (Binary Tree to store the data etc.). In header, not using classical pointer ()move one page to another place and a classical pointer breaks.
"Offset pointer": not absolute locations; store a distance from the specific memory location. --> assuming having the location of the header. an attribute is 28 bytes away.
Off page pointer (points to DB data not on the same page). Why? Big reason: in DB structures (e.g. B tree) requires off page pointer

2)3) need to translate to the memory locations. 3) is bit difficult will be covered later.

What info do we like to include in an Off-page pointer / what data goes into an off-page pointer

ID (identifier) for page

IP address (could be a distributed system), file, page position) (drawback: this might change)
Logical: file, page ID. --- you have to have some sort of server allow you to look up actual location with page ID. --- slower because you will need two searches.

location of data on page. (might be an offset Ora "logical" location)

Jan 31

Swizzling - refers to process of converting logical pointers to C-style memory addresses

when to swizzle and when to "un" swizzle?

Fully on demand swizzling(every time we defer, we swizzle)
pro - easy! QUIZ some one want to access a page, they call getBytes, what they have is a logical pointer, in our system, finalize with c-style pointer with a swizzling process
Con:expensive (repeatedly go and lookup pointer)
Automatic swizzling: automatic means the system swizzles every pointer when loaded into RAM
- pro: easy from application programmer's view. I just dereference things, don't need to look up (the logical 2 physical table). Also looks efficient.
To implement:
- need to handle eviction (swizzle ptr to null)
- need to handle buffer cache misses (looks like on demand)
- also need a lookup table to keep track of all the pointer in the RAM: so can handle page movements and evictions.
third solution: Hybrid (Once a pointer is dereferenced, its management is automatic )

Organizing records and pages into files:Big topic that going to spend a week to cover

go through several file organizing

(1) Heap (as database people, we really refers an unordered "pile" of records)
- inserts: simplest implementation: write to end of last page in file.
- Finds: iterate or scan from start to finish
- Deletes: remove rec(s) from page, periodically (Garbage Collection)
* Pros:
1. heap is simple; easy to iterate them;
2.can have good performance for large file;
3. good for adding data
* Cons:
1. Bad for "point" finds that access with a specific keys

(2) Sorted file: at records sorted according to some "search key" (one or more user supplied attributes)

inserts: Buffer a large batch, then periodically re-sort.
Question: how to sort a large (bigger than RAM) file? --answered by student (merge sort)
TPMMS (two-phase multi-way merge sort)
TWO phases: sort phase and merge phase
sort phase: while some portion of file is still unsorted:
1) Read in R pages from input file (R < B # numb buffered pages available)
2) Sort all of the records on these pages.
3) Write new pages (w. sorted records -- called a "run") to disk