index

0 intro

0.1 type of pages

sql server divides space(storage) into pages

  • i am :
    store location of tables or indexes (mall directory)
    stores info about pages
    used by index tables

  • data:
    it holds data from tables
    actual data
    CI leaf level
    HEAP

  • index:
    used by index to point to different nodes or data;
    pointers to intermediate or leaf level
    CI root/Intermediate
    NCI root/inter/leaf

  • lob pages:
    Data is stored in the pages. Size of the page is 2 GB. Used by columns which have VARCHAR(MAX), NVARCHAR(MAX), IMAGE,
    TEXT, and Column Store Index

0.1.1 page

  • each of pages : 8 kb
  • only come from one table

8 kb: the smallest unit of storage ;data from only one table is allowed per page)

0.2 page has 3 section

0.2.1 header:

  1. page number
  2. next & previous pages
  3. extent number
  4. space remaining in pages
  5. 96 Bytes

0.2.2 data rows

  1. rows from table

0.2.3 offset

  1. it is a number that references size of rows, if there are 10 rows in a page,there will be 10 offset values.

ps: we can know the record size from the datatype

0.3 extent

Extent is next smallest unit of storage after a page. Size of it is 8*8KB = 64 KB. It holds 8 pages.

extent -----> 8 pages ----> 64kb
row indentifier(RID): extent+page+offset value

0.3.1 type of extent

  • mixed extent:
    Multiple tables share the extent/pages within the extent.
  • uniform extent:
    All the pages are allocated to 1 table.

0.4 heap

  • a table without clustered index

a cluster table: a table with clustered index

-problem: table scan (要找所有page,slow)

0.5 index keys

1.1 what is index?

it is a dbo(database object), which sorts data logically/physicially to retrieve data faster

  • DBO’s used to sort and optimize data fetch time
  • Operate similar to index in a book
  • When created, an index will create a dynamic Balance Tree
  • Keys =/= Indexes
  • Tables without a Clustered Index are called HEAP Tables
  • Indexes can use a Max of 32 Columns or 900B of data (16 columns before 2016)

1.1.1 nci vs ci

  • logical sort: index will point you right location to find data
    书后面的字母目录
    nci: non clustered index

  • physicial sort: 数据就在index旁边, only can sort in one way
    所以只能有一个 clustered index

why we only one clustered index?
physical store; sort in one direction(只能用一种储存方式)
国内查电话的黄书
cluster index

1.2 Balance Tree (b tree)

https://blog.staynoob.cn/post/2017/06/btree-data-structure/

1.2.1 3 types of levels

  • root node :
    one table can have only one , root node only have 1 page
  • intermediate node:
    one table can have multiple intermediate node
    opitional dependent on data
  • leaf node :
    one table can have only one leaf node; leaf node can have multiple pages
  • Each Node is about 8KB in size
    8060B for data (for storing data)
    132B for pointers
    8192B in Total

  • Each Index created will have a Balance tree structure to be used, but the type of Index will determine how data is stored in a Balance Tree

  • Clustered Indexes will stored data in Leaf Pages and sort them based on the Key values of the column you choose.

why only one pk? there no other room to store clusted index?

  • Non-Clustered Indexes will NOT store data in the Leaf Pages, instead they’ll point to the rows they’re referencing
b tree

1.3 Clustered Index

  • A clustered index will physically move the data from the table into it’s Balance Tree in the same order;
  • The data is now matching physically and logically
  • Data is sorted based on ascending order for the column chosen, this becomes the clustering key
  • This is why there can only be 1 Clustered Index on a table, data can only be physically sorted and stored once
  • Create Clustered Index NAME On TABLENAME (COL1,COL2)
clusteded index

1.4 Non-Clustered on HEAP

  • A Non-Clustered Index on a HEAP table will have it’s B-Tree use pointers to locate where the data is
  • Non-Clustered Indexes on a HEAP table will use Row Identifiers to logically say where data is, but not physically move it
  • Since Non-Clustered Indexes do not physically move or sort data, there can be many on a single table. Currently up to 999 different Indexes
  • Row Identifiers will use a numbering system to find data. �Extent – Page – Offset

extent: which real page is???
page: divided in 8 sections
offset:

  • Create Non Clustered Index NAME On TABLENAME (COL1, COL2)

note:
page中,格式是,左边nci,右边extent-page-offset(row identifier)

1.4.1 two way to bring data

  • table scan
  • NCI seek + RID look up
    (sql server选择cost小的, it depends on how many records
    eg: 如果只有一条记录就是 nci seek+ rip look up; 很多条记录就会是table scan)

回答问题时候,要两种都说

image.png

1.5 Non-Clustered on Clustered

  • A Non-Clustered Index on a table with a Clustered Index must now grab data from the B-Tree of the CI. So data will come up through the Root of the CI and fall into the Leaf Pages of the NCI
  • The CI will use Key Identifiers that will take the clustering key of the CI to find data and bring it to the NCI Leaf Pages
  • The Key Identifiers will store the key value in the Leaf Pages to help point to the data. The key values will depend on the CI columns being used
  • If looking for the name of an employee whose ID is 3, the key value might say Bruce : 3

notes
find value ?
way 1: scan
way 2: non-clustered index + key value

image.png

1.6 Seek vs Scan in Indexes

https://blogs.msdn.microsoft.com/apgcdsd/2012/08/01/sql-serverscan-seek/

1.6.1 seek

  • Seek out the direct data needed
  • Faster and uses less resources if the query is designed for seeking data
  • Goal for indexes is to provide the ability to Seek

1.6.1.1 cluster index seek

if sql server knows exactly (in which page) where the data is and reads only those leaf pages which has required data it is calles CI Seek.
usually this is the best way to bring data.
this happens when the column used for index is used in where clause.
root node 有 index(id)的值

1.6.1.2 non clustered index seek

as nci has impact on what columns are in select list and columns used asre in nci it goes for nci seek

1.6.2 scan

  • Scan a lot of data to find the information being sought after
  • Slower than Seeking, but will still depend on the nature of the query being used to find data

in the where condition, 里面找的不是clustered index的信息,就是scan

1.6.2.1 table scan

  • when SQL SERVER has to go through all pages that belong to table(table does not have any indexes) to retrive data it is called table scan
  • usually this is not good for data retrieval performance
    不知道前后page,必须去iam
    eg:看视频,看完p1,问老师,看完p2,问老师

1.6.2.2 cluster index scan

if sql server has to read through all the leaf pages of the index to retrieve required data it is called as CI scan. usually this is better than table scan not the best way to bring data.
eg:看视频,问老师,老师说看完p1看p2再看p3
有了b tree,先去root找page,再去page中

eg:
select * from batch

  1. 去IAM
    2.去page
    3.直到找到所有相关信息

1.6.2.3 NCI scan

the column in the list are part of index and no search condition is defined

select fname
from person_b30

1.7 Advanced Types of Indexes

1.7.1 Covering Index

  • Used to fix Bookmark Lookup
    https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms180920(v=sql.105)?redirectedfrom=MSDN

bookmark lookup stop in 2008

  • Include more columns up to 1023

bookmark lookup:
when we use converting index?
index seek + key lookup太cost了

can use clustered index on pk,others use covering index

1.7.2 Full Text Index (FTI)

  • Allows indexing and searching for string or character based data
  • create word tokens(hash value: unique value for same word?)
  • 可以使用stop words(i,is,when...)
  • FTI column can use 2GB
  • only can have one FTI for table
  • if you want 2+ FTI, use include()
  • up to 1023 columns can be include
CREATE FULLTEXT INDEX ON table_name (column_name1 [...], column_name2 [...]) ...

1.7.2.1 limination of CI/NCI (why use full text index)

  • 大于900bytes
  • seaech happens char by char
  • search for forms (ex:run,running,ran)
  • index is created based on all characters
  • synonyms defined by user

1.7.2.2 Steps are taken in the creation of a full text index

  • create A CATALOG (logical container which can hold multiple FTI index)
  • create FTI
  • POPULATE index
    ( as fti creates an internal structures for all words, if there is cause volume of data, this process can take long time,so we want to do it in non business hours)

1.7.2.3 freetext()

  • Find all words with similar meanings
  • College (University, School, etc)

1.7.2.4 cintains()

  • Find exact words or phrases
  • Can use logical operators

1.7.3 Filtered Index

  • Index only the values that fall into the condition
  • Uses Where condition

it is an opition in nci, where indexing happens only on filtered set( sub set) of data.

https://www.cnblogs.com/woodytu/p/4509812.html

1.7.3.1 用在什么情况?

  1. large volume data; 并且只想要几年的数据
  1. 需要创建ssn,每个有ssn的值都要unique,但是也允许空值

1.7.3.2 advantage & limitations

  • Advantages
  1. As Index is created on subset of data size and depth of Index is less, so maintenance and size would be small.
  2. Data searched within the scope of index is much faster as the data Indexed is less.
  3. If there is a scenario where we need UNIQUE values in the column which can support multiple NULL values.
  • Limitations
  1. WHERE condition defined for filtering data has to Deterministic (Static)
  2. Certain functions and operators like BETWEEN are not supported in WHERE condition.

1.7.4 Indexed Views (materialized view)

a solution for creating two CI on a single table

  • Allows for two Clustered indexes on the same data set
  • Data kept in Sync
  • VIEW should be SCHEMABINDING.
  • There should be a column or set of columns that should have unique identifier.
  • NCI are possible only when there is a CI on the VIEW
  • While searching for data on the VIEW/Table SQL Server can use indexes either from VIEW or Table.
    数据小的时候,就会去index view
  • can not use CTE, derived table
  • clustered view创建之后,如果插入新的值给table,index view也会自动更新

1.7.4.1 when we use?

  • When we want to have 2 CIs on same data set
  • If an organization doesn't have cubes and want to generate some reports based on multiple tables with some calculations and other functions. Reports developed using VIEWs and SP's might be slow, to solve this situation we can create a VIEW retrieving data from multiple table and calculations and index it, so that data is pulled directly from B Tree of VIEW not from underlying tables.
  • In some organizations reports are pulled from cubes but for some reason if they cannot access cubes (maintenance, bugs, network issues) then indexed views are used as back up option for generating reports.

1.9 column store index

1.9.1 limitation of row store index

  • data big, need a lot of page number of reading page is high
  • storing duplicate values
  • bring on unneed info to ram/buffer

1.9.2 CSI

  • each column is called as segment
  • max 1M for values
  • row groups: each segments related rows

A rowgroup is a group of rows that are compressed into columnstore format at the same time. A rowgroup usually contains the maximum number of rows per rowgroup, which is 1,048,576 rows.
For high performance and high compression rates, the columnstore index slices the table into rowgroups, and then compresses each rowgroup in a column-wise manner. The number of rows in the rowgroup must be large enough to improve compression rates, and small enough to benefit from in-memory operations.

  • doesn't sort data
  • segements are stored in LOB PAGES (2GB)
    解决了第一个问题

-解决第二个问题:
eg:title都是重复的title,把他们变成数字+title (dictoniary),int比varchar更省内存

1.9.2.1

  • local dictoniary: only one segement
  • golbal dictionary: multiple segement

1.9.2 2种压缩方式:

  • 只有dictionary
  • tuple number
  • 先有dictionary+tuple number(higher level compression)

1.9.3版本比较

image.png

1.9.4 ADVANTAGE of CSI

  • data is stored in form of column segements which helps in bringing only those columns which are selected.
  • as data is stored as columns higher compression is possible. SS can use dicitonaries for reapeted values and use INT in the place of CHARACTER or DATE data types.
  • SS needs to read less number of pages as around a Million values are stored in a Segment, so large volume of data is stored in fewer LOB pages.
  • SS can use batch processing (ps: other processing is row by row processing)

1.9.5 disadvantages

  • CSI cannot help in compression if the data does not have lot of duplicates.
  • As data is not sorted in csi, retrieved of few records is not efficient. So it is only helpful in those systems where we need large valume of data retrieval. EX:DWH,DM
  • DML operations are handled in a different way compared to ROW Store Indexes, this adds up some overhead to SS.

1.9.5.1 DML

DML INSERT

  • <1 M,不会直接insert to CIS, it will insert in CI(delta store),after the data arrieved 1M,the state will show OPEN, and it will use TUPLE MOVER
  • less than 1M it will insert into CSI

DML DELETE
it will not delete physicialy,it will create the matrix,untill rebuild

delta store:
it is a structure like CI, which is used by SS for Insert and Update operations on

1.10 note

  • Can we create non clustered index with PK?
    Yes

  • Can we create an index without UK and PK?
    Yes

  • Can we create a PK with non unique non clustered index?
    No

  • Can we create a CI on a VARCHAR column?
    Yes

  • Can we create CI on DATE column?
    Yes

  • Can we create CI on multiple columns?
    Yes, it is called as composite index.

1.11 what is difference between clustered CSI and Nonclustered CSI?

  • clustered CSI 不能存image,如果存了会报错
  • 只能存成nonclustered CSI,(id,name)

1.12 code in the class

IF object_id('Person_B30') IS NOT NULL
DROP TABLE Person_B30
GO

CREATE TABLE [dbo].Person_B30(ID INT IDENTITY,
    [FirstName] [varchar](50) NOT NULL,
    [LastName] [varchar](50) NOT NULL,
    [JobTitle] [varchar](50),
    [BirthDate] [date],
    [MaritalStatus] [char](1),
    [Gender] [char](1)
) 
GO

INSERT INTO Person_B30
SELECT P.FirstName, P.LastName, E.JobTitle, E.BirthDate, E.MaritalStatus, E.Gender
FROM 
AdventureWorks2017.Person.Person P
LEFT JOIN AdventureWorks2017.HumanResources.Employee E
on E.BusinessEntityID = p.BusinessEntityID
GO

--NO Index on the Table (HEAP)
--Table Scan
SELECT * 
FROM Person_B30

--Table Scan
SELECT * 
FROM Person_B30
WHERE ID = 100

--Create a Clustered Index on ID column
CREATE CLUSTERED INDEX IX_Person_B30_ID
ON Person_B30 (ID)
GO

--CI Scan
SELECT * 
FROM Person_B30

--CI Seek
SELECT * 
FROM Person_B30
WHERE ID = 100

--CI Seek, columns listed in SELECT will not have any impact on data retrieval operation (scan, seen) in Regular C Indexes (Row Store Indexes)
SELECT ID, FirstName 
FROM Person_B30
WHERE ID = 100

--CI Scan because the column used in WHERE is not used in Index
SELECT ID, FirstName 
FROM Person_B30
WHERE FirstName = 'Abe'

--CI Scan SELCT column list has no impact
SELECT ID
FROM Person_B30
WHERE FirstName = 'Abe'

--CI Seek
SELECT FirstName, LastName 
FROM Person_B30
WHERE ID = 100

--How to create an C Index on a column in descending order
CREATE CLUSTERED INDEX IX_Person_B30_ID
ON Person_B30 (ID DESC)
GO

--Can we create a CI with multiple columns
    --Yes it is called Composite CI
--What is the max number of columns that can be used in a CI?
    --32

DROP TABLE Index_Limits
GO

CREATE TABLE Index_Limits(ColLimit CHAR(1000),
Col1 INT, Col2 INT, Col3 INT, Col4 INT, Col5 INT, Col6 INT, Col7 INT, Col8 INT, Col9 INT, Col10 INT, Col11 INT, Col12 INT, Col13 INT, Col14 INT, Col15 INT, Col16 INT, Col17 INT, Col18 INT, Col19 INT, Col20 INT, Col21 INT, Col22 INT, Col23 INT, Col24 INT, Col25 INT, Col26 INT, Col27 INT, Col28 INT, Col29 INT, Col30 INT, Col31 INT, Col32 INT, Col33 INT, Col34 INT, Col35 INT, Col36 INT)
GO

--Following query will fail because CI size should not be more than 900Bytes, but in our case it is 1000
CREATE CLUSTERED INDEX IX_Index_Limits
ON Index_Limits (ColLimit)

--Following will fail becasue CI can be created with 32 columns but not more
CREATE CLUSTERED INDEX IX_Index_Limits
ON Index_Limits (Col1, Col2, Col3, Col4,
Col5,
Col6,
Col7,
Col8,
Col9,
Col10,
Col11,
Col12,
Col13,
Col14,
Col15,
Col16,
Col17,
Col18,
Col19,
Col20,
Col21,
Col22,
Col23,
Col24,
Col25,
Col26,
Col27,
Col28,
Col29,
Col30,
Col31,
Col32, Col33, Col34, Col35, Col36)
GO

--DROP CI on Person_B30
DROP INDEX IX_Person_B30_ID
ON Person_B30

--Create NCI on FirstName
CREATE NONCLUSTERED INDEX IX_Person_B30_FirstName
ON Person_B30 (FirstName)
GO

--NCI Seek + RID Look Up
--Because there is only 1 record with the matching data so SQL Server uses above way to retrieve data
SELECT *
FROM Person_B30
WHERE FirstName = 'Abe'

--Table Scan, becasue there are over 50 records matching serach condition so SS found Table Scan is better than NCI Seek+RID Look Up
--RID Look Up operation happens only when there is a NCI on a Heap 
SELECT *
FROM Person_B30
WHERE FirstName = 'Aaron'

--NCI Seek + RID Look Up OR Table Scan
SELECT FirstName, LastName, Gender
FROM Person_B30
WHERE FirstName = 'Abe'

--NCI Seek, as NCI has impact on what columns are in SELECT list and columns used are in NCI and search condition is defined based on Index Column, it goes for NCI Seek
SELECT FirstName
FROM Person_B30
WHERE FirstName = 'Aaron'

--NCI Scan, because the columns in the list are part of Index and no search condition is defined
SELECT FirstName
FROM Person_B30

--DROP NCI on Person_B30
DROP INDEX IX_Person_B30_FirstName
ON Person_B30

--Create NCI on FirstName
CREATE CLUSTERED INDEX IX_Person_B30_ID
ON Person_B30 (ID)
GO

--Create NCI on FirstName
CREATE NONCLUSTERED INDEX IX_Person_B30_FirstName
ON Person_B30 (FirstName)
GO

--CI Seek, as SQL Server knows which pages to pull data from
SELECT *
FROM Person_B30
WHERE ID BETWEEN 1 AND 250

SELECT *
FROM Person_B30
WHERE ID > 0

--
SELECT *
FROM Person_B30
WHERE FirstName = 'Ana'

SELECT ID, FirstName, LastName, JobTitle
FROM Person_B30
WHERE FirstName = 'Ana'

SELECT ID, FirstName
FROM Person_B30
WHERE FirstName = 'Ana'

--In the following query columns selected are available in both NCI and CI, but CI has more number of pages than NCI (as data content is less) so SS goes for NCI Scan over CI Scan
SELECT ID, FirstName
FROM Person_B30

--DROP NCI on Person_B30
DROP INDEX IX_Person_B30_FirstName
ON Person_B30

--Create a covering index to solve Key Look Up and RID Look Up
CREATE NONCLUSTERED INDEX IX_Person_B30_FirstName
ON Person_B30 (FirstName)
INCLUDE (LastName, JobTitle)
GO


SELECT FirstName, LastName, JobTitle, ID
FROM Person_B30

--Creating table to compare Covering Index with Non Covering
IF object_id('Person_B30_NonCover') IS NOT NULL
DROP TABLE Person_B30_NonCover
GO

CREATE TABLE [dbo].Person_B30_NonCover(ID INT IDENTITY,
    [FirstName] [varchar](50) NOT NULL,
    [LastName] [varchar](50) NOT NULL,
    [JobTitle] [varchar](50),
    [BirthDate] [date],
    [MaritalStatus] [char](1),
    [Gender] [char](1)
) 
GO

INSERT INTO Person_B30_NonCover
SELECT P.FirstName, P.LastName, E.JobTitle, E.BirthDate, E.MaritalStatus, E.Gender
FROM 
AdventureWorks2017.Person.Person P
LEFT JOIN AdventureWorks2017.HumanResources.Employee E
on E.BusinessEntityID = p.BusinessEntityID
GO

--Create a CI on new table
CREATE CLUSTERED INDEX IX_Person_B30_NonCover
ON Person_B30_NonCover(ID)
GO

--Create a NCI on new table
CREATE NONCLUSTERED INDEX IX_Person_B30_NonCover_FirstName
ON Person_B30_NonCover(FirstName)
GO

DECLARE @ST DATETIME, @ET DATETIME
SELECT @ST = SYSDATETIME()
SELECT ID, FirstName, LastName, JobTitle
FROM Person_B30
WHERE FirstName = 'Ana'
SELECT @ET = SYSDATETIME()
SELECT DATEDIFF(MICROSECOND, @ST, @ET) AS 'Cover'

SELECT @ST = SYSDATETIME()
SELECT ID, FirstName, LastName, JobTitle
FROM Person_B30_NonCover
WHERE FirstName = 'Ana'
SELECT @ET = SYSDATETIME()
SELECT DATEDIFF(MICROSECOND, @ST, @ET) AS 'Non Cover'


SELECT *
FROM sys.dm_db_index_physical_stats(db_id('Training_SQL'), object_id('Person_B30'), 1, null, 'detailed') 

--Inserting same data 10 times to get an intermediate level
INSERT INTO Person_B30
SELECT P.FirstName, P.LastName, E.JobTitle, E.BirthDate, E.MaritalStatus, E.Gender
FROM 
AdventureWorks2017.Person.Person P
LEFT JOIN AdventureWorks2017.HumanResources.Employee E
on E.BusinessEntityID = p.BusinessEntityID
GO 10


SELECT *
FROM sys.dm_db_index_physical_stats(db_id('Training_SQL'), object_id('Person_B30'), 1, null, 'detailed') 

--Filtered Index
--It is an option in NCI, where Indexing happens only on filtered set (sub set) of data

--Table for Filter Index
SELECT max(H.OrderDate), MIN(H.OrderDate)
--H.SalesOrderID, H.SalesPersonID, H.OrderDate, H.TaxAmt, H.TotalDue, D.OrderQty, D.UnitPrice 
--INTO AdvSales_Filter
FROM AdventureWorks2017.Sales.SalesOrderHeader H
JOIN AdventureWorks2017.Sales.SalesOrderDetail D
ON H.SalesOrderID = D.SalesOrderID

--Table for Non Filter Index
SELECT H.SalesOrderID, H.SalesPersonID, H.OrderDate, H.TaxAmt, H.TotalDue, D.OrderQty, D.UnitPrice 
INTO AdvSales_NonFilter
FROM AdventureWorks2017.Sales.SalesOrderHeader H
JOIN AdventureWorks2017.Sales.SalesOrderDetail D
ON H.SalesOrderID = D.SalesOrderID

--Craete a Covering Filtered Index for subset of the data in the table
CREATE NONCLUSTERED INDEX IX_AdvSales_Filter
ON AdvSales_Filter(SalesOrderID)
INCLUDE(OrderDate, TotalDue, OrderQty)
WHERE OrderDate>= '02/01/2008' AND  OrderDate <= '07/31/2008' 

--Craete a Covering Index for all the data in the table
CREATE NONCLUSTERED INDEX IX_AdvSales_NonFilter
ON AdvSales_NonFilter(SalesOrderID)
INCLUDE(OrderDate, TotalDue, OrderQty)

SELECT OrderDate, TotalDue, OrderQty, SalesOrderID 
FROM AdvSales_Filter
WHERE OrderDate>= '05/01/2008' AND  OrderDate <= '05/15/2008' 

SELECT OrderDate, TotalDue, OrderQty, SalesOrderID
FROM AdvSales_NonFilter
WHERE OrderDate>= '05/01/2008' AND  OrderDate <= '05/15/2008' 

--Creating a Student table to hold multiple NULL values for SSN and all other Non NULL values should be UNIQUE
CREATE TABLE B30_Students(StuID INT, StuName VARCHAR(25), SSN CHAR(11))
GO

CREATE UNIQUE NONCLUSTERED INDEX IX_B30_Students_SSN
ON B30_Students (SSN)
WHERE SSN IS NOT NULL

INSERT INTO B30_Students VALUES
(1, 'Jiacheng', NULL),
(2, 'Luyao', '123-45-6789'),
(3, 'Zoe', NULL)

INSERT INTO B30_Students VALUES
(4, 'Ling', '123-45-6789')

SELECT * FROM B30_Students


你可能感兴趣的:(index)