SQL Server Integration Services _ Class Notes

Business Intelligence (BI)
Decision Support System (DSS): make business decision accurately,
Need a huge amount of data related to business:
Database is a collection of related data,
Data could be in different formats,

Microsoft Business Intelligence (MSBI): BI tools
Self Service BI Tool: do it yourself
Business intelligence is the process of transforming business data into information/knoewlddge using computer based technique thus enabling the users to make effetive fact-based decisions.

BI Stack/Suite has three parts

  1. First concept: collect those data: ETL Services (concept) ------SSIS (service tool)
  2. Second process: analyze data: Analysis Services (concept) ---------SSAS (service tool)
  3. Third process: report data:Reports/Data Visualization Services (concept) ----------SSRS (service tool)

BI Vendors: IBM, Microsoft, Oracle, SAP, SAS, MicroStrategy
New vendors: Tableau, actuate,QlikTech

/Linux is the graphical version of Unix; Microsoft has a very successful marketing strategy/

Questions of Manager: Show me the procuts with high margins, show me top selling products for past five yrs, show me inventory trend for the past 5 yrs, increase sales by identifying customer bysying pattern,
--------------------------------SSIS Introduction-----------------------------
SSIS is an MS ETL GUI based tool for data integration

Version Year of Launch BI Suite(Software) Called
DTS (data transformation services) 2000 SQL 2000
MSBI(SSIS,SSAS,SSRS) 2004 SQL 2005 BIDS
MSBI(SSIS,SSAS,SSRS) 2007 SQL 2008 BIDS
MSBI(SSIS,SSAS,SSRS) 2009 SQL 2008 R2 BIDS
MSBI(SSIS,SSAS,SSRS) 2011 SQL 2012 SSDT

BIDS: Business Intelligence Development Studio
SSDT: SQL Server Data Tool
Therefore, BIDS and SSDT are NOT any different

The first three are very important.
Every Single unit (smallest unit) in SSIS is called package. Every package is the collection of task and transformation.

Solution and Project:
Solution is the collection of projects
Project is the collection of packages
One solution can have more than one project (optional):
File>Add>New Project
One project can have more than one package.
U can NOT add package directly into solution; it has to be inserted into project first

Windows in SSIS:

  1. Solutoin Explorer: gives solutions/projects/packages; If u dont see it, View>Solution Explorer
  2. SSIS toolbox:if u dont see, SSIS tab>SSIS toolbox
  3. Connection managers(at the bottom): make connections from source (thru source adatper) to destination (thru destination adapter)
  4. Properties Window: show properties, set properties of package/task; package.dtsx(dtsx next version of dts(data transformation services), x stands for xml (Extensible Markup Language is a format that is platform independent, is univeral text format(utf); xml does NOT have a predefined definition; it has to use xml schema deifintion (XSD) files to explain itself)
    (SQL Server also works on XML:
    Select BusinessEntityID, PersonType, FirstName from Person.Person for XML auto;
    This gives data in xml)
    XML based SSIS;
    XML+HTML(HyperText Markup Language, Hyper Text refers to links that connect Web pages in the World Wide Web); FBML (facebook markup language)
    Markup language: data presentation/format language, not programming language (like C++)
    ----------------------Data Flow Task--------------------------
  5. Source
  6. Transformation
  7. Destination

SSIS Toolbox>Tasks & Containers
Control flow contains tasks and containers
Drag “Data Flow task” to “Control Flow” tag

Under ‘Data Flow’ tag>
Data Pipeline: the route/pipelin from data source to data destination via Transformation
‘Send Mail Task’ under Control Flow tag
Precedence Constraint: get the sequence of execution task, one after another
Green: Success
Red: failure
Black: completion
Error Redirect Row: error

‘Integration Service Catalog’ in SQL Server: dont worry about it now

Control flow: defines the business process
Example on control flow & data flow: see note

Adapter:
Connection Manager: gives the address of file

Code page:
https://en.wikipedia.org/wiki/Code_page

Comma separated files(comma separated values .csv)
Delimiter: to separate rows or columns
CRLF(CR = Carriage Return and LF = Line Feed): for window operating system as a row delimiter as default;
(Carriage Return: on a typewriter, after typing a line of text the assembly holding the paper (carriage) will return to the right so that the machine was ready to type again on the left-hand side of the paper)
(Line Feed: on a typewriter,the axe of “down” motion is needed to create a new line on the page)
Column delimiter: tab delimiter; column, semicolumn delimiter, double quote delimiter

Session 1: Introduction to SSIS
Part 1: SSIS Overview
• Introduction to SSIS
 SSIS is a Data Integration tool from Microsoft.

 SSIS is used to design, implement, and manage complex & high performance Data Integration Applications, also called ETL Applications
-Extraction: Pull data from different data sources (flat or excel file)
-Transformation: apply business rules on data (aggregation, string manipulation)
 -Loading: store data into different destination (flat or excel file)

 It is part of MSBI (Microsoft Business Intelligence) stack which includes SSIS(first step, collect data), SSAS(second step, analyze data), SSRS(third step, report data)

SSIS Architectures: SSIS Five Components/Building blocks (architecture)----------------------

  1. Control Flow: is the highest level of SSIS package that contains Tasks and Containers, including Data Flow Tasks
  2. Data Flow: is part of control flow that flow data from source to destination
  3. Event Handlers
  4. Parameters
  5. Package Explorer

• Task
 is the smallest job unit in a control flow.
 You can implement anything you need to do as a task (e.g. sending email, execute sql codes, migrate data from one location to another).
• Precedence Constraints
 Relationships among tasks represent data integration logics. Such logics are called precedence constraints.
 Three types:
 Success
 Failure
 Completion
• Variables
dynamically update column values and property expressions, control execution of repeating control flows, and define conditions for precedence constraints.
• Containers
 provide structure in packages and services to tasks in the control flow.
 SSIS has the following container types:
 For Loop container: repeats its control flow until a specified expression evaluates to False.
 Foreach Loop container: enumerates a collection and repeats its control flow for each member of the collection.
 Sequence container: define a subset of the control flow within a container to manage tasks and containers as a unit.
• Connection managers
 connect different types of data sources to extract and load data.
• Parameters
 assign values to properties at the run time.
 Project parameters at the project level: supply external input to one or more packages in the project.
 Package parameters at the package level: modify package execution without editing the package.
• Commonly used SSIS Scenarios
• Compile Data from Heterogeneous Sources
• Move Data between Systems
• Load Data Warehouses/Data Marts
• Clean, format, or Standardize Data
• Identify, capture and process Data Changes
SSIS Designer: a graphical tool to design ETL applications, an integrated development environment provided by Visual Studio
SSIS Wizard: import/export data from one point to another (data source to
destination) provide managed .net frameworks /native OLE DB provider;
Import and Export Data (64-bit)
Use Native/Managed providers for pulling/storing data
Command Line Utilities: execute SSIS from Command Prompt either from the file system (.dtsx) or from Integration Service Server
ETL is performed inside the Data Flow Task.
SSIS Object Model has 3 parts:
– Control Flow: the most fundamental/largest part
• Tasks
• Precedence Constraints
• Variables and Expressions
– System Variables
– User Defined Variables
• Containers
• Connection Managers
• Packages and Projects
• Parameters
– Package Parameter
– Project Parameter
• Log Providers
• Event Handlers
– Data Flow
• Source Adapters
• Transformation
• Destination Adapters
– SSIS Catalog
• Folders
• Environment
• References
– Relative
– Absolute
Precedence Constraints: define integration logic of the control flow
• Success(green): the child task will be executed if and only if the parent task was executed successfully;
• Failure(red): the child task will be executed if and only if the parent task was executed with errors;
• Completion(black): the child task will be executed if and only if the parent task was executed completely. It dosn NOT matter if the parent task was executed successfully or with errors.
How do tasks communicate with each other: Variables

Containers: subset of the control flow. Three types of containers:
• -For Loop Container
• -Foreach Loop Container
• -Sequence Container
• -Hidden/Abstract Container(Task Host Container): whenever u put a task on the design area, by default it has a task host container which means u can set up the properties of each task individually

Connection Manager: connection that helps the SSIS engine to connect to outside data sources
Packages: when u create a sol’n, by default it creates a package that is basically a container that has control flow, data flow, parameters, event handlers, loggings, package explorer
Project: the entire sol’n is called a project that can have different packages

2 types of parameters:
• -package scoped: package tab>parameter
• -project scoped (shared by all packages):Solution Explorer>SSIS>Project.params

Log providers: SSIS packages create a lot of logs that can be dependent on different types of events, it can be on error, warning, pre-execution, or post-exectuion. This kind of events can be created with the help of SSIS log providers. If a package is executed w/ errors, I can monitor the logging info which will help debug my package

Configuration:Control flow>SSIS>Logging>provider type

Event hanler: help execute an alternative work flow in a particular situation

Data Flow: one of the important tasks in SSIS, which is used to implement the ETL transformation
• -Source Adapters: pull out data from different data sources (Extraction)
• -Transformation: apply business rules on data
• -Destination Adapters: populate data destination (Loading)
• Data Pipeline is used to transfer data from one point to another.

SSIS Catalog: a new feature of SSIS 2012 that is used to deploy, manage and execute SSIS packages. (in previous versions, SSIS packages r deployed on MSDB of Integration Services Server: found by SSMS)
• -Folder
• -Environment: helps us configure the package at the runtime w/ different configuration values, and execute the package in different context
• -Reference: once the environment is created, to bind the environment to the project, binding comes into play. 2 types of references:
–relative reference:the project shares the environment w/ project within the same folder;
–absolute reference-the project shares the environment w/ project in a different folder

Difference btw parameter n variable:
• -parameter can be used in a situation when u want to set a value before the execution of ur package, and the value has to be set either by the developers or administrator and has to be constant during the execution of ur package; can be on package or project (shared by multiple package) scope;
• -variables can be used in a situation when u want to change the value during the execution of ur package, the value may or may not be constant;

Common Usage Scenarios for SSIS:
• Consolidation of Data from Heterogeneous Sources
– OLE DB Source and Destination
– Flat File Source and Destination
– Excel Source and Destination
– XML Source, etc.
(Provider = connection manager)
• Moving Data between Systems
• Loading Data Warehouses/Data Marts: SSIS comprises 60-80 % of the work load; hard to simply use SQL Server
• Cleaning, Formatting, or Standardizing Data: Cleaning-misspelling, upcase to lowercase; Standardizing-replace null values w/ standard values
• Identifying, Capturing and Processing Data Changes: initial & incremental load

Session 2: XML Framework, Data Flow Task, Connection Managers
Part 1: Source and Destinations in DFT
• XML Framework:
• SSIS package and other object definitions are stored in standard XML format.
When u create a SSIS application, it will add a SSIS package by default n generate a XML code at the back of the package. Whatever item ur adding to ur package(connection manager, parameter, variable, executable), correspondingly there’ll be a XML code generated n added to the XML definition file.

• Extension of a package is “.DTSX”, here ‘DTS’ stands for Data Transformation Services and ‘X’ stands for XML.
When we use any Microsoft application, it creates an XML code which is platform independent. Whenever ur migrating solution form one machine to another n u dont have the access to the designer or any graphical tool that is used to create or modify the project, u can simply open the XML code n do the modification as per the requirement. For example, any application built in .net, the extension will be .aspx where asp stands for Active Server Pages and x stands for xml. Microsoft Office product, the extension for documents is .docx where doc stands for document and x stands for XML; exel file has extension .xlsx where xl is short for exel, s for spreadsheet, x for XML. So every product has compatibility and relations w/ XML. It helps modify the project if the appropriate tool is not available for the object.
• We can edit the raw XML for an object or to use a graphical designer, with related menues, toolbars, and dialog boxes.
Data Flow Task: one of the tasks available in SSIS that is used to move data from one point to another n the only task in SSIS that is used to transform the data as well
• encapsulate data flow engine (pipeline) between sources and destinations.
• let user transform, clean, and modify data
• build an execution plan so data flow engine executes it at run time.
Whenever u put an item at the data flow level (source, transformation,destination), data flow task will create an execution plan n added items will be executed as per the execution plan
Part 2: OLE DB Source and Destination Adapters DFT
• OLE DB:
• OLE DB (Object Linking and Embedding Database), an API (Application Programming Interface) designed by Microsoft, allows accessing a variety of sources in a uniform manner.
Code page is a table of values that describes the character set used for a specific language
When u create a tbl or view in the connection manager, the system automatically creates a file that fits the destination structure
• OLE DB Source
• OLE DB Source extracts data from a variety of OLE DB-supported database table or view or SQL command (e.g. Microsoft Office Access or SQL Server databases)
• Access Modes in OLE DB Source adapter:
 Table or view: a table or view. You can specify an existing table or view, or you create a new table row by row.
 Table name or view name variable: a table or view specified in a variable.
 SQL Command: the results of an SQL statement
-OLE DB Source does support parameters; can be a parameterized query.
-Only Select statement to bring metadata of the destanation table to SSIS; No DML statements for they only go with OLE DB Command transformation.
 SQL Command from Variables: the results of an SQL statement stored in a variable.
• OLE DB Destination
• OLE DB Destination loads data into a variety of OLE DB-supported database table or view or SQL command (E.g. Microsoft Office Access and SQL Server databases).
• Access modes in OLE DB Destination adapter:
 Table or view
 Table or view- fast load (default): a table or view using fast-load options (batch by batch instead of row by row). You can specify an existing table or create a new table using the batch mode.
-Options: Keep identity(keep identity column), Keep nulls(keep null values), Table lock(checked by default, aquire table level lock on the destination while performing the bulk operation), Check constraints(checked by default, records that fit the check constraints will be inserted, others no), Rows per batch (-1 by default), Maximun insert commit size.
 Table name or view name variable
 Table name or view name variable-fast load: a table or view specified in a variable using fast-load options.
 SQL command
-OLE DB Destination does NOT support parameters.

Part 3: Flat File & Excel, Source and Destination Adapters in DFT
• Flat File Source:
• Flat File Source extracts data from a text file (Unicode or ANSI)
• Text file can have three formats.
 Delimited format: uses column and row delimiters to define columns and rows.
 Fixed width format: uses width to define columns and rows. This format also includes a character for padding fields to their maximum width.
 Ragged right format: uses width to define all columns, except for the last column, which is delimited by the row delimiter.
• Other properties:
 Text qualifier: if u have a string with quotation marks and u want them to be loaded as part of the string, clarify here; otherwise they will be ignored
 Header row identifier: differentiate header rows from data rows
(Options: {CR}{LF}, {CR},{LF}, Semicolon{;}, Colon{:}, Comma{,}, Tab{t}, Vertical bar{|}) {CR}{LF} Carriage Return/Line Feed: Windows operating system use it as a row delimiter by default; Tab{t} is used as column delimiter for flat files by default;
 Header rows to skip: if the header is 2-3 lines here, specify here
 Column names in the first data row: check it if the very first record in the flat file is a column name
 Advanced: allow us to manipulate the columns
 Retain null values from the source as null values in the data flow: check it if u wanna load null values to the destination
• Flat File Destination:
• Flat File Destination loads data into a text file (Unicode or ANSI).
• Text file can be in delimited, fixed width, fixed width with row delimiter, or ragged right format.
(Overwrite data in the file: Check it if u wanna delete the previous records when u keep on executing the package)
• Configuration in two ways:
 Provide a block of text that is inserted in the file before any data is written. Text can provide information such as column headings.
 Specify whether to overwrite a data in a destination file that has the same name.
• Excel Source:
• Excel Source extracts data from Microsoft Excel Workbooks.
• It uses an Excel connection manager.
• Access modes in Excel Source: (same as OLE DB)
 Table or view
 Table name or view name variable: Sheet name
 SQL Command
 SQL Command from Variables
• Excel Destination:
• load data into Microsoft Excel worksheet.
• It uses an Excel connection manager.
• Access modes in Excel Destination:
 Table or view
 Table name or view name variable: Sheet name
 SQL Command
 Errors I had when using Excel Destination: (Oftentimes interview question too)
–[Connection manager “Excel Connection Manager 1”] Error: The requested OLE DB provider Microsoft.ACE.OLEDB.12.0 is not registered. If the 64-bit driver is not installed, run the package in 32-bit mode. Error code: 0x00000000. An OLE DB record is available. Source: “Microsoft OLE DB Service Components” Hresult: 0x80040154 Description: “Class not registered”.
 Solution:
–Solution explorer>Project_name: SSIS in this case>right click: Properties (i.e. Project properties)>Configuration Properties: Debugging>Debug Options: set Run64BitRuntime to FALSE
Part 4: Other Source and Destinations in DFT
• ODBC (Open Database Connectivity): there is no advantages for ODBC compared to OLE DB because Microsoft developed OLE DB based on ODBC, and therefore OLE DB is simpler, faster, and more compatible, and have more options
• ODBC Source: (New in 2012 SSIS)
• ODBC extracts data from ODBC-supported database table or a SQL statement.
• It uses an ODBC connection manager.
• Access Modes:
 Table name
 SQL command
• Operation modes: the FetchMethod property
 Batch
 Row-by-Row
• ODBC Source has one regular output and one error output.
• ODBC Destination:
• load data into ODBC-supported database.
• uses an ODBC connection manager.
• ODBC destination includes mappings between source columns and destination columns.
• ODBC destination has one regular output and one error output.
• Operation Modes in ODBC Destination: by the FetchMethod property
 Batch (default)
 Row-by-Row
• ADO.Net Source (ActiveX Data Object.Net):
• extract data from a .NET provider.
• You can connect to Microsoft Windows Azure SQL Database using ADO.Net Source.
• Connecting to SQL Database using OLE DB is not supported.
• uses an ADO.NET connection manager.
• Access Modes in ODBC Source:
 Table name: a table or view.
 SQL command: the results of an SQL statement.
• ADO NET source has one regular output and one error output.
• ADO.Net Destination:
• load data into a ADO.NET-supported database.
• You have the option to use an existing table or view, or create a new table and load the data into the new table.
• You can connect to Microsoft Windows Azure SQL Database using ADO.NET destination.
• Connecting to SQL Database using OLE DB is not supported.
• uses an ADO.NET connection manager.
• Access modes in ADO.NET Destination:
Existing table or view,
Create new table
• Raw File Source:
• read raw data from a file that was previously written by Raw File destination (from a different transformation, bc u cannot create raw file urself).
• Because raw file is native to the source, the data requires no translation and almost no parsing.
• Raw File source can read data more quickly than other sources such as Flat File and OLE DB sources.
• can point to an empty raw file that contains only the columns (metadata-only file).
• This source has one output. It does not support an error output.
• This source does not use a connection manager.
• Access Modes in Raw File Source:
Name of the file
File name from Variable
• Raw File Destination:
• write raw data (semi-processed data) to a file.
• data requires no translation and little parsing.
• Raw File destination can write data more quickly than other destinations.
• You can configure Raw File destination in the following ways:
 access mode
name of the file
 Indicate whether the Raw File destination appends data to an existing file that has the same name or creates a new file.
• Raw File destination supports null data but not Binary Large Object (BLOB) data.
• does not use a connection manager.
• Access Modes in Raw File Destination:
Name of the file
File name from Variable
• WriteOption property includes:
Create always Always create a new file.
Create once Create a new file; If the file exists, the component fails.
Truncate and append Truncate an existing file and then write data to the file; The metadata of the appended data must match the file format.
Append Append(insert) data to an existing file. The metadata of the appended data must match the file format.

• XML Source:
• read data from an XML file.
• Can use a XML Schema Definition (XSD) file or inline schemas to translate the XML data.
• Access Modes:
 XML File
 XML File from Variable
 XML data from variable
• SQL Server Destination:
• loads bulk data into local SQL Server.
• uses an OLE DB connection manager.
• For loading data into local SQL Server, you should consider using SQL Server destination instead of OLE DB destination.
• You cannot access a SQL Server database on a remote server. (Instead, use OLE DB destination to access a remote server.)
• uses a fast-load access mode.
• SQL Server destination has one input. It does not support an error output.
• SQL Server Compact Destination:
• write data to SQL Server Compact databases.
• uses an SQL Server Compact connection manager that specifies the OLE DB provider.
• SQL Server Compact destination has one input and does not support an error output.
–SQL Server Integration Services
• SQL Server Compact 4.0 does not support SQL Server Integration Services (SSIS). In order to uss this feature you should use SQL Server Compact 3.5 SP2.
–SQL Server Management Studio
• Starting from SQL Server Compact 4.0, SQL Server Compact does not support SQL Server Management Studio. Transact-SQL Editor in the Visual Studio 2010 Service Pack 1 can be used to run T-SQL queries and to view the estimated and the actual query plans for a T-SQL query for the SQL Server Compact database.
 Part 5: Data Flow Transformation
• used to apply business rules on data (such as to aggregate, merge, distribute, and modify data).
• can also perform lookup operations and generate reference datasets (lookup transformation btw two datasets (input/reference) to find matching values)
• can also perform data cleansing and data standardization. (dataset with quality issues such as truncated or duplicated records)
Transformation Types:
– Row Transformations: take record from input, perform it, export to destination or next transformation; row by row basis
Character Map/Copy Column/Data Conversion/Derived Column/OLE DB Command
– Split and Join Transformations: divide or combine datasets
Conditional Split/Multicast/Union All/Merge/Merge Join/Lookup/Cache Transform
– Rowset Transformations: load all data and then perform operations on entire dataset
Aggregate/Sort
– Auditing Transformations: do certain kind of auditing
Audit/Row Count
– Business Intelligence Transformations: implement business intelligence strategies
Fuzzy Lookup/Fuzzy Grouping/Slowly Changing Dimension

asynchronous/eɪˈsɪŋkrənəs/异步的
Synchronous/Asynchronous Transformations: based on the way they deal with buffers
Data pipe line is nothing but a buffer (physical memory)
– Synchronous: The output of the transformation uses the same buffer as the input.
• Number of records IN == Number of records OUT
• Non-Blocking Transformations: does NOT load the entire dataset. e.g. Derived column, data conversion
– Asynchronous: The output of the transformation uses a new buffer from the input.
• Number of records IN MAY/MAY NOT == Number of records OUT
• Semi-Blocking and Full-Blocking Transformations
Semi-Blocking: load partial records (50 out of 1000) and then perform the transformation. E.g. Merge (join), union all
Full-Blocking: load the entire dataset and then perform the transformation.e.g. Sort or aggregation
• Differences between Non-Blocking, Semi-Blocking, Full-Blocking transformations are listed below:

Row Transformations
• Character Map: apply string functions to character data.
 operates only on columns of a string data type.
 has one input, one output, and one error output
 can convert column data in place or add converted data in the new column.
 Configuration:
 Specify the columns to convert.
 Specify the operations to apply to each column.
• Copy Column: copy input columns and add new columns to the output
 can create copies of a column, or multiple columns in one operation.
 has one input, one output. It does not support an error output.
 Configuration: specifying the input columns to copy
• Derived Column: create new columns based on input columns.
 An expression can have variables, functions, operators, and columns from the input.
 It can define multiple derived columns, and any variable or input columns can appear in expressions.
 The result can be added as a new column or inserted into an existing column as a replacement.
 Perform the following tasks:
 Concatenate data from different columns into a derived column.
E.g. FullName
 Extract characters from string data using functions, and then store the result in a derived column.
E.g. extract a person’s initial from the FirstName column, by using the expression SUBSTRING (FirstName, 1, 1).
 Apply mathematical functions to numeric data.
E.g. change the length and precision of a numeric column, SalesTax, to a number with two decimal places, by using the expression ROUND (SalesTax, 2)
 Create expressions that compare input columns and variables.
 Extract parts of a datetime value.
E.g. use the GETDATE and DATEPART functions to extract the current year, by using the expression DATEPART(“year”, GETDATE()).
• Data Conversion: convert an input column to a different data type and then copy it to a new output column.
 can apply multiple conversions to a single input column.
 perform the following tasks:
 Change the data type
 Set the column length of string data and the precision and scale on numeric data.
 Specify a code page.
Rowset Transformations
• Sort: sort input data in ascending or descending order and copy sorted data to output.
 can apply multiple sorts to an input: each sort is identified by a numeral that determines the sort order.
 The column with the lowest number is sorted first, and so on.
 A positive number denotes that the sort is ascending, and a negative number denotes that the sort is descending.
 Columns that are not sorted have a sort order of 0.
 Columns that are not selected for sorting are automatically copied to the output together with the sorted columns.
 can remove duplicate rows as part of its sort.
 has one input and one output. It does not support error outputs.
• Aggregate: apply aggregate functions to the input and copy results to the output.
 Operations on input:
 Sum, Average, Count, Count Distinct, Minimum, Maximum
 Group by: specify groups to aggregate across
 The comparison options of the aggregation.
 handles null values in the same way as the SQL Server relational database engine: missing or unknown values, does not mean the value is zero.
 IsBig property can be set on output for handling big or high-precision numbers: -If a column value may exceed 4 billion or a precision beyond a float data type is required, IsBig should be set to 1.
-For columns used in the GROUP BY, Maximum, or Minimum operations, can NOT set IsBig to 1.
 Configurations:
 Keys: specify the number of groups expected to result,
 KeysScale: specify an approximate number of keys when performing a Group By operation,
 CountDistinctKeys: specify the number of distinct values expected to result,
 CountDistinctScale: specify an approximate number of keys when performing a distinct count operation.
 has one input and multiple outputs. It does not support an error output.
 Asynchronous: it does not consume and publish data row by row. Instead it consumes the whole rowset, performs its groupings and aggregations, and then publishes the results.
Split Transformations
• Union All (Semi-blocking): combine multiple inputs into one output.
 The inputs are added to the output one after the other; no reordering (sorting) of rows occurs.
 At least one column from the input must be mapped to each output column. The mapping requires that the metadata of the columns match.
 Input columns not mapped to output columns are set to nulls in the output.
 has multiple inputs and one output. It does not support an error output.
• Merge (Semi-blocking): combine two sorted datasets into one.
 The rows from both datasets are inserted into the output based on their key columns.
 Perform the following tasks:
 Merge data from two data sources, such as tables and files.
 Create complex datasets by nesting Merge transformations.
 Remerge rows after correcting errors in the data.
 Inputs have matching metadata; map automatically columns with the same metadata.
 has two inputs and one output. It does not support an error output.
• Merge Join (Semi-blocking): combine two sorted datasets into one by using a FULL, LEFT, or INNER join.
 Joined columns must have matching metadata.
 has two inputs and one output. It does not support an error output.
 Configuration:
 Specify the join is a FULL, LEFT, or INNER join.
 Specify the columns the join uses.
 Specify if it handles nulls as equal to other nulls. (Set from property)
• Conditional Split: split input to different outputs depending on the condition.
 similar to a CASE decision structure in a programming language.
 evaluates expressions, and based on the results, directs the data row to the specified output.
 Configuration:
 Provide an expression that evaluates each condition.
 Specify the evaluation order for the conditions. Order is significant, because a row is sent to the output corresponding to the first condition that evaluates to true.
 Specify the default output.
 has one input, multiple outputs, and one error output.
 also has a default output: if a row matches no expression it is directed to the default output.
• Multicast: split the input to one or more outputs. (similar to Conditional Split: one input to multiple outputs.)
 directs every row to every output, while Conditional Split directs a row to one single output.
 Configuration: adding outputs.
 has one input and multiple outputs. It does not support an error output.
Auditing Transformations
• Row Count: count the number of rows as they pass through a data flow and store the final count in a variable.
 The variable must already exist, and must be in the scope of the Data Flow task using this transformation.
 can be used to update the variables used in scripts, expressions, and property expressions.
 The row count value is only stored in the variable only after the last row has passed through the transformation. Therefore, the variable is not updated automatically.
 This transformation has one input and one output. It does not support an error output.
• Audit: enables the data flow to include data about the environment in which the package runs. Such as the name of the package, computer, and operator.
 SSIS has system variables that provide this information. You can map a single system variable to multiple columns.
 configuration:
-provide the name of a new output column;
-map the system variable to the output column.
 has one input and one output. It does not support an error output.
 system variables that Audit transformation can use.
System variable Index Description
ExecutionInstanceGUID 0 The GUID that identifies the execution instance of the package.
PackageID 1 The unique identifier of the package.
PackageName 2 The package name.
VersionID 3 The version of the package.
ExecutionStartTime 4 The time the package started to run.
MachineName 5 The computer name.
UserName 6 The login name of the person who started the package.
TaskName 7 The name of the Data Flow task with which the Audit transformation is associated.

Lookup Transformations
• Lookup: perform lookups by joining columns in input and reference dataset.
 Input dataset can be anything
 Reference dataset can be an existing table or view (OLE DB supported), a new table, or the result of an SQL query or a cache file.
 uses either an OLE DB connection manager or a Cache connection manager to connect to the reference dataset.
 supports the following database providers for the OLE DB connection manager:
 SQL Server/Oracle/DB2
 perform an equi-join between the input and the reference dataset: each row in the input must match at least one row from the reference. If an equi-join is not possible, the transformation takes one of the following actions:
 If there are no matching entries in the reference dataset, no join occurs. By default, the transformation treats rows without matching entries as errors, but we have this option to configure the transformation to return them as no matching entries in the output.
 If there are multiple matches in the reference dataset, returns only the first match returned by the lookup. If multiple matches are found, only generates an error or warning when we configure to load all the reference to fill the cache. (Cache is memory inside CPU; Buffer is memory inside RAM)
 The transformation supports join columns with any data type, except for DT_R4, DT_R8, DT_TEXT, DT_NTEXT, or DT_IMAGE.
 Caching Modes:
 Full Cache (default): the reference dataset is queried once during the pre-execute phase of the data flow, and the entire reference dataset is pulled into cache memory.
This mode uses the most cache memory, and adds extra startup time for the data flow, as all of the caching takes place before any rows are read from the data flow source(s). The tradeoff is that the lookup operations will be very fast during execution. One thing to note is that the lookup will not swap memory out to disk, so if you run out of memory your data flow will fail. Not real time.
 Partial Cache: the cache memory starts off empty at the beginning of the data flow; when a new row comes in from the input, the transform queries the cache for matches:
-If no match, it queries the referece dataset.
-If match is found, the values are stored in the cache memory.
Since no caching is done during the pre-execute phase, the startup time is less than full cache. However, slower in execution as the transformation queries the reference data set more often.
 No Cache: the transform doesn’t maintain a lookup cache (actually, not quite true: we keep the last match around, as the cache memory has already been allocated. Once the tranformation is done, the record will be flushed from the cache memory). Slowest in execution phase as ull be hitting the reference dataset for every row.
 Lookup Output
 Match: there are matches in the reference dataset
 No Match: u have configured no-matching entries to the no-match output
 Error: if something goes wrong during the execution, u can log those records that cause issues for error handling purpose
 E.g. Most commonly used in ETL Strategy for performing incremental load: the customer table contains 5 records of CustomerID and CustomerName; from time to time you will update info in DWH for new customers. The Customer table will be the input tbl that has ID 1-10; DWH table will be the reference table that has ID 1-5, so 5 matching IDs and 5 nonmatching IDs. 6-10 are no matching entries and can be inserted for updating; 1-5 can be ignored.
• Cache Transformation: create a reference dataset for lookup transformation (preparation for Lookup)
 use a Cache connection manager to configure the Lookup transformation in the full cache mode: the reference dataset is loaded into cache before the transformation.
 writes only unique rows to the Cache connection manager.
 In a single package, only one Cache Transform can write data to the same Cache connection manager. If the package contains multiple Cache Transforms, the first Cache Transform that is called when the package runs, writes the data to the connection manager. The write operations of subsequent Cache Transforms fail.
 Configuration:
 Specify the Cache connection manager.
 Map input columns to destination columns in the Cache connection manager, and column data types must match.
Business Intelligence Transformation
• Fuzzy Lookup: perform data cleaning tasks such as data standardizing, data correcting, and providing missing values.
(Fuzzy Lookup differs from Lookup in its use of fuzzy match. Fuzzy Lookup requires at least one column to be configured for fuzzy match.)
 The match can be an exact match or a fuzzy match btw input and reference.
 Exact matching can use any DTS data type except DT_TEXT, DT_NTEXT, and DT_IMAGE.
(DTS:DataType=“0”:DBNull;“3”:Int16/32 (Yes same datatype for both);“4”:Single;“5”:Double; “7”:Datetime; “8”:String; “11”:Boolean; “16”:Sbyte; “17”:Byte;“18”:Char;“19”:UInt32; “21”:UInt64)
 Fuzzy matching can only use input columns with DT_WSTR and DT_STR data types.
 Fuzzy matching can return close matches in reference table (which must be in a SQL Server database)
 Configurations: (token象征性的)
-Maximum number of matches: how many records will be selected from the reference tbl,
-Token delimiters: characters used to perform fuzzy/exact matches, such as carriage return,line feed,tab, space,etc)
-Similarity thresholds: a value that describes how the fuzzy algorithm checks particular records from input data set with reference data set and decides if it is a close match (ranges from 0 to 1)
 Each match output include the following colummns:
 _Similarity: a column that describes the similarity between values in the input and reference columns.
 _Confidence: a column that describes the quality of the match.
 The transformation provides a default set of delimiters to tokenize the data, but you can add token delimiters to suit the needs of your data. The Delimiters property contains the default delimiters. Tokenization is important because it defines the units within the data that are compared to each other.
 The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates.
 If the Exhaustive property is set to True, the transformation compares every row in the input to every row in the reference. This comparison algorithm may produce more accurate results, but it is likely to make the transformation perform more slowly unless the number of rows is the reference table is small.
 If the Exhaustive property is set to False, the transformation returns only matches that have at least one indexed token or substring.
 lotraIf the WarmCaches property is set to True, the index and reference table are loaded into memory. When the input has many rows, setting the WarmCaches property to True can improve the performance of the transformation. When the number of input rows is small, setting theWarmCaches property to False can make the reuse of a large index faster.
 This transformation has one input and one output.
• Fuzzy Grouping: perform data quality tasks by identifying rows that are likely to be duplicates and selecting a canonical row for data standardizing.
 Configuration
 Select the input columns to use when identifying duplicates
 Select the type of match (fuzzy/exact) for each column.
 Features to customize the grouping:
 Token delimiters: a default set of delimiters is used to tokenize the data, but new delimiters can be added to improve the tokenization of data.
 Similarity threshold: indicates how strictly the transformation identifies duplicates (by setting the MinSimilarity property at the component and column levels. The column-level similarity threshold is only available to a fuzzy match. The similarity range is 0 to 1. The closer to 1 the threshold is, the more similar the rows and columns must be to qualify as duplicates. The component level threshold similarity suggests all rows across all columns have a similarity greater than or equal to the similarity threshold specified.)
 Exact match guarantees that only rows that have identical values in the column will be grouped;
Fuzzy match gurantess the rows that have approximately the same values in the column will be grouped.
 Exact matching can use columns of any Integration Services data type except DT_TEXT, DT_NTEXT, and DT_IMAGE. (Integration Services Data Type=DTS Data Type)
 Fuzzy matching can use only columns of DT_WSTR and DT_STR data types.
 has one output row for each input row, with the following additional columns:
 _key_in: a column that uniquely identifies each row.
 _key_out: a column that identifies duplicate rows. The column has the value of the _key_in column in the canonical row. (Rows with the same value in _key_out belong to the same group.)
 _score: a value between 0 and 1 that indicates the similarity of the input row to the canonical row.
 FuzzyComparisonFlags property is set to specify how the transformation compare string data.
 ExactFuzzy property is set to specify whether the transformation performs a fuzzy/exact match.
 If the Exhaustive property is set to true, the transformation compares every row in the input to every other row in the input. But it is likely to make the transformation more slowly unless the number of rows in the input is small. (To avoid performance issues, it is advisable to set the Exhaustive property to true only during package development.)
• DQS Cleansing (Introduced in SQL 2012):
 Use DQS (Data Quality Servieces) knowledge base to perform data quality tasks (data correcting, enrichment, standardizing and deduplicating).
 Unlike fuzzy lookup and grouping, DQS depends on domain specific knowledge bases (KB). KBs are built outside SSIS using DQS functionality.
(Knowledge Base (KB) is a repository of knowledge about your data that enables you to understand your data and maintain its integrity)
 Configuration:
 Provide DQS Server Name
 Specify Domains from KB to use for data cleansing
 Specify Columns to be cleansed in Mapping section
 has one input and one output and one error output.
 The transformation produce one output row for each input row. The output contain columns from input and following columns:
 Cleansed Columns: Cleansed data
 Column-level Cleansing Status: Result of the correction operation for each column selected for cleansing.
-Correct: original data (from input) was correct and therefore not modified
-Corrected: original data was not correct, and a correction value with a confidence level higher than the auto-suggestion threshold value identified
-Auto Suggest: original data was not correct, and a correction value with a confidence level higher than the auto-suggestion threshold value but lower than the auto-correction threshold value is identified
-New: original data are consistent with domain rules, and could be a valid value not available in knowledge base (or a correction value with a confidence level lower than the auto-suggestion threshold is identified)
-Invalid: orginal data are invalid to the domain, failed any domain rule, or DQS server could not determine the validatiy of original data
 Column-level Output Reason: Reason for the result of correction operation on input values in each column selected for cleansing is added to output.
-Domain Value: input data is a valid domain value. Corresponding column-level cleansing status is correct
-DQS Cleansing: the input data is corrected or a correction value was suggested by DQS. Corresponding column-level cleansing status is Corrected or Auto-suggest
-New Value: the input data is considered a new value for adding to the domain. Corresponding column-level cleansing status is new
 Column-level Confidence Score: The measure of confidence on the correction or suggestion from the DQS server (range from 0 to 1); is available only for the rows with status as Corrected or Auto Suggest.
 Row-Level Cleansing Status: Overall status aggregating all column-level cleansing status message. (correct, corrected, auto-suggest, new, invalid)
Data Flow Transformations:
• OLE DB Command: run an DML statement (insert, update, or delete) for each row in a data flow.
 Configuration:
 Provide the SQL statement that the transformation runs for each row.
 Specify the number of seconds before the SQL statement times out
 Specify the default code page.
 Typically, SQL statement includes parameters. The parameter values are stored in external columns in the transformation input, and mapping an input column to an external column maps an input column to a parameter.
 This transformation has one input, one regular output, and one error output.
• SCD (Slowly Changing Dimension) Transformation: non-blocking, not preferred bcoz very slow
 perform the updating and inserting of records in dimension tables of DWH. (we have historical records in DWH or data mart, from time to time we have updating or inserting in DWH or data mart from OLE DB databases. To track these changes we use SCD transformation.
 It requires at least one business key column (in DWH or data warts, fact n dimesion tbls r related via business keys; for establishing relationship from OLE DB to data mart, and among data mart to fetech a dimension, we need a business key to track down changes/maintain the history)
 Can be configured either by Time Stamp (e.g. start date & end date) or Flag (e.g. 0/1) column.
 SCD transformation supports:
-Changing attribute-Type 1: there are changing attributes that update the historical records and the history in dimension tbl is only maintained on certain levels (2 or 3 levels) (e.g. Marital status)
-Historical attribute-Type 2: there are changing attributes that do NOT update historical records and the history in dimension tbl is maintained (e.g. Cutomer address, cutomer order history)
The only change that is permitted in an existing record is an update to a column that indicates whether the record is current or expired.
-Fixed attribute changes-Type 0: there are no changing attributes and the history in dimension tbl is not maintained (e.g. SSN, cutomerID)
SCD transformation does not support Type 3 changes bcoz it requires the dimension table to be modified.

Type 0 Fixed. No history preservation
Type 1 Overwrite old data with new data
Type 2 Create new record by tuple versioning
Type 3 Maintain change data in a separate column
Type 4 Maintain change in a separate history table
Type 6 Concept derived using Type 1, 2 and 3

Type 0 Data that never changes. No history preservation. e.g. DOB
Type 1 Data that changes, but only the most recent records. e.g. marital status
Type 2 Data that changes, all records e.g. credit history
Type 3 Data that changes, only a certain number of records
Type 4 Use two tables: recent values, historical values
Type 6 Combine features of Type 1, 2 and 3

 Perform the following operations for slowly changing dimension tbls:
• Match incoming rows with rows in the lookup table to identify new and existing rows.
• Identify incoming rows that contain historical changes that require insertion of new records and updating of expired records.
• Detect incoming rows that contain changes that require the updating of existing records, including expired ones.
• Identify inferred member records that require updating.
• Script Component Transformation: A SSIS Data Flow item that allows you to write VB or C# scripts on the Data Flow level.
------Control Flow----------
Control Flow is the highest level of object model and most fundamental part. As a work flow environment, it defines
 Job units,
 How job units cooperate with each other (with the help of expressions)
 How job units are scheduled to execute (with the help of precedence constraints)
 Three elements of control flow:
 Tasks are the smallest job unit in a control flow that provide functionality.
 Containers that provide structures in packages,
 Precedence Constraints that connect the executables, containers, and tasks into an ordered control flow.
Control flow task is an atomic job unit that is used to create control flow that meets business requirements.
• Different type of Control Flow Tasks:
– Data Flow Task
– Bulk Insert Task
– File System Task
– Execute Process Task
– Execute Package Task
– Execute SQL Task
– Script Task
– Data Profiling Task
– FTP Task
– Send Mail Task
– CDC Control Task, etc.
Container: provide structure in packages for grouping tasks and implementing repeating control flows.
• Different type of Containers:
– For Loop
– For Each Loop
– Sequence
Precedence constraint: link executables, containers, and tasks to define data integration logics.
• Three types:
– Success: requires precedence executable must complete successfully for the constrained executable to run.
– Failure: requires precedence executable fail for the constrained executable to run.
– Completion: requires only precedence executable has completed, in spite of outcome, in order for the constrained executable to run.
• Configuration:
– Specify an evaluation operation.
– Specify an execution result.
– Specify Expression
– Specify if precedence constraint is evaluated singly or together with other constraints.

Bulk Insert Task: transfer data from a text file into a SQL Server table or view.
• The source must be a text file, on local or remote server.
• If the file is on a remote server, you must specify the file name using the Universal Naming Convention (UNC) name in the path.
• The destination must be an existing table or view in a SQL Server database.
• Transformations can NOT be performed using BI Task.

File System Task: perform operations on files and directories in the file system (Single file or a directory at a time).
• Can be used to set attributes on files and directories.
• Source and Destination file or directory can be specified by:
– A File connection manager
– A Variable
• Performe the following operations:
– Copy File, Directory
– Create/Delete Directory or Directory Contents
– Delete File
– Move Directory, File
– Rename File
– Set Attributes

Execute Process Task: runs an application or batch file (.bat) as part of a SSIS package workflow.
• can open any standard application (i.e Notepad, MS Excel, MS Word, CMD).
• Confirguration:
– RequireFullFileName
– Executable
– Arguments: output
– WorkingDirectory: input/source
– StandardInputVariable
– StandardOutputVariable
– StandardErrorVariable
– FailTaskIfReturnCodeIsNotSuccessValue
– SuccessValue
– TimeOut
– TerminateProcessAfterTimeOut
– WindowStyle

Execute Package Task: run other packages as part of a workflow.
• Perform the following operations:
– Break down complex package workflow
– Reuse parts of packages
– Group work units
– Control package security
• Package that runs other packages is called as the parent package and the package that a parent workflow runs are called child package.(one parent can have multiple child packages)
• Child packages can be called as follows:
– Project Reference: use PackageNameFromProjectReference property (under Package) to specify child packages
– External Reference
• Values can be passed from Parent to Child as follows:
– By using Parent Package Configurations
– By using Parameters:
• We can NOT pass directly a value from child package to parent package, but we Can the other way: indirectly it can be done thru parameters (Not variables, as variable is on the project level; but parameter is on the package level)
• Parent package variable can be defined in the scope of the Execute Package task or in a parent container such as the package.
• ExecuteOutOfProcess property of the Execute Package task can be configured to optimize the performance

Execute SQL Task: run SQL statements (if multiple, one by one in sequential order) or stored procedures from a package.
• Perform the following tasks:
-Create, alter, and drop database objects (DML Operations)
-Truncate a table or view before data inserting
-Re-create fact and dimension tables before data loading
-Run stored procedures
-Save the rowset returned by a query into a variable
• Supported Connection Managers are:
-Excel
-OLE DB
-ODBC
-ADO and ADO.NET
-SQL MOBILE
• Source of SQL statements:
-Direct: SQL Statement specified directly inside the task (multiple statements can execute except select statement)
-Variable: A Variable containing SQL Statement (Insert into using Parameter)
-File Connection: Connection to a SQL File using file connection manager
• Depending on the type of SQL statement, result set property can be configured.
• GO command can be used to specify multiple statements as a Batch.
• Three parameter types: Input, Output, and ReturnValue for Stored Procedures and SQL Statements.

Execute T-SQL Statement Task: runs Transact-SQL statements.
-supports only Transact-SQL version of the SQL language,NOT other dialects.
-Configuration:
• Use ADO.Net connection manager only
• Specify Execution time out.
• Specify T-SQL Statement.
• edit the Connection Manager to specify the database to be used.
-can NOT be used for:
• Running parameterized queries
• Saving the query results to variables
• Using property expressions

Execute SQL Vs Execute T-SQL Statement Task:
• Execute T-SQL Statement task supports only ADO.NET.
• Execute T-SQL Statement task can NOT be used for:
– Running parameterized/pə’ræmɪtəˌraɪz/ queries
– Saving the query results to variables
– Using property expressions
• Execute T-SQL Statement task supports only Transact-SQL version of the SQL language.
• Execute T-SQL Statement task tasks less memory, less parse time, and less CPU time than Execute SQL task.

Script Task: perform functions that are not available in the built-in tasks and transformations.
• can access external .NET assemblies in scripts by adding references in the project.
• Configuration:
-Specify the script language.
-Specify the method in the VSTA project that the Integration Services runtime calls as the entry point into the Script task code.
-Optionally, provide read-only or read/write variables.
• Perform the following purposes:
-Access data by using other technologies NOT supported by built-in connection types.
-Create a package-specific performance counter.
-Count number of records in file to initiate data load process
Main Function
• use Microsoft Visual Studio Tools for Applications (VSTA) as the environment for writing and executing Script.
• Script can be written using VB.NET and C#.NET programming languages.
• To run a script, VSTA must be installed on the computer where the package runs.
• To work with each row of data in a set, you should use the Script component instead of the Script task.

Data Profiling Task: analyze and understand data profiling problems in Table or View.
• Help identify data profiling problems.
• Table or View used has to be in SQL Server.
• Use ADO.NET connection manager to connect a SQL Server Instance.
• Data Profile analyzed is written to File or SSIS Variable in XML format.
• Account must have read/write permissions on Tempdb database.
• Perform following operations:
-Column Based Profiling
Column Length Distribution
Column Null Ratio
Column Pattern
Column Statistics
Column Value Distribution
-Multiple Column Based/Datasets Profiling
Candidate Key
Functional Dependency
Value Inclusion
• For empty table or column, it has two cases:
-When the table or view is empty, it does not compute any profiles.
-When values in selected column are null, it computes only Column Null Ratio profile.
• Configuration options:
-Wildcard columns (*)
-Quick Profile
• Data Profile Viewer is used to review the profile output.

Send Mail Task: send an e-mail message from SSIS Package.
• use SMTP Connection Manager to connect to mail server.
• Configuration:
-Specify SMTP connection manager (anonymous authentication or Windows Authentication).
-Specify sender and recipients on from, To, Cc, and Bcc lines.
-Specify a subject line.
-Provide the message text: direct Input, from File or Variable.
-Set the priority level of the message: Normal, Low or High
-Include attachments: multiple attachments can be specified by the pipe character(|).
• use a File connection manager to connect to a file.
SMTP (send mail transfer protocol)-is a port (pipeline ) used to send email
POP (Post office protocol)-is a port(pipeline) used to receive email

FTP Task (File Transfer Protocol/ˈproʊtəkɔl/协议): download data files and manage directories on servers.
Send or receive any file based files
(http: hyper text, DNS server changes the website to get server address)
• Use a FTP connection manager, which should be configured separately from the FTP task.
• Configuration:
-Specify Server Name and Port.
-Credentials for accessing the FTP server (anonymous and basic authentication).
-Specify time-out, number of retries, and the amount of data to copy.
-Indicate Passive or Active mode for FTP connection manager.
• Perform the following operations:
-Copy directories and data files from one directory to another
-Log in to a source FTP location and copy files/packages to a destination directory
-Download files from an FTP location
• Can include a set of predefined operations:
-Send and Receive Files
-Create Local or Remote Directory
-Remove Local or Remote Directory
-Delete Local or Remote Files
• When accessing a local file or a local directory, FTP task uses a File connection manager or path information stored in a variable.
• Wild Cards (‘*’ and ‘?’) can be used to access multiple files.

SFTP Task: send or receive files using secure connection.
• NOT provided in SSIS toolbox; In order to use:
-Download SSIS FTP runtime binary component.
-Copy the dll from the SSISSFTPTask110.zip archive to “C:\Program Files\Microsoft SQL Server\110\DTS\Tasks\”
-Add it to GAC using a power shell script: #$publish.GacInstall(“C:\Program Files\Microsoft SQL Server \110 \
DTS\Tasks\SSISSFTPTask110.dll”);
• In Business Intelligence Development Studio (i.e Visual Studio 2012), refresh SSIS Toolbox to get the SFTP task.
----------------------------------------Containers-------------------------------
Containers in SSIS: objects in SSIS that provide structure to packages and services to tasks.
• Perform the following functions:
-Repeat tasks for each element in a collection.
-Repeat tasks until a specified expression evaluates to false.
-Group tasks and containers that must succeed or fail as a unit.
• Container Types:
(For/Foreach Loop: can execute the same task multiple times)
-For Loop: when u know how many times to repeat execution
-Foreach Loop: when u dont know how many times to repeat execute (enumorator: decide how many times to execute the task)
-Sequence: multiple tasks, exeucte tasks sequently one by one
-Task Host
For Loop Container: defines a repeating control flow in a package. In each iteration, expression is evaluated and repeats its workflow until the expression evaluates to False.
• Configuration:
-Specify initialization expression.
-Specify an evaluation expression.
-Iteration expression that increments or decrements counter.
• For Loop can include other For Loop containers to implement nested loops.
• Transaction property can be set to define a transaction for a subset of the package control flow.
Foreach Loop Container: implements repeating control flow in a package for each member of a specified enumerator.
Folder has a lot of files, copy files into a different folder using For Loop Container.

• Enumerator/ɪ’nju:məreɪtə/计数器 types:
-Foreach File enumerator
-Foreach Item enumerator: under each category, whatever u deal w is an item
-Foreach from Variable enumerator
-Foreach ADO enumerator
-Foreach ADO.NET Schema Rowset enumerator
-Foreach Nodelist enumerator
-Foreach SMO enumerator
• Combination of variables and property expressions is used to update the property of the package object with the enumerator collection value.
• It can include multiple tasks and containers, but it can use only one type of enumerator.
• Transaction attribute can be set on the Foreach Loop container to define a transaction for a subset of the package control flow.
Sequence Container: defines a control flow that is a subset of the package control flow. Sequence container can include multiple tasks and other containers, connected by precedence constraints.
• Perform the following operations:
-Disable groups of tasks for package debugging.
-Manage properties on multiple tasks
-Provide scope for variables
-Group many tasks, for easy management
-Implemente transaction
Task Host Container: encapsulates a single task. (e.g. Package is a task host container)
• In SSIS, task host container is not configured separately.
• Task Host Container extends the use of variables and event handlers to the task level.
---------Error Handling in SSIS--------------
At Control Flow Level:

  1. On Failure Precedence Constraint(limited functionality, not so good in performance): like an on-task failure, nothing in the control flow task will help u handle it at the error level; every task has to be validated before execution starts, therefore very slow; This is the simplest basic way to handle errors at Control Flow Level.
  2. Event Handlers: Recommended.
    -Event Handlers can be set-up only on Executables (Package, Container and Tasks)
    -Each Executable can have up to max 12 different events
    -It is like a trigger where the workflow is automatically executed on selected event during package execution cycle.
    -Only two events used primarily to handle errors with the help of eventhandlers: onError and onTaskFailed

At Data Flow level
Most Data Flow components provide a specific ERROR OUTPUT called “Error Data Pipeline”. These data flow components can be configured with either of the following values:
- Fail Component (Default): Stop executing Package when error happens
- Ignore Failure : When error happens, package keeps on executing
regardless of error
- Redirect Row: Redirect error data to error output so that package
keeps executed when error happens. Error output pipeline
is marked in red color in Flat File Source component.

Difference between Error and Failure
• SSIS engine maintains the hierarchy while implementing solution
Package --> Container --> Tasks
• Error and Failure Escalation default implementation in SSIS (Errors and Failures are always communicated from a Child to its Parent)
• Error – Individual component with error message. A
package could have multiple errors
(On-Error event: an event triggered once per error; work flow executed for separate error)
• Failure – One or more errors results into a Failure
(On-TaskFailed event:an event triggered once; work flow executed only once)

Error and Failure Escalation
• By default, errors and failures always escalate to a Control Flow component’s immediate parent
• This may be a problem as multiple event handlers at different levels can trigger for a single error
• By changing the value of Propagate variable (system variable) from “True” to “False”, it stops error and failure escalating to its immediate parent
• To escalate failures but not errors: there are separate Propagate variables for onError (at error level) and onTaskFailed (at failure level); set Propagate at error level to false and that at Failure level to true
• Its important to always handle errors at the actual point of failure

Error Configuration in Control Flow
• MaximumErrorCount: Gets or sets an Integer value that indicates the maximum number of errors that can occur before the DtsContainer object stops running.
• FailPackageOnFailure
– Gets or sets a Boolean that indicates whether the package fails when a child container fails. This property is used on containers, not the package itself.
– True indicates that a failure in the container will set the package execution results to failure.
• FailParentOnFailure: Gets or sets a Boolean that defines whether the parent container fails when a child container fails.
----------------Transactions in SSIS-----------------
Transaction: any set of SQL Statements or Components that follow ACID properties are called “Transactions”; Every RDBMS has transctions;
ACID Properties?
Automoticity: specifies either all the components of transactions are successfully executed or nothing is executed.
Consistency: implies after a transaction is executed it should leave a database in a consistent state
Isolation: means all different transactions in the system should be isolated from each other
Durability: implie once the transaction is committed its committed forever. (It can NOT be rolled back)
Transactions are very common in various scenarios where you want to execute multiple functional components as one logical unit.

Transactions in SQL Server are conceptually very much the same as Transactions in SSIS; it just differs in the way they are implemented

Transactions in SSIS can be implemented only at Control Flow level meaning only executables can participate in Transaction.

Scope Level of Transaction
-Package Level
-Container Level
-Task Level
(NO data flow level)

MS DTC (Distributed Transaction Coordinator) is a Windows Service which is needed to be “ON” for implementing Transactions in SSIS (Search>services.msc>Distributed Transaction Coordinator>Start the service)

For transactions in SQL Server, u dont need DTC to be on as it is independent; SSIS is only a tool that relies on Windows so it realies DTC to be on

TransactionOption is the property that applies at Package, Containers & Tasks which is used to implement Transactions in SSIS. Three property values:
• Supported (default): make a control flow component support the transaction started by its immediate parent. (if its parent has already started a transaction.)
• Not Supported: make a control flow component NOT part of the transaction started by its parent & does not support its parent in terms of transaction (If the parent already started an trans, the componeent will not start a second transa by using required. In this case required=supported.)
• Required: start its own Transaction provided that its immediate parent has NOT already started a Transaction. If the immediate parent has already started a transaction then “required” is same as “supported”. only Required property value can start a Transaction.

For implementing Transactions in SSIS, group all the Tasks together to be part of Transaction using “Sequence Container.”

CheckPoints in SSIS
CheckPoints are the functionality in SSIS that helps to re-start the package execution from the latest point of failure.
• With the correct implementation of checkpoints you don’t need to execute the components that has already been executed in the previous execution cycle which avoids un-necessary duplication and saves time.
• CheckPoints can be implemented only at Control Flow level meaning only executables can participate in checkpoints implementation.
How to implement CheckPoints:
• At Package Level:
CheckPointFileName Property: specifies the name of the checkpoint file.
CheckPointUsage Property: specifies whether checkpoints are used.
-Never: specifies that the checkpoint file is not used and that the package runs from the start of the package workflow.
-Always: specifies that the checkpoint file is always used and that the package restarts from the point of the previous execution failure. If the checkpoint file is not found, the package fails.
-IfExists: specifies that the checkpoint file is used if it exists. If the checkpoint file exists, the package restarts from the point of the previous execution failure; otherwise, it runs from the start of the package workflow.
SaveCheckPoints Property (default: false, nothing will be saved): indicates whether the package saves checkpoints. This property must be set True to restart a package from a point of failure.

FailPackageOnFailure property must be set to true for all the containers/tasks in the package that you want to identify as restart points. (if set true, whenever the task fails, it will explicitly fail the package and then package will register this as the failure point into the checkpoint file; if set false, it will trigger an implicit failure to its immediate parent which will eventually trigger an implicit failure to the package. An implicit failure in the package will NOT regist the failure point in the Checkpoint file.)

ForceExecutionResult property is to test the use of checkpoints in a package. By setting ForceExecutionResult Property of a task or container to Failure, you can imitate real-time failure. When you rerun the package, the failed task and containers will be rerun.

Logging in SSIS

Logging captures run-time information about a package.
Perform the following operations:
-optimize package performance.
-troubleshoot a package.
-audit specific operations performed by a package.
Logging Types:
-SSIS Logging: Built-in SSIS loggings.
-Custom Logging: Defining workflow at Event Handlers.

SSIS loggings are built in loggings used to log execution results of a package using Log providers.
Configuration:
-Choose containers: Each container has checkbox represent 3 states:
Selected: Explicit Yes.
Cleared: Explicit No.
Unavailable: Inherit options from parent container.
-Select Event:
OnError,
OnInformation,
OnPreExecute,
OnPostExecute, etc.
-Add log providers:
Text files,
SQL Server,
SQL Server Profiler,
Windows Event Log, and
XML Files

Custom loggings can be used to customize the logging by defining the workflow using SSIS Event Handlers. It will log the specified information about the package, when event occurs at the run time.
Custom Loggings can be implemented by:
-Execute SQL Task,
-Script Task

SSIS Catalog provides logging options for logging on the server:
-None
-Basic
-Performance
-Verbose
All the captured logs are stored in a view Catalog.event_messages.

Configuration in Package Deployment Model
Two types of Deployement Models:
-Project Deployment Model
-Package Deployment Model

Configuration/Deployment are both used for project packages
Configuration:make for a global modification w/o changing the package properties
Deployment: project/package deployment model:

ConnectionString=“…”
-Package Deployment Model (Supported by both BIDS and SSDT): configure each package individually;
-Project Deployment Model (supported by 2012, SSDT only): configure all packages at once

Package Deployment Model
• A package is the unit of deployment.
• Configurations are used to assign values to package properties.
• Packages (.dtsx extension) and configurations (.dtsConfig extension) are saved individually to the file system.
• Environment-specific configuration values are stored in configuration files.
• Package is deployed in MSDB Database.
(push deployment: package stored in MSDB;
pull deployment: package stored in File System)

Configuration in SSIS:
• Package configurations can update the values of properties at run time.
• Configurations are available for the package deployment model.
• Package configurations provide the following benefits:
– Configuration makes Migration easier from a development to a production environment.
– Configurations are useful when you deploy packages to different servers.
– Configurations make packages more flexible.
Configuration Types: Five
– XML configuration: XML File contains the configurations;
– Environment variable: An environment variable contains configurations
– Registry entry: A Registry entry contains the configurations
– Parent package variable: (only in parent-child enviroment) A Variable in the package contains the configurations. It is typically used to update properties in child packages.
– SQL Server table: A table in SQL Server database contains the configuration.

Integration Services provides direct and indirect configurations:
– Direct: Specify configurations directly, SSIS creates a direct link between the configuration item and the package object property.
– Indirect: Instead of specifying the configuration setting directly, it points to an environment variable which contains the configuration value.

The first five types all can be direct (when storing XML,EV, RE,PPV,SST directly) or indirect (when storing as enviromental variables); Indirect is convenient, but direct is faster
------------Environment-----------
What is environment?
Production/Developlemnt/testing environment
ETL Strategy:D
ETL stands for extraction, transformation and loading.
Perform the following operations:
-Extract data from operational sources or archive systems.
-Transform the data:
-Cleaning: Mapping Null to ‘0’, Male to ‘M’, misspellings, etc.
-Validation: Validate Address fields, Zip codes, etc.
-Filtering: selecting certain columns
-Joining data from multiple sources: Lookup, merge
-Applying business rules: deriving new calculated values, sorting, generating aggregations, applying advance validation rules, etc.
-Load data into a data warehouse, DataMart or data repository /rɪˈpɒzitəri/

Components in ETL Process:
-OLTP Sources
-Structured Data (e.g. Oracle, Access, DB2, SQL Server, etc.)
-Unstructured Data (e.g. Flat File, Excel)
-Pre-Staging DB
-Data Profiling
-Data Mapping
-Staging DB
Data transformation (cleaning, aggregation, validation, etc.)
-Data Warehouse/DataMart
Data population(Initial and Incremental Load)

Initial and Incremental Load

Initial (Full) Load is to load all the records for the first time. Here we give the last extract date as empty so that all the data gets loaded.

Incremental Load is to load the records that have been changed after the last load.Here we give the last extract date such that only records after this date are loaded.

Incremental Load Techniques:
• Checksum
• CDC
• Timestamp
• Merge Joins
• Triggers
• MD#2, MD#4, MD#5, our own algorithms, etc.

ETL Using Checksum
Checksum():
• CHECKSUM computes the checksum value (a hash value or a number generated from a string of text for accessing data) over a list of arguments (either column of tbl or expression).
• Checksum arguments are (represented by ) to compute checksum value for specific row or columns or expressions separated by comma.
Syntax: Checksum (
| Expression [,…n]);
• For every delta operation on specified columns, checksum value gets updated.
• Support all data types except for text, ntext, image, XML, null values, etc.
• CHECKSUM value is dependent upon the collation/kə’leɪʃn/排序规则.

Implementing ETL using Checksum:
• Add a computed column as Checksum with columns to be tracked for delta operation in Source table.
• Create a tracking table with two columns, Business Key and Checksum.
• Perform Initial Load:
-Populate Destination Table for all records.
-Populate tracking table for all Business Key and corresponding checksum value.
• Perform Incremental Load:
-Compare Business Key of Tracking table with Source table to find new Business Keys that should be updated in destination table.
-For matching Business Keys, compare checksum values to find the changes that should be updated in destination table.
Pros and Cons of Checksum:
Pros:
• Fast and Easy to implement.
• Good for small/mid size databases.
Cons:
• Probability of Collision is high.
• Checksum is different if Order of Column or expression is different for two checksum that are being compared.
• Some data types are not supported: text, ntext, image

ETL Using Change Data Capture (CDC)
• Change Data Capture (CDC) captures DML operation done on SQL table and record them into a Change Table.
• CDC was introduced in SQL Server 2008.
• Change tables contain columns of source table, along with the metadata that define type of DML operation on a row by row basis.
Operation Column in Change table define the type of DML Operation:
1: represents Delete
2: represents Insert
3: represents value before Update
4: represents value after Update

• Configuration:
-Turn on SQL Server Agent.
-First, enable CDC on the Source Database by using system SP ‘Sys.sp_cdc_enable_db’.
-Next, enable CDC on the Source table by ‘Sys.sp_cdc_enable_table’ system SP.

• There are 2 jobs created:
-Capture
-Cleanup

Implementing ETL using CDC: Enable CDC on Source Database and Table.
• For Initial Load:
-Configure CDC Control Task to Mark Initial Load Start.
-Specify variable and table to store CDC state.
-Configure Data Flow Task to load all the records.
-Re-configure CDC Control Task to mark Initial Load End, CDC state and table storing CDC state.
• For Incremental Load:
-Configure CDC Control Task to Get Processing Range.
-Specify variable and table to store CDC state.
-Configure CDC Source in Data Flow Task for Source Table, Capture Instance, Processing Mode and CDC State Variable.
-Configure CDC Splitter to identify DML operation.
-Re-configure CDC Control Task to mark processed range, CDC state and table storing CDC state.

Pros and Cons of CDC:
• Pros:
-can be configured to only track certain tables or columns
-It is able to handle model changes to a certain degree
-It does not affect performance as heavily as triggers because it works with the transaction logs
-It is easily enabled/disabled and does not require additional columns on the table that should be tracked
• Cons:
-Consume time and resources
-The amount of history data can become huge fast
-The history data takes some time to catch up because it is based on the transaction logs
-It depends on the SQL Server Agent
ETL Using Timestamp
Timestamp: Is a data type that exposes automatically generated, unique binary numbers within a database.
The storage size is 8 bytes.
• Timestamp or rowversion data type is just an incrementing number and does not preserve a date or a time.
• Every time that a row with a Timestamp or rowversion column is modified or inserted, the incremented database rowversion value is inserted in the Timestamp or rowversion column.

Implementing ETL using TimeStamp:
• Add a column with Timestamp data type in Source table.
• Create a tracking table to store latest Timestamp for Initial or Incremental Load.
• Add a Column with Varbinary data type in destination table to set expiry/ɪkˈspaɪri/有效期of a deleted record.
• Create a Stored procedure to perform Initial or Incremental Load.
Pros and Cons of TimeStamp:
• Pros:
-Good Performance.
-Guarantees unique binary number for every insertion or modification.
-No collision, good for large databases as well.
• Cons:
-Procedure may become complex if number of tables are increased.

你可能感兴趣的:(SSIS)