Ø Dataware house:
It is a collection of data marts. Represents historical data. a data warehouse is a relational database which is specially designed for analysis purpose rather then for transactional purpose.
Ø Data mart:
It is a sub set of data ware housing.It can provide the data to analyze query reporting
& analysis. a datmart is subject oriented database which gives the data about each and every individual department in an organisation.
This confirm dimenstion methodology.If a dimension table is connected to more then one Fact table is called confirm dimension.
Fact Tables are connected by confirmed dimensions, Fact tables cannot be connected directly, so means of dimension we can connect
Yes, We should use the surrogate key, here we are getting data from different locations means every one have one primary key, while transforming the data into target that time more than two key not in use so if you use surrogate key it will identified the duplicate fields in dimensional table.
A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables or aggregated fact. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.
Partition table, bitmap index, sequence ,table function ,sql loader ,function like cube ,roll_up etc.
The snowflake and star schema are methods of storing data which are multidimensional in nature (i.e. which can be analysed by any or all of a number of independent factors) in a relational database .The snowflake schema (sometimes called snowflake join schema) is a more complex schema than the star schema because the tables which describe the dimensions are normalized.Snowflake schema is nothing but one dimension table will be connected to another dimension table and so on.
------------
Snowflake
------------
? If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snow flaking may
be appropriate.
? A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalized. A snowflake schema will hence be easier to implement.
? A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organization of the database.
? Some users may wish to submit queries to the database which, using conventional multidimensional reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snow flaking would typically be required to permit simple query tools such as Cognos
Power play to form such a query, especially if provision for these forms of query weren't anticipated when the data warehouse was first designed.
---------
Star
----------
The star schema (sometimes referenced as star join schema) is the simplest data warehouse schema, consisting of a single "fact table" with a compound primary key, with one segment for each "dimension" and with additional columns of additive, numeric facts.The star schema makes multi-dimensional database (MDDB) functionality possible using a traditional relational database. Because relational databases are the most common data management system in organizations today, implementing multi-dimensional views of data using a relational database is very appealing. Even if you are using a specific MDDB solution, its sources likely are relational databases. Another reason for using star schema is its ease of understanding. Fact tables in star schema are mostly in third normal form (3NF), but dimensional tables are in de-normalized second normal form (2NF). If you want to
normalize dimensional tables, they look like snowflakes (see snowflake schema) and the same problems of relational databases arise - you need complex queries and business users cannot easily understand the meaning of data. Although query performance may be improved by advanced DBMS technology and hardware, highly normalized tables make reporting difficult and applications complex.
1 Because More than one dimensions can be shareble for Other Department
2 The Physical Load will be less.
3 Less Complexity of Fact
An ODS is an environment that pulls together, validates, cleanses and integrates data from disparate source application systems. This becomes the foundation for providing the end-user community with an integrated view of enterprise data to enable users anywhere in the organization to access information for strategic and/or tactical decision support, day-to-day operations and management reporting.
The defination of Data Warehouse is as follows.
? Subject-oriented, meaning that the data in the database is organized so that all the data elements relating to the same real-world event or object are linked together;
? Time-variant, meaning that the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
? Non-volatile, meaning that data in the database is never over-written or deleted, but retained for future reporting;
? Integrated, meaning that the database contains data from most or all of an organization's operational applications, and that this data is made consistent.
Difference
------------
Ods
--------
Transactions similar to those of an Online Transaction Processing System
Data Warehouse
--------------
Queries process larger volumes of data
Ods
--------
Contains current and near current data
Data Warehouse
------------
Contains historical data.
Ods
-----------
Typically detailed data only, often resulting in very large data volumes
Data Warehouse
--------------
Contains summarised and detailed data,generally smaller in size than on ODS
Ods
--------
Real-time and near real-time data loads
Data Warehouse
------------
Typically batch data loads
Ods
--------
Generally modeled to support rapid data update
Data Warehouse
-----------------
Generally dimensionally modeled and tunes to optimise query performance
Ods
------
Updated at the data field leve
Data Warehouse
---------------
Data is appended, not updated
Ods
-------
Used for detailed decision making and
operational reporting
Data Warehouse
----------------
Used for ling-term decision making and
management reporting
Ods
-----
Knowledge workers (customer service
representatives, line managers)
Data Warehouse
-------------
Strategic audience (executives, business
unit management)
Warehouse is used for high level data analysis purpose .It is used for predictions, time series analysis, financial analysis , what -if simulations etc. Basically it is used for better decision making.
OLTP is NOT used for analysis purpose.It is used for transaction and data processing.
Its basically used for storing day-to-day transactions that take place in an organisation.
The main focus of OLTP is easy and fast inputing of data, while the main focus in data warehouse is easy retrieval of data.
OLTP doesnt store historical data.(this is the reason why it cant be used for analysis)
DW stores historical data.
ROLAP
ROLAP stands for Relational Online Analytical Process that provides multidimensional analysis of data, stored in a Relational database(RDBMS).
MOLAP
------
MOLAP(Multidimensional OLAP), provides the analysis
of data stored in a multi-dimensional data cube.
HOLAP
------
HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP can
provide multidimensional analysis simultaneously of data
stored in a multidimensional database and in a relational
database(RDBMS).
DOLAP
-----
DOLAP(Desktop OLAP or Database OLAP)provide multidimensional
analysis locally in the client machine on the data collected
from relational or multidimensional database servers.
1.Confirmed Dimension.
2.Junk Dimension.
3.Degenerated Dimension.
4.Slowly changing Dimensions.
Data cube is the logical representation of multidimensional data .The edge of the cube contains dimentions and the body of the cube contains datas.
data warehousing is used to store the historical data.by using dwh bsiness users can analize thier business.data mining is used to predict the future .dwh will act as the source for the data mining
In informatica ..bulk insert or bulk load does 2 things :-
1) Ignores the commit interval specified at the session level.
2) Do not create a database session log file.
So, advantage is it's very fast as no entry goes into log.
Disadvantage is session cannot be rolled back as no entry exists in the log file .
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.There appear to be two definitions of a surrogate in the literature. We shall call these surrogate (1) and surrogate
(2):Surrogate (1) This definition is based on that given by Hall, Owlett and Todd (1976). Here a surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible by the user or application.
Surrogate (2) This definition is based on that given by Wieringa and de Jung (1991). Here a surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or
application.
Joiner will join the two different data sources based on a
join condition ,and pass only the rows which satisfy that
condition.discards the remaining rows.
Joiner transformation supports 4 types of joins at
Informatica level
Normal
Master Outer
Detail Outer
Full Outer
LookUp Transformation
Lookup transformation basically for Reference ,based on the
lookup condition .when u want some data based on target
data ,will take lookup on that particular table and
retrieve the corresponding fields from that table.
we can override the lookup transformation using the SQL
query.
A view has a logical existence but a materialized view has
a physical existence .Moreover a materialized view can be
indexed , analisied and so on....that is all the things that
we can do with a table can also be done with a materialized
view.
Do a lookup on the Target table with an Lookup SQl Override
Select MAX(FIELD_NAME), field 1 , field3 from target group
by field1, field2...
In the Expression increment the Max values of the field
which you just got from the lookup by 1.
Here MAX_FIELDNAME) is the Max value of the field you want
to generate the sequence of..
Unconencted is used when ever u want to call the same
transformation several times and u have one return port.
We use unconnected transformation to use multiple number
of tables or views without physically taking the entity
into mapping.This kind of transformation is also helpful
when single return port is required.
Use dynamic cache if u want to update the case while
updating the target table itself and static is untouched
with the cache.
$ - These are the system variables like $Bad file,$input
file, $output file, $DB connection
$$ - Can any one tell me the scenario with example for
user defined variables
$$$ - $$$SessStartTime
$$$SessStartTime returns the initial system date value on
the machine hosting the PowerCenter Server when the server
initializes a session. $$$SessStartTime returns the session
start time as a string value. The format of the string
depends on the database you are using.
A Fact table without measures(numeric data) for a column is
called Factless Fact table.
FACT LESS FACT TABLES ARE USED TO CAPTURE DATE TRANSACTION
EVENTS
Normal: In this case server manager allocates the
resources(Buffers) as per the parameter settings. It creates
the log files in database.
Bulk: In this case server manager allocates maximum
resources(Buffers) available irrespective of the parameter
settings. It will not create any log files in database.
In first case data loading process will be time taking
process but other applications are not affected. While in
bulk data loading will be much faster but other application
are affected.
Normal Load: It loads the record one by one and writes log each file.It will take more time to complete.
Bulk Load: Load the number of records at a time ,It wont fallow ant log files or trace levels,It takes less time .
we use file type direct when we are loading single file into target. we use Indirect when we want to load multiple files through single session in the mapping
This can be handled by using the file list in informatica.
If we have 5 files in different locations on the server and
we need to load in to single target table.In session
properties we need to change the file type as Indirect.
am taking a notepad and giving following paths and file
names in this notepad and saving this notepad as
emp_source.txt in the directory /ftp_data/webrep/
/ftp_data/webrep/SrcFiles/abc.txt
/ftp_data/webrep/bcd.txt
/ftp_data/webrep/srcfilesforsessions/xyz.txt
/ftp_data/webrep/SrcFiles/uvw.txt
/ftp_data/webrep/pqr.txt
In session properties i give /ftp_data/webrep/ in the
directory path and file name as emp_source.txt and file type
as Indirect.
the best way to do this use the slowy changing dimension in
the mappings-->wizzard-->slowly changing dimension-->type1
here u need to select the source and target tables.
take a look t/f and a update strategy t/f. basing on look
up , if the record exits in target then reject it ,if not
exit insert it,and if the record exist but it is changed
then update it
Before using mapping parameters and mapping variables we
should declare these things in mapping tab of mapping
designer.
A mapping parameter cannot change untill the session has
completed unless a mapping variable can be changed in
between the session.
Example:::
if we declare mapping parameter we can use that parameter
untill completing the session,but if we declare mapping
variable we can change in between sessions.Use mapping
variable in Transcation Control Transformation......
Connect the ports of the filter transformation to the
second target table and enable the 'FORWARD REJECTED ROWS'
in the properties of the filter transformation. the
rejected rows will be forwarded to this table.
Well You can Use Router iif you need rejected rows along
with satisfied rows otherwis you just give the condition
for filter tx as you want it in your target table.
with the help of ISNULL() function of the Informatica
In the column properties sheet, write N/A in the Default
value text box for the particular column
Fact table is the primary table in the dimensional
modeling . the numerical performance of measures of the
business stored in fact table
mostly used facts ara numeric and additive
not every numeric is a fact ,but a numeric wich ara of type
key performance indicator is called fact
Update Override it is an option available in TARGET instance .By defalut Target table is updated based on Primary key values .To update the Target table on non primary key values u can generate the default Querey and override the Querey according to the requiremnet.Suppose for example u want to update the record in target table When a column value='AAA' then u can include this condition in where clause of default Querey.
Coming to SQL override it is an option available in Source
Qualifier and Lookup transafornmation where u can inlude
joins ,filters,Group by and order by.
Use the Sequence and Expresion transfermations.first
genarate the surrogate with Seq trans,then send values to
exp trans,connect the exp trans o/p ports to 3 dimentions.
First Seq generate the surrogate key like 1,2,3,4,5
we wil pass this column to next tran(Exp) from there we
will connect o/p port to dimentions.so '1' wil go all
dimentions,then '2' wil go then '3' .....
1.Normalizer transformations
2.COBOL sources
3.XML Source Qualifier transformations
4.XML sources
5.Target definitions
6.Other mapplets
7.Pre- and post- session stored procedures
8.sequence generator
data driven is the instruction fed to informatica server
wheather to insert/update/delete row whenever using update
strategy transformation
Expression tranformation is used in data cleansing.
If ur target table consist of not null column and source
table consisting null columns so assign some value in
expression tr. and then pass data to target.
This can be done by passing all ports to an expression
transformation and then creating a output port say ID=1 in
both the expression transformation of each file and then
join it using a joiner on ID,hope this helps..
IN THREE WAYS WE CAN GET THE LATEST RECORD IN SCD ...
1)IN "SCD TYPE 2 TIME STAMP" FOR LATEST RECORD THE "END
DATE" FIELD WILL BE BLANK. THAT MEANS IT IS THE NEW ROW .
2)IN "SCD TYPE 2 FLAG " THE FLAG NUMBER OF THE NEW ROW WILL
BE ONE .
3) IN " SCD TYPE 2 VERSION " THE LATEST RECORD IS HAVING
THE MAXIMUM VERSION NUMBER
F10 and F5 are used in debugging process
By pressing F10, the process will move to the next
transformation from the current transformation and the
current data can be seen in the bottom panel of the window..
whereas F5 will process the full data at a stretch..in case
of F5, u can see the data in the targets at the end of the
process but cannot see intermediate transformation values.
Coming to SQL override it is an option available in Source
Qualifier and Lookup transafornmation where u can inlude
joins ,filters,Group by and order by.
Update Override it is an option available in TARGET
instance. By defalut Target table is updated based on
Primary key values.To update the Target table on non
primary key values u can generate the default Querey and
override the Querey according to the requiremnet.Suppose
for example u want to update the record in target table
When a column value='AAA' then u can include this condition
in where clause of default Querey.
New lookup now
command task
session task
email task
Based on commit intervel session commits those many records
into target. suppose if commit intervel is 1000 if session
fails after 100 records it won't insert not a single record
into target.
REPOSITORY PRIVILAGES
FOLDER PERMISSION (OWNERS,GROUPS,USERS)
LOCKS (READ,WRITE,EXECUTE.FETCH ,SAVE)
informatcia server has 3 methods to recovering the
sessions.
1)Run the session again if the Informatica Server has not
issued a commit.
2)Truncate the target tables and run the session again if
the session is not recoverable.
3)Consider performing recovery if the Informatica Server
has issued at least one commit.
Use performing recovery to load the records from where the
session fails.
There are 3 types of data passing between informatica
server and stored proceduer these are
Input/Output parametors: Stored procedure it receive the
inputs and porvied the outputs.
Return Value:Ever data base to provied return value after
processing of stored procedure.
Status code: It is used for error handling.
Shortcut is a concept of reusability.If there is a mapping
that can be reused across several folders, create it in one
folder and use shortcuts of it in other folders. Thus, if
you have to make change, you can do it in main mapping
which reflects in shortcut mappings automatically.
Type1: No historical datas will be available, when changes
are made the old data will be deleted and the new data will
be insrted.
Type2: Flag:The old data will be denoted as false and the
new data will be denoted as true.
Version:The changes made will be numbered as 0,1,2...so on.
Date:The changes along with the date in which they are made
are clearly mentioned.
Type3:The latest change which is made is alone available.
in informatica 7, under $PMRootDir there is one utility
(script) called pmconfig exist, through it we can configure the inforamtica
Conformed dimension is a dimension which is connected to or
shared by more than one fact table.
Eg:A business which takes care of both sales and orders of
products then product dimension becomes a conformed
dimension for both sales fact and order fact
surrogate key is one type of key which is used to maintain
history .
it is used in slowly changing dimension (scd)
If u have mulitiple records then for mentain this records we
need to generate surrogate key in informatica
if you have increased your commit interval to ~25000 rows
the session will run faster but when your session fails at
24000th record you will not have any data in your target.
When you decrease your commit interval to ~10000 rows your
session will be slow when compared to previous but if the
session fails at 24000th record you will lose only 4000
records.
if commit interval is set to high value, performance will
be high. if commit is given for every 1000 rows say for eg,
it will affect the performance badly
MDDB stands for Multi Dimensional Database
MDDB: In MDDB, it views data in mutidimensional
(Perspective)i.e.,through various dimensions at a time with
the help of cubes developed using dimensions and stores
data in multidimensional i.e., stores data in power cubes.
In power cubes, each axis is a dimension and each member of
dimension is column.
In MDDB, at a glance we can see the dimensions and data
present in the dimensions
RDDB: In RDDB, it views data in two dimensional and stores
data in two dimensional i.e., in rows and columns in a
table .
In RDDB, we can just see the rows and columns, but only
after issuing select over that u can see the data.
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/7724693/viewspace-985351/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/7724693/viewspace-985351/