Friday, October 9, 2009
Normalization:
Database normalization is a design technique for structuring relational database tables.Highly normalized data reduces the data duplication. Redundant data is removed by the process of normalization.Normalization requires more tables and joins.
Denormalization:
De normalization is the process of attempting to optimize the performance of a data base by adding redundant data. Instead of joining several tables (Which affects the performance) to get a common information, just place some repeated values in the table is the use of de normalization
Surrogate Key
Surrogate key is a unique identifier. It is used to identify each row of a table uniquely. It also helps to track slowly changing dimensions. It replaces all natural keys. Surrogate keys are system generated keys, they are not derived from any data sources.
Sunday, March 1, 2009
Wednesday, February 25, 2009
Slowly Changing Dimension
Type 1 slowly changing dimension: Overwrite the old value with the new value and call it a day. This is very useful when dealing with issues such as types on the client’s name.
Type 2 slowly changing dimension: Create a new record in the dimension with a new primary key.
Type 3 slowly changing dimension: Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change. This type of slowly changing dimension resolution would be beneficial if there is a change that can happen once and only once.
Type 2 slowly changing dimension: Create a new record in the dimension with a new primary key.
Type 3 slowly changing dimension: Overwrite the old value with the new value, and add additional data to the table such as the effective date of the change. This type of slowly changing dimension resolution would be beneficial if there is a change that can happen once and only once.
Friday, February 20, 2009
Data Modelling
Data Modelling - Data is simply a record of all business activities, resources and results of the organization
Data Model is a well organized abstraction of that data.
Data Modeling helps to organize the structure and contents of the data in the data warehouse.
Data Mart:
Subset of Data ware house
A Data Mart is a specific, subject oriented, repository of data designed to answer specific questions for a specific set of users. So an organization could have multiple data marts serving the needs of marketing, sales, operations, collections, etc. A data mart usually is organized as one dimensional model as a star-schema(OLAP cube) made of a fact table and multiple dimension tables.
1. Dataware housing is a process that builds an intelligent system to provide enterprise wide data and business solutions. Data warehouse is a de-normalized database which stores historical data in summary level format and is specifically meant for heavy duty querying and analysis purpose
2. Advanced technology which is used purely to support the executives of the organization to take business decisions
Datamart is a subset of data warehouse and it supports a particular region, business unit or business function
Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.
Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.
In contrast, a Data Warehouse (DW) is a single organizational repository of enterprise wide data across many or all subject areas. The Data Warehouse is the authoritative repository of all the fact and dimension data (that is also available in the data marts) at an atomic level. Combination of Data marts .Approach to build a Data warehouseFrom a business perspective, it is very important to first get clarity on the end user requirements and a system study before commencing any Data warehousing project. From a technical perspective, it is important to first understand the dimensions and measures, determine quality and structure of source data from the OLTP systems and then deciding whether which dimensional model to apply, i.e. whether we do a star or snowflake or a combination of both. From a conceptual perspective, we can either go the Ralph Kimball method (build data marts and then consolidate at the end to form an enterprise Data warehouse) or the Bill Inmon method (build a large Data warehouse and derive data marts from the same. In order to decide on the method, a strong understanding of the business requirement and data structure is needed as also consensus with the customer.
Staging AreaStaging area is also called ‘Operational Data Store’ (ODS). It is a data holding place where the data which is extracted from all the data sources are stored. From the Staging area, data is loaded to the data warehouse. Data cleansing takes place in this stage.
Data Model helps functional and technical team in designing the data base
Data Model is a visual representation of the Data Structures (Tables…)
Target tables are created using data modeling techniques by modeling peoples in the organization
Requirement gathering, Making Schema and Optimization are the job of modeling peoples
In Data warehouse we should consider the de normalization and Redundancy for performance
In Data warehouse we have to consider the data partitioning in the data architecture because the data volume in the Data warehouse increases
Surrogate Key Surrogate key is a unique identifier. It is used to identify each row of a table uniquely. It also helps to track slowly changing dimensions. It replaces all natural keys. Surrogate keys are system generated keys, they are not derived from any data sources.
Denormalization: De normalization is the process of attempting to optimize the performance of a data base by adding redundant data. Instead of joining several tables (Which affects the performance) to get a common information, just place some repeated values in the table is the use of de normalization
Normalization: Database normalization is a design technique for structuring relational database tables.Highly normalized data reduces the data duplication. Redundant data is removed by the process of normalization.Normalization requires more tables and joins.
Modeling types and Data Modeling development life cycle
Conceptual data modeling – Created by gathering requirements from various sources like business documents, discussion with functional teams, business analysts and end users who do the reporting on the database.
Logical data Model. As soon as the conceptual data model is accepted by the functional team development of logical data model starts. In Logical data model we define the tables (Entity) required, Columns (Attribute) required, (Relationship) Primary keys, Foreign keys required etc.,
Physical Data Model: Once Logical data model is approved by functional team and there after development of physical data model starts. Data types, database rules, refrential integrity are defined here. Data base performance, Optimization, indexing strategies and de normalization are important parameters of a physical model
Dimensional Analysis or Dimensional Data modeling:
It is about the Dimensional model
Dimensional Data modeling is used for calculating summarized data and rearranging the data and present views of the data to support data analysis
Dimensional model is new name for old technique that makes the Data base simple and understandable
Visualize Database as a “cube” and do slicing and dicing of the cube along each of its dimensions
Dimensional data modeling comprises of one or more dimensional tables or fact tables. Used in OLAP systems which are analytical based.
Relational Data Model: Used in OLTP systems which are transaction oriented
Operational System: Used to run the business (OLTP)
Decision Support: used to watch how the business runs (OLAP)
Dimension:
Dimensions define the boundaries of the business and they take part in some business functions called as business objects.
Technically combination of Hierarchy and levels are called as Dimension (Time Dimension, Year Dimension, Product Dimension, Region Dimension etc.,)
Facts:
Facts are called as measures or events (Quantity, Sales amount, Discount, profit etc.,)
Thumb Rule:
Dimensions contains text like values and Facts / Measures / Events contains numeric like values
Star Schema:
Star Schema is relational data base schema for representing multidimensional data
Simplest form of data warehouse schema that contains one or more dimensions and fact tables. Typical star schema consists of a fact table with one or more dimensional tables around
Steps in designing Star Schema
Identify a business process for analysis
Identify measures or facts
Identify dimensions for facts (Product dimension, location dimension, time dimension etc.,)
List the columns that describe each dimension (Product name, location name etc.,)
Important aspects of Star schema and snow flake schema:
In a star schema every dimension will have a primary key
In a star schema a dimensional table will not have parent table
In a snow flake schema a dimensional table will have one or more parent table
Hierarchies for the dimensions are stored in dimensional table itself in star schema
Hierarchies for the dimensions are broken into separate tables in the snow flake schema
Advantages of Start Schema:
Star Schema is very easy to understand
Provides better performance and smaller query times
Difference between star schema and snow flake schema
Star Schema --------------- Snowflake Schema
De-normalized -------------- Normalized
Data access latency is less --------------- More
Size of the data warehouse will
be large --------------- Small
Low --------------- Performance High
Memory usage is high ------------------- Less
The main difference between star schema and snowflake schema is that the star schema is highly denormalized and the snowflake schema is normalized. So the data access latency is less in star schema in comparison to snowflake schema. As the star schema is denormalized, the size of the data warehouse will be larger than that of snowflake schema. The schemas are selected as per the client requirements. Performance wise, star schema is good. But if memory utilization is a major concern, then snow flake schema is better than star schema.
Hierarchy
A logical structure that uses ordered levels as means of organizing the data.
Level
A position in the Hierarchy
Fact table
A table in a star schema that contains facts and connected to Dimensions. A fact table typically will have 2 types of columns, those that contain facts and those that are foreign keys to the dimension tables
Dimension table
Dimensional table is the one that describes the business entities of an enterprise, represented as hierarchial, categorical information such as time, departments, locations and products
Summary Table
Fact table that contains aggregated facts are called as summary table
Building Data Mart
Inman Approach – First build the data warehouse and then the data mart
Kimball Approach – First build the data mart and then the data warehouse (Common fields should be in the data mart
Relational Data modeling -
Data is stored in RDBMS
Tables are units of storage
Data is normalized and used for OLTP. Optimized for OLTP processing
Several tables and chains of relationships among them
Volatile(several updates) and time variant
SQL is used to manipulate data
Detailed level of transactional data
Dimensional Data Modeling
Data is stored in RDBMS or Multidimensional databases
Cubes are units of storage
Data is denormalized and used in datawarehouse and data mart. Optimized for OLAP
Few tables and fact tables are connected to dimensional tables
Non volatile and time invariant
MDX is used to manipulate data
Summary of bulky transactional data(Aggregates and Measures) used in business decisions
Data Model is a well organized abstraction of that data.
Data Modeling helps to organize the structure and contents of the data in the data warehouse.
Data Mart:
Subset of Data ware house
A Data Mart is a specific, subject oriented, repository of data designed to answer specific questions for a specific set of users. So an organization could have multiple data marts serving the needs of marketing, sales, operations, collections, etc. A data mart usually is organized as one dimensional model as a star-schema(OLAP cube) made of a fact table and multiple dimension tables.
1. Dataware housing is a process that builds an intelligent system to provide enterprise wide data and business solutions. Data warehouse is a de-normalized database which stores historical data in summary level format and is specifically meant for heavy duty querying and analysis purpose
2. Advanced technology which is used purely to support the executives of the organization to take business decisions
Datamart is a subset of data warehouse and it supports a particular region, business unit or business function
Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.
Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.
In contrast, a Data Warehouse (DW) is a single organizational repository of enterprise wide data across many or all subject areas. The Data Warehouse is the authoritative repository of all the fact and dimension data (that is also available in the data marts) at an atomic level. Combination of Data marts .Approach to build a Data warehouseFrom a business perspective, it is very important to first get clarity on the end user requirements and a system study before commencing any Data warehousing project. From a technical perspective, it is important to first understand the dimensions and measures, determine quality and structure of source data from the OLTP systems and then deciding whether which dimensional model to apply, i.e. whether we do a star or snowflake or a combination of both. From a conceptual perspective, we can either go the Ralph Kimball method (build data marts and then consolidate at the end to form an enterprise Data warehouse) or the Bill Inmon method (build a large Data warehouse and derive data marts from the same. In order to decide on the method, a strong understanding of the business requirement and data structure is needed as also consensus with the customer.
Staging AreaStaging area is also called ‘Operational Data Store’ (ODS). It is a data holding place where the data which is extracted from all the data sources are stored. From the Staging area, data is loaded to the data warehouse. Data cleansing takes place in this stage.
Data Model helps functional and technical team in designing the data base
Data Model is a visual representation of the Data Structures (Tables…)
Target tables are created using data modeling techniques by modeling peoples in the organization
Requirement gathering, Making Schema and Optimization are the job of modeling peoples
In Data warehouse we should consider the de normalization and Redundancy for performance
In Data warehouse we have to consider the data partitioning in the data architecture because the data volume in the Data warehouse increases
Surrogate Key Surrogate key is a unique identifier. It is used to identify each row of a table uniquely. It also helps to track slowly changing dimensions. It replaces all natural keys. Surrogate keys are system generated keys, they are not derived from any data sources.
Denormalization: De normalization is the process of attempting to optimize the performance of a data base by adding redundant data. Instead of joining several tables (Which affects the performance) to get a common information, just place some repeated values in the table is the use of de normalization
Normalization: Database normalization is a design technique for structuring relational database tables.Highly normalized data reduces the data duplication. Redundant data is removed by the process of normalization.Normalization requires more tables and joins.
Modeling types and Data Modeling development life cycle
Conceptual data modeling – Created by gathering requirements from various sources like business documents, discussion with functional teams, business analysts and end users who do the reporting on the database.
Logical data Model. As soon as the conceptual data model is accepted by the functional team development of logical data model starts. In Logical data model we define the tables (Entity) required, Columns (Attribute) required, (Relationship) Primary keys, Foreign keys required etc.,
Physical Data Model: Once Logical data model is approved by functional team and there after development of physical data model starts. Data types, database rules, refrential integrity are defined here. Data base performance, Optimization, indexing strategies and de normalization are important parameters of a physical model
Dimensional Analysis or Dimensional Data modeling:
It is about the Dimensional model
Dimensional Data modeling is used for calculating summarized data and rearranging the data and present views of the data to support data analysis
Dimensional model is new name for old technique that makes the Data base simple and understandable
Visualize Database as a “cube” and do slicing and dicing of the cube along each of its dimensions
Dimensional data modeling comprises of one or more dimensional tables or fact tables. Used in OLAP systems which are analytical based.
Relational Data Model: Used in OLTP systems which are transaction oriented
Operational System: Used to run the business (OLTP)
Decision Support: used to watch how the business runs (OLAP)
Dimension:
Dimensions define the boundaries of the business and they take part in some business functions called as business objects.
Technically combination of Hierarchy and levels are called as Dimension (Time Dimension, Year Dimension, Product Dimension, Region Dimension etc.,)
Facts:
Facts are called as measures or events (Quantity, Sales amount, Discount, profit etc.,)
Thumb Rule:
Dimensions contains text like values and Facts / Measures / Events contains numeric like values
Star Schema:
Star Schema is relational data base schema for representing multidimensional data
Simplest form of data warehouse schema that contains one or more dimensions and fact tables. Typical star schema consists of a fact table with one or more dimensional tables around
Steps in designing Star Schema
Identify a business process for analysis
Identify measures or facts
Identify dimensions for facts (Product dimension, location dimension, time dimension etc.,)
List the columns that describe each dimension (Product name, location name etc.,)
Important aspects of Star schema and snow flake schema:
In a star schema every dimension will have a primary key
In a star schema a dimensional table will not have parent table
In a snow flake schema a dimensional table will have one or more parent table
Hierarchies for the dimensions are stored in dimensional table itself in star schema
Hierarchies for the dimensions are broken into separate tables in the snow flake schema
Advantages of Start Schema:
Star Schema is very easy to understand
Provides better performance and smaller query times
Difference between star schema and snow flake schema
Star Schema --------------- Snowflake Schema
De-normalized -------------- Normalized
Data access latency is less --------------- More
Size of the data warehouse will
be large --------------- Small
Low --------------- Performance High
Memory usage is high ------------------- Less
The main difference between star schema and snowflake schema is that the star schema is highly denormalized and the snowflake schema is normalized. So the data access latency is less in star schema in comparison to snowflake schema. As the star schema is denormalized, the size of the data warehouse will be larger than that of snowflake schema. The schemas are selected as per the client requirements. Performance wise, star schema is good. But if memory utilization is a major concern, then snow flake schema is better than star schema.
Hierarchy
A logical structure that uses ordered levels as means of organizing the data.
Level
A position in the Hierarchy
Fact table
A table in a star schema that contains facts and connected to Dimensions. A fact table typically will have 2 types of columns, those that contain facts and those that are foreign keys to the dimension tables
Dimension table
Dimensional table is the one that describes the business entities of an enterprise, represented as hierarchial, categorical information such as time, departments, locations and products
Summary Table
Fact table that contains aggregated facts are called as summary table
Building Data Mart
Inman Approach – First build the data warehouse and then the data mart
Kimball Approach – First build the data mart and then the data warehouse (Common fields should be in the data mart
Relational Data modeling -
Data is stored in RDBMS
Tables are units of storage
Data is normalized and used for OLTP. Optimized for OLTP processing
Several tables and chains of relationships among them
Volatile(several updates) and time variant
SQL is used to manipulate data
Detailed level of transactional data
Dimensional Data Modeling
Data is stored in RDBMS or Multidimensional databases
Cubes are units of storage
Data is denormalized and used in datawarehouse and data mart. Optimized for OLAP
Few tables and fact tables are connected to dimensional tables
Non volatile and time invariant
MDX is used to manipulate data
Summary of bulky transactional data(Aggregates and Measures) used in business decisions
Subscribe to:
Posts (Atom)