Wednesday, August 5, 2009

DataWarehouse Basic Interview Questions -1

1) Dimensional Modeling

Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the Dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated.

2) Non-additive facts

Non additive facts are the facts that do not participate in arithmetic calculations. For example in stock fact table there will be opening and closing balances along with qty sold and amt etc. but opening and closing balances were never used in arithmetic calculations.

3) Semi-additive facts

Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. For example:
Current Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information

4) Aggregate tables

These are the tables which contains aggregated / summarized data. E.g.: Yearly, monthly sales information. These tables will be used to reduce the query execution time.
Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. Retrieving the required data from the actual table, which have millions of records will take more time and also affects the server performance. To avoid this we can aggregate the table to certain required level and can use it. This table reduces the load in the database server and increases the performance of the query and can retrieve the result very fastly.

5) Lookup table

A lookup table is nothing but a 'lookup' it gives values to referenced table (it is a reference), it is used at the run time, it saves joins and space in terms of transformations. Example, a lookup table called states, provide actual state name ('Texas') in place of TX to the output.

6) ODS

ODS stands for Operational Data Store. It contains near real time data. In typical data warehouse architecture, sometimes ODS is used for analytical reporting as well as source for Data Warehouse.


7) What is level of Granularity in a fact table?

Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail you are willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it up to minute and put that data.

8) What are slowly changing dimensions?

Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.

9) Why OLTP database designs are not generally a good idea for DWH?

OLTP cannot store historical information about the organization. It is used for storing the details of daily transactions while a data warehouse is a huge storage of historical information obtained from different data marts for making intelligent decisions about the organization.
OLTP systems will be having transactional data that’s keeping on changing, so it can’t be taken for warehousing.
Because OLTP databases are transactional data related databases. The meaning of this is these databases are used in real time to insert, update and delete data. To accomplish these tasks in real time, the model used in OLTP databases is highly normalized. The problem of using this model in Data Warehousing is, we have to join multiple tables to get a single piece of data. With the amount of historical data we deal with in DW, it is highly desirable not to have a highly normalized data model like OLTP.

10) Why should you put your DWH on a different system than your OLTP system?

Data Warehouse is a part of OLAP (On-Line Analytical Processing). It is the source from which any BI tools fetch data for Analytical, reporting or data mining purposes. It generally contains the data through the whole life cycle of the company/product. DWH contains historical, integrated, demoralized, subject oriented data.
However, on the other hand the OLTP system contains data that is generally limited to last couple of months or a year at most. The nature of data in OLTP is: current, volatile and highly normalized. Since, both systems are different in nature and functionality we should always keep them in different systems.

11) Degenerated dimension

In a data ware house, a degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table.
Degenerate dimensions are often used when a fact table's grain represents transactional level data and one wishes to maintain system specific identifiers such as order numbers, invoice numbers.
The decision to use degenerate dimensions is often based on the desire to provide a direct reference back to a transactional system without the overhead of maintaining a separate dimension table.
In simple terms, the column in a fact table that does not map to any dimensions, neither it’s a measure column.
For e.g. Invoice no, Invoice line no in fact table will be a degenerate dimension (columns), provided if you don’t have a dimension called invoice.


The data items that are not facts and data items that do not fit into the existing dimensions are termed as Degenerate Dimensions.
Degenerate Dimensions are the fastest way to group similar transactions. Degenerate Dimensions are used when fact tables represent transactional data.


12) Confirmed fact

Confirmed facts are allowed to have the same name in separate tables and can be combined and compared mathematically.

13) Why fact table is in Normal form?


The Fact table is central table in schema; fact table is kept Normalized because it is very large table and so we should avoid redundant data in it.

14) What is the definition of Normalization and Denormalization?

Normalization is the process of removing redundancies.
Demoralization is the process of allowing redundancies.

OLTP uses the Normalization process and the OLAP/DW uses the demoralized process to capture greater level of detailed data (each and every transaction).

15) Junk Dimension

A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.

16) Difference between View and Materialized view

View - store the SQL statement in the database and let you use it as a table. Every time you access the view, the SQL statement executes.

Materialized view - stores the results of the SQL in table form in the database. SQL statement only executes once and after that every time you run the query, the stored result set is used. Pros include quick query results.

17 Give an example of Degenerated Dimension

Degenerated Dimension is a dimension key without corresponding dimension. Example:

In the PointOfSale Transaction Fact table, we have:

Date Key (FK), Product Key (FK), Store Key (FK), Promotion Key (FP), and POS Transaction Number

Date Dimension corresponds to Date Key; Production Dimension corresponds to Production Key. In a traditional parent-child database, POS Transactional Number would be the key to the transaction header record that contains all the info valid for the transaction as a whole, such as the transaction date and store identifier. But in this dimensional model, we have already extracted this info into other dimension. Therefore, POS Transaction Number looks like a dimension key in the fact table but does not have the corresponding dimension table.

Therefore, POS Transaction Number is a degenerated dimension.

18 Rapidly Changing dimension

There is no Dimension called Rapidly changing dimension in DWH. But if you consider overall ODS tables; a rapidly changing dimension is one that holds the transactional data rather than staging data.

19) Hybrid Slowly Changing dimension

Hybrid SCD’s are combination of both SCD 1 and SCD 2.

It may happen that in a table, some columns are important and we need to track changes for them i.e. capture the historical data for them whereas in some columns even if the data changes, we don't care.

For such tables we implement Hybrid SCD’s, where in some columns are Type 1 and some are Type 2.

20) Can a dimension table contain numeric values?

Dimension table will have only textual values only. Some dimension table will have numeric values for example Customer number in a Customer dimension. Only thing here to remember is the object type of the Customer number should always a Char type.

21) How do you connect two fact tables? Is it possible?

The only way to connect two fact tables is by using conformed dimension.

22) What is meant by aggregate fact table?

An aggregate fact table stores information that has been aggregated, or summarized from a detail fact table. Aggregate fact table Ares useful in improving query performance.

Often an aggregate fact table can be maintained through the use of materialized views.

23) Different levels of Data models

Conceptual Data Model:

At this level, the data modeler attempts to identify the important entities and the relationships among them.
No attribute is specified.
No primary key is specified.

Logical Data Model:

At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database.
All attributes for each entity are specified.
The primary key for each entity specified.
Foreign keys (keys identifying the relationship between different entities) are specified.

Physical Data Model:

At this level, the data modeler will specify how the logical data model will be realized in the database schema.
The steps for physical data model design are as follows:
Convert entities into tables.
Convert relationships into foreign keys.
Convert attributes into columns.

24) Data Warehouse

A Data warehouse is a Subject-oriented, Time-varying, Integrated, Non-volatile collection of data that is used primarily in organizational decision making.

25) Need for Data Warehousing

Better business intelligence for end-users.
Reduction in time to locate access and analyze information.
Consolidation of disparate information sources.

26) Example of OLTP, ODS and DWH

OLTP : Is this medicine available in Stock?
ODS : What are the tests this patient has completed so far?
DWH : Has the incidence of some disease increased in last 5 years in Southern Region?

27) Limitations of DWH

High cost of building and on-going maintenance ($ 3-5 million)
Since it has to integrate all the data & transaction system database and hence requires more time to design & build.

28) OLAP

This is a common use of Data warehouse that involves real time access and analysis of multi-dimensional data such as sales information.

OLAP EXAMPLE:

An example OLAP database may be comprised of sales data which has been aggregated by region, product type, and sales channel. A typical OLAP query might access a multi-year sales database in order to find all product sales in each region for each product type.
After reviewing the results, an analyst might further refine the query to find sales volume for each sales channel within region/product classifications.
As a last step the analyst might want to perform year-to-year or quarter-to-quarter comparison for each sales channel.

29) Slicing and Dicing

Slicing and dicing refers to the ability to look at the database from different view points. One slice of the sales database might show all sales of product type within a region. Another slice might show all sales by sales channel within each product type. Slicing and dicing is often performed along a time axis in order to analyze trends and find patterns.

4 comments:

  1. thanks and keep continuing, how to reach u if i have any doubt regarding informatica, plz give reply, my id is saimurali222@gmail.com

    ReplyDelete
  2. Can you update on factless fact table with example. Its urgetnt

    ReplyDelete
  3. Super collection thanks for providing this blog very useful ti everyoneInformatica Online Training

    ReplyDelete