Exploring Big Data, Analytics and Business Intelligence

Thursday, March 13, 2014

BI Reporting - a few "must know" Concepts and Terminologies

In the BI Report development world there are a few key concepts and terminologies repeated quite often. So if one is looking into making a career in this area then it will be a wise idea to know some of these concepts and terminologies.

Let us start by classifying BI reports into various categories by using their functional purposes as the classifier. This classification of types of reports given below may not be necessarily observed or followed in all of the organizations. But the chances that a given report fits into one of these categories is pretty high for any given organization.

Management Reports: Historically, this is the most prevalent nomenclature for data and reporting that shows things like project status, profit and loss, budget vs. actual, etc. They are generally reports reviewed by upper management, with summary data and minimal detail data.

Daily Operational Reports: This reflects the type of reporting used by bookkeepers on a daily or routine basis – and mostly used to scan for errors, data entry problems, regulatory purpose, or even backup for a management report. These reports may contain data from the lower most granularity/level of the fact tables.

Custom Reports: Custom denotes something that is not off the shelf, or needs additional manipulation that needs to be done in the reporting environment to suite the user needs. Both management reporting and daily operational can fall in this category.

Ad-Hoc Reports: An ad-hoc report is created by sending a request or query, for specific information and the query results show the most current information in the format you specify. Business users generally fire in ad-hoc report queries to get instant answers to their questions.

It is also important to know at least a few concepts or techniques used in a common BI Reporting environment. The below listed concepts are very important to be understood well by any BI Report developer.

Drill Down: A component of OLAP analysis. The term drill down, in the context of data analysis, refers to the process of navigating from less detailed aggregated information to viewing more granular data.

Drill Through: Drill through is an action in which one moves horizontally between two items via a related link. An example to drill through is in the case of two reports that are in a master /detail relation with each other, and by clicking a master item on the master report you reach the details of the clicked item on the details report.

Drill Up (Roll up): It is a specific analytical technique whereby the user navigates among levels of data ranging from the most detailed (down) to the most summarized (up) along a concept hierarchy. For example, when viewing the data for the city of Toronto, a roll-up operation in the Location dimension would display Ontario. A further roll-up on Ontario would display data for Canada.

Report Prompts: A prompt is an object that enables users to input values for filtering the data. For example, allowing users to select from a pick list to show data by a particular product category or region. The parameters are passed on by the report server to the database server and used in the where or having clause of a SQL query that selects the data for the report.

There are many more concepts, techniques, terminologies involved in BI Report and Dashboard development. These concepts sink in and get ingrained in a BI Developers lingo as he/she gains experience. I hope that the posts so far in this blog has helped a novice to get started in any one of the BI tool out there in the market. I intend to come up with more posts related to BI and Data Analytics, please stay tuned!

References:

http://www.elegantjbi.com/resources/glossary_a.htm

http://www.sleeter.com/blog/understanding-business-intelligence-and-related-terminology/

Thursday, February 27, 2014

Business Intelligence - tool based metadata modeling

A data warehouse is designed, built and loaded with data. Fast forwarding and looking forward to the next phase of the project, it would most likely be implementing a sophisticated Business Intelligence (BI) reporting, analytics platform. Enterprises - big, medium and small, all like to make use of very sophisticated BI tools/software or services to aid decision making.

In this post, let us explore how a "BI Metadata Model" is framed to enable capabilities for authoring, viewing and modifying BI reports and interactive visualizations - online or offline, in Microsoft Office applications or in-process applications, in the office or on the go. In majority of the organizations, the tool used for BI modeling is most likely to be provided by one amongst the leading BI software/services vendors shown below in the Gartner BI Magic Quadrant image:

Gartner BI Magic Quadrant - 2014

A BI tool supports in sourcing data from heterogeneous data sources (OLAP cubes, relational database, spreadsheet, flat files etc.)and is not restricted to just a data warehouse as its data source. An ideal BI metadata model can be defined in terms of 3 layers: 1) Database layer 2) Business Layer 3) Presentation Layer. All the layers can be equated to a folder containing multiple database objects either in their original form or modified form. The first image shown below is a diagrammatic representation of an ideal BI Metadata Model. The second image below shows an actual real-world Oracle BI metadata model with the above said 3 layers. Let us go through each of the 3 layers to gain basic understanding of the use of building a metadata model with 3 layers.

3 layer BI Metadata Model

Real-world Oracle BI Metadata Model

Physical Data: BI modeler imports all the required data base objects from the source data warehouse. Generally tables, views, materialized views, stored procedures are imported to this layer with very little modifications or changes. In this layer, these objects should be related to each other to form the star-schema model. If there are multiple star-schema, conformed dimension joins of many fact tables are also done so enable drill-through capabilities.

Business Layer: This layer pulls metadata (objects) from physical layer and supports any logical modifications required to handle various business and user requirement. The database objects are all named the way a data modeler understood them and not the way an user can understand. Static filters can also be built inside the tables or views to implement data level security. More such modifications arising from business requirements are all handled here in this layer.

Presentation Layer: From the name of this layer it is evident that the purpose of this layer is to create a very presentable form of metadata model to the business users. Combining objects by user group or by subject areas, adding object level security based on user groups and roles are some of the main things done at this level.

Following this 3 layer approach a BI metadata model derives the following advantages:

Separates the logical behavior of the application from the physical model
Provides the ability to federate multiple physical sources to the same logical object, enabling aggregate navigation and partitioning
Provides dimension conformance and isolation from changes in the physical sources
Enables the creation of portable BI Applications
Hides the complexity of the source data models from the user

There are many other uses and advantages of having each of the 3 layers in a BI metadata model. It would require a blog post on each of the layer separately to discuss those uses and advantages in depth. However, in its core essence the 3 layer model supports easy of usability and simplicity of maintenance.

If your curiosity has now just gone to another level in terms of knowing what next after BI metadata modeling, stay tuned here for my next post on BI report development.

References:

http://www.informationweek.com/big-data/big-data-analytics/gartner-bi-magic-quadrant-winners-and-losers/d/d-id/1114013

http://docs.oracle.com/cd/E17904_01/bi.1111/e10540/intro.htm http://www.rittmanmead.com/2011/12/agile-exadata-obiee-puzzle-pieces/

Thursday, February 13, 2014

Star and Snowflake schemas

Selecting a particular type of dimensional model approach is a difficult art. Most of us will run into this common dilemma : star schema or snow-flake? Which one to choose, when and why? In this post I will try to decompose the basic difference between star schemas and snow-flake. Then we will also see the scenarios when snowflake schema is preferred over a star schema.

To understand the concept well let me introduce the following OLTP tables and then create both star and snowflake schemas from these OLTP tables.

OLTP tables:

Product (ProductID, ProductName, PricePerUnit_$, CategoryCode, CategoryName, CategoryDesc)

Note that this Product table has redundancy because a Category of products can have multiple Products in it and hence this table is not normalized to 3NF.

Store (StoreID, StoreName, LocationID, LocationName, StateCode, State)

Again notice that even this table is not normalized to 3NF and has the following functional dependency: State -> Location, Store.

Sale (TransactionID, ProductID, StoreID, Quality, Amount, SaleDate)

This table is a transactional table which records the point of sales data that happen in each store.

Star schema model

So if we decide to build a star-schema, then we wouldn't choose to normalize the Product and Store tables and hence end up joining the above tables the way shown below to achieve a dimensional model. For the sake of simplicity no other dimension tables, including Date or Time dimension, have been introduced in this case.

Star schema tables:

In this case the OLTP tables have not changed much in structure at all apart from getting surrogate primary keys, of which we are in full control.

D_Product (ProductID, ProductName, PricePerUnit_$, CategoryCode, CategoryName, CategoryDesc)

D_Store (StoreID, StoreName, LocationID, LocationName, StateCode, State)

F_Transaction (TransactionID, ProductID, StoreID, Quality, Amount, SaleDate)

Snowflake model

To do a snowflake dimensional model, we will have to normalize the Product and Store tables and create multiple 3NF dimensional tables while retaining foreign keys to the separated dimensional tables in the dimensional table having most granular data. For example, foreign key to D_Location remains in the D_Store table since D_Store is at the lowest level of granularity. The snowflake model shown below clearly shows the increase in dimensional tables and relationships in the model.

Snowflake schema tables

D_Product (ProductID, ProductName, PricePerUnit_$, CategoryCode)

D_Category (CategoryCode, CategoryName, CategorDesc)

D_Store (StoreID, StoreName, LocationID)

D_Location (LocationID, LocationName, StateCode)

D_State (StateCode, State)

F_Transaction (TransactionID, ProductID, StoreID, Quality, Amount, SaleDate)

When do you use Snowflake Schema Implementation?

Ralph Kimball, the data warehousing guru, proposes three cases where snowflake implementation is not only acceptable but is also the key to a successful design:

· Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions

· Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products

· Multi-enterprise calendar dimensions because each organization has idiosyncratic fiscal periods, seasons, and holidays

Ralph Kimball recommends that in most of the other cases, star schemas are a better solution. Although redundancy is reduced in a normalized snowflake, more joins are required. Kimball usually advises that it is not a good idea to expose end users to a physical snowflake design, because it almost always compromises understandability and performance. So star-schema should be your preferred schema when designing your data warehouse unless your requirements falls into one of the three points mentioned above. Stay posted here, I will come up with more on Business Intelligence reporting next time.

Reference: http://www.oracle.com/webfolder/technetwork/tutorials/obe/db/10g/r2/owb/owb10gr2_gs/owb/lesson3/starandsnowflake.htm

Tuesday, February 4, 2014

Implementation of a DW/BI project - by a green-horn designer!

Things are exciting in the world of DW and BI. Many of us are comfortable with the terminologies and concepts like dimensions, facts, star-schema, snow-flake schema, types and categories of facts and dimensions etc. While still learning DW/BI in much detail, some of us might be still tempted to get real and start designing and modeling a data warehouse.

This kind of curiosity leads us to think about things like DW/BI project phases and implementation that happens in a real world scenario. To understand this at a very high level, we will consider the case of an accounting firm and its need for a business analytics capability. As a DW/BI solution expert our job is to gather requirement, analyze, design and implement a technical solution.

1. Requirement gathering and understanding the data: Business processes and data are centric to our project and solution. Hence first and foremost task is to know the business processes and understand the data. The processes and sources that create data, grain and flow of data are all important things to be understood and documented first. Diagram 1 shown below, represents the current data flow of the accounting company for which this DW/BI solution is being built. There are multiple sources, a path and grain which defines the data. Once this picture is captured the next job is to plan the way of project implementation.

Diagram 1

2. Project plan and phases: After gathering the requirement and having a basic understanding of the data, one is in a good stead to know how the project can be implemented and propose the plan to the client. This accounting company has no database or data warehouse in place. Hence the project will start from understanding data, 'ETL'ing it and warehousing it. So creating ETL logics, staging database and data warehouse are all part of the project. After creating the data warehouse, data can be fed to an OLAP cube and ultimately be used for BI purposes in the form of reports, charts, scorecards, ad-hoc queries etc. Apart from the milestone events shown in Diagram 2, generally more than one sub-events will precede or follow the main events.

Diagram 2

3. The outcome - DW/BI solution: Diagram 3 represents the high level diagram of the DW/BI solution that will provide business analytics capabilities to the accounting firm. It will be now able to fire "what-if" queries to the BI system and obtain answers. Decisions of the company and recommendations to the client will be based on the answers provided by the BI solution to the business users and decision makers of the accounting firm!

Diagram 3

The accounting firm is counting on us while making their decisions! As DW/BI solution experts, it will be our job to provide the correct and most accurate BI answers. To be successful in such a job we need more technical skills and expertise along with domain knowledge. Count on Kimball and Ross to provide us just that!

Inputs:

1) Typical data flow in an accounting firm - Professor Boulger, MIS Dept., Eller College of Management

Monday, January 27, 2014

Why "Kimball model" is the preferred way to build a Data Warehouse?

Okay! we got to agree with the philosophical thought that this "world is full of contradictions". Else, how would one explain this sudden change in approach to building a database (data warehouse is also a database).

It took us almost a full semester to understand normalized database model taught in the EDM class. Come and go winter break, things have changed. Father of Data warehouse/Business Intelligence, Mr. Kimball, says he has nothing to do with 'normalization'. He wants it all the 'de-normalized' way; strange isn't it!?

So why is Kimball advocating a completely different approach to a database (data warehouse) model design? Other than the obvious reason stated above (world if full of contradictions), what are the main reasons behind Kimball advocating such a polar opposite design approach?

Doing some study of various resources like blogs, whitepapers, wikipedia, text books allowed me to find many factors that support Kimball and his de-normalized, dimensional, star-schema approach to data warehouse. So to kill all our confusions and dilemmas let us now look into some of these factors that are in favor of "Kimball model" as we will call it here.

Complex ER Model

Simple Kimball (Star-Schema) Model

Ease of use for the business users: The BI and DW solutions are designed to be consumed by business users. Try telling them that, to find count of Employees in a particular department of their organization they have to find their way through 6 normalized tables and have 5 join conditions and a couple of where clause filters. That will drive them crazy and as a BI resource your job will be on the line for doing that! So in such a scenario think of a Kimball model. Isn't it easy for the users when they can find all the information needed about an employee in one dimension table and may be some more measures concerned in a fact table with just 1 join between these 2 tables? That is the first and foremost advantage that Kimball model brings to the table.

Performance: Think of this ER model query and data set; 10 discrete tables, 9 joins, sorting and some more column level calculations on couple of million rows of data. The complexity of this query means that the database software would be really slow in churning out the result set to the user. This is where Kimball model comes up trumps. Dimensional tables surrounding a centralized fact table means that the software is almost sure in predicting its query execution plan. This means that the performance of the database is quicker and therefore the BI report users get answers to their ad-hoc query almost instantly.

Enhancement, maintenance or extensibility: Let us now think from the perspective of a database modeler or designer. The model is developed and being used. Now comes a change request that demands another attribute to be shown in the report. If this was an ER model, then each such modification would have to be considered carefully, because of the complex dependencies between database tables and would have ripple effects. But a Kimball model supports such changes easily. All that the modeler has to do is find the correct dimension and add a column or alter the table without affecting any other tables.

No wonder then with all these great advantages over traditional transaction supporting database model, the Kimball model is being seen as the best model to build a data warehouse. And knowing these advantages makes one understand and accept the contradictions easily and efficiently.

Now that we have accepted to follow Kimball and his model, we will look into more interesting things that this type of database modeling throws at us. But I will reserve that for another day and time. By then I hope, we would have acquitted ourselves well with terminologies like data warehouse, business intelligence, star-schema, fact and dimension tables, types of facts and dimensions

References:
http://en.wikipedia.org/wiki/Dimensional_modeling
http://www.kimballgroup.com/1997/08/02/a-dimensional-modeling-manifesto/