Loading...
A data modeling blueprint is a critical differentiator for an enterprise data warehouse.

Tech2Tech

Applied Solution 1

Plan—then execute

A data modeling blueprint is a critical differentiator for an enterprise data warehouse.

The value of a data warehouse comes from the exploitation of relationships in data gathered and integrated from across multiple source systems within and beyond an enterprise. A key distinction is the difference between integration and consolidation.

Many organizations claim to have a data warehouse by virtue of col­lecting large volumes of data onto a single platform into a single database technology. Oftentimes, the data is organized into groups of tables that look suspiciously like the online transaction processing system files from which they were sourced. These implementations qualify more as data dumping grounds than data warehouses; while the data may be consolidated onto a single platform in such deployments, it is likely that it has not been integrated for effective decision support.

A data warehouse is distinguished by the integration of data into relational technolo­gies that facilitate navigation and analysis across multiple subject areas of data without the need for heroics on the part of business users for traversing business relationships in the data. (See figure.)

Figure: Decision support requires data integration

Click to enlarge

The goal state

A best-practice data warehouse is an enterprise information asset. This means that its design is not specific to a particular analytic application but rather supports multiple departments and func­tions as a centralized, shared repository of information for analytics. It will contain multiple subject areas of data with histori­cal, detailed content to support decision-making processes. A fundamental principle of cost-effective data warehousing is to extract, transform and load (ETL) data once—but to reuse that data many times across different knowledge worker commu­nities within an enterprise.

To support the principle of reuse, the underlying data model for organizing con­tent in the warehouse must avoid applica­tion-specific denormalizations that cause loss of relationships or detail inherent to the business data. Extensibility and flexibility of the underlying data model is critical for sup­porting data content and analytic applica­tions that have not yet been conceived.

A healthy data warehouse continues to extend its content to sustain the long-term value proposition for analytics within the enterprise. A key principle of success for ongoing extension to the data warehouse is that new analytic applications should be capable of leveraging new content together with existing content, without major rede­signs, in order to minimize delivery times and maximize return on investment (ROI) for the information repository.

The importance of a data model blueprint

Without a data model blueprint, there will be no governing framework for organizing content in the data warehouse. In fact, the typical result is a collection of independent data marts. Each data mart aligns to the needs of a specific analytic application or department. The problem is that content is difficult to reuse when data marts are deployed in this way because the data models are usually specific to a particu­lar purpose; data is often summarized or otherwise denormalized without regard to the needs of knowledge workers other than those sponsoring a particular project.

The enterprise data model for a data warehouse is like the city plan for designing a metropolis. It is not an effective strategy to build out a complex system without a plan. Without a plan, chaos will ensue.

The inherent lack of extensibility in most data mart deployments that are not within an enterprise data model framework causes each new data mart to replicate significant data—and all of the ETL work that goes along with the acquisition of such data—that has already been provisioned to previous data marts. These data marts may or may not exist on the same platform or in the same database technology. Moreover, each data mart will inevitably source its content in a slightly dif­ferent manner. This will likely cause confu­sion within an organization when multiple knowledge worker communities produce analytic results with inconsistent data.

Moreover, each data mart will require care and feeding to support ongoing updates, data quality management, capacity planning, performance tuning, backups and so on. Studies have shown that the cost of multiple data mart deployments is approxi­mately 70% higher than when data can be consistently reused within an enterprise data warehouse (EDW) framework.

An alternate approach, rather than starting out with a data model blueprint, is to build out the data model incrementally as new business requirements demand expansion into existing or new subject areas of data. The advantage of the incremental approach is that it does not require up-front investment in the acquisition or construction of an enter­prise logical data model (ELDM). Instead, the funding for data modeling activities is built into each analytic application project as part of its deployment rather than requiring an up-front “infrastructure tax” that would typically be associated with an enterprise logical data modeling undertaking as a prerequisite to data warehouse construction.

The problem with this alternate approach is that it will inevitably result in a non-trivial amount of rework with each new project delivery. The re-engineering that is necessary as new data requirements are integrated into project-driven data models is significant. Even with the best intentions, it is difficult to con­sider enterprise requirements in the context of a project-driven design approach. It is like asking a carpenter to build a house one room at a time without a blueprint. Even a highly skilled carpenter would rework substantial parts of his labor if this were the case. Studies have shown that when using the incremen­tal data modeling approach, the amount of rework associated with each phase is between 20% and 30%. This represents both cost and time for solution delivery—both of which are precious resources.

While it may seem painful to embark upon an enterprise data modeling endeavor as a prerequisite to data warehouse deploy­ment, experience shows that this approach yields the best results when managed effec­tively. However, a number of pitfalls must be avoided during the construction of a data model blueprint for the enterprise.

Prevailing risks

One of the most common pitfalls in pursuing an enterprise data model is to fall into the “not invented here” syndrome. Starting with a blank sheet of paper when designing an ELDM is almost always a mistake. While it is certainly true that every enterprise is unique, experience shows that, from a data modeling perspective, organiza­tions within an industry are 10% unique, not 100%. Little justification can be made for building yet another industry ELDM for telecommunications, banking, or any other major industry. The data models already exist and can be acquired and modified with 10% customiza­tion rather than build­ing from scratch.

Leveraging third-party assets—some of which may be open source or available through industry consortiums, depending on the industry—for logical data modeling saves time and money.

Another common trap is to attempt to create a detailed design rather than managing the scope to that of a blueprint. Destined to fail are those enterprise data modeling efforts that go down the path of exhaustive enumeration of all entities and attributes that might ever be needed by an organization. Such ventures end up becom­ing theoretical exercises that hypothesize possible requirements that may or may not come to fruition. An enterprise data model should focus on the fundamental entities and the relationships between those entities. Worrying about detailed attribu­tion and multiple levels of sub-typing only becomes a distraction.

The intent should be to create a blueprint for organizing data, not a detailed design. Once a proper blueprint is in place, details can be filled in based on project require­ments focused on specific business analyt­ics. In this way, the effort for developing the blueprint is managed in its scope and cost with the details being filled in using an incremental approach as part of delivery for specific business projects. If creation of the ELDM blueprint takes more than 100 days, there is probably a problem in the level of detail in the design or in the lack of leverage from third-party data model assets.

Of course, having a data model blueprint will not help if it does not get used. Leverage of the blueprint must be built into the processes for deploying new data into the analytic environment. Far too many data models end up being “shelfware” or nothing more than a large poster on a data admin­istrator’s wall (with no resemblance to the reality of data warehouse content).

A clear design path is needed to translate from the ELDM to a physical data model for deployment in the warehouse. A process must also be in place for keeping the data model current with content in the data warehouse as it evolves. The data model must be a living entity that matures along with content in the warehouse rather than remaining static from a one-time design effort.

The most important pitfall to avoid is attacking the data modeling effort as an IT project. It is absolutely critical to have a business-driven approach for designing and validating the data model blueprint. Business subject matter experts must be part of the team for the successful realization of an ELDM. It is critical to understand that data modeling is a business exercise, not a technical undertaking. Validation of the data model through use-case testing of the data model structure is a critical step in the design process for the blueprint.

Think big, start small

Once the data model blueprint is in place, it is essential to use a phased, business-driven approach in populating its content. The best practice is to think big in terms of the vision but to deploy in small increments, so as to manage risk and dem­onstrate business value in frequent intervals rather than with long, drawn-out projects.

The goal should be to provide business deliverables every 90 to 180 days. Less than 90 days tends not to be enough time to deliver projects with high value and high quality. Projects that take longer than 180 days to deliver run the risk of not succeed­ing and will be difficult to keep the atten­tion of the business. Projects that are too big should be broken down into multiple phases, each of which has clear and measur­able business value.

Managing scope to stay within these guidelines is crucial. There is often a ten­dency, especially at the beginning of a data warehouse initiative, to do too much in each phased deliverable. Understandably, the knowledge workers would like as much data placed into the warehouse as quickly as pos­sible. However, it is vital to establish the pace of quick deliverables from the onset in the data warehouse initiative because this will set the tone for the whole program.

Initial projects should not be penalized in their timeframes or budgets to bring in data that might be required for later initia­tives. The data model blueprint should be designed to allow future data content to be provisioned incrementally in alignment to funded projects that require it rather than trying to gather all possible data with the first few projects. Big-bang initiatives rarely succeed.

Once a data model blueprint is in place, it should be no more work to incremen­tally provision data into the EDW than it would be to populate a data mart designed to service the same analytic requirements. In fact, the enterprise approach should prove to be less effort because the data model framework is already in place and the reuse of data reduces delivery time as critical mass is attained, usually by the third project.

Avoid chaos

The enterprise data model for a data warehouse is like the city plan for designing a metropolis. It is not an effec­tive strategy to build out a complex system without a plan. Without a plan, chaos will ensue. Chaos in the data warehouse context translates to unacceptable total cost of ownership (TCO) associated with prolifera­tion of multiple copies of data, inconsistent results for decision makers and difficulty in supporting advanced analytics.

The corporate landscape is littered with failed data modeling initiatives. Discipline and adherence to best practices is necessary for getting full value from an enterprise logi­cal data model.


Your Comment:
  
Your Rating:

Comments