Tech2Tech
Applied Solution 1
Plan—then execute
A data modeling blueprint is a critical differentiator for an enterprise data warehouse.
by Stephen Brobst
The value of a data warehouse comes from the exploitation of relationships in data gathered and integrated from across multiple source systems within and beyond an enterprise. A key distinction is the difference between integration and consolidation.
Many organizations claim to have a data warehouse by virtue of collecting large volumes of data onto a single platform into a single database technology. Oftentimes, the data is organized into groups of tables that look suspiciously like the online transaction processing system files from which they were sourced. These implementations qualify more as data dumping grounds than data warehouses; while the data may be consolidated onto a single platform in such deployments, it is likely that it has not been integrated for effective decision support.
A data warehouse is distinguished by the integration of data into relational technologies that facilitate navigation and analysis across multiple subject areas of data without the need for heroics on the part of business users for traversing business relationships in the data. (See figure.)

Click to enlarge
The goal state
A best-practice data warehouse is an enterprise information asset. This means that its design is not specific to a particular analytic application but rather supports multiple departments and functions as a centralized, shared repository of information for analytics. It will contain multiple subject areas of data with historical, detailed content to support decision-making processes. A fundamental principle of cost-effective data warehousing is to extract, transform and load (ETL) data once—but to reuse that data many times across different knowledge worker communities within an enterprise.
To support the principle of reuse, the underlying data model for organizing content in the warehouse must avoid application-specific denormalizations that cause loss of relationships or detail inherent to the business data. Extensibility and flexibility of the underlying data model is critical for supporting data content and analytic applications that have not yet been conceived.
A healthy data warehouse continues to extend its content to sustain the long-term value proposition for analytics within the enterprise. A key principle of success for ongoing extension to the data warehouse is that new analytic applications should be capable of leveraging new content together with existing content, without major redesigns, in order to minimize delivery times and maximize return on investment (ROI) for the information repository.
The importance of a data model blueprint
Without a data model blueprint, there will be no governing framework for organizing content in the data warehouse. In fact, the typical result is a collection of independent data marts. Each data mart aligns to the needs of a specific analytic application or department. The problem is that content is difficult to reuse when data marts are deployed in this way because the data models are usually specific to a particular purpose; data is often summarized or otherwise denormalized without regard to the needs of knowledge workers other than those sponsoring a particular project.
The enterprise data model for a data warehouse is like the city plan for designing a metropolis. It is not an effective strategy to build out a complex system without a plan. Without a plan, chaos will ensue.
The inherent lack of extensibility in most data mart deployments that are not within an enterprise data model framework causes each new data mart to replicate significant data—and all of the ETL work that goes along with the acquisition of such data—that has already been provisioned to previous data marts. These data marts may or may not exist on the same platform or in the same database technology. Moreover, each data mart will inevitably source its content in a slightly different manner. This will likely cause confusion within an organization when multiple knowledge worker communities produce analytic results with inconsistent data.
Moreover, each data mart will require care and feeding to support ongoing updates, data quality management, capacity planning, performance tuning, backups and so on. Studies have shown that the cost of multiple data mart deployments is approximately 70% higher than when data can be consistently reused within an enterprise data warehouse (EDW) framework.
An alternate approach, rather than starting out with a data model blueprint, is to build out the data model incrementally as new business requirements demand expansion into existing or new subject areas of data. The advantage of the incremental approach is that it does not require up-front investment in the acquisition or construction of an enterprise logical data model (ELDM). Instead, the funding for data modeling activities is built into each analytic application project as part of its deployment rather than requiring an up-front “infrastructure tax” that would typically be associated with an enterprise logical data modeling undertaking as a prerequisite to data warehouse construction.
The problem with this alternate approach is that it will inevitably result in a non-trivial amount of rework with each new project delivery. The re-engineering that is necessary as new data requirements are integrated into project-driven data models is significant. Even with the best intentions, it is difficult to consider enterprise requirements in the context of a project-driven design approach. It is like asking a carpenter to build a house one room at a time without a blueprint. Even a highly skilled carpenter would rework substantial parts of his labor if this were the case. Studies have shown that when using the incremental data modeling approach, the amount of rework associated with each phase is between 20% and 30%. This represents both cost and time for solution delivery—both of which are precious resources.
While it may seem painful to embark upon an enterprise data modeling endeavor as a prerequisite to data warehouse deployment, experience shows that this approach yields the best results when managed effectively. However, a number of pitfalls must be avoided during the construction of a data model blueprint for the enterprise.
Prevailing risks
One of the most common pitfalls in pursuing an enterprise data model is to fall into the “not invented here” syndrome. Starting with a blank sheet of paper when designing an ELDM is almost always a mistake. While it is certainly true that every enterprise is unique, experience shows that, from a data modeling perspective, organizations within an industry are 10% unique, not 100%. Little justification can be made for building yet another industry ELDM for telecommunications, banking, or any other major industry. The data models already exist and can be acquired and modified with 10% customization rather than building from scratch.
Leveraging third-party assets—some of which may be open source or available through industry consortiums, depending on the industry—for logical data modeling saves time and money.
Another common trap is to attempt to create a detailed design rather than managing the scope to that of a blueprint. Destined to fail are those enterprise data modeling efforts that go down the path of exhaustive enumeration of all entities and attributes that might ever be needed by an organization. Such ventures end up becoming theoretical exercises that hypothesize possible requirements that may or may not come to fruition. An enterprise data model should focus on the fundamental entities and the relationships between those entities. Worrying about detailed attribution and multiple levels of sub-typing only becomes a distraction.
The intent should be to create a blueprint for organizing data, not a detailed design. Once a proper blueprint is in place, details can be filled in based on project requirements focused on specific business analytics. In this way, the effort for developing the blueprint is managed in its scope and cost with the details being filled in using an incremental approach as part of delivery for specific business projects. If creation of the ELDM blueprint takes more than 100 days, there is probably a problem in the level of detail in the design or in the lack of leverage from third-party data model assets.
Of course, having a data model blueprint will not help if it does not get used. Leverage of the blueprint must be built into the processes for deploying new data into the analytic environment. Far too many data models end up being “shelfware” or nothing more than a large poster on a data administrator’s wall (with no resemblance to the reality of data warehouse content).
A clear design path is needed to translate from the ELDM to a physical data model for deployment in the warehouse. A process must also be in place for keeping the data model current with content in the data warehouse as it evolves. The data model must be a living entity that matures along with content in the warehouse rather than remaining static from a one-time design effort.
The most important pitfall to avoid is attacking the data modeling effort as an IT project. It is absolutely critical to have a business-driven approach for designing and validating the data model blueprint. Business subject matter experts must be part of the team for the successful realization of an ELDM. It is critical to understand that data modeling is a business exercise, not a technical undertaking. Validation of the data model through use-case testing of the data model structure is a critical step in the design process for the blueprint.
Think big, start small
Once the data model blueprint is in place, it is essential to use a phased, business-driven approach in populating its content. The best practice is to think big in terms of the vision but to deploy in small increments, so as to manage risk and demonstrate business value in frequent intervals rather than with long, drawn-out projects.
The goal should be to provide business deliverables every 90 to 180 days. Less than 90 days tends not to be enough time to deliver projects with high value and high quality. Projects that take longer than 180 days to deliver run the risk of not succeeding and will be difficult to keep the attention of the business. Projects that are too big should be broken down into multiple phases, each of which has clear and measurable business value.
Managing scope to stay within these guidelines is crucial. There is often a tendency, especially at the beginning of a data warehouse initiative, to do too much in each phased deliverable. Understandably, the knowledge workers would like as much data placed into the warehouse as quickly as possible. However, it is vital to establish the pace of quick deliverables from the onset in the data warehouse initiative because this will set the tone for the whole program.
Initial projects should not be penalized in their timeframes or budgets to bring in data that might be required for later initiatives. The data model blueprint should be designed to allow future data content to be provisioned incrementally in alignment to funded projects that require it rather than trying to gather all possible data with the first few projects. Big-bang initiatives rarely succeed.
Once a data model blueprint is in place, it should be no more work to incrementally provision data into the EDW than it would be to populate a data mart designed to service the same analytic requirements. In fact, the enterprise approach should prove to be less effort because the data model framework is already in place and the reuse of data reduces delivery time as critical mass is attained, usually by the third project.
Avoid chaos
The enterprise data model for a data warehouse is like the city plan for designing a metropolis. It is not an effective strategy to build out a complex system without a plan. Without a plan, chaos will ensue. Chaos in the data warehouse context translates to unacceptable total cost of ownership (TCO) associated with proliferation of multiple copies of data, inconsistent results for decision makers and difficulty in supporting advanced analytics.
The corporate landscape is littered with failed data modeling initiatives. Discipline and adherence to best practices is necessary for getting full value from an enterprise logical data model.
Stephen Brobst is chief technology officer of Teradata.