Features
Bridge The Gap
Data services can provide a critical link in implementing an Active Enterprise Intelligence environment.
by Mark Madsen
For 20 years, organizations have focused on getting data into a data warehouse and reporting on it. The challenge now is how to bring data to the enterprise in other forms through business intelligence (BI) tools.Organizations must focus on data use outside the familiar environment if they want to move to the next level of sophistication. The data warehouse architecture is no longer a vertical stack from extract, transform and load (ETL) processes through a database to BI tools. The new model for an active data warehouse is as a platform for data use and distribution.
A matter of access
Now that many companies have data warehouses, people are looking for new ways to take advantage of them. This means increased focus on active data warehousing, operational or perva-sive BI and external access via Web applications. In most cases, the applications are not directly accessing the data warehouse. Instead, developers copy data to their local systems and build solutions there. The resulting proliferation of data and additional integration complexity make this a poor long-term solution. The primary reasons given for duplicating data rather than accessing the data warehouse are:
- Control over the data that applications are consuming
- Perceived complexity of the database
- Worries about performance when accessing the data warehouse
To maintain the data warehouse as a central repository for meaningful data, you need to understand the developer's technologies and approach - not all of the details, but enough to get the necessary data to developers for easy use. Failure to grasp this will result in poorly designed systems that create integration problems and relegate the data warehouse to narrow query and reporting uses.
When approached about a data services solution, most data warehouse managers ask, “Why can’t we use the BI tools we have?“ The answer: Operational BI isn’t done directly in such tools because most are for general purposes, designed to work in stand-alone fashion. They are not meant for highly constrained uses, embedding or access over a network by applications.
Application developers need the ability to retrieve and display data in their environment, which means the traditional user interface isn't acceptable. Very few BI tools can be easily embedded into an application.
To access a data warehouse using an application, you generally program directly with a software development kit (SDK) or primitive set of application programming interfaces (APIs), which are often complex and hard to use. They must be prepared in the languages the vendor requires on the server where the software is installed - none of which is conducive to prevalent development techniques. Consequently, developers create their own display libraries and request copies of data in their local application database.
The Requestor
To understand how to make their jobs easier, look at how developers access data in their applications. Web applications can access data in two ways: through the Web client running within a browser, or the application logic running on a server within a Web framework. Browser-based clients are constrained to a narrow set of access techniques. The desired model for client-side access is generally Web services.
Server-side access by Web applications is often done utilizing a data access framework such as Hibernate, though this is not required. Here you will find that some people want Web services to access data, while others favor a direct or framework access model.
Traditional enterprise applications rarely use Web services, though this is changing as applications are modernized. Almost all of these systems are written in Java, C++, C# and .NET. Web applications tend to be more diverse, using a broader range of scripting languages and frameworks. This means vendor-supplied SDKs and APIs are not very useful because they offer access using a small number of languages.
Compounding this challenge, developers are usually told to write structured query language (SQL) if they want to access the data warehouse. But most developers don’t know the language well. Poor SQL combined with large data volumes and high numbers of concurrent application users can ruin performance.
Given the many different types of applications and technologies that could be used, what is needed is a single, common mechanism for applications to access the data warehouse. It must simplify the application architecture as well as control the data and performance.
Understand data services
A data service is the common technology for delivering data across the various applications. To determine what must be built, a clear definition is required. Read-only services that provide controlled and scalable access to data are needed for the data warehouse. There’s a difference between providing data to consumers and creating or modifying data. The former is a data service, the latter is a transaction service.
Some consider data service something one calls to get and render a report, find cross-sell offers for a call center application or score a customer. While these return data, they are actually invoking logic from an application. Accordingly, these are BI or decision services, not data services. It’s important to draw these distinctions because data services are meant to supply basic data access capabilities, not to provide higher-level business functions. Each type of service will have slightly different technical and design requirements and will influence how you implement the services.
Data services change some common assumptions about the data warehouse. Because they are used by applications, there is no rendering of reports - that would be a BI service. In fact, a BI tool could be broken down into two sets of services: one for data requests and one for report or graph display requests. Unfortunately, none on the market today is capable of more than basic BI service calls, and the technical implementations are both complex and unscalable.
Creating a service layer over the data warehouse provides a single means to access it, so all applications, regardless of tech-nology and location, can leverage the data. Developers need to learn only one set of common, open standards-based calls for any of the data they need. Since the technology is familiar, there are no complaints about learning SQL or trying to put object-relational mappers on top of the data warehouse - something that often generates bad SQL for anything beyond simple queries.
Build data services
Implementation of a data services solution should start with understanding the scope of what developers are accessing and how they model the data in their application. The way customer data is modeled in the database and the way developers view a customer might not match exactly. Therefore, you must determine how to translate the data into their terms without making the service so specific that only one part of the application can use it.
You will be building services to different levels of abstraction, so it is important to keep these levels in mind during implementation. (See figure, page 3.) Three levels of granularity exist for data services:
- Data level. This is the basic level for data retrieval, hiding the data model or database behind a service API. The purpose is to give developers access to a row or set of rows for a single basic object - for example, a product. At this level, the service provides mostly join logic and minimal transformation.
- Business level. At the business level, a more abstracted interface, similar to the difference between a physical and logical model in data modeling, is created. For example, you might have a set of "party" tables modeled in the data warehouse. The problem for application developers is that "getParty()" isn't a useful call for them because of the implied semantics in the data. What they care about in the application is customers or suppliers, so the business-level API should have services like "getCustomer()" and "getSupplier()" instead. The bulk of data services is likely to fall into this area. One thing many of these types of objects have in common is that they often can't be returned easily from a single query in a row set. At this level you are dealing with objects, not result sets. For example, a business customer has a bill to address, multiple ship-to addresses and multiple customer contacts. This isn't something easily crammed into a single row set. When the service returns data, it will be packaged in a form that can accommodate this disparate data such as extensible markup language (XML).
- Application level. Services at this level are specific to the consuming application and can generally be built from lower-level services. This level exists to make a service that is specifically what the application requires. More mapping, specific ways of packaging the data, as well as changing or sorting columns might be required. An example is services to retrieve a list of preferred suppliers or high-value customers. This encapsulates the data rules and saves a programmer from writing code to filter the full set of suppliers or customers.
Performance considerations can also drive application-level services. For example, an application could call services to get a list of preferred customers, then customer addresses, then metrics like customer lifetime value. This would require the application to make multiple calls in serial order, slowing response. A service that encapsulates these generally runs entirely at the server level and packages the data for a one-time response.
Tools for the job
Five categories of tools can be used to construct data services:
- Native coding tools like Java and .NET combined with Web service libraries or tools
- Application tools such as SAP NetWeaver or Oracle's Fusion Middleware
- Web service development tools provided by companies like Progress and IBM
- ETL tools that offer a real-time callable option via Web services
- Data federation tools from companies like Composite Software
Each category is limited in some way, so there is no best option. In many cases, the easiest option is data federation tools, because they offer the simplicity of a data-oriented tool with the easy publishing of service endpoints.
It will take skill and knowledge to evaluate the options and choose tools appropriate for your environment. The knowledge often does not reside entirely in one group within an organization. Developers understand the code and services but not the data; the data people understand the data and databases but not how best to provide developer access.
Learning process
Designing data services is still more art than science. The biggest challenge is creating generic data services that meet most needs. It's similar in many ways to the challenge of designing a general-purpose data warehouse schema, with the added complexity of having to create multiple levels of service for different purposes.
Regardless of how good a job you do with the first set of data services, they will evolve over time and become more nuanced. That change should be expected as new applications are built that need to consume data. Each time a new request comes in, you will learn more about the various consumers and what makes the most sense.
A consultant and industry analyst, Mark Madsen is founder and president of Third Nature, Inc.