Applied Solutions 1
Everything in Its Place
Teradata Virtual Storage helps maintain system balance.
For a typical Teradata DBA, the physical storage of the database has been “virtual” for a long time. Virtual in the sense that the data placement, organization and management are automatically taken care of by the Teradata Database with minimal user input or administration. There are no tablespaces to define or maintain, no buffer storage or temporary workspaces to manage, no requirement to define how to spread data evenly across drives and controllers, and no need to perform periodic table or index reorganizations.
The new Teradata Virtual Storage option can enhance storage management. To maintain optimum system balance in a data warehouse from Teradata, this new capability allows the physical storage to include a mix of devices with different capacities and performance levels. Additionally, Teradata Virtual Storage manages the data that gets distributed across those resources.
The Teradata approach to data warehousing is founded on the concept of a shared-nothing architecture. That is, each row of data is owned by one of the parallel engines that make up a Teradata Database. The engine, known as an “AMP,” is the only entity in the database that can create, read, lock, unlock, update or delete that row of data. In a balanced Teradata system, each AMP owns about the same amount of data and manages all of the data’s transactions. As a result, AMPs never contend with one another for data access.
To balance the performance of all AMPs, Teradata systems have historically been configured using disk drives of equivalent performance and capacities within each clique. For more than 25 years, this simple architectural guideline has worked well for the implementation of enterprise data warehouses from Teradata. However, storage solutions have evolved to provide a range of performance and capabilities, including:
- High-capacity enterprise drives
- Improvements in desktop and consumer-oriented storage technologies
- Introduction of new storage devices based on flash memory
Prior to Teradata 13, all storage in the configuration was assumed to have equivalent capacity and performance, and the only parameter used to determine where to store a piece of data was the ID of the owning AMP.
Since each drive (or drive pair with RAID1) is reserved for the use of a single AMP, any data from that AMP could be written to any of the drives that were reserved for it. All drives have equal performance, so to help balance the system, all AMPs are given access to an equal number of drives. However, this makes it difficult to alter the amount of storage in the system for two reasons: Any new drives must have the same capacity as the original drives, and enough drives must be added to maintain an equal number for each AMP.
The link between each disk drive and an AMP is removed in Teradata Database 13 so that those constraints can be relaxed in Teradata Virtual Storage, allowing storage to be made up of mixed capacities. Even though there might not be enough high-capacity drives for each AMP, Teradata Virtual Storage will ensure the necessary space is allocated to each AMP to maintain balanced performance. Figure 1 (below) depicts a Teradata system that originally included a balanced number of 73GB drives. The system was augmented with an additional number of high-capacity drives.
In addition to changing how mixed storage is handled, Teradata Virtual Storage can also manipulate how it is used by the database. Space in the database can be allocated based on the owning AMP and the performance requirement—or response time—for the data that will be stored. This is accomplished by measuring the expected performance of each drive within the storage arrays. These measurements are used to match the correct storage resources with the type of data, including user data (tables/rows), spool data, indexes, logs, journals, fallback data and so on.
For instance, consider an in-flight query that needs spool space for keeping temporary tables as it progresses through the execution of a given query plan. Teradata Virtual Storage can select and allocate space on a set of high-performance drives to store the data in those tables so the query can finish as quickly as possible.
When using mixed-drive configurations, Teradata Virtual Storage accordingly adjusts the expected drive performance to reflect any predicted contention that might arise when storing data from multiple AMPs on a single drive. This type of adjustment will help keep data that needs higher performance from being stored on drives that will be used by multiple AMPs.
Click to enlarge
In data warehousing terms, temperature represents the relative demand for a particular set of data (i.e., tables, partitions, rows). The data’s temperature is described by a few qualitative terms, as opposed to a large range of explicit numerical values. “Hot” refers to data that is accessed frequently by users, such as the last 30 days of data in the “promotions redemption” table. “Cold” refers to data that is infrequently accessed.
While temperature refers to the frequency of data access, terms like “fast” and “slow” are explicit to the performance characteristics of storage devices. In other words, the relative performance of each storage device is measured by response time.
In addition to grading the performance of storage drives, Teradata Virtual Storage can determine the respective temperatures of the data on those drives by monitoring the I/O patterns for a particular piece of data over a period of time. Figure 2 (above) shows how temperature could vary over time for a given table/partition. These temperatures are not absolute but are relative to the ranges of the rest of the data stored in the warehouse.
The recency concept is important in that as time passes, the data’s temperature cools off. Even if a table or partition has a lot of historical access, for instance, the temperature of that data will appear lower than data that has the same amount of access during a more recent time period. This trait of data becoming cooler over time is commonly referred to as data aging. However, just because data is older doesn’t mean that it will only continue to get cooler. In fact, cooler data can become warm or hot again as access increases.
For example, sales data from this quarter may remain hot for several months as current and previous-quarter comparisons are run in different areas of the business. After six months, that sales data may cool off until nearly a year later when it becomes increasingly queried (i.e., becomes hotter) for comparison against the current quarter’s sales data.
While Teradata Virtual Storage can monitor and adapt to data temperatures, it can’t change or manipulate the temperature of any data in the warehouse. Data temperatures are primarily dictated by the workloads in the warehouse—the more queries that are run against a particular table, the higher its temperature. The only way to change a table’s temperature is to alter the frequency and/or number of queries that are run against it. Thus, by adopting the appropriate workload management and data management technologies, administrators can start down the path of enabling a multi-temperature data warehouse.
Teradata Virtual Storage can automatically move data to different storage locations that best match its currently measured temperature. This is called migration. If there is a large difference between the data’s temperature and the response time of the location where it is currently stored, the data can be moved to a more appropriate location (either faster or slower). It will remain there until the variance between measured temperature and the performance of its new location becomes substantially different again. This serves two purposes: correcting any original placements that were made in predicting the initial data temperature, and modifying the placement of data to match its usage as it progresses through different temperature cycles.
Of course, measuring temperatures and migrating data consumes resources, so the decision to move data to a different location is more complex than simply comparing temperatures and storage response times. In fact, migration candidates are determined by a cost/benefit analysis. This includes factors such as current system performance, the impact of the migration, availability of appropriate storage and the expected decrease in average I/O response time. If the benefits outweigh the cost, a migration is initiated.
This continual measurement and migration optimizes the data layout in the warehouse so that data currently in demand can be accessed more quickly than data that is less in demand. So, although the data stays in the warehouse, its physical location may change based on how intensively it is used. It doesn’t change the amount of I/O or the physical resources available to the system; rather, it makes an effort to use those resources more effectively and efficiently.
Click to enlarge
Few steps are required to take advantage of these new features. Queries don’t need to be modified. Workloads don’t need to be adapted—although some companies might change their workload parameters after gaining insight into their data usage patterns. Tables don’t need to be altered. Tablespaces, reorganizations and data files are still automatically managed by the system.
By more closely integrating the database and storage capabilities, Teradata Virtual Storage more effectively and efficiently uses storage for active and multi-temperature data warehouses. Many organizations will factor Teradata Virtual Storage in their planning efforts to expand their data warehouses with new subject areas and deeper histories, or additional applications and the associated processing and storage requirements. As updated storage solutions and components are released, Teradata Virtual Storage will let customers add them to the data warehouse.