Tech2Tech
Hands On
Traverse Hierarchies
Analyze data and prevent problems with recursive queries.
by Ralph Meira
We find hierarchies in our everyday lives: bills of materials, organizational structures, genealogy charts, etc. In theory, hierarchies are easy to understand. However, they can be quite a challenge when you attempt to represent and manipulate them using a database and SQL.
The efficient handling of hierarchies can lead to the detection of anomalies in complex systems, the uncovering of unknown relationships and much more. Unfortunately, in the real world, they are often plagued by data quality problems that can negatively affect their analysis. The Teradata Database can shield you from these problems through the use of recursive queries—as long as you take a few precautions.
Analyze Levels of Data
Hierarchies are best represented in the form of a treelike structure with several levels. Their in-database representation typically relies on parent-child relationships that, in the case of an organizational chart, identify the relationship between employer and employee. Due to the nature of hierachical structures, data quality issues such as missing links or cyclic child-parent-child relationships can often complicate the analysis of data.
You can easily visualize how data quality problems occur by comparing figure 1 to figure 2. Notice how the link between "22" and "222" is missing in figure 2, perhaps due to input error. Also notice that "1," "11" and "111" can lock you into a cyclic loop that never ends.

Click to enlarge

Click to enlarge
Recursive queries can be used to analyze hierarchical data while preventing data quality problems. As an example, the recursive SQL syntax shown below has "safety features" that help identify cyclic data loops and keep the recursion depth to a maximum of 10 levels. This syntax can be repurposed to unravel any hierarchy that follows the format in table 1 below and to let you go much deeper than just 10 levels.
WITH RECURSIVE TREE
( LEVEL, PARENT_ID, CHILD_ID, PATH) AS
( SELECT 0 AS LEVEL,
E.PARENT_ID,
E.CHILD_ID,
CAST (E.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TABLE_1 E
WHERE E.PARENT_ID = 0
UNION ALL
SELECT TREE.LEVEL +1 ,
S.PARENT_ID, S.CHILD_ID,
CAST (TREE.PATH || S.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TREE, TABLE_1 S
WHERE TREE.CHILD_ID=S.PARENT_ID
AND TREE.LEVEL < 10
AND POSITION (S.PARENT_ID IN TREE.PATH) < 1
)
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN (POSITION(CHILD_ID IN PATH)>0)
THEN 'CYCLIC'
ELSE''END AS CYCLIC
FROM TREE GROUP BY 1,2,4,5
ORDER BY 4,3,1,2;

Click to enlarge
Some key elements in the syntax are:
- A materialized PATH is composed of concatenated values that show the route taken by each level of recursion.
- POSITION determines whether a node from the hierarchy is being repeated in the PATH, thus identifying cyclic data.
- The TREE.LEVEL < 10 effectively stops the query from going any deeper than 10 levels.
- The WHERE E.PARENT_ID = 0 clause in the SQL is responsible for defining node 0 as the starting point. Each step of all possible routes initiated at node 0 is shown following the PARENT-CHILD row order.
It’s not uncommon to be confused by the syntax of recursive queries, so it is useful to employ recursive VIEWS to simplify the analysis of hierarchies. Using the RECURSIVE VIEW format enables the use of SET functions such as MINUS and INTERSECT that allow comparative analysis between two similar hierarchies.
By using REPLACE RECURSIVE VIEW, you can start to analyze changes or anomalies introduced by data quality issues. ANSI SQL recursive queries are also available to Teradata Database users and have been for many years.
This is what syntax with recursive VIEWS can look like.
REPLACE RECURSIVE VIEW TREE_V
( LEVEL, PARENT_ID, CHILD_ID, PATH) AS
( SELECT 0 AS LEVEL,
E.PARENT_ID,
E.CHILD_ID,
CAST (E.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TABLE_1 E
WHERE E.PARENT_ID = 0
UNION ALL
SELECT TREE_V.LEVEL +1 ,
S.PARENT_ID, S.CHILD_ID,
CAST (TREE_V.PATH || S.PARENT_ID
AS VARCHAR(200)) AS PATH
FROM TREE_V, TABLE_1 S
WHERE TREE_V.CHILD_ID=S.PARENT_ID
AND TREE_V.LEVEL < 10
AND POSITION (S.PARENT_ID IN TREE_V.PATH) < 1
);

Click to enlarge
If the appropriate row changes are made to table 1, it’s possible to have the table represent the hierarchy shown in figure 2. Executing the recursive SQL syntax against a revised table 1 will produce the results shown in table 2.
You can compare the hierarchy of nodes in figures 1 and 2 using this SQL.
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID AS PATH,
CASE
WHEN (POSITION(CHILD_ID IN PATH) > 0)
THEN 'CYCLIC' ELSE''END AS CYCLIC
FROMTREE_FIG2_V GROUP BY 1,2,4,5
MINUS
SELECT PARENT_ID, CHILD_ID,
MIN(LEVEL)+1 AS DEPTH,
PATH||CHILD_ID ASPATH,
CASE
WHEN (POSITION(CHILD_ID IN PATH) > 0)
THEN 'CYCLIC' ELSE''END AS CYCLIC
FROM TREE_FIG1_V GROUP BY 1,2,4,5
ORDER BY 4,3,1,2;
Note that the SQL before and after the MINUS is identical to the syntax used earlier in the recursive SQL syntax to extract, group and order the results of the recursive query. A simpler analysis of changes to a hierarchy can often be carried out without the need for recursive queries at all. For example, to find out which nodes in a hierarchy are considered to be top nodes, that is, nodes without parents, you only need to SELECT all PARENT_IDs MINUS the SELECT of all CHILD_IDs for the same hierarchy.
In order to obtain the results shown in table 3, it is being assumed that TREE_FIG2_V and TREE_FIG1_V correspond to recursive VIEWs in the format shown in the recursive VIEWs syntax where the underlying TABLEs contain the parent-child relationships that correspond to figures 2 and 1, respectively. Table 3 shows the differences between figures 1 and 2, as originally intended.

Click to enlarge
Data Quality Protection
Hierarchies, though easy to traverse based on their parent-child rules, often come with data quality issues that can lead to infinite loops. Recursive queries can protect you from data problems such as missing links and cyclic loops.
The WITH RECURSIVE syntax can guide you through hierarchies to find their depth, detect infinite loops and more. You’ll be surprised how easy it is to use SQL to navigate the structures and identify data quality problems.
Ralph Meira is a Teradata senior solution architect focused primarily on manufacturing accounts.