Tuesday, February 10, 2015

TD Performance Tuning Guide

Performance Optimization :


  • Teradata makes itself the decision to use the index or not - if you are not careful you spend time in table updates to keep up an index which is no used at all (one cannot give the query optimizer hints to use some index - though collecting of statistics may affect the optimizer strategy
  • In the MP-RAS environment, look at the script "/etc/gsc/bin/perflook.sh".  This will provide a system-wide snapshot in a series of files.  The GSC uses this data for incident analysis.
  • When using an index one must keep sure that the index condition is met in the sub queries "using IN, nested queries, or derived tables"
  • Indication of the proper index use is found by explain log entry "a ROW HASH MATCH SCAN across ALL-AMPS"
  • If the index is not used the result of the analysis is the 'FULL TABLE SCAN' where the performance time grows when the size of the history table grows
  • Keeping up an index information is a time/space consuming issue. Sometimes Teradata is much better when you "manually" imitate the index just building it from scratch.
  • keeping up join index might help, but you cannot multiload to a table which is a part of the join index - loading with 'tpump' or pure 'SQL' is OK but does not perform as well. Dropping and re-creating a join index with a big table takes time and space.
  • when your Teradata "explain" gives '25' steps from your query (even without the update of the results) and the actual query is a join of six or more tables

Case e.g :

We had already given up updating the secondary indexes - because we have not had much use for them.
After some trials and errors we ended up to the strategy, where the actual "purchase frequency analysis" is never made "directly" against the history table.
Instead:

  1. There is a "one-shot" run to build the initial "customer's previous purchase" from the "purchase history" - it takes time, but that time is saved later
  2. The purchase frequency is calculated by joining the "latest purchase" with the "customer's previous purchase".
  3. When the "latest purchase" rows are inserted to the "purchase history" the "customer's previous purchase" table is dropped and recreated by merging the "customer's previous purchase" with the "latest purchase"
  4. By following these steps the performance is not too fast yet (about 25 minutes in our two node system) for a bunch of almost 1.000.000 latest receipts - but it is tolerable now.
  5. (We also tested by adding both the previous and latest purchase to the same table, but because its size was in average case much bigger than the pure "latest purchase", the self-join was slower in that case)

*********

MANAGING CONCURRENT WORKLOADS

Integrated e-commerce efforts present many warehouse challenges. Here's how Teradata can help.

The word e-commerce means many things to many people. Although for some it connotes only the Web, the real value of e-commerce can only be realized when all channels of a business are integrated and have full access to all customer information and transactions. In fact, to me, e-commerce means using the rich technology available today to bring added value to the customer and additional value to the business through all customer interaction channels. 
Under this definition of e-commerce, an active warehouse is at the epicenter, providing the storage and access for decision making in the e-commerce world. As more and more companies adopt active warehousing for this purpose, data warehouse workloads are expanding and changing. 
If your warehouse relies on a Teradata DBMS, you'll find that handling the challenge of high-volume, widely varying, disparate service-level workloads is one of its core competencies. One of the biggest concerns I hear from customers is how to deal with the quickly rising number of concurrent queries and concurrent users that can result from active warehousing and e-commerce initiatives. Expected service levels vary widely among different groups of users, as do query types. And, of course, the entire workload must scale upward linearly as the demand increases, ideally with a minimum of effort required from users and systems staff. Here's a look at some of the most frequent questions I receive on the subject of mixed workloads and concurrency requirements. 
  
How do I balance the work coming in across all nodes of my Teradata configuration? 
You don't. Teradata automatically balances sessions across all nodes to evenly distribute work across the entire parallel configuration. Users connect to the system as a whole rather than a specific node, and the system uses a balancing algorithm to assign their sessions to a node. Balancing requires no effort from users or system administrators. 

Does Teradata balance the work queries cause? 

The even distribution of data is the key to parallelism and scalability in Teradata. Each query request is sent to all units of parallelism, each of which has an even portion of the data to process, resulting in even work distribution across the entire system. 
For short queries and update flow typical of Web interactions, the optimizer recognizes that only a single unit of parallelism is needed. A query coordinator routes the work to the unit of parallelism needed to process the request. The hashing algorithm does not cluster related data, but spreads it out across the entire system. For example, this month's data and even today's data is evenly distributed across all units of parallelism, which means the work to update or look at that data is evenly distributed.

Will many concurrent requests cause bottlenecks in query coordination? 

Query coordination is carried out by a fully parallel parsing engine (PE) component. Usually, one or more PEs are present on each node. Each PE handles the requests for a set of sessions, and sessions are spread evenly across all configured PEs. Each PE is multithreaded, so it can handle many requests concurrently. And each PE is independent of the others with no required cross-coordination. The number of users logged on and requests in flight are limited only by the number of PEs in the configuration.

How do you avoid bottlenecks when the query coordinator must retrieve information from the data dictionary? 
In Teradata, the DBMS itself manages the data dictionary. Each dictionary table is simply a relational table, parallelized across all nodes. The same query engine that manages user workloads also manages the dictionary access, using all nodes for processing dictionary information to spread the load and avoid bottlenecks. The PE even caches recently used dictionary information in memory. Because each PE has its own cache, there is no coordination overhead. The cache for each PE learns the dictionary information most likely to be needed by the sessions assigned to it.

With a large volume of work, how can all requests execute at once?
 

As in any computer system, the total number of items that can execute at the same time is always limited to the number of CPUs available. Teradata uses the scheduling services Unix and NT provide to handle all the threads of execution running concurrently. Some requests might also exist on other queues inside the system, waiting for I/O from the disk or a message from the BYNET, for example. Each work item runs in a thread; each thread gets a turn at the CPU until it needs to wait for some external event or until it completes the current work. Teradata configures several units of parallelism in each SMP node. Each unit of parallelism contains many threads of execution that aren't restricted to a particular CPU; therefore, every thread gets to compete equally for the CPUs in the SMP node. 
There is a limit, of course, to the number of pieces of work that can actually have a thread allocated in a unit of parallelism. Once that limit is reached, Teradata queues work for the threads. Each thread is context free, which means that it is not assigned to any session, transaction, or request. Therefore, each thread is free to work on whatever is next on the queue. The unit of work on the queue is a processing step for a request. Combining the queuing of steps with context-free threads allows Teradata to share the processing service equally across all the concurrent requests in the system. From the users' point of view, all the requests in the system are running, receiving service, and sharing system resources.

How does Teradata avoid resource contention and the resulting performance and management problems? 

Teradata algorithms are very resource efficient. Other DBMSs optimize for single-query performance by giving all resources to the single query. But Teradata optimizes for throughput of many concurrent queries by allocating resources sparingly and using them efficiently. This kind of optimization helps avoid wide performance variations that can occur depending on the number of concurrent queries. 
When faced with a workload that requires more system resources than are available, Teradata tunes itself to that workload. Thrashing, a common performance failure mode in computer systems, occurs when the system has fewer resources than the current workload requires and begins using more processing time to manage resources than to do the work. With most databases, a DBA would tune the system to avoid thrashing. However, Teradata adjusts automatically to workload changes by adjusting the amount of running work and internally pushing back incoming work. Each unit of parallelism manages this flow control mechanism independently.

If all concurrent work shares resources evenly, how are different service levels provided to different users? 

The Priority Scheduler Facility (PSF) in Teradata manages service levels among different parts of the workload. PSF allows granular control of system resources. The system administrator can define up to five resource partitions; each partition contains four available priorities. Together, they provide 20 allocation groups (AGs) to which portions of the workload are assigned by an attribute of the logon ID for the user or application. The administrator assigns each AG a portion of the total system resources and a scheduling policy. 
For example, the administrator can assign short queries from the Web site a guaranteed 20 percent of system resources and a high priority. In contrast, the administrator might assign medium priority and 10 percent of system resources to more complex queries with lower response-time requirements. Similarly, the administrator might assign data mining queries a low priority and five percent of the total resources, effectively running them in the background. You can define policies so that the resources adjust to the work in the system. For example, you could allow data mining queries to take up all the resources in the system if nothing else is running. 
Unlike other scheduling utilities, PSF is fully integrated into the DBMS, not managed at the task or thread level, which makes it easier to use for parallel database workloads. Because PSF is an attribute of the session, it follows the work wherever it goes in the system. Whether that piece of work is executed by a single thread in a single unit of parallelism or in 2,000 threads in 500 units of parallelism, PSF manages it without system administrator involvement. 
CPU scheduling is a primary component of PSF, using all the normal techniques (such as quantum size, CPU queues by priority, and so on). However, PSF is endemic throughout the Teradata DBMS. There are many queues inside a DBMS handling a large volume mixed workload. All of those queues are prioritized based on the priority of the work. Thus, a high priority query entered after several lower priority requests that are awaiting their turn to run will go to the head of the queue and will be executed first. I/O is managed by priority. Data warehouse workloads are heavy I/O users, so a large query performing a lot of I/O could hold up a short, high-priority request. PSF puts the high-priority request I/Os to the head of the queue, helping to deliver response time goals.

Data warehouse databases often set the system environment to allow for fast scans. Does Teradata performance suffer when the short work is mixed in? 

Because Teradata was designed to handle a high volume of concurrent queries, it doesn't count on sequential scans to produce high performance for queries. Although other DBMS products see a large fall in request performance when they go from a single large query to multiple queries or when a mixed workload is applied, Teradata sees no such performance change. Teradata never plans on sequential access in the first place. In fact, Teradata doesn't even store the data for sequential accesses. Therefore, random accesses from many concurrent requests are just business as usual. 
Sync scan algorithms provide additional optimization. When multiple concurrent requests are scanning or joining the same table, their I/O is piggybacked so that only a single I/O is performed to the disk. Multiple concurrent queries can run without increasing the physical I/O load, leaving the I/O bandwidth available for other parts of the workload.

What if work demand exceeds Teradata's capabilities? 

There are limits to how much work the engine can handle. A successful data warehouse will almost certainly create a demand for service that is greater than the total processing power available on the system. Teradata always puts into execution any work presented to the DBMS. 
If the total demand is greater than the total resources, then controls must be in place before the work enters the DBMS. When your warehouse reaches this stage, you can use Database Query Manager (DBQM) to manage the flow of user requests into the warehouse. DBQM, inserted between the users' ODBC applications and the DBMS, evaluates each request and then applies a set of rules created by the system administrator. If the request violates any of the rules, DBQM notifies the user that the request is denied or deferred to a later time for execution. 
Rules can include, for example, system use levels, query cost parameters, time of day, objects accessed, and authorized users. 

How do administrators and DBAs stay on top of complex mixed workloads? 

The Teradata Manager utility provides a single operational system view for administrators and DBAs. The tool provides real-time performance, logged past performance, users and queries currently executing, management of the schema, and more.

STAYING ACTIVE
 

The active warehouse is a busy place. It must handle all decision making for the organization, including strategic, long-range data mining queries, tactical decisions for daily operations, and event-based decisions necessary for effective Web sites. Nevertheless, managing this diversity of work does not require a staff of hundreds running a complex architecture with multiple data marts, operational data stores, and a multitude of feeds. It simply requires a database management system that can manage multiple workloads at varying service levels, scale with the business, and provide 2437 availability year round with a minimum of operational staff.


q  Use COMPRESS in whichever attribute possible. This helps in reducing IO and hence Improves performance. Especially for attribute having lots of NULL values/Unique known values.

q  COLLECT STATISTICS on daily basis (after every load) inorder to improve performance.

q  Drop and recreate secondary indices before and after every load. This helps in improving load performance (if critical)

q  Regularly Check for EVEN data distribution across all AMPs using Teradata Manager or thru queryman

q  Check for the combination on CPU, AMP’s, PE, nodes for performance optimization.
Each AMP can handle 80 tasks and each PE can handle 120 sessions.

q  MLOAD – Customize the number sessions for each MLOAD jobs depending on the
  1. Number of concurrent MLOAD jobs &
  2. Number of PE’s in the system

e.g
SCENARIO 1
# of AMPS = 10
# of MAx load Jobs handled by Teradata=5 (Parameter which can be set values-5 to 15)
# of Sessions per load job= 1 (parameter that can be set at Global or at each MLOAD script level)
# of PE's=1

So 10*5*1= 50 + 10 (2 per job overhead) = 60 is the Max sessions on Teradata box
This is LESS then 120, which is max # of sessions a PE can handle

SCENARIO 2
#AMPS = 16
#Max load Jobs handles by Teradata=15
#Sessions per load job= 1
#of PE's=1

So 16*15*1= 240 + 30 (2 per job ovehead) = 270 (Max sessions on Teradata box).
This is MORE then 120, which is the max sessions a PE can handle.

Hence MLOAD fail, inspite of the usage of the SLEEP & TENACITY features.


q  Use the SLEEP and TENACITY features of MLOAD for scheduling MLOAD jobs.

q  Check the TABLEWAIT parameter. If omitted can cause immediate load job failure if you submit two MLOADS loads that are trying to update the same table.

q  JOIN INDEX - Check the limit on number of fields for a join Index (max 16 fields). It may vary by version

Join Index is like building the table physically. Hence it has the advantage like BETTER Performance since data is physically stored and not calculated ON THE FLY etc. Cons are of LOADING time(MLOAD needs Join Indices to be dropped before loading) and additional space since it is a physical table.

 


 SQL Tunning :



Contents

1. Introduction
  • Reduce the Workload
  • Balance the Workload
  • Parallelize the Workload
  • Upgrade

2. Improve SQL statement tuning
2.1 Reviewing the Execution Plan
2.2 Restructuring SQL statement
2.2.1 AND or ‘=’ clause
2.2.2 IN or BETWEEN clauses
2.2.3 LIKE clause
2.2.4 IN and EXISTS clauses
2.2.5 DISTINCT clause
2.2.6 UNION or UNION ALL clauses
2.2.7 CASE statement
2.2.8 SELECT DISTINCT clause
2.3 Eliminate the use of Temporary Tables
2.4 Avoid Mixed Type Expressions
3. Indexes
3.1 Secondary Indexes
3.2 Join Index
3.2.1 Single table Join Index
3.2.2 Aggregate Join Index
3.2.3 Sparse Index
3.2.4 Global (Join) Index
3.2.5 Join Index performance
3.3 Partitioned Primary Index
3.4. Index Usage


1. Introduction

Performance of Teradata SQL based application depends on several factors, including database design, network latency, and query optimization, hardware specifications. Poorly tuned queries often cause performance problems. The objective of tuning a system is to either reduce the response time for end users of the system, or to reduce the resources used to process the same work. Both these objectives can be achieved in several ways:

Reduce the Workload

This is what commonly constitutes SQL tuning: finding more efficient ways to process the same workload. It is possible to change the execution plan of the statement without altering the functionality to reduce the resource consumption.
Two examples of how resource usage can be reduced are:
  1. If a commonly executed query needs to access a small percentage of data in the table, then it can be executed more efficiently by using an index. Creation and utilization of such an index, reduces the amount of resources used.
  1. If a user is looking at the first twenty rows of the 10,000 rows returned in a specific sort order, and if the query (and sort order) can be satisfied by an index, then the user does not need to access and sort the 10,000 rows to see the first 20 rows.

Balance the Workload

Systems often tend to have peak usage in the daytime when real users are connected to the system and low usage in the nighttime. If non-critical reports and batch jobs can be scheduled to run in the nighttime and their concurrency during daytime reduced, then it frees up resources for the more critical programs in the day.

Parallelize the Workload

Queries that access large amounts of data (typical data warehouse queries) often can be parallelized. This is extremely useful for reducing the response time in low concurrency data warehouse. However, for OLTP environments, which tend to be high concurrency, this can adversely impact other users by increasing the overall resource usage of the program.

Upgrade

Software/Hardware upgrade can be another means for attaining the desired performance levels and should be used only after trying everything from SQL tuning/data model restructuring and Database tuning perspective.

To ensure performance after a major / minor software or hardware upgrade, perform the following tasks on the production system.
  • Recollect statistics where possible.
  • Save the EXPLAINs for the base queries.
  • Run the test bed again after the upgrade and get the new EXPLAINs.
  • Compare the two EXPLAINs.
  • Check for faster or slower response on any of the test queries.
  • If slower then check for the before and after performance EXPLAIN.

2. Improve SQL statement tuning

2.1 Reviewing the Execution Plan

When tuning (or writing) a SQL statement in an OLTP environment, the goal is to drive from the table that has the most selective filter. This means that there are fewer rows passed to the next step. If the next step is a join, then this means that fewer rows are joined. Check to see whether the access paths are optimal.

When examining the optimizer execution plan, look for the following:
  • The plan is such that the driving table has the best filter.
  • The join order in each step means that the fewest number of rows are being returned to the next step (that is, the join order should reflect, where possible, going to the best not-yet-used filters).
  • There are any unintentional Cartesian products (even with small tables).
  • Each table is being accessed efficiently:
Consider the predicates in the SQL statement and the number of rows in the table. Look for suspicious activity, such as a full table scans on tables with large number of rows, which have predicates in the where clause. Determine why an index is not used for such a selective predicate.
A full table scan does not mean inefficiency. It might be more efficient to perform a full table scan on a small table, or to perform a full table scan to leverage a better join method (for example, hash_join) for the number of rows returned.
If any of these conditions are not optimal, then consider restructuring the SQL statement or the indexes available on the tables.

2.2 Restructuring SQL statement

Rewriting an inefficient SQL statement is often easier than repairing it. Since SQL is a flexible language, more than one SQL statement may meet the needs of your requirements. Although two SQL statements may produce the same result, one may be processed faster than the other. The results of the EXPLAIN statement can be used to compare the execution plans of the two statements and determine which is more efficient.

2.2.1 AND or ‘=’ clause

To improve SQL efficiency, it is advantageous to use equijoins whenever possible. Statements that perform equijoins on untransformed column values are easier to tune.


2.2.2 IN or BETWEEN clauses

In case of a choice of using the IN or the BETWEEN clauses in the query, it is advantageous to use the BETWEEN clause, as it is much more efficient. For example:
SELECT customer_number, customer_name
FROM customer
WHERE customer_number in (1000, 1001, 1002, 1003, 1004);
is much less efficient than:
SELECT customer_number, customer_name
FROM customer
WHERE customer_number BETWEEN 1000 and 1004
Assuming there is a useful index on customer_number, the Query Optimizer can locate a range of numbers much faster (using BETWEEN) than it can find a series of numbers using the IN clause.

2.2.3 LIKE clause

If LIKE is used in a WHERE clause, it is better to try to use one or more leading character in the clause, if at all possible. For example, use:
LIKE 'm%'  not LIKE '%m'
If a leading character is used in the LIKE clause, then the Query Optimizer has the ability to potentially use an index to perform the query thereby, speeding performance.
But if the leading character in a LIKE clause is a wildcard, the Query Optimizer will not be able to use an index, and a table scan must be run, which reduces performance and takes more time.

2.2.4 IN and EXISTS clauses

In certain circumstances, it is better to use IN rather than EXISTS. In general, if the selective predicate is in the subquery, then use IN. If the selective predicate is in the parent query, then use EXISTS.
A subquery when used with an IN clause can take advantage of selectivity specified in the subquery. This is most beneficial when the most selective filter appears in the subquery and there are indexes on the join columns. Conversely, using EXISTS is beneficial when the most selective filter is in the parent query. This allows the selective predicates in the parent query to be applied before filtering the rows against the EXISTS criteria.


Below are two examples that demonstrate the benefits of IN and EXISTS. Both examples use the same schema with the following characteristics:
·         There is a unique index on the employees.employee_id field.
·         There is an index on the orders.customer_id field.
·         There is an index on the employees.department_id field.
·         The employees table has 27,000 rows.
·         The orders table has 10,000 rows.
Example 1:
This example demonstrates how rewriting a query to use IN can improve performance. This query identifies all employees who have placed orders on behalf of customer 144.
The following SQL statement uses EXISTS:
SELECT e.employee_id, e.first_name, e.last_name, e.salary
FROM employees e
WHERE EXISTS (SELECT 1 
                FROM orders o
                             WHERE e.employee_id = o.sales_rep_id
                                           AND o.customer_id = 144);
Rewriting the statement using IN, results in significantly fewer resources used. The SQL statement using IN:
SELECT e.employee_id, e.first_name, e.last_name, e.salary
FROM employees e
WHERE e.employee_id IN (SELECT o.sales_rep_id
                                            FROM orders o
                                            WHERE o.customer_id = 144);
 
Explanation:
 
In the query using EXISTS, an extra unnecessary step is being performed by the parent query. From the sub-query, we obtain a table that has equijoined the tables employees and orders on the basis of employee_id and sales_rep_id. Further, the table is filtered for customer_id = 144. The obtained table is the resultant table needed but, an extra step of comparing of employee_id is done once again when the parent query is performed.
 
However, in the query using IN, the subquery returns only a filtered table on the basis of the specified customer_id and the join function is performed by the parent query. Thus, the work performed by the query using IN clause is much less than that of the query using the EXISTS clause. 
 
            Example 2:
This example demonstrates how rewriting a query to use EXISTS can improve performance. This query identifies all employees from department 80 who are sales reps who have placed orders. The following SQL statement uses IN:


SELECT e.employee_id, e.first_name, e.last_name, e.department_id, e.salary
FROM employees   e
WHERE e.department_id = 80
              AND e.job_id = 'SA_REP'
              AND e.employee_id IN (SELECT o.sales_rep_id 
                                     FROM orders o);
The following SQL statement uses EXISTS:
SELECT e.employee_id, e.first_name, e.last_name, e.salary
FROM employees   e
WHERE e.department_id = 80
              AND e.job_id = 'SA_REP'
              AND EXISTS (SELECT 1
                                     FROM orders o
                                     WHERE e.employee_id = o.sales_rep_id);
           
Explanation:

In the query using IN, the subquery returns the entire orders table with the required column data. The rest of the filtering and the joining are done in the parent query.

However, in the query using EXISTS, the subquery returns a table equijoined between employees and orders table. The parent query then performs the filtration according to the specified conditions. Thus, the work performed by the query using EXISTS clause is much less than that of the query using the IN clause.

2.2.5 DISTINCT clause

At times this clause is added to every SELECT statement, even when it is not necessary. The DISTINCT clause should only be used in SELECT statements if it is known that duplicate returned rows are a possibility, and that having duplicate rows in the result set would cause problems with the requirements. The DISTINCT clause creates a lot of extra work, and reduces the physical resources that other SQL statements have at their disposal. Because of this, the DISTINCT clause is used only if it is necessary.

2.2.6 UNION or UNION ALL clauses

When using the UNION statement, keep in mind that, by default, it performs the equivalent of a SELECT DISTINCT on the final result set. In other words, UNION takes the results of two like recordsets, combines them, and then performs a SELECT DISTINCT in order to eliminate any duplicate rows. This process occurs even if there are no duplicate records in the final recordset. Selecting a distinct result requires building a temporary worktable, storing all rows in it and sorting before producing the output. If it is known that there are duplicate records, and this presents a problem for the application, then, the UNION statement should be used to eliminate the duplicate rows.

On the other hand, if it is known that there will never be any duplicate rows, or if there are, and this presents no problem to your application, then the UNION ALL statement should be used instead of the UNION statement. The advantage of the UNION ALL is that it does not perform the SELECT DISTINCT function, which saves a lot of unnecessary resources from being used. UNION ALL requires no worktable and no sorting. In most cases it’s much more efficient.

One more potential problem with UNION is the danger of flooding temporary database with a huge worktable. It may happen if a large result set is expected from a UNION query.

2.2.7 CASE statement

Often, it is necessary to calculate different aggregates on various sets of tables. Usually, this is done with multiple scans on the table, but it is easy to calculate all the aggregates with one single scan. Eliminating n-1 scans can greatly improve performance.
Combining multiple scans into one scan can be done by moving the WHERE condition of each scan into a CASE statement, which filters the data for the aggregation.

Example:
The following example asks for the count of all employees who earn less then 2000, between 2000 and 4000, and more than 4000 each month. This can be done with three separate queries:
SELECT COUNT (*)
FROM employees
WHERE salary < 2000;
 
SELECT COUNT (*)
FROM employees
WHERE salary BETWEEN 2000 AND 4000;
 
SELECT COUNT (*)
FROM employees
WHERE salary > 4000;
However, it is more efficient to run the entire query in a single statement. Each number is calculated as one column. The count uses a filter with the CASE statement to count only the rows where the condition is valid. For example:
SELECT COUNT (CASE WHEN salary < 2000 
                   THEN 1 ELSE null END) count1, 
       COUNT (CASE WHEN salary BETWEEN 2001 AND 4000 
                   THEN 1 ELSE null END) count2, 
       COUNT (CASE WHEN salary > 4000 
                   THEN 1 ELSE null END) count3 
  FROM employees;

2.2.8 SELECT DISTINCT clause

The DISTINCT option is used in a SELECT statement to filter out duplicate results from a query's output. In a simple SELECT from one table this is the easiest and quickest way of doing things. However, with a more complex query, the query can be recoded to gain a performance advantage. Consider the following example:

            


Example:
             The query returns authors that have a book already published.

SELECT DISTINCT au_fname, au_lname
FROM authors a
JOIN titleAuthor t ON t.au_id = a.au_id;

The same result can be obtained by writing the query in the following manner,

SELECT au_fname, au_lname
FROM authors a
WHERE EXISTS (SELECT *
   FROM   titleAuthor t
   WHERE t.au_id = a.au_id );
The second example gives a slightly better performance than the first one. The reason is that the EXISTS clause causes a name to be returned when the first book is found, and no further books for that author are considered (we already have the author's name, and only want to see it only once)
On the other hand, the DISTINCT query returns one copy of the author's name for each book the author has worked on, and the list of authors generated subsequently needs to be examined for duplicates to satisfy the DISTINCT clause.
The DISTINCT clause involves a worktable, which does not happen in the EXISTS clause.

AND has precedence over OR

2.3 Eliminate the use of Temporary Tables


As queries become more complex, temporary tables are used more and more. While temporary tables may sometimes be unavoidable, they can often be sidestepped by using derived tables instead. 

A derived table is the result of using another SELECT statement in the FROM clause of a SELECT statement. By using derived tables instead of temporary tables, the application's performance can be boosted. Temporary tables slow the performance dramatically. The problem with temporary tables is the amount of overhead that goes along with using them. In order to get the fastest queries possible, the goal must be to make them do as little work as possible. The biggest benefit of using derived tables over using temporary tables is that they require fewer steps, and everything happens in memory instead of a combination of memory and disk. Fewer the steps involved, along with less I/O, faster is the performance.

2.4 Avoid Mixed Type Expressions


It is always better to avoid mixed-mode expressions, and beware of implicit type conversions.
Consider the following example:

Example:

Assume a table with a column charcol defined as VARCHAR(2) and is also the primary index of the table. The following SQL query is run on this table,



SELECT * FROM <tablename>
WHERE charcol = <numexpr>;

where, numexpr is an expression of numeric type. In SQL, the default implicit type conversion is from Character to Numeric data type. Thus, in the above query charcoal is converted to numeric data type.

Since the primary index has been defined on a column of VARCHAR data type and the implicit conversion makes it of numeric type, a full table scan is performed. This affects the performance of the query.
This can be avoided by specifying an explicit conversion. The reformed query is,

SELECT * FROM <tablename>
WHERE charcol = CAST (<numexpr> AS VARCHAR (2));

In this case, since the column on which the primary index has been defined, remains of the same type, a single AMP data retrieval takes place. This improvises the performance of the query.

3. Indexes

3.1 Secondary Indexes

Secondary Indexes (SI) supply alternate access paths and the use of appropriate secondary indexes can increase retrieval performance. For best results, secondary indexes should be based on frequently used set selections and on equality search.
A table can have up to 32 Secondary Indexes that can be created and dropped dynamically. However, it is not a good idea to create a number of SIs for each table just to speed up set selection because SIs consume the following extra resources:

  • SIs require additional storage to hold their subtables. In the case of a Fallback table, the SI subtables are Fallback also. Twice the additional storage space is required.
  • SIs require additional I/O to maintain these subtables.

When deciding whether or not to define a NUSI, there are other considerations. The
Optimizer may choose to do a Full Table Scan rather than utilize the NUSI in two cases:

  • When the NUSI is not selective enough.
  • When no COLLECTed STATISTICS are available.

As a guideline, choose only those rows having frequent access as NUSI candidates. After the table has been loaded, create the NUSI indexes, COLLECT STATISTICS on the indexes, and then do an EXPLAIN referencing each NUSI. If the Parser chooses a Full Table Scan over using the NUSI, drop the index.

3.2 Join Index

Join indexes are defined in a way that allows join queries to be resolved without accessing or joining their underlying base tables and also, reduce the number of rows processed in generating result sets. A join index is useful for queries where the index structure contains all the columns referenced by one or more joins, thereby allowing the index to cover all or part of the query. For a join index the join result is persistently stored and automatically maintained.


3.2.1 Single table Join Index

A single table join index is very useful for resolving joins on large tables without having to   redistribute the joined rows across the AMPs. A single-table join index partitions all or a subset of a base table using a primary index based on the table’s foreign key (preferably the primary index of the table to which it is to be joined).

3.2.2 Aggregate Join Index

Aggregate join indexes offer an extremely efficient, cost-effective method of resolving queries that frequently specify the same aggregate operations on the same column or columns. When aggregate join indexes are available, the system does not have to repeat aggregate calculations for every query.

An aggregate join index can be defined on two or more tables, or on a single table. A single-table aggregate join index includes:

·         A columnar subset of a base table
·         Additional columns for the aggregate summaries of the base table columns

An aggregate join index can be created using:

·         SUM function
·         COUNT function
·         GROUP BY clause

The use of an aggregate join index entails:

·         Initial time consumed to calculate and create the index.
·         Whenever a value in a join-index column of the base table is updated, time to recalculate the aggregate and update the index.

3.2.3 Sparse Index

Sparse indexes index a portion of the table using WHERE clause predicates to limit the rows indexed. Allowing constant expressions in the WHERE clause of the CREATE JOIN INDEX statement limits the rows that are included in the join index to a subset of the rows in the table based on an SQL query result. This capability in effect allows creation of sparse indexes. When base tables are large, this feature can be used to reduce the content of the join index to only the portion of the table that is frequently used if the typical query only references a portion of the rows.

A sparse index can focus on the portions of the tables that are most frequently used. This capability:

·         Reduces the storage requirements for a join index.
·         Makes the costs for maintaining an index proportional to the percent of rows actually referenced in the index.


3.2.4 Global (Join) Index

A Global Index is a term used to define a join index that contains the Row IDs of the base table rows. Queries may use the join index to qualify a few rows, then refer to the base tables to obtain requested columns that aren't stored in the join index. Such queries are said to be partially-covered by the index. This is referred to as a partially-covered global index.

Because the RDBMS supports multi-table, partially-covering join indexes, all types of join indexes, except the aggregate join index, can be joined to their base tables to retrieve columns that are referenced by a query but are not stored in the join index.

A partial-covering join index takes less space than a covering join index. Not all columns that are involved in a query selection condition have to be stored in a partial-covering join index. The benefits are:

·         Disk storage space for the join index decreases when fewer columns are stored in the join index.
·         Performance increases when the number of selection conditions that can be evaluated on the join index increases.

3.2.5 Join Index performance

            3.2.5.1 Collecting Statistics

To provide the Optimizer with the information needed to generate the best plans, you need to have collected statistics on the primary index columns of each join index.
Consider collecting statistics to improve performance during:

• Creation of a join index
• Update maintenance of a join index
3.2.5.2 Cost considerations
Join indexes, like secondary indexes, incur both space and maintenance costs. For example, insert, update, and delete operations must be performed twice, once for the base table and once for the join index. However, if join indexes are suited to your applications, the improvements in query performance can far outweigh the costs.
3.2.5.3 Load utilities 
MultiLoad and FastLoad utilities cannot be used to load or unload data into base tables that have an associated join index defined on them because join indexes are not maintained during the execution of these utilities. If an error occurs, the join index must be dropped and recreated after that table has been loaded. The TPump utility, which perform standard SQL row inserts and updates, can be used because join indexes are properly maintained during the execution of such utilities.

3.3 Partitioned Primary Index

 The Partitioned Primary Index (PPI) feature allows a class of queries to access a portion of a large table, instead of the whole table. This is achieved by hashing rows to different virtual AMPs, as is done with a normal PI, but creating local partitions within each virtual AMP. Normal PI access remains unchanged, but in the case of a range query, for example, each virtual AMP is able immediately to focus its search on specific partitions within its workspace.
PPI’s are defined on a table in order to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes.

A Partitioned Primary Index (PPI) allows the data rows of a table to be:

  • Hash partitioned to the AMPs by the hash of the primary index columns
  • Partitioned on some set of columns on each AMP
  • Ordered by the hash of the primary index columns within that partition

Two functions, RANGE_N and CASE_N, can be used to simplify the specification of a partitioning expression.
PPI improves performance as follows:

  • Automatic optimization occurs for queries that specify a restrictive condition on the partitioning column.
  • Uses partition elimination to improve the efficiency of range searches when, for example, the searches are range partitioned.
  • Only the rows of the qualified partitions in a query need to be accessed avoid full table scans.
  • Provides an access path to the rows in the base table while still providing efficient join strategies
  • If the same partition is consistently targeted, the part of the table updated may be able to fit largely in cache, significantly boosting performance

Benefits that are the result of using PPI vary based on:

  • The number of partitions defined
  • The number of partitions that can be eliminated given the query workloads
  • Whether or not the user follows an update strategy that takes advantage of partitioning.

However, the main disadvantages of using a PPI are:

  • The PI access disadvantage occurs only when the partitioning column is not part of the PI. In this situation, a query specifying a PI value, but no value for the partitioning column, must look in each partition for that value, instead of positioning directly to the first row for the PI value.
  • The direct join disadvantage occurs when another table with the same PI is joined with an equality condition on every PI column. For two non-PPI tables, the rows of the two tables will be ordered the same, and the join can be performed directly. If one of the tables is partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table.

3.4. Index Usage


The EXPLAIN facility expresses the ‘confidence’ level for retrieval of data from a table. Some of the phrases used are:

…with high confidence
  • Restricting conditions exist on index(es) or column(s) that have collected statistics.

…with low confidence
  • Restricting conditions exist on index(es) having no statistics, but estimates can be based upon a sampling of the index(es).
  • Restricting conditions exist on index(es) or column(s) that have collected statistics but are “AND-ed” together with conditions on non-indexed columns.
  • Restricting conditions exist on index(es) or column(s) that have collected statistics but are “OR-ed” together with other conditions.

…with no confidence
  • Conditions outside the above.


The execution strategy provided by the EXPLAIN facility gives a direct feedback on what steps the optimizer chooses to do. Studying EXPLAIN outputs is an excellent way to know the inner workings of the Teradata DBS and how it handles SQL. Use of EXPLAIN facility regularly, saves a lot of time and computing resources by pointing out problems in SQL statements before they are actually run. 

No comments:

Post a Comment