Computer hardware is constantly evolving with faster CPUs and larger disks every year. Currently, the number of cores per CPU is expected to double every two years. Thus, database management systems (DBMSs) face new challenges in exploiting existing cores, available for parallel multithreaded processing [1, 2]. However, developing multithreaded algorithms is a complex problem due to the difficulty of managing and balancing the workload among a large number of threads, as well as synchronizing them.
Parallel multithreaded processing is especially valuable for data mining and statistical computations due to the fact that most data processing is translated into a large number of mathematical CPU operations that generally can be processed in parallel [3-6]. Moreover, data mining algorithms require efficient I/O mechanisms when processing large data sets, in which it is preferable to interleave mathematical processing with a full table scan [7, 8]. Unfortunately, most hardware improvements do not accelerate access to secondary storage (hard disk technology) and therefore I/O processing remains a performance bottleneck. This is particularly important when aggregate operations are performed on large input tables. As a result, for practical purposes, aggregations become a bottleneck for data mining algorithms [7-9]. Therefore, improving the I/O mechanisms for aggregates in DBMSs represent an important research issue.
DBMS extensibility allows taking advantage of programming mechanisms that can extend its analytic functionality. However, most of the time, such approach is generally ignored by researchers in database systems, data mining and statistics. Despite the fact that a DBMS can indeed be extended with data mining processing [4, 7, 8], the majority of data mining processing is done outside the DBMS by exporting data samples to small flat files that are analyzed by efficient programs in languages such as C++ or Java , statistical tools such as Mat-lab,SAS, or R [5, 8], or more recently MapReduce [4, 6]. User-defined functions (UDFs) are a powerful extensibility mechanism [5, 10, 11]. Furthermore, some UDF implementations include application programming interfaces (APIs)  that enable the control of multithreaded processing. Aggregate user-defined functions (also called user-defined aggregates) are developed by implementing a well-defined sequence of processing steps;such steps provide enough information to manage the working threads in charge of processing database aggregations . However, the user is often oblivious to parallel execution, and does not have control over optimizations that can be applied at the hardware level. With such issues in mind, our contributions are mostly focused on accelerating data mining processing in a DBMS exploiting UDFs. However, our ideas can be applied on any database algorithm that requires efficient processing on large data sets. In this paper, we work on extending the DBMS to compute sufficient statistics that are fundamental for many data mining models: principal component analysis (PCA), linear regression, clustering, variable selection, among others [8, 14]. In addition, we study caching, efficient memory management and multithreaded processing in order to exploit multiple cores and large RAM memory. It is important to emphasize that our research studies efficiently interleave data set processing with a full table scan on a modern DBMS. This research is applied,for instance, in one-pass data mining algorithms, such as Naive Bayes , or dimensionality reduction with PCA [5, 14], that can take advantage of sufficient statistics to compute the model, instead of reading the data set multiple times. These models can now be obtained more efficiently with a middleware layer (UDFs in our case) that maximizes the utilization of all cores in the CPU.
This paper is organized as follows. Section II compares our research with previous works. Section III introduces definitions,presents an overview of sufficient statistics and explains how they can be computed with UDFs. Section IV presents our main contributions. This section introduces algorithms and optimizations to compute aggregations exploiting multithreaded processing. Section V contains an experimental evaluation on a computer with a multicore CPU, comparing performance of different aggregation algorithms, integrated in a DBMS. Finally, Section VI presents the conclusions and directions of our future work.
There is a wide range of related works regarding efficiently using memory and parallel processing for database operations and data mining. Previous research has been done on the hardware level to exploit operations in main memory for fast database processing [1, 15]. Adibi et al.  evaluate the link discovery algorithms in a processing-in-memory (PIM) architecture. In this research, experiments on multithreading and in-memory processing are presented. Unlike our work, the latter algorithms were specifically coded and tested for their hardware architecture and cannot be extended to any configuration of hardware. Manegold et al.  propose data structure partitioning algorithms for query joins that optimize cache performance by memory access. As in our research, the authors seek to speed up the execution of a critical database operator by optimizing memory management. However, their modifications are in the database core modules that will require modifying the DBMS engine. In our work, we decided to extend the DBMS capabilities by exploiting the existing framework for multithreading and memory access, which can be incorporated to any current database system.
In a similar manner, aggregate operations have been previously approached with the use of multicore technology. Cieslewicz and Ross  analyze several factors of multithreading and caching. In this work, an adaptive aggregation algorithm is proposed to optimize access to L1 and L2 cache memory in order to minimize cache misses. The latter algorithm is successful, even with skewed data. Unfortunately, despite the fact that the latter algorithm can be extended to work on different CPU architectures, it requires a complete rewrite of low-level algorithms for managing multithreaded processing and aggregation operations. In contrast, our algorithm can be extended to perform more complex processing than just sufficient statistics. Cache performance of in-memory and block oriented aggregate operations is studied by Cieslewicz et al. . The main difference between our proposal and previous research is that joins and aggregations are evaluated together to avoid cache misses by modifying the size of the buffer. Notice that we avoid join operations in our aggregation process. This is an important assumption because any required join operations for obtaining sufficient statistics is assumed to be performed in a pre-processing step. We also evaluate the performance of our algorithm by modifying the block size. Using a separate context per thread for data mining algorithms is proposed by Ghoting et al. . Although there has been considerable work on exploiting the current hardware technology for optimizing database performance, our work goes further by optimizing the aggregation bottleneck.
Integrating data mining and statistical techniques into a DBMS has received little attention by the research community. However, aggregate functions have been used not only in probabilistic databases , but also to construct patterns in multirelational data mining and online analytical processing (OLAP) [18, 19]. Aggregate UDFs are shown to be useful when implementing database algorithms [8, 20]. The main functionality needed to define an aggregate UDF for multithreading is identified in . Furthermore, it has been pointed out that traditional cost-based optimizations cannot be applied when working with UDFs because they represent non-traditional database systems processing . As a consequence, memory and core usage have to be managed by the user. Recently, there has been research on optimizing the computation of sufficient statistics by exploiting caching in RAM and sampling . This work takes a step further by proposing specific changes to the DBMS aggregation algorithms and accelerating performance with asymmetric multithreaded processing.
Example of an input data set X with d = 3; n = 10
Assume we have an input data set
The data set
The computation of
Aggregate user-defined functions (aggregate UDFs) give users the capability to extend the functionality of DBMS. The set of steps (see Fig. 2) that must be implemented by the user to program the aggregation are :
1) Initialize: Data structures and variables for the aggregation are initialized.
2) Accumulate: This step is the most important. In this step, each row of the data set is processed, one at a time. An accumulation in a local variable is performed by every thread. Notice that while the table scan is being processed, the threads are fed the rows to accumulate.
3) Merge: This step merges the accumulated values of independent threads into the main result. This thread is responsible for merging both local variables and local data structures into a global aggregation result.
4) Terminate: In this final step, after all threads partial results have been merged, the function return value is computed. Once this last step is finalized, the final aggregation result is returned to the user.
It is important to point out that due to the fact that user-defined functions are compiled fragments of C code, the input arguments for aggregate functions must be fixed in order to allocate the memory space and allow argument value passing to each thread . Therefore, to allow a dynamic
developed: (1) a parsing function (STR-UDF); (2) a serializable function for reading and writing binary objects (BIN-UDF) (under the .NET programming environment ). The parsing function receives one string value, where all dimension values of the input vector are concatenated. Individual dimension values are parsed and assigned to local variables at run-time. On the other hand, the aggregate UDF for summarization receives all values for the d dimensions of a data point packed as a binary object, and returns all the elements in
Execution performance of the aggregate UDF can be improved by materializing a table with a single column storing the packed dimensions with the user-defined type. Even though the aggregation can be efficiently computed when data is already in binary format, creating such a table is a pre-step that can be time-consuming, especially for large input tables.
Table-valued functions (TVFs) are a type of user-defined function that, unlike aggregate UDFs, is able to return a table as the final result of the function. TVFs read an input data set as a single data stream and do not implicitly manage parallelism. Despite the lack of “out-of-the-box” parallelism, it is common that database systems allow the user to implement routines that support parallelism. Without loss of generality, in this work, a TVF will be used to compute multidimensional aggregations, with internal thread management algorithms returning a result table with just one row.
We now present our main contributions. We start with an overview on how we optimize the processing of UDFs. We then go into more technical detail, explaining how to manage memory and how to guarantee correct results under concurrent processing by multiple threads. We introduce three alternatives to manage workload among threads. Such workload involves disk I/O and CPU operations. We conclude with a brief time complexity and I/O cost analysis.
Our basic UDF-based algorithm exploits parallel processing to distribute the workload among all threads, while ensuring the hard disk access is accessed with full table scans, to achieve maximum performance. More importantly, processing with concurrent threads needs to guarantee correct results without race conditions, deadlocks or process starvation.
In developing aggregate functions for data summarization, we incorporated several changes that increase the speed of multicore CPU computers where computing the aggregations is faster than reading records from secondary storage. To obtain sequential reading, one thread is the only process in charge of reading records from the input table, caching blocks of records in main memory, and calling a monitor to dispatch the job to another thread that actually performs the calculations. All threads share memory to update the global aggregate computation. Moreover, we define techniques to control the number of threads executing simultaneously, and the amount of memory used by the aggregation process.
Even though algorithms for aggregation are explained in a general manner, we target the specific problem of computing sufficient statistics. Some key aspects have been modified from the UDF API. For example, the accumulate step receives a complete row from the input table without the need to pack its values as a user-defined type. Also, the initialization includes multithreaded execution parameterization, which would be specific for the hardware configuration. Since the computing of sufficient statistics requires a set of matrices and vectors, results are returned as tables using the common connectivity features of database programmability. Finally, our algorithms integrate into the modern DBMS without modifying any of the primitives for access data.
The purpose of caching part of the input table in main memory is to have quick access to its data records. We address the problem of obtaining a sequential reading by introducing a thread to cache the data blocks of the input data set in main memory. Since each worker thread is assigned the task of computing the aggregation of one block, portions of the data are cached throughout the execution time. Once the computation of a thread is concluded, the memory space occupied by the block is sent to the garbage collector. The reading process is oblivious to multithreaded execution since its only task is to allocate memory space to fit a fixed number of rows and fill the current block with records from the input table. As soon as a block is full or there are no more records to retrieve, a pointer to the block is sent to the monitor process, and a new block is started once the monitor is done. The characteristic difference between such processing approach and a standard parallel aggregation is the way threads access the input table. Instead of having the threads request data blocks, threads are assigned a block as workload by the “monitor” process. Both reading and monitoring are done by the main thread. Thus, in addition to the cost of sequentially reading the input and allocating blocks in main memory, we must consider the overhead of dispatching the worker threads. There is little overhead caused by the monitor calls, due to the difference in speed between reading the rows from disk and computing the flops by CPU.
Correct concurrent processing is solved by defining different types of memory access for working threads. Since each block will be accessed only by one thread after its creation, it will be immutable, and the memory space can then be disposed once the thread is done. Each working thread has a private memory space to compute the local aggregation of its data block and public access to update the global aggregation computed by all threads. Since the results of global aggregation must be the same regardless of how individual operations are interleaved, access to the memory space storing the global aggregation is granted only after the thread acquires a lock on the shared resource. In other words, only one thread is allowed to update the global aggregation at some point in time. In a multithreaded processing environment, there will be several threads working on the private memory space of their current task. Since each thread computes the aggregation on a private memory space and each data block is accessed only by one thread, working threads do not interfere with each other during the aggregation processing. We redefine the merging step of aggregate UDFs to update the global result every time a thread completes a task. All private memory space (cached data and local aggregation) is sent to the garbage collector, so that it can be released by the DBMS once a task is finished. It is important to notice that there are no deadlocks because there is only one lock granting access to all values in the global computation. In order to minimize concurrency control overhead, the data block size should be large enough to have a small number of threads simultaneously attempting to update the global aggregation data structure, thereby reducing contention.
In order to manage multiple worker threads, it is necessary to create a monitor process, similar to those processes used by the operating system (OS). Such monitor process executes as part of the main thread; it is in charge of dispatching the workload and terminating execution when all worker threads have finished their individual computations. We propose three monitor alternatives (Fig. 4):
1) Creating a new thread (i.e., there are multiple threads) for every upcoming task (MT-UDF).
2) Using a fixed number of threads (FT-TVF).
3) Exploiting a pool of threads; also called thread pool (TPTVF).
The first, simplest approach of the monitor is to create a new thread for every request to dispatch a workload (MT-UDF) and to then add the thread to a list. When the reading process is finished and the monitor requests to join the threads, the monitor uses the list to wait until all threads complete their execution. At this point, the aggregation is complete. In this configuration, the
reading process allocates blocks in RAM memory regardless of whether or not the processing power is sufficient for completing the tasks prior to causing stack or memory overflow. Moreover, the scheduling policy of the operating system assigns CPU time slices to the threads. A higher priority is not necessarily given to tasks closer to being finished and incomplete tasks will retain memory space until completed.
We now discuss the FT-TVF approach. The maximum amount of memory used for caching can be controlled by the monitor process. Even though this approach does have a circular list to keep a fixed number of working threads, the need for a queue is eliminated since there can be at most one task waiting to be executed. Whenever a new task is created by the monitor, it initially locates the first available slot on the list. A slot is considered available either when its thread is done or when there is no thread assigned to it. Although the circular list decreases the amount of RAM memory used for caching, the reading process has to be stopped every time the list becomes full. Finally, stopping a sequential read for a long period of time can severely impact the algorithm performance.
We now explain the third monitor approach (TP-TVF). To control the number of threads executed in the system and to have a first-in-first-out (FIFO) policy for the upcoming workload, we include a thread pool managed by the monitor process. With such configuration, all tasks created by the monitor are added to the thread pool. When the thread pool is initialized, it creates a fixed number of threads and a FIFO list. As such, whenever a thread finishes its current task, it is assigned the next task in the queue. Even though completed tasks free up memory space, each task in the thread pool queue has a data block associated with it. Moreover, if the waiting queue grows large enough, then it could cause memory overflow.
Time complexity and the number of I/O operations for each
Time complexity of n L and Q
of the data summarization matrices are given in Table 2. Since the expected input consists of large data sets (
We conducted our experiments on a DBMS installed on a server with an Intel Core 2 Quad CPU with four cores at 2.83 GHz each, and 3.2 GB of RAM. The hard disk had 320 GB of capacity, with a SATA interface running at 3 GB/s, and 7,200 RPM. The operating system was Windows XP and the DBMS was Microsoft SQL Server. UDFs were programmed in C#.
In the following sections, we evaluate different alternatives to compute sufficient statistics. The code was compiled and added to the DBMS server using aggregate UDFs and TVFs. The main experimental parameters were the chosen aggregate UDF agg, the blocking factor b (number of records per block) and the number of threads
In this section, we analyze the performance impact of all different alternatives to manage threads in multithreaded processing:
1) Creating a thread for every task or block to aggregate (MTUDF).
2) Using a maximum fixed number of threads with at most one task waiting to start; no queue (FT-TVF).
3) Exploiting a pool of threads with a fixed number of threads and a queue of waiting tasks (TP-TVF).
All control alternatives are compared against the execution time of performing a full table scan. No aggregation is performed during the full table scan: this query is used as a benchmark baseline. The full table scan is performed with a sequential data access given by the UDF API of the DBMS.
The results in Fig. 5 show time performance of the three alternatives to compute Q at different
[Fig. 5.] Concurrency control comparison (task = Q n = 1 M d = 64 t = 4). MT-UDF: multiple thread user-defined function FT-TVF: fixed threads table-valued function TP-TVF: thread pool table-valued function.
32, the impact of performing the aggregation while reading the table is minimal. On the other hand, the trend becomes evident when
Caching data blocks of the input table, regardless of the amount of memory used for this matter, could cause memory overflow. While being aware of this potential problem, we have proposed methods to limit memory usage, and implemented
UDF processing time varying the number of threads t (time in seconds and n = 10 M)
them in FT-TVF. Since our experimental study focuses on
Table 3 displays the performance of our monitor policies, when varying the number of working threads. For
Such behavior is caused because the time to read a record of size d from the input table is greater than the time it takes the process to calculate the
The blocking factor is directly related to two aspects of the execution: the amount of memory used for caching and contention for concurrently updating the global aggregation.
The performance of different policies computing
[Fig. 6.] Impact of the block size (agg = Q; n = 1 M; d = 64; t = 4). MT-UDF: multiple thread user-defined function FT-TVF: fixed threads table-valued functions TP: thread pool table-valued functions.
Comparison with aggregate UDFs with/without multithreaded processing (time in seconds and agg = Q; n = 1 M; t = 4; b = 1000)
For comparison purposes, we also tested plain SQL aggregations (Plain SQL) and aggregate user-defined functions: binary object input parameter (BIN-UDF) and string input parameter (STR-UDF). We include a comparison with a TVF that does not use multithreading to distribute workload (Regular TVF). Table 4 shows the execution performance when solving the Q aggregation for
The execution performance of Regular TVF is severely affected by
Nevertheless, the BIN-UDF step alone remains as a good comparison example for multithreading. The second implementation of aggregate UDFs (STR-UDF) moves the CAST function, for packing rows into a user-defined type object, into the SQL statement of the aggregation. Yet, STR-UDF still has a higher execution time than any of our algorithms because of the overhead of the parsing function in the user-defined.
It can be seen in Fig. 7 that tendencies hold when increasing the size of
[Fig. 7.] Aggregate comparison when varying d (agg = Q; n = 10 M; t = 4; b = 1000). BIN-UDF: binary user-defined function TP-TVF: thread pool table-valued functions REG-TVF: regular table-valued functions.
rows in a query prevent experimenting with
This paper studied the efficient computation of data set summaries on a large data set, exploiting parallel processing with multiple threads. We especially focused on UDFs as the programming mechanism to extend a DBMS with data mining capabilities. More specifically, we proposed algorithms that can be embedded into a DBMS as UDFs (Table UDFs) to interleave table scans and CPU processing. We presented three “monitor” algorithms to manage threads and to evenly distribute the workload. Such algorithms are particularly useful in cases of multiple core CPUs. The first algorithm (MT-UDF) has a master thread that reads the table and dynamically assigns data blocks to new threads. The second algorithm (FT-TVF) has a master thread that dynamically allocates a workload to new threads until a maximum number of threads has been reached; threads are destroyed once they finish processing the aggregation. The third algorithm (TP-TVF) has a pool of threads that reuses idle threads. Hardware optimization is achieved by controlling the CPU usage and managing RAM memory for data caching. We performed a careful experimental evaluation on a DBMS working on a multicore CPU computer. Our algorithms performance was compared against plain SQL queries and aggregate UDFs. We show our algorithms generally outperform the aggregate UDFs provided by the DBMS. Our algorithms showed better performance at high dimensionality (
Although our optimizations are specific to computing data summarization for linear Gaussian models, we believe that future research on UDFs should also consider thread management for fast parallel processing. Research issues include synchronization policies for threads with complex data structures and memory allocation to avoid table misses and data overflow. In addition, user-defined functions can be extended to allow a low-level of data access in order to speed up the execution of critical parallel processes (e.g., a thread retrieving a data page directly from disk). Finally, more research is needed on exploiting DBMS extensibility mechanisms like UDFs to perform data mining, instead of processing large data sets on flat files.