Schema- and Data-driven Discovery of SQL Keys
- Author: Tran Le Van Bao, Link Sebastian, Memari Mozhgan
- Organization: Tran Le Van Bao; Link Sebastian; Memari Mozhgan
- Publish: Journal of Computing Science and Engineering Volume 6, Issue3, p193~206, 30 Sep 2012
-
ABSTRACT
Keys play a fundamental role in all data models. They allow database systems to uniquely identify data items, and therefore, promote efficient data processing in many applications. Due to this, support is required to discover keys. These include keys that are semantically meaningful for the application domain, or are satisfied by a given database. We study the discovery of keys from SQL tables. We investigate the structural and computational properties of Armstrong tables for sets of SQL keys. Inspections of Armstrong tables enable data engineers to consolidate their understanding of semantically meaningful keys, and to communicate this understanding to other stake-holders. The stake-holders may want to make changes to the tables or provide entirely different tables to communicate their views to the data engineers. For such a purpose, we propose data mining algorithms that discover keys from a given SQL table. We combine the key mining algorithms with Armstrong table computations to generate informative Armstrong tables, that is, key-preserving semantic samples of existing SQL tables. Finally, we define formal measures to assess the distance between sets of SQL keys. The measures can be applied to validate the usefulness of Armstrong tables, and to automate the marking and feedback of non-multiple choice questions in database courses.
-
KEYWORD
Algorithm , Complexity , Armstrong database , Key , Soundness , Completeness , Mining , SQL
-
In databases, keys play a fundamental role in understanding both the structure and semantics. Given an SQL table schema, a key is a collection of columns whose values uniquely identify rows. That is, no two different rows have matching total values in each of the key columns. The concept of a key is essential for many other data models, including semantic models [1-4], object models [5], description logics [6], XML [7-9], RDF [10], and OWL [11].
The discovery of semantically meaningful SQL keys is a crucial task in many areas of modern data management, for example, in data modeling, database design, query optimization, indexing, and data integration [12]. This article is concerned with methods for semi-automated schemadriven and automated data-driven SQL key discovery.
There is great demand in industry for such methods, because they vastly simplify the job of the database administrator and thereby decrease the overall cost of database ownership. The discovery of composite keys is especially difficult, because the number of possible keys
increases exponentially with the number of columns [3,13]. Because of the industry demand, the goal is to provide practical algorithms that exhibit good typical case behavior. Industry-leading data modeling and design tools, such as the
ERwin data modeler, emphasize the need for good test data in order to validate the semantics of the models they produce [14]. In our context, test data is considered good if it assists data engineers with the discovery of semantically meaningful keys. This calls for algorithms that produce test data which satisfy the keys currently perceived as semantically meaningful and violate the keys currently perceived as semantically meaningless.Consider a simple database that collects basic information about the weekly schedule of courses. That is, we have a schema SCHEDULE with columns
C_ID ,L_Name , as well asTime andRoom . The schema stores the time (including the weekday) and room in which a lecturer (identified by their lecturer nameL_Name ) gives a course (identified by itsC_ID ). An SQL definition may be given as follows:The table schema specifies additional assertions. The primary key forces rows over SCHEDULE to be NOT NULL in the
C_ID andTime columns, and to be unique ontheir {
C_ID ,Time } projections. That is, no two different rows must have matching values in both theC_ID column and theTime column. A team of data engineers may wonder if the semantics of the application domain has been captured by this SQL definition. They decide to generate some good test data to discuss their current understanding with the domain experts, who do not understand database terminology. Therefore, the data engineers produce the data sample in Table 1, where ni denotes occurrences of theno information null marker.The domain experts express concern about rows 1 and 3: Ullman teaches both 11301 and 78200 in the red room on Monday at 10 am. After some discussion, the domain experts emphasize that different courses should not be taught by the same lecturer in the same room at the same time. As a consequence, the team of engineers decides to specify the uniqueness constraint (UC)
u (Time ,L_Name ,Room ). They produce the data sample in Table 2 to discuss their new understanding with the domain experts.The domain experts direct the engineers’ attention to rows 1 and 3 where Ullman teaches both 11301 and 78200 on Monday at 10 am. After exchanging some ideas, the experts agree that lecturers must not teach different courses at the same time.
Moreover, from rows 1 and 4 of the data sample in Table 2, experts notice that both 11301 and 78200 are taught on Monday at 10am in the red room. It turns out that no two different courses must be taught at the same time in the same room. Therefore, the UC
u (Time ,L_Name ,Room ) is replaced by the two stronger UCsu (Time ,L_Name ) andu (Time ,Room ). The example shows the potential benefit of investigating good sample data for the discovery of semantically meaningful SQL keys. As a revised SQL table schema, the team of data engineers specifyWhile this approach appears to be very beneficial, the question remains what constitutes good test data, and how to create it (automatically). In some situations the domain experts may feel very confident and would like to have the opportunity to modify values in the test data to reflect their domain knowledge, or provide the entire test data themselves. In such cases, and other situations, the data engineers require automated means to discover a representation of the keys that are satisfied by the test data. For example, when the domain experts inspect the data sample in Table 1, they may simply suggest using the data sample in Table 3 instead, with the updated values indicated by bold font.
On input of this table, a constraint mining algorithm would return NOT NULL constraints on
C_ID andTime , as well as the threeUC su (C_ID ,Time ),u (L_Name ,Time ), andu (Room ,Time ). This, again, leads to the definition of SCHEDULE'. The question remains how to discover the SQL keys that are satisfied by a given SQL table.In this paper we will establish detailed answers to the two questions above. As our first main contribution we discuss the schema-driven discovery of SQL keys that are semantically meaningful for a given application domain. For this purpose we investigate the well-known concept of Armstrong databases for the class of SQL keys. Armstrong databases formalize the concept of good test data in the sense that they satisfy the set Σ of keys currently perceived as semantically meaningful, and violate all keys that are not implied by Σ. We characterize when a given SQL table is Armstrong with respect to a given set Σ of SQL keys. This characterization allows us to establish an algorithm that generates good test data for arbitrary sets of SQL keys. While we demonstrate that the problem of computing such Armstrong tables is precisely exponential in the number of column headers, our algorithm produces an Armstrong table whose size is at most quadratic. We also show that there are situations where the size of the Armstrong table our algorithm produces is exponentially smaller than the size of the keys given. Both extreme cases result from very specific situations, which are very unlikely to occur in practice. Indeed, in most situations that do occur in practice, our algorithm will produce Armstrong tables whose size is in the same order of magnitude as the size of the keys given.
As a second main contribution we discuss the datadriven discovery of SQL keys. For this purpose we establish two algorithms that compute the set of minimal SQL keys satisfied by a given SQL table. While the problem generally requires exponential time, our algorithms show good best case behavior. As a third main contribution we combine the schema-driven and data-driven approach to increase the effectiveness of SQL key discovery. Given a real world data set, we apply the data-driven approach to identify the minimal SQL keys satisfied by this data set, and then apply the schema-driven approach to compute an Armstrong table for the set of these minimal keys. The Armstrong table is called informative as it contains only rows of the original data set, is much smaller in size, and satisfies the same SQL keys.
As the final contribution, we define formal measures that can be applied to 1) empirically validate the usefulness of our Armstrong tables for the acquisition of semantically meaningful SQL keys, and 2) automate the feedback and marking of database exam questions.
We summarize related work in Section II, and provide preliminary definitions in Section III. We investigate structural and computational properties of Armstrong tables for the class of SQL keys in Section IV. In Section V we study the SQL key mining problem. Section VI combines the data-driven and schema-driven approaches to the discovery of SQL keys. Our formal measures of usefulness and their applications are discussed in Section VII. Finally, in Section VIII, we conclude and briefly comment on future work.
Data dependencies have been studied thoroughly in the relational model of data, cf. [15-17]. Dependencies are essential to the design of the target database, the maintenance of the database during its lifetime, and all major data processing tasks [15,17]. These applications also provide strong motivation for developing data mining algorithms to discover the data dependencies that are satisfied by a given database. Armstrong databases are a useful design aid for data engineers, which can help with the consolidation of data dependencies [18,19] and schema mappings [20], the design of databases [21] and the creation of concise test data [22].
In the relational model of data, the class of keys is subsumed by the class of functional dependencies. The structural and computational properties of Armstrong relations have been investigated in the relational model of data for the class of keys [3,23], and the more general class of functional dependencies [21,24]. The mining of keys and functional dependencies in relations has also received considerable attention in the relational model [12,25,26]. The concept of informative Armstrong databases was introduced in [22] as semantic samples of existing databases.
One of the most important extensions of Codd’s basic relational model [27] is incomplete information [28,29]. This is mainly due to the high demand for the correct handling of such information in real-world applications. Approaches to deal with incomplete information comprise incomplete relations, or-relations or fuzzy relations. In this paper we focus on incomplete bags, and the most popular interpretation of a null marker as “no information” [29,30]. This is the general case of SQL tables where duplicate rows and null markers are permitted to occur in columns that are specified as null-able. Relations are idealized special SQL tables where no duplicate rows can occur and all columns are specified NOT NULL. Recently, Armstrong tables have been investigated for their combined class of SQL keys and functional dependencies [31]. In this article, we establish optimizations that arise from the focus on the sole class of SQL keys. This provides insight into the trade-off between the expressiveness of data dependency classes and the structural and computational properties of Armstrong tables. The insight can be used by data engineers to make a more informed decision about the complexity of the design process for the target database. For example, it may appear that the most important requirements of an application domain can already be captured by SQL keys alone, without the need for functional dependencies. In this case, engineers can utilize the algorithms developed in the current article, which result in computations that are more resource-efficient than those developed for the combined class of keys and functional dependencies [31]. Note that Hartmann et al. [1] studies Codd keys, which require tuples to have unique and total projections on key attributes. In contrast, the present article studies SQL keys, which require total tuples to have unique projections on key attributes. Hartmann et al. [1] does also not discuss any key mining algorithms or measures to assess the difference of key sets, nor does it discuss informative Armstrong databases. To the best of the authors’ knowledge, no previous work has addressed the mining of keys from SQL tables. Our results reveal that the methods developed for the idealized special case of relations [26] can be adapted to the use for SQL tables. We are also unaware of any studies related to informative Armstrong databases except [22], measures that capture the difference between sets of keys, nor their utilization for assessing the usefulness of Armstrong tables or database exam questions.
The present article is an extension of the conference paper [32]. The present article contains all the proofs of our results as well as additional motivation and discussion. In addition, the concept of informative Armstrong tables was not discussed in the conference paper.
In this section we introduce the basic definitions necessary and sufficient for the development of our results in subsequent sections. In particular, we introduce a model that is very similar to SQL’s data definition capabilities. The interpretation of null marker occurrences as “no information” follows the SQL interpretation, and the formal model introduced by Zaniolo [29]. The class of SQL keys under consideration is the exact class of uniqueness constraints defined in the SQL standard [33].
Let
be a countably infinite set of symbols, called
column headers orheaders for short. Atable schema is a finite non-empty subsetT ofEach header
H of a table schemaT is associated with a countably infinite domaindom (H ) of the possible values that can occur in columnH . To encompass partial information, every column can have a null marker, denoted by ni ∈dom (H ). The intention of ni is to signify “no information”.For header sets
X andY , we may writeXY forX ∪Y . IfX = {H 1,. . . ,Hm }, then we may writeH 1 ...Hm forX . In particular, we may write simplyH to represent the singleton {H }. Arow overT (T -row or simply row, ifT is understood) is a functionr :T → ∪H ∈T dom (H ) withr (H ) ∈dom (H ) for allH ∈T . The null marker occurrencer (H ) = ni associated with a headerH in a rowr means that there is no information aboutr (H ). That is,r (H ) may not exist orr (H ) exists but is unknown. ForX ⊆T letr (X ) denote the restriction of the rowr overT toX . Atable t overT is a finite multi-set of rows overT . For a rowr overT and a setX ⊆T ,r is said to beX -total if for allH ∈X ,r (H ) ≠ ni. Similarly, a tablet overT is said to beX total, if every rowr oft isX -total. A tablet overT is said to be atotal table if it isT -total.A UC over a table schema
T is an expressionu (X ) whereX ⊆T . A tablet overT is said tosatisfy the UCu (X ) overT , denoted by |=t u (X ), if for allr 1,r 2 ∈t , ifr 1(X ) =r 2(X ) andr 1,r 2 areX -total, thenr 1 =r 2. The semantics is that of SQL’s UC [33], and it reduces to that of a key for total tables [15].Following Atzeni and Morfuni [30] a
null-free subschema (NFS) over the table schemaT is an expressionnfs (Ts ) whereTs ⊆T . The NFSnfs (Ts ) overT is satisfied by a tablet overT , denoted by |=t nfs (Ts ), if and only ift isTs -total. SQL allows the specification of column headers as NOT NULL. NFSs occur in everyday database practice: the set of headers declared NOT NULL forms an NFS over the underlying table schema.In schema design and maintenance data dependencies are normally specified as semantic constraints on the tables intended to be instances of the schema. During the design process or the lifetime of a database one usually needs to determine further dependencies, which are implied by the given ones. Let
T be a table schema,nfs (Ts ) an NFS, and Σ ? {φ } be a set of UCs overT . We say that Σimplies φ in the presence ofnfs (Ts ), denoted by Σ |=Ts φ , if everyTs -total tablet overT that satisfies Σ also satisfiesφ . If Σ does not implyφ in the presence ofnfs (Ts ) we may also write Σ |≠Ts φ . Let Σ*Ts = {φ |Σ|=Ts φ } be the semantic closure of Σ. If we do not want to emphasize the presence of the NFSnfs (Ts ) we may simply write Σ |=φ , Σ |≠φ or Σ*, respectively. The next result explains why minimal UCs are important. Indeed, for a set Σ ∪ {u (X )} of UCs, and an NFSnfs (Ts ) overT we callu (X )minimal if and only if Σ |=Ts u (X ) and for allu (Y ) overT where Σ |=Ts u (Y ) andY ⊆X we haveY =X .THEOREM 1.
Let T be a table schema, nfs(Ts) an NFS, and Σ ∪ {φ} be a set of UCs over T . Then, Σ implies φ in the presence of nfs(Ts) if and only if there is some u(Y) ∈ Σ such that Y ⊆ X.Proof. Suppose first that there is someu (Y ) ∈ Σ such thatY ⊆X . We need to show that Σ impliesφ in the presence ofnfs (Ts ). Assume to the contrary that there is someTs -total tablet that satisfies Σ, but violatesφ . Consequently, there are two differentTs -total rowsr 1,r 2 ∈t such thatr 1(X ) =r 2(X ). SinceY ⊆X we conclude thatt violatesu (Y ) ∈ Σ, a contradiction. Consequently, Σ impliesφ in the presence ofnfs (Ts ).For the reverse direction we present the contraposition. That is, we assume that for all
u (Y ) ∈ Σ we haveand show that under this assumption Σ does not imply
φ in the presence ofnfs (Ts ). For the latter, we construct aTs - total tablet that satisfies Σ but violatesφ . Lett be the table overT consisting of two rowsr 1 andr 2 overT wherer 1 andr 2 areXTs -total andr 1(X ) =r 2(X ), andr 1(H ) = ni =r 2(H ) for allH ∈T ?XTs , andr 1(H ) ≠r 2(H ) for allH ∈ (T ?X ) ?Ts . The construction ensures thatt violatesu (X ). Letu (Y ) ∈ Σ. We have assumed thatthat is,
Again, the construction ensures that
t satisfiesu (Y ). This concludes the proof. □In the relational model of data, keys enjoy a simple axiomatization [17] by two axioms. The set axiom says that for every relation schema
R , the setR of all attributes forms a key. The superkey rule states that for every setK of attributes inR that forms a key overR , every supersetK ' ⊆R ofK also forms a key overR . For SQL we have just shown that this axiomatization becomes even simpler. The proof of Theorem 1 has just shown that the superkey rule is sound and complete for the implication of SQL keys. In particular, the set axiom is no longer sound since we can have two duplicate tuples in a multiset.EXAMPLE 1. We can capture the SQL table schema of the running example as the table schema SCHEDULE =
CTLR with SCHEDULEs =CT . Let Σ consist of the two UCsu (CT ) andu (LTR ). It follows that Σ does not imply any of the following UCs:u (CLR ),u (LT ) noru (TR ). For instance, the SCHEDULEs -total table in Table 2 satisfies Σ and violates every of the three UCs. As an application of Theorem 1 we see that neither ofCLR ,LT norTR is a superset ofCT orLTR . □IV. SCHEMA-DRIVEN SQL KEY DISCOVERY
In this section we investigate the structural and computational properties of suitable data to test the semantic meaningfulness of uniqueness constraints over the SQL table schemata. For this purpose, we use Armstrong tables to formalize the notion of suitable test data. Having introduced the concepts of strong agree sets and anti-keys, we characterize when an arbitrarily given SQL table is Armstrong for an arbitrarily given set of uniqueness constraints. The characterization is then used to compute Armstrong tables. Finally, we derive results on the time and space complexity associated with the computation of Armstrong tables.
The official concept of an Armstrong database was originally introduced by Fagin [16]. We require our tables to be Armstrong with respect to uniqueness constraints and the NFS. Intuitively, an Armstrong table satisfies the given constraints and violates the constraints in the given class that are not implied by the given constraints. This results in the following definition.
DEFINITION 1.
Let Σ be a set of UCs, and nfs(Ts) an NFS over table schema T. A table t over T is said to be Armstrong for Σ and nfs(Ts) if and only if i) t satisfies Σ, ii) t violates every UC φ ?Σ*Ts, iii) t is Ts-total, and iv) for every H ∈ T ? Ts, t is not H-total. □ Through the use of Theorem 1, it is easy to see that a table
t is Armstrong for Σ andnfs (Ts ) if and only if 1) for allu (X ) overT ,t satisfiesu (X ) if and only if there is someu (Y ) ∈ Σ such thatY ⊆X , and 2) for allnfs (T's ) overT ,t satisfiesnfs (T's ) if and only ifT's ⊆Ts .EXAMPLE 2. Let SCHEDULE =
CTLR with SCHEDULEs =CT . Let Σ consist of the two UCsu (CT ) andu (LTR ). The data sample in Table 2 is Armstrong for Σ andnfs (SCHEDULEs ). For example, it violates the UCs:u (CLR ),u (LT ) andu (TR ), as well as every NFSnfs (T's ) whereT's is not contained in SCHEDULEs . □A natural question to ask is how we can characterize the structure of tables that are Armstrong. With this in mind, we introduce the formal notion of strong agree sets for pairs of distinct rows, and tables.
DEFINITION 2.
For two rows r1 and r2 over table schema T where r1 ≠ r2 we define the strong agree set of r1 and r2 as the set of all column headers over T on which r1 and r2 have a matching total value, i.e., ags(r1,r2) = {H ∈ T | r1(H) = r2(H) and r1(H) ≠ ni ≠ r2(H)}. Furthermore, the strong agree set of a table t over table schema T is defined as ags(t) = {ags(r1, r2) | r1,r2 ∈ t ∧ r1 ≠ r2}. EXAMPLE 3. Let SCHEDULE =
CTLR with SCHEDULEs =CT . Let Σ consist of the two UCsu (CT ) andu (LTR ). Lett denote the data sample in Table 2. The strong agree set oft consists ofCLR ,LT ,TR ,L ,R , andT . □For a table
t to be Armstrong for Σ andnfs (Ts ),t must violate all UCsu (X ) not implied by Σ in the presence ofnfs (Ts ). Instead of violating all UCs, it suffices to violate those ones that are maximal with the property that they are not implied by Σ in the presence ofnfs (Ts ). This motivates the following definition.DEFINITION 3.
Let Σ be a set of UCs, and nfs(Ts) an NFS over table schema T. The set Σ-1 of all anti-keys with respect to Σ and nfs(Ts) is defined as Σ-1 = {a(X) | X ⊆ T ∧ Σ |≠Ts u(X) ∧ ∀H ∈ (T ? X)(Σ |=Ts u (XH ))}. □Hence, an anti-key for Σ is given by a maximal set of column headers, which does not form a uniqueness constraint implied by Σ.
EXAMPLE 4. Let SCHEDULE =
CTLR with SCHEDULEs =CT . Let Σ consist of the two UCsu (CT ) andu (LTR ). The set Σ-1 of anti-keys with respect to Σ and SCHEDULEs consists ofa (CLR ),a (LT ) anda (TR ). □> B. Structure of Armstrong Tables
The concepts from the last sub-section are sufficient to establish a characterization of Armstrong tables for the class of UCs over SQL table schemata.
THEOREM 2.
Let Σ be a set of UCs, and nfs(Ts) an NFS over the table schema T. For all tables t over T it holds that t is an Armstrong table for Σ and nfs(Ts) if and only if the following three conditions are satisfied: for all a(X) ∈ Σ-1 we have X ∈ ags(t); for all u(X) ∈ Σ and for all Y ∈ ags(t) we have total(t) = {H ∈ T | ∀r ∈ t(r(H) ≠ ni)} = Ts. Proof. First, we show that the three conditions are sufficient for
t to be an Armstrong table for Σ andnfs (Ts ). Suppose thatt is such that the three conditions are satisfied. It follows immediately from the last condition thatt satisfiesnfs (T's ) if and only ifT's ⊆Ts . Letu (X ) ∈ Σ. If there were two rowsr 1,r 2 ∈t such thatr 1 ≠r 2 andX ⊆ags (r 1,r 2) =Y , thenY ∈ags (t ). This, however, would violate the second condition. Hence,t satisfiesu (X ). Sinceu (X ) ∈ Σ was an arbitrary choice we conclude thatt satisfies Σ. Letu (Y ) ? Σ*. Then there is somea (X ) ∈ Σ-1 such thatY ⊆X holds. From the first condition we conclude thatY ⊆ags (r 1,r 2) for somer 1,r 2 ∈t withr 1 ≠r 2. Consequently,t violates every uniqueness constraint not implied by Σ.Showing that the three conditions hold necessarily whenever
t is an Armstrong table for Σ andnfs (Ts ) still remains. Suppose thatt is an Armstrong table for Σ andnfs (Ts ). The last condition follows immediately from the fact thatt satisfiesnfs (T's ) if and only ifT's ⊆Ts . Sincet satisfies Σ there cannot be anyY ∈ags(t) andu (X ) ∈ Σ such thatX ⊆Y holds. We conclude that the second condition is satisfied. It remains to show that the first condition is satisfied, too. Leta (X ) ∈ Σ-1. We need to show thatX ∈ags(t) . Froma (X ) ∈ Σ-1 it follows thatu (X ) ? Σ*. Ast violatesu (X ) it follows that there arer 1,r 2 ∈t such thatr 1 ≠r 2 andX ⊆Y =ags (r 1,r 2). Also, froma (X ) ∈ Σ-1 it follows that for allH ∈T ?X we haveu (XH ) ∈ Σ*, and thus thatt satisfiesu (XH ). Suppose that there is someH ∈Y ?X . Thent satisfiesu (XH ) and, therefore,r 1(H ) = ni =r 2(H ) orr 1(H ) ≠r 2(H ). This, however, is a contradiction asH ∈Y =ags (r 1,r 2). Consequently,X =Y ∈ags (t ). □EXAMPLE 5. Let SCHEDULE =
CTLR with SCHEDULEs =CT . Let Σ consist of the two UCsu (CT ) andu (LTR ), and lett denote the data sample in Table 2. Recall from the previous examples that Σ-1 = {a (CLR ),a (LT ),a (TR )}, andags (t ) = {CLR ,LT ,TR ,L ,R ,T }. Sincet satisfies the three conditions of Theorem 2, it follows thatt is an Armstrong table for Σ and SCHEDULEs . □> C. Computation of Armstrong Tables
We will now use the characterization of Theorem 2 to compute Armstrong tables for an arbitrarily given set Σ of UCs and an arbitrarily given NFS
nfs (Ts ) over an arbitrarily given table schemaT . A great part of the computation is devoted to determining the set Σ-1. For this purpose, we borrow concepts from hyper-graphs. Indeed, to compute Σ-1 we generate the simple hyper-graphH = (V ,E ) with the vertex setV =T and the setE = {X |u (X ) ∈ Σ} of hyper-edges. In fact, based on Theorem 1 we assume without loss of generality that Σ consists of minimal UCs only. If not, then we remove all those UCs from Σ that are not minimal. From this we obtain Σ-1 = {a (T ?X ) |X ∈Tr (H )} whereTr (H ) denotes the minimal transversals of the hyper-graphH , i.e., the set of minimal setsX ⊆T that have a non-empty intersection with every hyper-edge ofH [34].LEMMA 1.
Let Σbe a set of UCs over table schema T. Then, Σ-1 = {a (T ?X ) |X ∈Tr (H )}.Proof. Recall thatWe show first that if
X ∈Tr (H ), thena (T ?X ) ∈ Σ-1. First, it follows that Σ |≠Ts u (T ?X ) since otherwise there would be someu (Z ) ∈ Σ such thatZ ⊆T ?X . This, however, would mean thatZ ∩X = ?, which contradicts the hypothesis thatX ∈Tr (H ). It remains to show that for allH ∈X , Σ |=Ts u ((T ?X )H ). Assume the opposite, i.e., there is someH ∈X such that Σ |≠Ts u ((T ?X )H ). Then, there cannot be anyu (Z ) ∈ Σ such thatZ ⊆ (T ?X )H =T ? (X ?H ). Hence, for allu (Z ) ∈ Σ we haveThis, however, contradicts the minimality of
X ∈Tr (H ). We have shown thata (T ?X ) ∈ Σ-1.Now, we show that if
a (X ) ∈ Σ-1, thenT ?X ∈Tr (H ). Froma (X ) ∈ Σ-1 we conclude that Σ |≠Ts u (X ). Hence, for allFrom
a (X ) ∈ Σ-1 we know that for allH ∈T ?X , Σ |=Ts u (XH ). Hence, for allH ∈T ?X there is someu (Z ) ∈ Σ such thatZ ⊆XH . Thus, for allH ∈T ?X there is someu (Z ) ∈ Σ such thatThat is,
T ?X ∈Tr (H ).Now, we have a complete strategy for computing Armstrong tables, which we summarize in Algorithm 1. For each column header
H , letcH ,0,cH ,1, ... denote mutually distinct domain values fromdom (H ). Lines 2-4 deal with the special case where Σ contains the empty keysaying that a table can have at most one row. Here, we just need to return a table consisting of one row with null marker occurrences in columns which are null-able. Otherwise, we start off with a row
r 0 with total values in every column (line 6). In lines 8 and 9 we use Lemma 1 to compute the set of anti-keys with respect to Σ andnfs (Ts ). In lines 11-15 we introduce for each anti-keya (Y ) a new rowri , which has a strong agree setY withr 0. In lines 16-20 we may need to introduce an additional row that has null value occurrences in null-able columns that do not feature null value occurrences in previously generated rows.The correctness of Algorithm 1 follows from Lemma 1 and Theorem 2.
THEOREM 3.
On input (T , Σ,nfs (Ts )),Algorithm 1 computes a table t over T that is Armstrong for Σand nfs (Ts ).Proof. It suffices to verify that the output of Algorithm 1 satisfies the conditions in Theorem 2. Indeed, Lemma 1 reveals that Algorithm 1 computes the anti-keys correctly. The main construction of Algorithm 1 in lines 11- 15 guarantees that the output satisfies the first condition. The construction also ensures that every strong agree set from the output of Algorithm 1 is contained in some antikey. This means that no UC from Σ can be contained in some strong agree set. This shows that the output satisfies the second condition. Finally, lines 17-19 ensure that the output satisfies the third condition. □EXAMPLE 6. Let SCHEDULE = CTLR with SCHEDULE
s =CT . Let Σ consist of the two UCsu (CT ) andu (LTR ). On input (SCHEDULE, Σ, SCHEDULEs ), Algorithm 1 would compute the following Armstrong table:A suitable value substitution yields the data sample in Table 2. □
> D. Complexity Considerations
In this subsection, we investigate properties regarding the space and time complexity for computing Armstrong tables. We will demonstrate that the user-friendly representation of a set of SQL keys in the form of an Armstrong table comes, in the worst case, at a price. In fact, the number of rows in a minimum-sized Armstrong table can be exponential in the total number of column headers that occur in Σ. Because of this result we cannot, in the worst case, design an algorithm for generating Armstrong tables in polynomial time.
1) Worst-case time-complexity
The following result follows straight from Theorem 2 and the correctness of Algorithm 1.
PROPOSITION 1.
Let Σbe a set of UCs and nfs (Ts )be some NFS over table schema T. Let t be an Armstrong table for Σand nfs (Ts ).Then Proof. The first condition of Theorem 2 states that for alla (X ) ∈ Σ-1 we haveX ∈ags (t ). If we add the second condition of Theorem 2, then we derive that for allY ∈ags (t ) there is at most oneX ∈ Σ-1 such thatX ⊆Y . This shows thatFinally,
since every distinct pair of distinct tuples in
t has precisely one agree set. □Let the size of an Armstrong table be defined as the number of rows that it contains. It is a practical question to ask how many rows a minimum-sized Armstrong table requires. An Armstrong table
t for Σ andnfs (Ts ) is said to beminimum-sized if there is no Armstrong tablet' for Σ andnfs (Ts ) such that |t' | < |t |. That is, for a minimum-sized Armstrong table for Σ andnfs (Ts ) there is no Armstrong table for Σ andnfs (Ts ) with a smaller number of rows.We recall what we mean by
precisely exponential [24]. Firstly, it means that there is an algorithm for computing an Armstrong table, given a set Σ of UCs and an NFSnfs (Ts ), where the running time of the algorithm is exponential in the size of the keys. Secondly, it means that there is a set Σ of UCs and an NFSnfs(Ts) in which the number of rows in each minimum-sized Armstrong table for Σ andnfs (Ts ) is exponential.PROPOSITION 2.
The complexity of finding an Armstrong table, given a set of UCs and an NFS, is precisely exponential in the size of the UCs. Proof. The time complexity of Algorithm 1 is proportional to the time-complexity of computing the set Σ-1 of anti-keys with respect to Σ andnfs (Ts ). The computation of the anti-keys via the hypergraph transversals in lines 8 and 9 requires time exponential in the size of the UCs given [35]. Therefore, Algorithm 1 runs in time exponential in the size of the UCs given.It remains to show that there is a set Σ of UCs for which the number of rows in each Armstrong table for Σ is exponential in the size of the UCs given. According to Theorem 1 it suffices to find a set Σ of UCs such that Σ-1 is exponential in the size of the UCs. Such a set is given by
Σ = {u(H2i-1, H2i) | i = 1, . . . , n}.
The set Σ-1 consists of those anti-keys that contain precisely one column header for each element of Σ. Therefore, Σ-1 contains precisely 2
n elements with 2n being the size of Σ. □2) Minimum-sized Armstrong tables
Despite the general worst-case exponential complexity in the size of the keys, Algorithm 1 is a fairly simple algorithm that is, as we now demonstrate, quite conservative in its use of time.
PROPOSITION 3.
Let Σbe a set of UCs, nfs (Ts )an NFS over table schema T. Let Σt be a minimum-sized Armstrong table forand nfs (Ts ).Then Proof. The upper bound follows immediately from Theorem 3. The lower bound follows from Theorem 1. Indeed, it follows thatConsequently,
□
Note the upper bound in Proposition 3. In general, the focus on UCs can yield Armstrong tables with a substantially smaller number of rows than Armstrong tables for more expressive classes such as functional dependencies. The reason is that we do not need to violate any functional dependencies that are not implied by the given set. In practice, this is desirable for the validation of schemata known to be in Boyce-Codd normal form, for example. Such schemata are often the result of entity-relationship modeling. Applying the algorithm from [31], designed for UCs and functional dependencies, to our running example would yield an Armstrong table with 12 rows. Instead, Algorithm 1, designed for UCs only, produces an Armstrong table with just 4 rows. In general, we can conclude that Algorithm 1 always computes an Armstrong table of reasonably small size.
COROLLARY 1.
On input (T , Σ,nfs (Ts )),Algorithm 1 computes an Armstrong table for Σand nfs (Ts )whose size is at most quadratic in the size of a minimum-sized Armstrong table for Σand nfs (Ts ). □3) Size of representations
We show that, in general, there is no superior way of representing the information inherent in a set of UCs and a null-free subschema.
THEOREM 4.
There is some set Σof UCs and an NFS nfs (Ts )such that Σhas size O (n ),and the size of a minimum- sized Armstrong table for Σand nfs (Ts )is O (2n /2).There is some table schema T, some NFS nfs (Ts )and some set Σof UCs over T such that there is an Armstrong table for Σand nfs (Ts )where the number of rows is in O (n ),and the size of the minimal UCs implied by Σin the presence of nfs (Ts )is in O (2n ).Proof. For the first claim letT =H 1, . . . ,H 2n ,Ts =T and Σ = {u (H 2i -1H 2i ) |i = 1, . . . ,n }. Then Σ-1 = {a (X 1 ...Xn ) | ∀i = 1, ... ,n (Xi ∈ {H 2i -1,H 2i })}.For the second claim let
T =H 1 ...H 2n ,Ts =T and Σ = {u (X1 ...Xn ) | ∀i = 1, . . . ,n (Xi ∈ {H 2i -1,H 2i })}. Then, the set of minimal UCs implied by Σ is Σ itself. Since Σ-1 = {a (T ? (H 2i -1H 2i )) |i = 1, . . . ,n } there is an Armstrong table for Σ andnfs (Ts ) where the number of rows is inO (n ). □We can perceive that the representation in form of an Armstrong table can offer tremendous space savings over the representation as a UC set, and vice versa. It seems intuitive to use the representation as Armstrong tables for the discovery of semantically meaningful constraints, and the representation as constraint sets for the discovery of semantically meaningless constraints. This intuition has been confirmed empirically for the class of functional dependencies over relations [18].
V. DATA-DRIVEN SQL KEY DISCOVERY
In this section we will establish algorithms for the automated discovery of uniqueness constraints from given SQL tables. Such algorithms have direct applications in schema design, query optimization, and the semantic sampling of databases [22,26,31,36,37]. In requirements engineering, for example, these algorithms can be utilized to discover semantically meaningful uniqueness constraints from sample data, which domain experts provide.
> A. Mining by Pairwise Comparison of Rows
Our first algorithm gradually inspects all pairs of rows for the given table, and adjusts the set of minimal UCs accordingly. More specifically, for every given pair of different rows
r 1,r 2 (line 3), and for every UCu (X ) satisfied by the rows inspected so far (line 4), we replaceu (X ) byu (XH ) wheneverr 1,r 2 violateu (X ) andH is a column inT ?X such thatr 1,r 2 satisfyu (XH ) (lines 5-10). Lines 12-14 remove non-minimal UCs. Note that the output of Algorithm 2 iswhenever
t contains any duplicate rows.THEOREM 5.
On input (T ,t ),Algorithm 2 computes the set of minimal UCs satisfied by table t over T in time O (|T |2 × |t |2 ×m 2t )where mt denotes the maximum number of minimal UCs satisfied by any subset s ⊆t .Proof. Algorithm 2 works correctly. A formal proof can be established using induction over the number of rows in the input tablet . Whent contains no row or just one row, then Algorithm 2 returnsuc (t ) ={u (?)} - which is the unique minimal UC satisfied byt . Suppose we know that Algorithm 2 correctly computes the set of minimal UCs satisfied by tablet withn rows, and thatt' :=t ∪ {r' }. Then, the new rowr' is compared with every other rowr int . Whenever a UCu (X ) is violated, it is replaced by the set of uniqueness constraintsu (XH ) wherer' andr do not strongly agree onH . This ensures that tablet' satisfies all the uniqueness constraints inuc (t' ). Finally, lines 12-14 ensure thatuc (t' ) does not contain any uniqueness constraints that are not minimal.For the complexity bound note that there are |
t |2 pairs of rows to inspect. For each pair,uc (t ) has at mostmt elements, and for each element there are at mostmt · |T |2 operations to perform. □EXAMPLE 7. Let
t, t' be the data samples over SCHEDULE from Tables 2 and 3, respectively. The following table shows the evolution of the minimal UCs by gradually adding a new pair of rows until the entire table has been explored.The UCs are those explained in the introductory section of this article. □
> B. Mining by Exploration of Hyper-Graph Transversals
Again, our next algorithm uses the concept of hypergraph transversals. Indeed Algorithm 3 computes the complements of the strong agree sets for the given input table (lines 2 and 3). These complements are called weak disagree sets. Lines 4 and 5 compute the necessary weak disagree sets, that is, those weak agree sets that are minimal. For the hypergraph where the node set is
T and the edge set consists of all necessary weak disagree sets, Algorithm 3 uses any (preferably the most efficient) algorithm to compute the set of all minimal transversals of the hypergraph. Lemma 2 shows that this set contains all minimal UCs satisfied by the input tablet . Algorithm 3 is compact and benefits from any progress on the popular problem of computing minimal hypergraph transversals [34].The following lemma explains the soundness of Algorithm 3.
LEMMA 2.
Let t be a table over table schema T. Then, for all X ⊆T, X ∈Tr (T,nec-disagw (t ))if and only if |=t ∈u (X )and for all H X, |≠t u (X ?H ).Proof. We begin by showing that the two conditions are necessary for everyX ∈Tr (T ,nec-disagw (t )).First, if
t violatedu (X ), then there were two rowsr 1,r 2 ∈t such thatr 1 ≠r 2 and X ⊆ags (r 1,r 2). Hence,and there would be some
Y ∈nec-disagw (t ) such thatY ⊆disagw (r 1,r 2). Thus,which contradicts the hypothesis that
X ∈Tr (T ,nec-disagw (t )). We conclude thatt satisfies u (X ). Suppose there was someH ∈X such that |=t u (X ?H ). Then, it would follow that for allr 1,r 2 ∈t such thatr 1 ≠r 2 it held thatThis, however, would violate the minimality of
X ∈Tr (T ,nec-disagw (t )). Therefore, it holds that for allH ∈X , |≠t u (X ?H ).Now, we demonstrate that the two conditions are sufficient for
X to be inTr (T ,nec-disagw (t )). From |=tu (X ) follows that for allr 1,r 2 ∈t wherer 1 ≠r 2 we haveFrom |≠
t u (X ?H ) for allH ∈X it follows that for allH ∈X there are somer 1,r 2 ∈t such thatr 1 ≠r 2 andX ?H ⊆ags (r 1,r 2). The last condition means thatholds. Consequently, for all
H ∈X there is someZ ∈nec-disagw (t ) such thatZ ⊆disagw (r 1,r 2), and thusWe conclude that
X ∈Tr (T ,nec-disagw (t )). □THEOREM 6.
On input (T ,t ),Algorithm 3 computes the set of minimal UCs satisfied by table t over T in time O (m +n )where m := |T |2 × |t |2 × |nec-disagw (t )|and n :=Proof. The correctness of Algorithm 3 follows directly from the description of the algorithm and Lemma 2.The collection
nec-disagw (t ) can be computed in timeO (m ): there are |t |2 pairs of rows to consider, and for each new set indisagw (t ) one can check with at most |T |2 × |necdisagw (t )| operations whether the set is also innec-disag w (t ). The computation of the setTr (T ,nec-disagw (t )) of minimal transversals can be done in timeO (n ) [34]. Note that the hypergraph (T ,nec-disagw (t )) is simple, that is, there are no different setsX ,Y innec-disagw (t ) whereX ⊆Y holds. □EXAMPLE 8. Let
t, t' be the data samples over SCHEDULE from Tables 2 and 3, respectively.The next table shows the steps for applying Algorithm 3 to (SCHEDULE,
t ) and (SCHEDULE,t' ), respectively.The UCs are those explained in the introductory section of this article. □
VI. INFORMATIVE ARMSTRONG TABLES
Here, we combine the schema- and data-driven approach to the discovery of SQL keys, as given in Sections IV and V, respectively. The disadvantage of the schema-driven approach is that the Armstrong tables, when generated by our algorithm, contain only artificial values. It is therefore doubtful that such tables illustrate the current perceptions of the data engineers to the domain experts who inspect the table. Of course, the data engineers may substitute real domain values for the artificial values before they present the table to the domain experts, or they generate the Armstrong tables on the basis of some real domain values. However, the table may still not be a good representative of the real world application domain, since some of the combinations of the values may never occur in practice. We will now outline a solution to this problem, which overcomes this limitation. Here, the assumption that is necessary is that some real world data set t is available, for example, in the form of legacy data. Under this assumption, we can apply the data-driven algorithms from Section V to mine the set
uc (t ) of minimal uniqueness constraints (and NOT NULL constraints) satisfied byt . Subsequently, we apply the schema-driven algorithm from Section IV to compute an Armstrong tablet' foruc (t ) and the null-free subschema satisfied byt . The Armstrong tablet' can be populated with suitable rows fromt , that is,t' is a subset of the tablet . Thus, the number of rows int' will be much smaller than the number of rows int . Therefore,t' constitutes an invaluable semantic sample of the data sett , semantic in the sense that it satisfies the same set of uniqueness and NOT NULL constraints ast .DEFINITION 4.
Let t be a table over table schema T, and let C denote a class of constraints. A C-informative Armstrong table for t is a table t' over T such that i) t' ⊆t and ii) for all constraints φ of C over T, t' satisfies φ if and only if t satisfies φ . □In the context of this paper, we are interested in informative Armstrong tables with respect to the class
C of uniqueness and NOT NULL constraints. For an illustration of the concept we will now look at a small exampleThe data sample
treal in Table 4 shows some aspects of the Semester 2 timetable for 200 level courses in the Department of Computer Science, the University of Auckland, one month before the beginning of Semester 2 in 2012. Suppose we would like to compute an informative Armstrong table fortreal with respect to the class of uniqueness and NOT NULL constraints. We may apply Algorithm 2 or Algorithm 3 to input (SCHEDULE,t ), and compute the setuc (t ) as{u(Time), u(C_ID,L_Name), u(Room,L_Name)}.
It is not difficult to compute the maximal null-free subschema
nfs (Ts ) satisfied bytreal . We would simply start withT and gradually inspect row by row intreal . For each rowr we remove any columnH ∈Ts fromTs wheneverr (H ) = ni. For the data sampletreal from Table 4 we obtainTs = {C_ID, Time}. Using the algorithms from Section IV we compute the setuc (t )-1 of anti-keys as{a(L_Name), a(C_ID,Room)}.
Now, finding rows in
treal whose strong agree sets are these anti-keys, and which violate any null-free subschema violated bytreal still remains. One such tabletinf is shown in Table 5. The first two rows have a strong agree set {C_ID,Room}, the first and third row have strong agree {L_Name}, and the fourth row ensures thatt' is total exactly wheretreal is. The example illustrates the potential savings in terms of size that informative Armstrong tables have over the original data samples. For original data samples that are large, the informative Armstrong tables will be considerably smaller. For example, in the case of relational databases, De Marchi and Petit [22] have reported that their informative Armstrong database contained only 0.6% of the tuples in the original database. One can imagine that the inspection for semantic samples of smaller size is much more effective than the inspection of the original data set.VII. EMPIRICAL MEASURES OF USEFULNESS
In this section we describe how the usefulness of applying Armstrong tables for the discovery of semantically meaningful SQL keys can be measured. For this purpose, we will first introduce different measures of usefulness, and then describe a detailed example illustrating how the marking and feedback for non-multiple choice questions in database courses can be automated.
> A. Soundness, Completeness, and Proximity
Measuring the usefulness of applying Armstrong tables for the discovery of semantically meaningful SQL keys appears to be non-trivial. However, one may conduct a two-phase experiment where database design teams are given an application domain and are asked to specify the set of UCs they perceive as semantically meaningful. In the first phase, they are not given any help, except natural language descriptions by domain experts. In the second phase, our algorithm can be used to produce Armstrong tables for the set of UCs the teams perceive currently as semantically meaningful. The Armstrong tables may be inspected together with the domain experts, and when new UCs are identified a corresponding Armstrong table may be repeatedly produced. For an experiment or assignment, one may specify a target set Σ
t and possibly a target NFSnfs (Tts ). One may then measure the quality of the output sets (Σ1,Ts1 ) of the first phase against the target sets (Σt ,Tts ), and the output sets (Σ2,Ts2 ) of the second phase against the target sets (Σt ,Tts ). If there is an increase in quality, then Armstrong tables, indeed, appear to be useful. The question remains how to measure the quality of the output sets against the target sets. For this purpose we propose three measures. Letmin (Σ) denote the UCs in Σ that are minimal.Soundness measures, which of the (minimal) UCs and headers currently perceived as semantically meaningful, are actually semantically meaningful:If
and
we define
Completeness measures, which of the actually semanti-cally meaningful (minimal) UCs and headers in NFSs, are currently perceived as semantically meaningful:
If
and
we define
Finally,
proximity prox ((Σ,Ts ), (Σt ,Tts )) combines soundness and completeness and is defined as:If
and
we define
prox ((Σ,Ts ),In database courses one may use Armstrong tables as automated feedback to solutions. Our measures can be applied to automatically mark non-multiple choice questions. This can reduce errors and save time in assessing course work.
Again, we will use our running example where SCHEDULE consists of the four attributes C(_ID), L(_Name), R(oom), and T(ime). Suppose we describe, in natural language, the application domain to the students of a database course. Then, we ask them to identify the uniqueness and NOT NULL constraints that they perceive to be semantically meaningful for this application domain. The students may ask questions in natural language to clarify their perceptions about the application domain. The lecturers of the course act as domain experts who will provide answers to questions in natural language that are consistent with the target set Σ
t = {u (CT ),u (LT ),u (RT )},over the table schema SCHEDULE.
Suppose one group of students returns as an answer to the question the set Σ = {
u (CT ),u (LRT ),u (CLR )} andTs =LRT . One can then automatically computeand
Furthermore, one may also return an Armstrong table for Σ and
nfs (Ts ), for example, the one in Table 6.The students inspect the table and may make several observations. For example, Fagin teaches 55505 and Gottlob teaches 77707 at the same time in the same room. The students decide to ask the domain expert whether different courses can be taught in the same room at the same time. The domain experts indicate that this is impossible, and the students decide to replace the UC
u (LRT ) by the UCu (RT ) in response. In addition, the students notice that Gottlob teaches in different rooms at the same time. As this is impossible, they decide to specify the UCu (LT ). They now submit their revised constraint set Σ' = {u (CT ),u (RT ),u (LT ),u (CLR )} andTs =LRT . Again, one can then automatically computeand
Hence, inspecting the Armstrong table results in an improvement of 1/14 in soundness, 2/5 in completeness, and 1/6 in proximity.
VIII. CONCLUSION AND FUTURE WORK
We investigated the data- and schema-driven discovery of SQL keys. We established insights into structural and computational properties of Armstrong tables. These can increase the discovery of semantically meaningful SQL keys in practice, leading to better schema designs and improved data processing. In addition, we also established algorithms to automatically discover SQL keys in given SQL tables. These have applications in requirement acquisition, schema design and query optimization. We combined the data- and schema-driven approaches to compute informative Armstrong tables, which are effective semantic samples of given data sets. Moreover, we defined formal measures to assess the difference between sets of SQL keys. These can be applied to validate the usefulness of Armstrong tables and to database education. Our findings also apply to Codd’s null marker interpretation
value exists but unknown , using Levene and Loizou’s possible world semantics [38]. In the future we plan to implement our results in a design aid, to test our measures in applications, and to address the class of foreign keys.-
[Table 1.] An Armstrong table for SCHEDULE
-
[Table 2.] Another Armstrong table
-
[Table 3.] Updated table
-
[Algorithm1] Algorithm 1 Armstrong table computation
-
[Algorithm2] Algorithm 2 Mining of uniqueness constraints by pairwise row comparisons
-
[Algorithm3] Algorithm 3 Mining of uniqueness constraints by exploring hypergraph transversals
-
[Table 4.] Real-world table treal
-
[Table 5.] Informative Armstrong table tinf for treal
-
[Table 6.] Armstrong table as feedback