Schema- and Data-driven Discovery of SQL Keys

  • cc icon
  • ABSTRACT

    Keys play a fundamental role in all data models. They allow database systems to uniquely identify data items, and therefore, promote efficient data processing in many applications. Due to this, support is required to discover keys. These include keys that are semantically meaningful for the application domain, or are satisfied by a given database. We study the discovery of keys from SQL tables. We investigate the structural and computational properties of Armstrong tables for sets of SQL keys. Inspections of Armstrong tables enable data engineers to consolidate their understanding of semantically meaningful keys, and to communicate this understanding to other stake-holders. The stake-holders may want to make changes to the tables or provide entirely different tables to communicate their views to the data engineers. For such a purpose, we propose data mining algorithms that discover keys from a given SQL table. We combine the key mining algorithms with Armstrong table computations to generate informative Armstrong tables, that is, key-preserving semantic samples of existing SQL tables. Finally, we define formal measures to assess the distance between sets of SQL keys. The measures can be applied to validate the usefulness of Armstrong tables, and to automate the marking and feedback of non-multiple choice questions in database courses.


  • KEYWORD

    Algorithm , Complexity , Armstrong database , Key , Soundness , Completeness , Mining , SQL

  • I. INTRODUCTION

    In databases, keys play a fundamental role in understanding both the structure and semantics. Given an SQL table schema, a key is a collection of columns whose values uniquely identify rows. That is, no two different rows have matching total values in each of the key columns. The concept of a key is essential for many other data models, including semantic models [1-4], object models [5], description logics [6], XML [7-9], RDF [10], and OWL [11].

    The discovery of semantically meaningful SQL keys is a crucial task in many areas of modern data management, for example, in data modeling, database design, query optimization, indexing, and data integration [12]. This article is concerned with methods for semi-automated schemadriven and automated data-driven SQL key discovery.

    There is great demand in industry for such methods, because they vastly simplify the job of the database administrator and thereby decrease the overall cost of database ownership. The discovery of composite keys is especially difficult, because the number of possible keys

    increases exponentially with the number of columns [3,13]. Because of the industry demand, the goal is to provide practical algorithms that exhibit good typical case behavior. Industry-leading data modeling and design tools, such as the ERwin data modeler, emphasize the need for good test data in order to validate the semantics of the models they produce [14]. In our context, test data is considered good if it assists data engineers with the discovery of semantically meaningful keys. This calls for algorithms that produce test data which satisfy the keys currently perceived as semantically meaningful and violate the keys currently perceived as semantically meaningless.

      >  A. Running Example

    Consider a simple database that collects basic information about the weekly schedule of courses. That is, we have a schema SCHEDULE with columns C_ID, L_Name, as well as Time and Room. The schema stores the time (including the weekday) and room in which a lecturer (identified by their lecturer name L_Name) gives a course (identified by its C_ID). An SQL definition may be given as follows:

    image

    The table schema specifies additional assertions. The primary key forces rows over SCHEDULE to be NOT NULL in the C_ID and Time columns, and to be unique on

    their {C_ID, Time} projections. That is, no two different rows must have matching values in both the C_ID column and the Time column. A team of data engineers may wonder if the semantics of the application domain has been captured by this SQL definition. They decide to generate some good test data to discuss their current understanding with the domain experts, who do not understand database terminology. Therefore, the data engineers produce the data sample in Table 1, where ni denotes occurrences of the no information null marker.

    The domain experts express concern about rows 1 and 3: Ullman teaches both 11301 and 78200 in the red room on Monday at 10 am. After some discussion, the domain experts emphasize that different courses should not be taught by the same lecturer in the same room at the same time. As a consequence, the team of engineers decides to specify the uniqueness constraint (UC) u(Time, L_Name, Room). They produce the data sample in Table 2 to discuss their new understanding with the domain experts.

    The domain experts direct the engineers’ attention to rows 1 and 3 where Ullman teaches both 11301 and 78200 on Monday at 10 am. After exchanging some ideas, the experts agree that lecturers must not teach different courses at the same time.

    Moreover, from rows 1 and 4 of the data sample in Table 2, experts notice that both 11301 and 78200 are taught on Monday at 10am in the red room. It turns out that no two different courses must be taught at the same time in the same room. Therefore, the UC u(Time, L_Name, Room) is replaced by the two stronger UCs u(Time, L_Name) and u(Time, Room). The example shows the potential benefit of investigating good sample data for the discovery of semantically meaningful SQL keys. As a revised SQL table schema, the team of data engineers specify

    image

    While this approach appears to be very beneficial, the question remains what constitutes good test data, and how to create it (automatically). In some situations the domain experts may feel very confident and would like to have the opportunity to modify values in the test data to reflect their domain knowledge, or provide the entire test data themselves. In such cases, and other situations, the data engineers require automated means to discover a representation of the keys that are satisfied by the test data. For example, when the domain experts inspect the data sample in Table 1, they may simply suggest using the data sample in Table 3 instead, with the updated values indicated by bold font.

    On input of this table, a constraint mining algorithm would return NOT NULL constraints on C_ID and Time, as well as the three UCs u(C_ID, Time), u(L_Name, Time), and u(Room, Time). This, again, leads to the definition of SCHEDULE'. The question remains how to discover the SQL keys that are satisfied by a given SQL table.

      >  B. Contributions

    In this paper we will establish detailed answers to the two questions above. As our first main contribution we discuss the schema-driven discovery of SQL keys that are semantically meaningful for a given application domain. For this purpose we investigate the well-known concept of Armstrong databases for the class of SQL keys. Armstrong databases formalize the concept of good test data in the sense that they satisfy the set Σ of keys currently perceived as semantically meaningful, and violate all keys that are not implied by Σ. We characterize when a given SQL table is Armstrong with respect to a given set Σ of SQL keys. This characterization allows us to establish an algorithm that generates good test data for arbitrary sets of SQL keys. While we demonstrate that the problem of computing such Armstrong tables is precisely exponential in the number of column headers, our algorithm produces an Armstrong table whose size is at most quadratic. We also show that there are situations where the size of the Armstrong table our algorithm produces is exponentially smaller than the size of the keys given. Both extreme cases result from very specific situations, which are very unlikely to occur in practice. Indeed, in most situations that do occur in practice, our algorithm will produce Armstrong tables whose size is in the same order of magnitude as the size of the keys given.

    As a second main contribution we discuss the datadriven discovery of SQL keys. For this purpose we establish two algorithms that compute the set of minimal SQL keys satisfied by a given SQL table. While the problem generally requires exponential time, our algorithms show good best case behavior. As a third main contribution we combine the schema-driven and data-driven approach to increase the effectiveness of SQL key discovery. Given a real world data set, we apply the data-driven approach to identify the minimal SQL keys satisfied by this data set, and then apply the schema-driven approach to compute an Armstrong table for the set of these minimal keys. The Armstrong table is called informative as it contains only rows of the original data set, is much smaller in size, and satisfies the same SQL keys.

    As the final contribution, we define formal measures that can be applied to 1) empirically validate the usefulness of our Armstrong tables for the acquisition of semantically meaningful SQL keys, and 2) automate the feedback and marking of database exam questions.

      >  C. Organization

    We summarize related work in Section II, and provide preliminary definitions in Section III. We investigate structural and computational properties of Armstrong tables for the class of SQL keys in Section IV. In Section V we study the SQL key mining problem. Section VI combines the data-driven and schema-driven approaches to the discovery of SQL keys. Our formal measures of usefulness and their applications are discussed in Section VII. Finally, in Section VIII, we conclude and briefly comment on future work.

    II. RELATED WORK

    Data dependencies have been studied thoroughly in the relational model of data, cf. [15-17]. Dependencies are essential to the design of the target database, the maintenance of the database during its lifetime, and all major data processing tasks [15,17]. These applications also provide strong motivation for developing data mining algorithms to discover the data dependencies that are satisfied by a given database. Armstrong databases are a useful design aid for data engineers, which can help with the consolidation of data dependencies [18,19] and schema mappings [20], the design of databases [21] and the creation of concise test data [22].

    In the relational model of data, the class of keys is subsumed by the class of functional dependencies. The structural and computational properties of Armstrong relations have been investigated in the relational model of data for the class of keys [3,23], and the more general class of functional dependencies [21,24]. The mining of keys and functional dependencies in relations has also received considerable attention in the relational model [12,25,26]. The concept of informative Armstrong databases was introduced in [22] as semantic samples of existing databases.

    One of the most important extensions of Codd’s basic relational model [27] is incomplete information [28,29]. This is mainly due to the high demand for the correct handling of such information in real-world applications. Approaches to deal with incomplete information comprise incomplete relations, or-relations or fuzzy relations. In this paper we focus on incomplete bags, and the most popular interpretation of a null marker as “no information” [29,30]. This is the general case of SQL tables where duplicate rows and null markers are permitted to occur in columns that are specified as null-able. Relations are idealized special SQL tables where no duplicate rows can occur and all columns are specified NOT NULL. Recently, Armstrong tables have been investigated for their combined class of SQL keys and functional dependencies [31]. In this article, we establish optimizations that arise from the focus on the sole class of SQL keys. This provides insight into the trade-off between the expressiveness of data dependency classes and the structural and computational properties of Armstrong tables. The insight can be used by data engineers to make a more informed decision about the complexity of the design process for the target database. For example, it may appear that the most important requirements of an application domain can already be captured by SQL keys alone, without the need for functional dependencies. In this case, engineers can utilize the algorithms developed in the current article, which result in computations that are more resource-efficient than those developed for the combined class of keys and functional dependencies [31]. Note that Hartmann et al. [1] studies Codd keys, which require tuples to have unique and total projections on key attributes. In contrast, the present article studies SQL keys, which require total tuples to have unique projections on key attributes. Hartmann et al. [1] does also not discuss any key mining algorithms or measures to assess the difference of key sets, nor does it discuss informative Armstrong databases. To the best of the authors’ knowledge, no previous work has addressed the mining of keys from SQL tables. Our results reveal that the methods developed for the idealized special case of relations [26] can be adapted to the use for SQL tables. We are also unaware of any studies related to informative Armstrong databases except [22], measures that capture the difference between sets of keys, nor their utilization for assessing the usefulness of Armstrong tables or database exam questions.

    The present article is an extension of the conference paper [32]. The present article contains all the proofs of our results as well as additional motivation and discussion. In addition, the concept of informative Armstrong tables was not discussed in the conference paper.

    III. THE SQL TABLE MODEL

    In this section we introduce the basic definitions necessary and sufficient for the development of our results in subsequent sections. In particular, we introduce a model that is very similar to SQL’s data definition capabilities. The interpretation of null marker occurrences as “no information” follows the SQL interpretation, and the formal model introduced by Zaniolo [29]. The class of SQL keys under consideration is the exact class of uniqueness constraints defined in the SQL standard [33].

    Let

    image

    be a countably infinite set of symbols, called column headers or headers for short. A table schema is a finite non-empty subset T of

    image

    Each header H of a table schema T is associated with a countably infinite domain dom(H) of the possible values that can occur in column H. To encompass partial information, every column can have a null marker, denoted by ni ∈ dom(H). The intention of ni is to signify “no information”.

    For header sets X and Y, we may write XY for XY. If X = {H1,. . . , Hm}, then we may write H1 ... Hm for X. In particular, we may write simply H to represent the singleton {H}. A row over T (T-row or simply row, if T is understood) is a function r : T → ∪ HT dom(H) with r(H) ∈ dom(H) for all HT. The null marker occurrence r(H) = ni associated with a header H in a row r means that there is no information about r(H). That is, r(H) may not exist or r(H) exists but is unknown. For XT let r(X) denote the restriction of the row r over T to X. A table t over T is a finite multi-set of rows over T. For a row r over T and a set XT, r is said to be X-total if for all HX, r(H) ≠ ni. Similarly, a table t over T is said to be Xtotal, if every row r of t is X-total. A table t over T is said to be a total table if it is T-total.

    A UC over a table schema T is an expression u(X) where XT. A table t over T is said to satisfy the UC u(X) over T, denoted by |=t u(X), if for all r1, r2t, if r1(X) = r2(X) and r1, r2 are X-total, then r1 = r2. The semantics is that of SQL’s UC [33], and it reduces to that of a key for total tables [15].

    Following Atzeni and Morfuni [30] a null-free subschema (NFS) over the table schema T is an expression nfs(Ts) where TsT. The NFS nfs(Ts) over T is satisfied by a table t over T, denoted by |=t nfs(Ts), if and only if t is Ts-total. SQL allows the specification of column headers as NOT NULL. NFSs occur in everyday database practice: the set of headers declared NOT NULL forms an NFS over the underlying table schema.

    In schema design and maintenance data dependencies are normally specified as semantic constraints on the tables intended to be instances of the schema. During the design process or the lifetime of a database one usually needs to determine further dependencies, which are implied by the given ones. Let T be a table schema, nfs(Ts) an NFS, and Σ ? {φ} be a set of UCs over T. We say that Σ implies φ in the presence of nfs(Ts), denoted by Σ |=Ts φ, if every Ts-total table t over T that satisfies Σ also satisfies φ. If Σ does not imply φ in the presence of nfs(Ts) we may also write Σ |≠Ts φ. Let Σ* Ts = {φ |Σ|=Ts φ} be the semantic closure of Σ. If we do not want to emphasize the presence of the NFS nfs(Ts) we may simply write Σ |= φ, Σ |≠ φ or Σ*, respectively. The next result explains why minimal UCs are important. Indeed, for a set Σ ∪ {u(X)} of UCs, and an NFS nfs(Ts) over T we call u(X) minimal if and only if Σ |=Ts u(X) and for all u(Y) over T where Σ |=Ts u(Y) and YX we have Y = X.

    THEOREM 1. Let T be a table schema, nfs(Ts) an NFS, and Σ ∪ {φ} be a set of UCs over T. Then, Σ implies φ in the presence of nfs(Ts) if and only if there is some u(Y) ∈ Σ such that Y ⊆ X.

    Proof. Suppose first that there is some u(Y) ∈ Σ such that YX. We need to show that Σ implies φ in the presence of nfs(Ts). Assume to the contrary that there is some Ts-total table t that satisfies Σ, but violates φ. Consequently, there are two different Ts-total rows r1, r2t such that r1(X) = r2(X). Since YX we conclude that t violates u(Y) ∈ Σ, a contradiction. Consequently, Σ implies φ in the presence of nfs(Ts).

    For the reverse direction we present the contraposition. That is, we assume that for all u(Y) ∈ Σ we have

    image

    and show that under this assumption Σ does not imply φ in the presence of nfs(Ts). For the latter, we construct a Ts- total table t that satisfies Σ but violates φ. Let t be the table over T consisting of two rows r1 and r2 over T where r1 and r2 are XTs-total and r1(X) = r2(X), and r1(H) = ni = r2(H) for all HTXTs, and r1(H) ≠ r2(H) for all H ∈ (TX) ? Ts. The construction ensures that t violates u(X). Let u(Y) ∈ Σ. We have assumed that

    image

    that is,

    image

    Again, the construction ensures that t satisfies u(Y). This concludes the proof. □

    In the relational model of data, keys enjoy a simple axiomatization [17] by two axioms. The set axiom says that for every relation schema R, the set R of all attributes forms a key. The superkey rule states that for every set K of attributes in R that forms a key over R, every superset K' ⊆ R of K also forms a key over R. For SQL we have just shown that this axiomatization becomes even simpler. The proof of Theorem 1 has just shown that the superkey rule is sound and complete for the implication of SQL keys. In particular, the set axiom is no longer sound since we can have two duplicate tuples in a multiset.

    EXAMPLE 1. We can capture the SQL table schema of the running example as the table schema SCHEDULE = CTLR with SCHEDULEs = CT. Let Σ consist of the two UCs u(CT) and u(LTR). It follows that Σ does not imply any of the following UCs: u(CLR), u(LT) nor u(TR). For instance, the SCHEDULEs-total table in Table 2 satisfies Σ and violates every of the three UCs. As an application of Theorem 1 we see that neither of CLR, LT nor TR is a superset of CT or LTR. □

    IV. SCHEMA-DRIVEN SQL KEY DISCOVERY

    In this section we investigate the structural and computational properties of suitable data to test the semantic meaningfulness of uniqueness constraints over the SQL table schemata. For this purpose, we use Armstrong tables to formalize the notion of suitable test data. Having introduced the concepts of strong agree sets and anti-keys, we characterize when an arbitrarily given SQL table is Armstrong for an arbitrarily given set of uniqueness constraints. The characterization is then used to compute Armstrong tables. Finally, we derive results on the time and space complexity associated with the computation of Armstrong tables.

      >  A. Key Concepts

    The official concept of an Armstrong database was originally introduced by Fagin [16]. We require our tables to be Armstrong with respect to uniqueness constraints and the NFS. Intuitively, an Armstrong table satisfies the given constraints and violates the constraints in the given class that are not implied by the given constraints. This results in the following definition.

    DEFINITION 1. Let Σ be a set of UCs, and nfs(Ts) an NFS over table schema T. A table t over T is said to be Armstrong for Σ and nfs(Ts) if and only if i) t satisfies Σ, ii) t violates every UC φ ?Σ*Ts, iii) t is Ts-total, and iv) for every H ∈ T ? Ts, t is not H-total. □

    Through the use of Theorem 1, it is easy to see that a table t is Armstrong for Σ and nfs(Ts) if and only if 1) for all u(X) over T, t satisfies u(X) if and only if there is some u(Y) ∈ Σ such that YX, and 2) for all nfs(T's) over T, t satisfies nfs(T's) if and only if T'sTs.

    EXAMPLE 2. Let SCHEDULE = CTLR with SCHEDULEs = CT. Let Σ consist of the two UCs u(CT) and u(LTR). The data sample in Table 2 is Armstrong for Σ and nfs(SCHEDULEs). For example, it violates the UCs: u(CLR), u(LT) and u(TR), as well as every NFS nfs(T's) where T's is not contained in SCHEDULEs. □

    A natural question to ask is how we can characterize the structure of tables that are Armstrong. With this in mind, we introduce the formal notion of strong agree sets for pairs of distinct rows, and tables.

    DEFINITION 2. For two rows r1 and r2 over table schema T where r1 ≠ r2 we define the strong agree set of r1 and r2 as the set of all column headers over T on which r1 and r2 have a matching total value, i.e., ags(r1,r2) = {H ∈ T | r1(H) = r2(H) and r1(H) ≠ ni ≠ r2(H)}. Furthermore, the strong agree set of a table t over table schema T is defined as ags(t) = {ags(r1, r2) | r1,r2 ∈ t ∧ r1 ≠ r2}.

    EXAMPLE 3. Let SCHEDULE = CTLR with SCHEDULEs = CT. Let Σ consist of the two UCs u(CT) and u(LTR). Let t denote the data sample in Table 2. The strong agree set of t consists of CLR, LT, TR, L, R, and T. □

    For a table t to be Armstrong for Σ and nfs(Ts), t must violate all UCs u(X) not implied by Σ in the presence of nfs(Ts). Instead of violating all UCs, it suffices to violate those ones that are maximal with the property that they are not implied by Σ in the presence of nfs(Ts). This motivates the following definition.

    DEFINITION 3. Let Σ be a set of UCs, and nfs(Ts) an NFS over table schema T. The set Σ-1 of all anti-keys with respect to Σ and nfs(Ts) is defined as Σ-1 = {a(X) | X ⊆ T ∧ Σ |≠Ts u(X) ∧ ∀H ∈ (T ? X)(Σ |=Ts u(XH))}. □

    Hence, an anti-key for Σ is given by a maximal set of column headers, which does not form a uniqueness constraint implied by Σ.

    EXAMPLE 4. Let SCHEDULE = CTLR with SCHEDULEs = CT. Let Σ consist of the two UCs u(CT) and u(LTR). The set Σ-1 of anti-keys with respect to Σ and SCHEDULEs consists of a(CLR), a(LT) and a(TR). □

      >  B. Structure of Armstrong Tables

    The concepts from the last sub-section are sufficient to establish a characterization of Armstrong tables for the class of UCs over SQL table schemata.

    THEOREM 2. Let Σ be a set of UCs, and nfs(Ts) an NFS over the table schema T. For all tables t over T it holds that t is an Armstrong table for Σ and nfs(Ts) if and only if the following three conditions are satisfied: for all a(X) ∈ Σ-1 we have X ∈ ags(t); for all u(X) ∈ Σ and for all Y ∈ ags(t) we have

    image

    total(t) = {H ∈ T | ∀r ∈ t(r(H) ≠ ni)} = Ts.

    Proof. First, we show that the three conditions are sufficient for t to be an Armstrong table for Σ and nfs(Ts). Suppose that t is such that the three conditions are satisfied. It follows immediately from the last condition that t satisfies nfs(T's) if and only if T'sTs. Let u(X) ∈ Σ. If there were two rows r1, r2t such that r1r2 and Xags(r1,r2) = Y, then Yags(t). This, however, would violate the second condition. Hence, t satisfies u(X). Since u(X) ∈ Σ was an arbitrary choice we conclude that t satisfies Σ. Let u(Y) ? Σ*. Then there is some a(X) ∈ Σ-1 such that YX holds. From the first condition we conclude that Yags(r1,r2) for some r1, r2t with r1r2. Consequently, t violates every uniqueness constraint not implied by Σ.

    Showing that the three conditions hold necessarily whenever t is an Armstrong table for Σ and nfs(Ts) still remains. Suppose that t is an Armstrong table for Σ and nfs(Ts). The last condition follows immediately from the fact that t satisfies nfs(T's) if and only if T'sTs. Since t satisfies Σ there cannot be any Yags(t) and u(X) ∈ Σ such that XY holds. We conclude that the second condition is satisfied. It remains to show that the first condition is satisfied, too. Let a(X) ∈ Σ-1. We need to show that Xags(t). From a(X) ∈ Σ-1 it follows that u(X) ? Σ*. As t violates u(X) it follows that there are r1, r2t such that r1r2 and XY = ags(r1,r2). Also, from a(X) ∈ Σ-1 it follows that for all HTX we have u(XH) ∈ Σ*, and thus that t satisfies u(XH). Suppose that there is some HYX. Then t satisfies u(XH) and, therefore, r1(H) = ni = r2(H) or r1(H) ≠ r2(H). This, however, is a contradiction as HY = ags(r1,r2). Consequently, X = Yags(t). □

    EXAMPLE 5. Let SCHEDULE = CTLR with SCHEDULEs = CT. Let Σ consist of the two UCs u(CT) and u(LTR), and let t denote the data sample in Table 2. Recall from the previous examples that Σ-1 = {a(CLR), a(LT), a(TR)}, and ags(t) = {CLR, LT, TR, L, R, T}. Since t satisfies the three conditions of Theorem 2, it follows that t is an Armstrong table for Σ and SCHEDULEs. □

      >  C. Computation of Armstrong Tables

    We will now use the characterization of Theorem 2 to compute Armstrong tables for an arbitrarily given set Σ of UCs and an arbitrarily given NFS nfs(Ts) over an arbitrarily given table schema T. A great part of the computation is devoted to determining the set Σ-1. For this purpose, we borrow concepts from hyper-graphs. Indeed, to compute Σ-1 we generate the simple hyper-graph H = (V, E) with the vertex set V = T and the set E = {X | u(X) ∈ Σ} of hyper-edges. In fact, based on Theorem 1 we assume without loss of generality that Σ consists of minimal UCs only. If not, then we remove all those UCs from Σ that are not minimal. From this we obtain Σ-1 = {a(TX) | XTr(H)} where Tr(H) denotes the minimal transversals of the hyper-graph H, i.e., the set of minimal sets XT that have a non-empty intersection with every hyper-edge of H [34].

    LEMMA 1. Let Σ be a set of UCs over table schema T. Then, Σ-1 = {a(TX) | XTr(H)}.

    Proof. Recall that

    image

    We show first that if XTr(H), then a(TX) ∈ Σ-1. First, it follows that Σ |≠Ts u(TX) since otherwise there would be some u(Z) ∈ Σ such that ZTX. This, however, would mean that ZX = ?, which contradicts the hypothesis that XTr(H). It remains to show that for all HX, Σ |=Ts u((TX)H). Assume the opposite, i.e., there is some HX such that Σ |≠Ts u((TX)H). Then, there cannot be any u(Z) ∈ Σ such that Z ⊆ (TX)H = T ? (XH). Hence, for all u(Z) ∈ Σ we have

    image

    This, however, contradicts the minimality of XTr(H). We have shown that a(TX) ∈ Σ-1.

    Now, we show that if a(X) ∈ Σ-1, then TXTr(H). From a(X) ∈ Σ-1 we conclude that Σ |≠Ts u(X). Hence, for all

    image

    From a(X) ∈ Σ-1 we know that for all HTX, Σ |=Ts u(XH). Hence, for all HTX there is some u(Z) ∈ Σ such that ZXH. Thus, for all HTX there is some u(Z) ∈ Σ such that

    image

    That is, TXTr(H).

    Now, we have a complete strategy for computing Armstrong tables, which we summarize in Algorithm 1. For each column header H, let cH,0, cH,1, ... denote mutually distinct domain values from dom(H). Lines 2-4 deal with the special case where Σ contains the empty key

    image

    saying that a table can have at most one row. Here, we just need to return a table consisting of one row with null marker occurrences in columns which are null-able. Otherwise, we start off with a row r0 with total values in every column (line 6). In lines 8 and 9 we use Lemma 1 to compute the set of anti-keys with respect to Σ and nfs(Ts). In lines 11-15 we introduce for each anti-key a(Y) a new row ri, which has a strong agree set Y with r0. In lines 16-20 we may need to introduce an additional row that has null value occurrences in null-able columns that do not feature null value occurrences in previously generated rows.

    The correctness of Algorithm 1 follows from Lemma 1 and Theorem 2.

    THEOREM 3. On input (T, Σ, nfs(Ts)), Algorithm 1 computes a table t over T that is Armstrong for Σ and nfs(Ts).

    Proof. It suffices to verify that the output of Algorithm 1 satisfies the conditions in Theorem 2. Indeed, Lemma 1 reveals that Algorithm 1 computes the anti-keys correctly. The main construction of Algorithm 1 in lines 11- 15 guarantees that the output satisfies the first condition. The construction also ensures that every strong agree set from the output of Algorithm 1 is contained in some antikey. This means that no UC from Σ can be contained in some strong agree set. This shows that the output satisfies the second condition. Finally, lines 17-19 ensure that the output satisfies the third condition. □

    EXAMPLE 6. Let SCHEDULE = CTLR with SCHEDULEs = CT. Let Σ consist of the two UCs u(CT) and u(LTR). On input (SCHEDULE, Σ, SCHEDULEs), Algorithm 1 would compute the following Armstrong table:

    image

    A suitable value substitution yields the data sample in Table 2. □

      >  D. Complexity Considerations

    In this subsection, we investigate properties regarding the space and time complexity for computing Armstrong tables. We will demonstrate that the user-friendly representation of a set of SQL keys in the form of an Armstrong table comes, in the worst case, at a price. In fact, the number of rows in a minimum-sized Armstrong table can be exponential in the total number of column headers that occur in Σ. Because of this result we cannot, in the worst case, design an algorithm for generating Armstrong tables in polynomial time.

    1) Worst-case time-complexity

    The following result follows straight from Theorem 2 and the correctness of Algorithm 1.

    PROPOSITION 1. Let Σ be a set of UCs and nfs(Ts) be some NFS over table schema T. Let t be an Armstrong table for Σ and nfs(Ts). Then

    image

    Proof. The first condition of Theorem 2 states that for all a(X) ∈ Σ-1 we have Xags(t). If we add the second condition of Theorem 2, then we derive that for all Yags(t) there is at most one X ∈ Σ-1 such that XY . This shows that

    image

    Finally,

    image

    since every distinct pair of distinct tuples in t has precisely one agree set. □

    Let the size of an Armstrong table be defined as the number of rows that it contains. It is a practical question to ask how many rows a minimum-sized Armstrong table requires. An Armstrong table t for Σ and nfs(Ts) is said to be minimum-sized if there is no Armstrong table t' for Σ and nfs(Ts) such that |t'| < |t|. That is, for a minimum-sized Armstrong table for Σ and nfs(Ts) there is no Armstrong table for Σ and nfs(Ts) with a smaller number of rows.

    We recall what we mean by precisely exponential [24]. Firstly, it means that there is an algorithm for computing an Armstrong table, given a set Σ of UCs and an NFS nfs(Ts), where the running time of the algorithm is exponential in the size of the keys. Secondly, it means that there is a set Σ of UCs and an NFS nfs(Ts) in which the number of rows in each minimum-sized Armstrong table for Σ and nfs(Ts) is exponential.

    PROPOSITION 2. The complexity of finding an Armstrong table, given a set of UCs and an NFS, is precisely exponential in the size of the UCs.

    Proof. The time complexity of Algorithm 1 is proportional to the time-complexity of computing the set Σ-1 of anti-keys with respect to Σ and nfs(Ts). The computation of the anti-keys via the hypergraph transversals in lines 8 and 9 requires time exponential in the size of the UCs given [35]. Therefore, Algorithm 1 runs in time exponential in the size of the UCs given.

    It remains to show that there is a set Σ of UCs for which the number of rows in each Armstrong table for Σ is exponential in the size of the UCs given. According to Theorem 1 it suffices to find a set Σ of UCs such that Σ-1 is exponential in the size of the UCs. Such a set is given by

    Σ = {u(H2i-1, H2i) | i = 1, . . . , n}.

    The set Σ-1 consists of those anti-keys that contain precisely one column header for each element of Σ. Therefore, Σ-1 contains precisely 2n elements with 2n being the size of Σ. □

    2) Minimum-sized Armstrong tables

    Despite the general worst-case exponential complexity in the size of the keys, Algorithm 1 is a fairly simple algorithm that is, as we now demonstrate, quite conservative in its use of time.

    PROPOSITION 3. Let Σ be a set of UCs, nfs(Ts) an NFS over table schema T. Let t be a minimum-sized Armstrong table for Σ and nfs(Ts). Then

    image

    Proof. The upper bound follows immediately from Theorem 3. The lower bound follows from Theorem 1. Indeed, it follows that

    image
    image

    Consequently,

    image

    Note the upper bound in Proposition 3. In general, the focus on UCs can yield Armstrong tables with a substantially smaller number of rows than Armstrong tables for more expressive classes such as functional dependencies. The reason is that we do not need to violate any functional dependencies that are not implied by the given set. In practice, this is desirable for the validation of schemata known to be in Boyce-Codd normal form, for example. Such schemata are often the result of entity-relationship modeling. Applying the algorithm from [31], designed for UCs and functional dependencies, to our running example would yield an Armstrong table with 12 rows. Instead, Algorithm 1, designed for UCs only, produces an Armstrong table with just 4 rows. In general, we can conclude that Algorithm 1 always computes an Armstrong table of reasonably small size.

    COROLLARY 1. On input (T, Σ, nfs(Ts)), Algorithm 1 computes an Armstrong table for Σ and nfs(Ts) whose size is at most quadratic in the size of a minimum-sized Armstrong table for Σ and nfs(Ts). □

    3) Size of representations

    We show that, in general, there is no superior way of representing the information inherent in a set of UCs and a null-free subschema.

    THEOREM 4. There is some set Σ of UCs and an NFS nfs(Ts) such that Σ has size O(n), and the size of a minimum- sized Armstrong table for Σ and nfs(Ts) is O(2n/2). There is some table schema T, some NFS nfs(Ts) and some set Σ of UCs over T such that there is an Armstrong table for Σ and nfs(Ts) where the number of rows is in O(n), and the size of the minimal UCs implied by Σ in the presence of nfs(Ts) is in O(2n).

    Proof. For the first claim let T = H1, . . . , H2n, Ts = T and Σ = {u(H2i-1H2i) | i = 1, . . . , n}. Then Σ-1 = {a(X1 ... Xn) | ∀ i = 1, ... ,n(Xi ∈ {H2i-1, H2i})}.

    For the second claim let T = H1 ... H2n, Ts = T and Σ = {u(X1 ... Xn) | ∀ i = 1, . . . , n(Xi ∈ {H2i-1, H2i})}. Then, the set of minimal UCs implied by Σ is Σ itself. Since Σ-1 = {a(T ? (H2i-1H2i)) | i = 1, . . . , n} there is an Armstrong table for Σ and nfs(Ts) where the number of rows is in O(n). □

    We can perceive that the representation in form of an Armstrong table can offer tremendous space savings over the representation as a UC set, and vice versa. It seems intuitive to use the representation as Armstrong tables for the discovery of semantically meaningful constraints, and the representation as constraint sets for the discovery of semantically meaningless constraints. This intuition has been confirmed empirically for the class of functional dependencies over relations [18].

    V. DATA-DRIVEN SQL KEY DISCOVERY

    In this section we will establish algorithms for the automated discovery of uniqueness constraints from given SQL tables. Such algorithms have direct applications in schema design, query optimization, and the semantic sampling of databases [22,26,31,36,37]. In requirements engineering, for example, these algorithms can be utilized to discover semantically meaningful uniqueness constraints from sample data, which domain experts provide.

      >  A. Mining by Pairwise Comparison of Rows

    Our first algorithm gradually inspects all pairs of rows for the given table, and adjusts the set of minimal UCs accordingly. More specifically, for every given pair of different rows r1, r2 (line 3), and for every UC u(X) satisfied by the rows inspected so far (line 4), we replace u(X) by u(XH) whenever r1, r2 violate u(X) and H is a column in TX such that r1, r2 satisfy u(XH) (lines 5-10). Lines 12-14 remove non-minimal UCs. Note that the output of Algorithm 2 is

    image

    whenever t contains any duplicate rows.

    THEOREM 5. On input (T, t), Algorithm 2 computes the set of minimal UCs satisfied by table t over T in time O(|T|2 × |t|2 × m2 t) where mt denotes the maximum number of minimal UCs satisfied by any subset st.

    Proof. Algorithm 2 works correctly. A formal proof can be established using induction over the number of rows in the input table t. When t contains no row or just one row, then Algorithm 2 returns uc(t) ={u(?)} - which is the unique minimal UC satisfied by t. Suppose we know that Algorithm 2 correctly computes the set of minimal UCs satisfied by table t with n rows, and that t' := t ∪ {r'}. Then, the new row r' is compared with every other row r in t. Whenever a UC u(X) is violated, it is replaced by the set of uniqueness constraints u(XH) where r' and r do not strongly agree on H. This ensures that table t' satisfies all the uniqueness constraints in uc(t'). Finally, lines 12-14 ensure that uc(t') does not contain any uniqueness constraints that are not minimal.

    For the complexity bound note that there are |t|2 pairs of rows to inspect. For each pair, uc(t) has at most mt elements, and for each element there are at most mt · |T|2 operations to perform. □

    EXAMPLE 7. Let t, t' be the data samples over SCHEDULE from Tables 2 and 3, respectively. The following table shows the evolution of the minimal UCs by gradually adding a new pair of rows until the entire table has been explored.

    image

    The UCs are those explained in the introductory section of this article. □

      >  B. Mining by Exploration of Hyper-Graph Transversals

    Again, our next algorithm uses the concept of hypergraph transversals. Indeed Algorithm 3 computes the complements of the strong agree sets for the given input table (lines 2 and 3). These complements are called weak disagree sets. Lines 4 and 5 compute the necessary weak disagree sets, that is, those weak agree sets that are minimal. For the hypergraph where the node set is T and the edge set consists of all necessary weak disagree sets, Algorithm 3 uses any (preferably the most efficient) algorithm to compute the set of all minimal transversals of the hypergraph. Lemma 2 shows that this set contains all minimal UCs satisfied by the input table t. Algorithm 3 is compact and benefits from any progress on the popular problem of computing minimal hypergraph transversals [34].

    The following lemma explains the soundness of Algorithm 3.

    LEMMA 2. Let t be a table over table schema T. Then, for all XT, XTr(T,nec-disagw(t)) if and only if |=t u(X) and for all HX, |≠t u(XH).

    Proof. We begin by showing that the two conditions are necessary for every XTr(T, nec-disagw(t)).

    First, if t violated u(X), then there were two rows r1, r2t such that r1r2 and X ⊆ ags(r1, r2). Hence,

    image

    and there would be some Ynec-disagw(t) such that Ydisagw(r1, r2). Thus,

    image

    which contradicts the hypothesis that XTr(T, nec-disagw(t)). We conclude that t satisfies u(X). Suppose there was some HX such that |=t u(XH). Then, it would follow that for all r1, r2t such that r1r2 it held that

    image

    This, however, would violate the minimality of XTr(T, nec-disagw(t)). Therefore, it holds that for all HX, |≠t u(XH).

    Now, we demonstrate that the two conditions are sufficient for X to be in Tr(T, nec-disagw(t)). From |=t u(X) follows that for all r1, r2t where r1r2 we have

    image

    From |≠t u(XH) for all HX it follows that for all HX there are some r1, r2t such that r1r2 and XHags(r1, r2). The last condition means that

    image

    holds. Consequently, for all HX there is some Znec-disagw(t) such that Zdisagw(r1, r2), and thus

    image

    We conclude that XTr(T, nec-disagw(t)). □

    THEOREM 6. On input (T, t), Algorithm 3 computes the set of minimal UCs satisfied by table t over T in time O(m + n) where m := |T|2 × |t|2 × |nec-disagw(t)| and n :=

    image

    Proof. The correctness of Algorithm 3 follows directly from the description of the algorithm and Lemma 2.

    The collection nec-disagw(t) can be computed in time O(m): there are |t|2 pairs of rows to consider, and for each new set in disagw(t) one can check with at most |T|2 × | necdisagw(t)| operations whether the set is also in nec-disagw(t). The computation of the set Tr(T, nec-disagw(t)) of minimal transversals can be done in time O(n) [34]. Note that the hypergraph (T, nec-disagw(t)) is simple, that is, there are no different sets X, Y in nec-disagw(t) where XY holds. □

    EXAMPLE 8. Let t, t' be the data samples over SCHEDULE from Tables 2 and 3, respectively.

    The next table shows the steps for applying Algorithm 3 to (SCHEDULE, t) and (SCHEDULE, t'), respectively.

    image

    The UCs are those explained in the introductory section of this article. □

    VI. INFORMATIVE ARMSTRONG TABLES

    Here, we combine the schema- and data-driven approach to the discovery of SQL keys, as given in Sections IV and V, respectively. The disadvantage of the schema-driven approach is that the Armstrong tables, when generated by our algorithm, contain only artificial values. It is therefore doubtful that such tables illustrate the current perceptions of the data engineers to the domain experts who inspect the table. Of course, the data engineers may substitute real domain values for the artificial values before they present the table to the domain experts, or they generate the Armstrong tables on the basis of some real domain values. However, the table may still not be a good representative of the real world application domain, since some of the combinations of the values may never occur in practice. We will now outline a solution to this problem, which overcomes this limitation. Here, the assumption that is necessary is that some real world data set t is available, for example, in the form of legacy data. Under this assumption, we can apply the data-driven algorithms from Section V to mine the set uc(t) of minimal uniqueness constraints (and NOT NULL constraints) satisfied by t. Subsequently, we apply the schema-driven algorithm from Section IV to compute an Armstrong table t' for uc(t) and the null-free subschema satisfied by t. The Armstrong table t' can be populated with suitable rows from t, that is, t' is a subset of the table t. Thus, the number of rows in t' will be much smaller than the number of rows in t. Therefore, t' constitutes an invaluable semantic sample of the data set t, semantic in the sense that it satisfies the same set of uniqueness and NOT NULL constraints as t.

    DEFINITION 4. Let t be a table over table schema T, and let C denote a class of constraints. A C-informative Armstrong table for t is a table t' over T such that i) t't and ii) for all constraints φ of C over T, t' satisfies φ if and only if t satisfies φ. □

    In the context of this paper, we are interested in informative Armstrong tables with respect to the class C of uniqueness and NOT NULL constraints. For an illustration of the concept we will now look at a small example

    The data sample treal in Table 4 shows some aspects of the Semester 2 timetable for 200 level courses in the Department of Computer Science, the University of Auckland, one month before the beginning of Semester 2 in 2012. Suppose we would like to compute an informative Armstrong table for treal with respect to the class of uniqueness and NOT NULL constraints. We may apply Algorithm 2 or Algorithm 3 to input (SCHEDULE, t), and compute the set uc(t) as

    {u(Time), u(C_ID,L_Name), u(Room,L_Name)}.

    It is not difficult to compute the maximal null-free subschema nfs(Ts) satisfied by treal. We would simply start with T and gradually inspect row by row in treal. For each row r we remove any column HTs from Ts whenever r(H) = ni. For the data sample treal from Table 4 we obtain Ts = {C_ID, Time}. Using the algorithms from Section IV we compute the set uc(t)-1 of anti-keys as

    {a(L_Name), a(C_ID,Room)}.

    Now, finding rows in treal whose strong agree sets are these anti-keys, and which violate any null-free subschema violated by treal still remains. One such table tinf is shown in Table 5. The first two rows have a strong agree set {C_ID,Room}, the first and third row have strong agree {L_Name}, and the fourth row ensures that t' is total exactly where treal is. The example illustrates the potential savings in terms of size that informative Armstrong tables have over the original data samples. For original data samples that are large, the informative Armstrong tables will be considerably smaller. For example, in the case of relational databases, De Marchi and Petit [22] have reported that their informative Armstrong database contained only 0.6% of the tuples in the original database. One can imagine that the inspection for semantic samples of smaller size is much more effective than the inspection of the original data set.

    VII. EMPIRICAL MEASURES OF USEFULNESS

    In this section we describe how the usefulness of applying Armstrong tables for the discovery of semantically meaningful SQL keys can be measured. For this purpose, we will first introduce different measures of usefulness, and then describe a detailed example illustrating how the marking and feedback for non-multiple choice questions in database courses can be automated.

      >  A. Soundness, Completeness, and Proximity

    Measuring the usefulness of applying Armstrong tables for the discovery of semantically meaningful SQL keys appears to be non-trivial. However, one may conduct a two-phase experiment where database design teams are given an application domain and are asked to specify the set of UCs they perceive as semantically meaningful. In the first phase, they are not given any help, except natural language descriptions by domain experts. In the second phase, our algorithm can be used to produce Armstrong tables for the set of UCs the teams perceive currently as semantically meaningful. The Armstrong tables may be inspected together with the domain experts, and when new UCs are identified a corresponding Armstrong table may be repeatedly produced. For an experiment or assignment, one may specify a target set Σt and possibly a target NFS nfs(Tts). One may then measure the quality of the output sets (Σ1, Ts1) of the first phase against the target sets (Σt, Tts), and the output sets (Σ2, Ts2) of the second phase against the target sets (Σt, Tts). If there is an increase in quality, then Armstrong tables, indeed, appear to be useful. The question remains how to measure the quality of the output sets against the target sets. For this purpose we propose three measures. Let min(Σ) denote the UCs in Σ that are minimal. Soundness measures, which of the (minimal) UCs and headers currently perceived as semantically meaningful, are actually semantically meaningful:

    image

    If

    image

    and

    image

    we define

    image

    Completeness measures, which of the actually semanti-

    cally meaningful (minimal) UCs and headers in NFSs, are currently perceived as semantically meaningful:

    image

    If

    image

    and

    image

    we define

    image

    Finally, proximity prox((Σ, Ts), (Σt, Tts)) combines soundness and completeness and is defined as:

    image

    If

    image

    and

    image

    we define prox((Σ,Ts),

    image

    In database courses one may use Armstrong tables as automated feedback to solutions. Our measures can be applied to automatically mark non-multiple choice questions. This can reduce errors and save time in assessing course work.

      >  B. A Detailed Example

    Again, we will use our running example where SCHEDULE consists of the four attributes C(_ID), L(_Name), R(oom), and T(ime). Suppose we describe, in natural language, the application domain to the students of a database course. Then, we ask them to identify the uniqueness and NOT NULL constraints that they perceive to be semantically meaningful for this application domain. The students may ask questions in natural language to clarify their perceptions about the application domain. The lecturers of the course act as domain experts who will provide answers to questions in natural language that are consistent with the target set Σt = {u(CT), u(LT), u(RT)},

    image

    over the table schema SCHEDULE.

    Suppose one group of students returns as an answer to the question the set Σ = {u(CT), u(LRT), u(CLR)} and Ts =LRT. One can then automatically compute

    image
    image

    and

    image

    Furthermore, one may also return an Armstrong table for Σ and nfs(Ts), for example, the one in Table 6.

    The students inspect the table and may make several observations. For example, Fagin teaches 55505 and Gottlob teaches 77707 at the same time in the same room. The students decide to ask the domain expert whether different courses can be taught in the same room at the same time. The domain experts indicate that this is impossible, and the students decide to replace the UC u(LRT) by the UC u(RT) in response. In addition, the students notice that Gottlob teaches in different rooms at the same time. As this is impossible, they decide to specify the UC u(LT). They now submit their revised constraint set Σ' = {u(CT), u(RT), u(LT), u(CLR)} and Ts = LRT. Again, one can then automatically compute

    image
    image

    and

    image

    Hence, inspecting the Armstrong table results in an improvement of 1/14 in soundness, 2/5 in completeness, and 1/6 in proximity.

    VIII. CONCLUSION AND FUTURE WORK

    We investigated the data- and schema-driven discovery of SQL keys. We established insights into structural and computational properties of Armstrong tables. These can increase the discovery of semantically meaningful SQL keys in practice, leading to better schema designs and improved data processing. In addition, we also established algorithms to automatically discover SQL keys in given SQL tables. These have applications in requirement acquisition, schema design and query optimization. We combined the data- and schema-driven approaches to compute informative Armstrong tables, which are effective semantic samples of given data sets. Moreover, we defined formal measures to assess the difference between sets of SQL keys. These can be applied to validate the usefulness of Armstrong tables and to database education. Our findings also apply to Codd’s null marker interpretation value exists but unknown, using Levene and Loizou’s possible world semantics [38]. In the future we plan to implement our results in a design aid, to test our measures in applications, and to address the class of foreign keys.

  • 1. Hartmann S., Leck U., Link S. 2011 ‘On Codd families of keys over incomplete relations” [The Computer Journal] Vol.54 P.1166-1180 google
  • 2. Thalheim B. 1989 “On semantic issues connected with keys in relational databases permitting null values” [Journal of Information Processing and Cybernetics] Vol.25 P.11-20 google
  • 3. Thalheim B. 1992 “The number of keys in relational and nested relational databases” [Discrete Applied Mathematics] Vol.40 P.265-282 google
  • 4. Weddell G. E. 1992 “Reasoning about functional dependencies generalized for semantic data models” [ACM Transactions on Database Systems] Vol.17 P.32-64 google
  • 5. Khizder V. L., Weddell G. E. 2003 “Reasoning about uniqueness constraints in object relational databases” [IEEE Transactions on Knowledge and Data Engineering] Vol.15 P.1295-1306 google
  • 6. Toman D., Weddell G. E. 2008 “On keys and functional dependencies as first-class citizens in description logics” [Journal of Automated Reasoning] Vol.40 P.117-132 google
  • 7. Buneman P., Davidson S., Fan W., Hara C., Tan W. C. 2002 “Keys for XML” [Computer Networks] Vol.39 P.473-487 google
  • 8. Hartmann S., Link S. 2009 “Efficient reasoning about a robust XML key fragment” [ACM Transactions on Database Systems] Vol.34 google
  • 9. S. Hartmann, S. Link 2010 “Numerical constraints on XML data” [Information and Computation] Vol.208 P.521-544 google
  • 10. Lausen G., Christophides V. 2008 “Relational databases in RDF: keys and foreign keys” Semantic Web, Ontologies and Databases, Lecture Notes in Computer Science vol. 5005 P.43-56 google
  • 11. Grau B. C., Horrocks I., Motik B., Parsia B., Patel-Schneider P., Sattler U. 2008 “OWL 2: the next step for OWL” [Web Semantics] Vol.6 P.309-322 google
  • 12. Sismanis Y., Brown P., Haas P. J., Reinwald B. 2006 “GORDIAN: efficient and scalable discovery of composite keys” [Proceedings of the 32nd International Conference on Very Large Data Bases] P.691-702 google
  • 13. Demetrovics J. 1978 On the number of candidate keys” [Information Processing Letters] Vol.7 P.266-269 google
  • 14. CA ERwin data modeler: methods guide r7.3 google
  • 15. Abiteboul S., Hull R., Vianu V. 1995 Foundation of Database, Reading google
  • 16. Fagin R. 1982 “Armstrong databases” google
  • 17. Thalheim B. 1991 Dependencies in Relational Databases google
  • 18. Langeveldt W. D., Link S. 2010 “Empirical evidence for the usefulness of Armstrong relations in the acquisition of meaningful functional dependencies” [Information Systems] Vol.35 P.352-374 google
  • 19. Link S. 2012 “Armstrong databases: validation, communication and consolidation of conceptual models with perfect test data” [Proceedings of the 9th Asia-Pacific Conference on Conceptual Modelling] P.3-19 google
  • 20. Alexe B., Cate B. T., Kolaitis P. G., Tan W. C. 2011 “Characterizing schema mappings via data examples” [ACM Transactions on Database Systems] Vol.36 P.4 google
  • 21. Mannil H., Raiha K. J. 1986 “Design by example: an application of Armstrong relations” [Journal of Computer and System Sciences] Vol.33 P.126-141 google
  • 22. de Marchi F., Petit J. M. 2007 “Semantic sampling of existing databases through informative Armstrong databases” [Information Systems] Vol.32 P.446-457 google
  • 23. Demetrovics J. 1979 “On the equivalence of candidate keys with Sperner systems” [Acta Cybernetica] Vol.4 P.247-252 google
  • 24. Beeri C., Dowd M., Fagin R., Statman R. 1984 “On the structure of Armstrong relations for functional dependencies” [Journal of the ACM] Vol.31 P.30-46 google
  • 25. Huhtala Y., Karkkainen J., Porkka P., Toivonen H. 1999 “TANE: an efficient algorithm for discovering functional and approximate dependencies” [The Computer Journal] Vol.42 P.100-111 google
  • 26. Mannila H., Raiha K. J. 1994 “Algorithms for inferring functional dependencies from relations” [Data and Knowledge Engineering] Vol.12 P.83-99 google
  • 27. Codd E. F. 1970 “A relational model of data for large shared data banks” [Communications of the ACM] Vol.13 P.377-387 google
  • 28. Imielinski T., Lipski Jr W. 1984 “Incomplete information in relational databases” [Journal of the ACM] Vol.31 P.761-791 google
  • 29. Zaniolo C. 1984 “Database relations with null values” [Journal of Computer and System Sciences] Vol.28 P.142-166 google
  • 30. Atzeni P., Morfuni N. M. 1986 “Functional dependencies and constraints on null values in database relations” [Information and Control] Vol.70 P.1-31 google
  • 31. Hartmann S., Kirchberg M., Link S. 2012 ‘Design by example for SQL table definitions with functional dependencies” [The VLDB Journal] Vol.21 P.121-144 google
  • 32. Le V. B. T., Link S., Memari M., Lee S. 2012 “Discovery of keys from SQL tables”, Database Systems for Advanced Applications, Lecture Notes in Computer Science vol. 7238 P.48-62 google
  • 33. Date C. J. 2012 Database Design and Relational Theory: Normal Forms and All That Jazz google
  • 34. Eiter T., Gottlob G. 1995 “Identifying the minimal transversals of a hypergraph and related problems” [SIAM Journal on Computing] Vol.24 P.1278-1304 google
  • 35. Demetrovics J., Thi V. D. 1987 “Keys, antikeys and prime attributes” [Annales Universitatis Scientiarum Budapestinensis de Rolando Eotvos Nominatae Sectio Computatorica] Vol.8 P.35-52 google
  • 36. Hartmann S., Link S. 2010 “When data dependencies over SQL tables meet the logics of paradox and S-3” [Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems] P.317-326 google
  • 37. Hartmann S., Link S. 2012 “The implication problem of data dependencies over SQL table definitions: axiomatic, algorithmic and logical characterizations,” [ACM Transactions on Database Systems] Vol.37 google
  • 38. Levene M., Loizou G. 1998 “Axiomatisation of functional dependencies in incomplete relations” [Theoretical Computer Science] Vol.206 P.283-300 google
  • [Table 1.] An Armstrong table for SCHEDULE
    An Armstrong table for SCHEDULE
  • [Table 2.] Another Armstrong table
    Another Armstrong table
  • [Table 3.] Updated table
    Updated table
  • [Algorithm1] Algorithm 1 Armstrong table computation
    Algorithm 1 Armstrong table computation
  • [Algorithm2] Algorithm 2 Mining of uniqueness constraints by pairwise row comparisons
    Algorithm 2 Mining of uniqueness constraints by pairwise row comparisons
  • [Algorithm3] Algorithm 3 Mining of uniqueness constraints by exploring hypergraph transversals
    Algorithm 3 Mining of uniqueness constraints by exploring hypergraph transversals
  • [Table 4.] Real-world table treal
    Real-world table treal
  • [Table 5.] Informative Armstrong table tinf for treal
    Informative Armstrong table tinf for treal
  • [Table 6.] Armstrong table as feedback
    Armstrong table as feedback