Data Modeling

From Systems Analysis and Design Methods by Whitten, Bentley, and Barlow

Data modeling
A technique for organizing and documenting a system's data, usually implemented as a database.
Entity-relationship diagram (E-R diagram)
A data modeling tool that graphically illustrates the associations among different categories of data within an information system. It is a logical representation in that it does not imply anything about an implementation.
Entity
A data entity is a person, place, thing, or event that describes something about the information system being modeled. A data entity will always have data associated with it that requires storing and processing.
Relationship
A data relationship is an association that exists between one or more entities. Relationships can be thought of as activities or events that link one or more entities.
Attributes
Attributes are characteristics that are common to all or most instances of a particular entity.
Element, Data item, Field
A name given to the attribute, within the computer system, that represents the attribute.
File
A set of entities composed of one or more fields for each entity.
Identifier or Key
An attribute or combination of attributes of an entity that uniquely identify a particular instance of the entity.
Candidate key
Any key that identifies an entity, such as student name or student number.
Primary key
The candidate key chosen to represent the entity.
Secondary key
An alternate candidate key that may or may not be unique, used to represent an entity.
Foreign key
A field that is an attribute of one entity set while at the same time being a candidate key of a different entity set.
Relation (or Table or File, see File above)
A set of all entities of tuples (records) arranged in a tabular form.
Tuple (or Record)
An ordered collection of one or more data elements (attributes) that make up a record. A row within a relation.
Cardinality
Defines the min and max number of occurrences of one entity for a single instance of the related entity (defining their relationship).

What is needed is to produce normalized relations and an E-R diagram to show the relationships between the relations. Two steps are necessary in order to produce normalized relations: first group the relations, then normalize them. We have spent a fair amount of time in class looking at how to identify entities (or objects). They are usually people, places, data stores, or events. We are actually pretty good at making a first stab at this. The rest of the entities usually emerge during the normalization process.

From Data Analysis: The Key to Data Base Design by Richard Perkinson

Normalization was introduced by E.F. Codd. It is a technique for decomposing data structures into smaller structures in which every attribute is completely dependent on the primary key of the entity in which it resides. Normalization depends on a knowledge and understanding of data in the functional business unit being examined and the way it relates together.

The benefits of reducing data to small stable structures through normalization can be quite substantial. Gane and Sarson refer to the process as "inspired common sense". Consider the following typical file maintenance problem.

Maintenance involves both minor tasks, such as expanding the length of a field and more complex tasks such as adding new fields with increased functional complexity. For the sake of argument, let us say a purchase order includes a product warehouse location for each item ordered so that the order serves as a packing slip. What happens if this location changes? All orders for that product would have to be found and updated. This could be a trivial problem, or it could prove to be a significant one. What is the change involved several warehouses? The data field for warehouse location would have to be expanded. This would necessitate locating and updating every program that used the order file, even for programs that do not concern product location.

Certainly the impact of changes like that can be minimized by putting filler in files to accommodate larger record formats. But the point is this: a product location change should affect only those programs that utilize the data in question - not every program that uses the file. Instead, if product numbers were the only product-related attributes retained on the order file, that key could be used to find product location from another product file or table.

In fact, you may have intuitively decided to do just that - separate product information onto a separate file. In doing so, you would have been intuitively normalizing the order entry. Through normalization we minimize the impact of future change on data. It minimizes both program maintenance and data maintenance.

1st Rule of Normalization

No repeating groups. Repeating groups are removed into a new entity, such as adding a new COURSE entity so that multiple courses will not be part of the STUDENT entity. An entity )or relation) with no repeating groups is said to be in 1st Normal Form.

2nd Rule of Normalization

A relation is in 2nd normal form if it is in the 1st normal form and every nonkey attribute is fully functionally dependent on the whole key (which might be a compound key). Attributes that are wholly dependent on only part of the primary key or primary compound key are removed into a separate entity.

3rd Rule of Normalization

A relation is in 3rd normal form if it is in 2nd normal form and each nonkey attribute is functionally dependent on the primary key and not dependent on any nonkey attribute. Attributes wholly dependent upon another key within an entity should be removed into another entity.