(PART 1)Use what you learned in the Template to create a final summary for your project sponsors.Revise your process and approach, as appropriate.Include the following in your summary:The consideratio
Introduction
There are certain discrepancies that may arise such as duplication of data. We will identify each table with a primary key which ought to be unique. A primary key is the candidate key that is selected to identify things uniquely within the relation. We will also employ the use of foreign keys. A foreign key is an attribute within one relation that matches the primary key of some relation. In some instances, it could be the same relation.
We will also use foreign key with cascade delete to enforce data quality. This means that, if a record in the parent table is deleted then the corresponding record in the child table should also be deleted automatically.
We will also use data types whereby a user cannot enter something that is contrary to the data type. For example, the attribute AmountOfLoans from table FFEL ought to be a currency, so we need to use a data type that supports currency such as DECIMAL(15,2).
DLGradPlus
PK DLGradPlusId
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
AmountOfDisbursement
FK DLId
DLParentPlus
PK DLParentPlusId
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
AmountOfDisbursement
FK DLId
FFEL
PK FFELId
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
FFELLoanType
FK LoandId
DirectLoanUnsubsidized
PK DLUnsubsidizedId
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
AmountOfDisbursement
FK DLId
DirectLoanSubsidized
PK DLSubsidizedId
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
AmountOfDisbursement
FK DLId
Grants
PK GrantId
GrantProgram
FK SchoolCode
FederalPellGrant
PK FederalPellfrantId
SumOfRecepients
SumOfDisbursement
FK GrantId
TeachGrant
PK TeachGrantId
SumOfRecipents
SumOfDisbursements
FK GrantId
IraqAfghanGrant
PK IraqAfghanGrantId
SumOfRecepients
SumOfDisbursements
FK GrantId
GainfulEmployment
PK GEId
CIPCode
CIPName
CredentialLevel
RepaymentRate
RepaymentNumerator
RepaymentDenominator
DirectLoans
PK DLId
LoanType
FK LoanId
Loans
PK LoanId
LoanType
FK SchoolCode
School
PK SchoolCode
SchoolName
Address
City
State
Zip
Province
Country
PostCode
SchoolType
Physical database
School (SchoolCode PK, SchoolName,Address, City, State, Zip, Province, Country, PostCode, SchoolType)
Loans (LoanId PK, LoanType, SchoolCode FK)
DirectLoans (DLId PK, LoanType, LoanId FK)
DirectLoanSubsidized (DLSubsidizedId PK, Recipients, NoOfLoans, AmountOfLoans, NoOfDisbursement, AmountOfDisbursement, DLId FK)
GainfulEmployement (GEId PK, CIPCode, CIPName, CredentialLevel, RepaymentRate, RepaymentNumerator, RepaymentDenominator, SchoolCode FK)
DirectLoanUnsubsidized (DLUnsubsidizedId PK, Recipients, NoOfLoans, AmountOfLoans, NoOfDisbursement, AmountOfDisbursement, DLId FK)
DLParentPlus (DLParentPlusId PK, Recipients, NoOfLoans, AmountOfLoans, NoOfDisbursement, DLId FK)
DLGradPlus (DLGradPlusId PK, Recipients, NoOfLoans, AmountOfLoans, NoOfDisbursement, AmountOfDisbursement, DLId FK)
Grants (GrantId PK, GrantProgram, SchoolCode FK)
FFEL (FFELId PK, Recipients, NoOfLoans, AmountOfLoans, NoOfDisbursement, FFELLoanType, LoanId FK)
FederalPellGrant (FederalPellGrantId PK, SumOfRecipients, SumOfDisbursement, GrantId FK)
TeachGrant (TeachGrantId PK, SumOfRecipients, SumOfDisbursements, GrantId FK)
IraqAfghanGrant (IraqAfghanGrantId PK, SumOfRecipients, SumOfDisbursements, GrantId FK)
The bold attributes with PK means it is the Primary Key and the attributes with FK means it is the Foreign Key attribute.
SQL Queries
1. Creating the School table
CREATE TABLE `School` (
`SchoolCode` VARCHAR (10) NOT NULL,
`SchoolName` VARCHAR (255) NOT NULL,
`Address` VARCHAR (255) NOT NULL,
`City` VARCHAR (255) NOT NULL,
`State` VARCHAR (5) NOT NULL,
`Zip` VARCHAR (10) NOT NULL,
`Province` VARCHAR (50) NOT NULL,
`Country` VARCHAR (100) NOT NULL,
`PostCode` VARCHAR (20) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
2. Showing School table data
SELECT * FROM School;
3. Which schools generate the highest amount of debt for the average student who attends?
SELECT s.SchoolName, MAX(l.NoOfDisbursements) AS Debt FROM school s, DirectLoanUnsubsidized l GROUP BY s.SchoolName;
4. Employment rate for the students who graduate from these schools?
SELECT s.SchoolName, AVG(g.RepaymentRate) AS EmploymentRate FROM School s, GainfulEmployment g WHERE s.SchoolCode = g.SchoolCode GROUP BY s.SchoolName;
Best Practices
I have used primary keys to ensure that there are no duplicate records. In this project, I have also used foreign keys so as to ensure relationship among the tables. The use of aliases in creating calculated fields has been performed. I have ensured that you cannot put an empty data by using the SQL NOT NULL key word in attributes.
Before we look at the integration plans, let us first define data warehouse and look at the process of transforming our data schema to a data warehouse schema.
Data warehouse
A data warehouse is a subject-oriented, integrate, time-variant, and nonvolatile collection of data in support of management’s decision-making process. Data warehouse is seen as the storage of huge volumes of data that supports the organization in decision making and complex analysis of the data.
Star schema
We will use the star schema to transform our data schema to a data warehouse schema. A star schema is a data model that has a fact table in the center, surrounded by denormalized tables. From our data warehouse schema we can visualize that we have one fact table known as “USAidFact_Table” and 13 dimension tables. The dimension tables include: “School”, “Loans”, “DirectLoans”, “DirectLoanSubsidized”, “FederalPellGrant”, “TeachGrant”, “IraqAfghanGrant”, “FFEL”, “Grants”, “DLGradPlus”, “DLParentPlus”, “DirectLoanUnsubsidized”, “GainfulEmployment”.
Integration plans
In order for us to integrate our data warehouse with the internet application, we need to incorporate certain aspects such as:
Data warehouse database management system
Data warehouse metadata
Business intelligence tools such as OLAP and Data mining
Network infrastructure
Data warehouse database management
We need to integrate a data warehouse DBMS in our data warehouse. The DBMS will support load performance, load processing, scalability when data volume increases and data quality management.
Data warehouse metadata
We will have to integrate a data warehouse metadata. The metadata shows the history of any item searched since it gives the pathway back to where the data began. We also need to integrate data mart.
OLAP and Data mining
We will integrate OLAP and Data Mining in our data warehouse. Online analytical processing (OLAP) is the dynamic analysis and consolidation of large volumes of data. Data mining is the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make business decisions. We will use OLAP and data mining for analysis of the data and use it for decision making within the organization.
Network infrastructure
The data warehouse cannot run without the network infrastructure. Therefore, we need to have these devices such as data mart etc. interconnected to one another so that they can communicate to each other.
USAidFact_Table
SchoolId
LoanId
DLId
DLSubsidizedId
GEId
DLUnsubsidizedId
DLParentPlusId
GradPlusId
GrantId
FFELId
IraqAfghanGrantId
TeachGrantId
FederalPellGrantId
Grants
GrantId {PK}
GrantProgram
FFEL
FFELId {PK}
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
FFELLoanType
Loans
LoanId {PK}
LoanType
DirectLoans
DLId {PK}
LoanType
DirectLoanSubsidized
DLSubsidizedId {PK}
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
AmountOfDisbursement
IraqAfghanGrant
IraqAfghanGrantId {PK}
SumOfRecepients
SumOfDisbursement
TeachGrant
TeachGrantId {PK}
SumOfRecepients
SumOfDisbursement
FederalPellGrant
FederalPellGrantId {PK}
SumOfRecepients
SumOfDisbursement
DLGradPlus
DLGradPlusId {PK}
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
AmountOfDisbursement
DLParentPlus
DLParentPlusId {PK}
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
AmountOfDisbursement
DirectLoanUnsubsidized
DLUnsubsidizedId {PK}
Recepients
NoOfLoans
AmountOfLoans
NoOfDisbursement
GainfulEmployment
GEId {PK}
CIPCode
CIPName
CredentialLevel
RepaymentRate
RepaymentNumerator
RepaymentDenominator
School
SchoolId {PK}
SchoolCode
Address
City
State
Zip
Province
Country
PostCode
SchoolType
This paper looks at the process of evaluation of a data warehouse for data incompleteness, nulls and the ability to provide consistent query data. It will also look at the data evaluation strategy, data evaluation queries and results per query. This paper will finish by looking at the summary of findings.
Evaluation strategies
Before looking at the evaluation strategies of a data warehouse, let us first define data warehouse. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data. A data warehouse can be described in the following areas: capacity, loading and indexing performance, operational integrity, reliability and manageability, client/ server connectivity and query processing performance.
Capacity
Capacity is all about space. Since we will be saving huge volumes of data, we need to have a memory that can support that. We intend to have a memory capacity of half a petabyte.
Loading and indexing performance
We will evaluate our data warehouse to see the loading and indexing performance. A data warehouse deals with frequent loading of data within smaller than usual time windows. The loading of data needs to have a high performance this way it doesn’t disrupt the system. We will also evaluate the indexing performance of our data warehouse and this deals with how well the data is indexed and how quickly record is found.
Operational integrity, reliability, and manageability
This section deals with the availability of the data. The data needs to be available all the time (24 * 7). We will also evaluate whether the data is reliable. Reliability is the transaction atomicity and durability. A transaction should either commit or rollback in case of a failure. The data warehouse should also recover from failure. The reliability, operational integrity and recovering from failure is provided by the database management system. The database management system also ensures that you cannot put null values and provides format of data e.g. the date data type has the format YYYY-MM-DD therefore ensuring that you cannot put incomplete data.
Client / server connectivity
This section deals with the evaluation of the network connectivity and speed since our data will be accessed over the network.
Query processing performance
This section looks at how the data warehouse DBMS handles complex queries. We will evaluate our data warehouse and see how fast it returns results of complex queries.
SQL Queries
SELECT schoolid, institutionname, SUM (loans) FROM school, gainfulemployment WHERE school.schoolid = gainfulemployment.schoolid GROUP BY ROLLUP(schoolid, institutionname);
Summary of findings
These are the findings that we came up with after evaluating our data warehouse:
We need to increase the storage capacity of our data warehouse to half a petabyte to facilitate increase in volume of data.
We need to install a high-speed network that will facilitate fast client / server communication.
Sometimes the data is not available due to slow network.
The data warehouse database management system is responsible for other things such as query processing, reliability, recovery and security of the data warehouse.