Answered You can hire a professional tutor to get the answer.
Final product Your final product will be a pandas dataframe of college names sorted in descending order by the percent difference between in-state
Final product
Your final product will be a pandas dataframe of college names sorted in descending order by the percent difference between in-state and out-of-state tuition
Raw data sources
Your "raw" data sources are both from https://collegescorecard.ed.gov, with some modifications to optimize the data for this midterm:
- a copy of the "Most-Recent-Cohorts-All-Data-Elements.csv" downloaded 2019-03-05T21:30Z
a slightly massaged version of the "CollegeScorecardDataDictionary.xlsx" that has been converted to a tab-delimited text file and reshaped
This ~140MB file has the data you need to produce your final product.
Unfortunately, it has
- a lot MORE than you need,
- column labels that are not human-readable, and
- a lot of missing data in the form of either "NULL" entries or "PrivacySuppressed" entries
CollegeScorecardDataDictionary_wide.txt
To help you decipher the contents of the data file, the "Data Dictionary" is provided. This is NOT a python dictionary. Rather, it is a tab-delimited table that describes the column labels in the data file, as well as a "decoder ring" for the numeric codes used in some of the data columns.
Unfortunately, it has
- a lot MORE information than you need,
- "decoder ring" information is stored in a very inconvenient and inefficient "wide" format
def tab_import(fname):
df=pd.read_csv(fname,sep='t')
return df
#TEST
dd_df = tab_import('../resource/lib/publicdata/m2p1/CollegeScorecardDataDictionary_wide.txt')
function that
- Adds a column to the df called 'Coded' that contains Boolean values: False if the columns labeled '-2' through '107' are all NaN, and True otherwise
- Splits the df into two new dataframes, one containing the rows that are 'Coded' and those that are not
- Removes the 'Coded' column from each new dataframe, since it is no longer required
def split_by_coded(full_df):
###
### YOUR CODE HERE
###
return coded_df, not_coded_df
- Melts the coded dataframe columns '-2' to '107' into a column 'Code' containing the labels and a column 'Description' containing the code description
- Removes all the rows in which 'Description' is NaN from the melted coded dataframe, since these rows add no information
- Deletes the not_coded dataframe columns '-2' to '107' and adds two columns, labeled 'Code' and 'Description', to the not_coded dataframe, populating them with NaN's (use np.nan), since this dataframe had no codes to describe. It is recommended to use the pandas.DataFrame.assign() method. The .assign() method is useful for cleanly creating a new dataframe with new columns, optimized for "chaining" methods. https://stackoverflow.com/questions/48177914/why-use-pandas-assign-rather-than-simply-initialize-new-column
- Concatenates the two new dataframes into a single dataframe
- The columns should be ordered as NAME OF DATA ELEMENT, dev-category, developer-friendly name, API data type, VARIABLE NAME, Code, Description.
def melt_together(coded_df, not_coded_df):
###
### YOUR CODE HERE
###
return melted_df
#TEST
m_df= melt_together(c_df, n_c_df)