Final product Your final product will be a pandas dataframe of college names sorted in descending order by the percent difference between in-state

Final product

Your final product will be a pandas dataframe of college names sorted in descending order by the percent difference between in-state and out-of-state tuition

Raw data sources

Your "raw" data sources are both from https://collegescorecard.ed.gov, with some modifications to optimize the data for this midterm:

a copy of the "Most-Recent-Cohorts-All-Data-Elements.csv" downloaded 2019-03-05T21:30Z

a slightly massaged version of the "CollegeScorecardDataDictionary.xlsx" that has been converted to a tab-delimited text file and reshaped

This ~140MB file has the data you need to produce your final product.

Unfortunately, it has

a lot MORE than you need,
column labels that are not human-readable, and
a lot of missing data in the form of either "NULL" entries or "PrivacySuppressed" entries

CollegeScorecardDataDictionary_wide.txt

To help you decipher the contents of the data file, the "Data Dictionary" is provided. This is NOT a python dictionary. Rather, it is a tab-delimited table that describes the column labels in the data file, as well as a "decoder ring" for the numeric codes used in some of the data columns.

Unfortunately, it has

a lot MORE information than you need,
"decoder ring" information is stored in a very inconvenient and inefficient "wide" format

def tab_import(fname):

df=pd.read_csv(fname,sep='t')

return df

#TEST

dd_df = tab_import('../resource/lib/publicdata/m2p1/CollegeScorecardDataDictionary_wide.txt')

function that

Adds a column to the df called 'Coded' that contains Boolean values: False if the columns labeled '-2' through '107' are all NaN, and True otherwise
Splits the df into two new dataframes, one containing the rows that are 'Coded' and those that are not
Removes the 'Coded' column from each new dataframe, since it is no longer required

def split_by_coded(full_df):

###

### YOUR CODE HERE

###

return coded_df, not_coded_df

Melts the coded dataframe columns '-2' to '107' into a column 'Code' containing the labels and a column 'Description' containing the code description
Removes all the rows in which 'Description' is NaN from the melted coded dataframe, since these rows add no information
Deletes the not_coded dataframe columns '-2' to '107' and adds two columns, labeled 'Code' and 'Description', to the not_coded dataframe, populating them with NaN's (use np.nan), since this dataframe had no codes to describe. It is recommended to use the pandas.DataFrame.assign() method. The .assign() method is useful for cleanly creating a new dataframe with new columns, optimized for "chaining" methods. https://stackoverflow.com/questions/48177914/why-use-pandas-assign-rather-than-simply-initialize-new-column
Concatenates the two new dataframes into a single dataframe
The columns should be ordered as NAME OF DATA ELEMENT, dev-category, developer-friendly name, API data type, VARIABLE NAME, Code, Description.

def melt_together(coded_df, not_coded_df):

###

### YOUR CODE HERE

###

return melted_df

#TEST

m_df= melt_together(c_df, n_c_df)