Waiting for answer This question has not been answered yet. You can hire a professional tutor to get the answer.
#------------------------------------------ #------------------------------------------ ############## Homework #6 ##############
#------------------------------------------
#------------------------------------------
############## Homework #6 ##############
#------------------------------------------
#------------------------------------------
# Directions: cluster a sample of Amazon product
# reviews. Your sample will include 250 automotive products
# from a population of over 20,000 amazon product reviews, with
# corresponding product information.
#------------------------------------------
######### Preliminary Code #########
#------------------------------------------
#------------------------------------------
## Get/Set Your Working Directory
#------------------------------------------
#------------------------------------------
## Load Packages (libraries)
#------------------------------------------
library(tm)
library(cluster)
#------------------------------------------
## Load Functions & Data (.RData File)
load("HW6-1.RData")
#------------------------------------------
# load the data used in this HW
load("HW6-1.RData")
# load the cluster functions
load("clusterFunctions.RData")
#------------------------------------------
######### Solutions #########
#------------------------------------------
## 1. First, learn about the objects that you loaded into your
# workspace. Next, set your birthday seed, before running the code in the
# answer section. In words, describe what is this code doing.
#------------------------------------------
## ANSWER 1##
# Set your seed.
products <- sample(unique(autorevs$asin), 250, replace=FALSE)
docs <- autorevs$doc_id[autorevs$asin %in% products]
#
#------------------------------------------
## 2. Next, create a TDM and dataframe subsets based on
# the docs and products vectors created in step 1.
# How many documents are in your subsets?
#------------------------------------------
## ANSWER 2##
#------------------------------------------
## 3. First, we will cluster review text to find clusters of terms.
# First, create the distance matrix. Use the dist() function to create
# a distance matrix for the automotive review terms named rev_tdist.
# Then, perform hierarchical clustering using Ward's Method.
#------------------------------------------
## ANSWER 3##
#------------------------------------------
## 4. Evaluate the best number of clusters, k, using plots of the average
# silhouette width and within-cluster SSE across k values to guide your choice.
# Consider k values up to 15. Based on your plots, how many clusters would your choose?
#------------------------------------------
## ANSWER 4##
#------------------------------------------
## 5. Based on your chosen k in answer 4, cut your dendogram. Plot the
# distribution of terms. Are the terms evenly distributed across clusters?
#------------------------------------------
## ANSWER 5##
#------------------------------------------
## 6. Choose one of the clusters and view the terms in that cluster. Do
# they appear to be related? Explain.
#------------------------------------------
## ANSWER 6##
#------------------------------------------
## 7. Next, we will apply kmeans clustering to the documents. First, use the
# plot of the average silhouette width across k values up to 25 to choose
# the optimal k.
#------------------------------------------
## ANSWER 7##
#------------------------------------------
## 8. Use your choice of k from answer 7 and perform kmeans clustering. Plot
# the distribution of documents. Then, use the doc_clus_overview() function
# to view the cluster size and the most important terms in each cluster.
# Hint: don't forget to apply the function to the DTM, not TDM!
# Hint 2: dont forget to set your seed!
#------------------------------------------
## ANSWER 8##
#------------------------------------------
## 9. Now that we know a little more about the naturally existing clusters
# of terms and documents, explore your dataframe subset further. Use
# summary(), table(), etc. to learn more about your metadata. Are there
# any variables that may help you to understand the clustering solution?
# Which ones? Explain.
#------------------------------------------
## ANSWER 9##