It's a group project and my part is step 4 so I only need step 4 completed. Data sheet: https://www.kaggle.com/datasets/themrityunjaypathak/most-subscribed-1000-youtube-channels Here's the other step

Step 1: Data Collection - https://www.kaggle.com/datasets/themrityunjaypathak/most-subscribed-1000-youtube-channels

The dataset chosen is “Most Subscribed 1000 Youtube Channels”, and it was found on kaggle. This dataset creates a very clear picture of the types of hypotheses that can be generated from using it. When doing research for a study like this one, we want to collect the data by using exploratory research, and then interpret it in a way where we can report the research findings by backing it up with sound data. I began with the source Kaggle. Kaggle proved to me to be a very useful source. The site is organized, and managed properly. They provide profile information on their experts, and the site shows the last time the expert logged in or when to expect an update on their dataset. The experts are ranked and you can see their various contributions under their profile. As a user of the site, you are able to “up vote” on a dataset - which I did easily for this particular dataset; to show my appreciation to the expert. There are a few key factors that were used when deciding on this dataset:

I looked at the usability score that kaggle has given, which was a 10. This means that the data is well-documented, there is an overview provided, and the data is in machine-readable, ready format. I was able to convert it directly from a csv file to an excel format, and then format it as a table.
The kaggle user that uploaded the dataset has a ranking of number 41 out of 85,348 making him a Datasets Expert. He is also a fellow undergraduate student, and studies at the Banarus Hindu University.
There are three related notebooks on kaggle which are data that is included for the research, such as exploratory data, analyzation of the data, and data analysis with descriptive analysis.
Last but not least, the dataset was highly ranked and downloaded 6,053 other times which made me feel very comfortable using it. There is a sufficient amount of data to sort, gather samples, and create null and alternative hypotheses.

For a dataset to be useful, it should have categorical and continuous variables, and a sufficient amount of rows and columns of data. This dataset does fit that description with 1,000 rows, and 8 columns. Specific items I found interesting about the dataset were the initial headings the expert created. There are 1,000 YouTube channels ranked from 1-1000 based on the number of subscribers, which clearly shows the video count of the channel, and the video views per channel (continuous variables). Additional information provided is the category the channel is related to (categorical variables) and the year the channel started. From this beginning information, we can already see that the hypotheses that can be formed from a sample of data are: Ranking correlates to Video Views or is not related, and Amount of Subscribers means higher ranking or does not. Next we will calculate the summary of descriptive statistics around the data to better understand this dataset.

Step 2: Summary statistics are numerical measurements that summarize the distribution of a dataset. The most common summary statistics include measures of central tendency (e.g., mean, median, mode) and measures of variability (e.g., standard deviation, range, interquartile range).

The mean is the average value of a dataset and is calculated by summing up all the values and dividing by the number of observations. The median is the middle value in a dataset when the values are arranged in order. The mode is the most frequently occurring value in a dataset. The standard deviation measures how spread out the data is from the mean. A small standard deviation indicates that the data is tightly clustered around the mean, while a large standard deviation indicates that the data is more spread out. Percentiles are a way to divide a dataset into equal parts. For example, the 25th percentile is the value below which 25% of the observations fall. Descriptive statistics provide insights into the characteristics of a dataset. For example, the mean can provide an idea of the typical value in the dataset, while the standard deviation can indicate the level of variability. Percentiles can provide information about the spread of the data and the presence of outliers. Overall, descriptive statistics are useful in summarizing the key features of a dataset and in making comparisons between different datasets.

Next, we are going to show the mean, median, mode, standard deviation, and percentiles for the number of subscribers and number of video views from the data selected above.

Number of Subscribers:

Mean: 21,581,400

Median: 16,600,000

Mode: 12800000

Standard Deviation: 16625563.55

Video Views:

Mean: 9,791,803,942

Median: 6,723,360,159

Mode: no mode

Standard Deviation: 13005457457

Step 3: Meaningful Charts

What can be seen from the chart below is the breakdown of the average number of views per category in the “Most Subscribed 1000 YouTube Channels”, and which category in general is leading in the number of views and which category may have the most subscribers overall.

It's a group project and my part is step 4 so I only need step 4 completed. Data sheet: https://www.kaggle.com/datasets/themrityunjaypathak/most-subscribed-1000-youtube-channels Here's the other step 1

What can be seen from the chart above is a breakdown of the average number of subscribers per category and how they compare with the other categories in our dataset which allows us to see which categories are most likely to have more video views.

In this final graph what we are able to roughly see is the correlation between the number of subscribers and how it correlates to the number of views that a given YouTube channel receives. This shows that the more subscribers that a YouTube channel has the more likely it is for that channel to receive more views and for it to continue growing.