Answered You can hire a professional tutor to get the answer.

QUESTION

I am willing to pay $100 for this to be done by 8am on Friday, October 6th, however I can't seem to figure out how to put this amount in to holding...

I am willing to pay $100 for this to be done by 8am on Friday, October 6th, however I can't seem to figure out how to put this amount in to holding pending an answer...

In this assignment you will download some of the mailing list data from http://mbox.dr-chuck.net/ and run the data cleaning / modeling process and take some screen shots. You will then run two visualizations of the email data you have retrieved and processed: a word cloud to visualize the frequency distribution. While a word cloud might seem a little silly and over-used, it is actually a very engaging way to visualize a frequency distribution or histogram. The word cloud is really a nice continuation of frequency/counting assignments we have been doing in this class. The second visualization will be to show how the data is a timeline to show how the data is changing over time. You are provided the base code for the two visualizations but will need to edit it to improve the data output. Finally you will need to create your own visualization using the spidered data

Here is a copy of the Sakai Developer Mailing list from 2006-2014.

http://mbox.dr-chuck.net/ (Links to an external site.)

Links to an external site.

The base program that includes gmane.py, gmodel.py, gword.py and gline.py. It also included sample generated gword.js (with gword.htm) and gline.js (with gline.htm) is found in the shared folder here https://drive.google.com/file/d/0B97phCsyLEpvRFJfNGNZX1oyTlE/view?usp=sharing.

You can install the SQLite browser http://sqlitebrowser.org/ (Links to an external site.)

Links to an external site.

 if you would like to to view and modify the databases used for this assignment.

Project Structure

gmane.py

The gmane.py file is provided for you. It operates as a spider in that it runs slowly and retrieves one mail message per second so as to avoid getting throttled. It stores all of its data in a database and can be interrupted and re-started as often as needed. It may take many hours to pull all the data down. So you may need to restart several times. You should download and process at least 1000 messages for the data visualizations to work - but more data is always better.

The base URL (http://mbox.dr-chuck.net/ (Links to an external site.)

Links to an external site.

) is hard-coded in the gmane.py. Make sure to delete the content.sqlite file if you switch the base url.

Navigate to the folder where you extracted the gmane.zip

Here is a run of gmane.py getting the last five messages of the sakai developer list:

python gmane.py

How many messages:10

http://mbox.dr-chuck.net/sakai.devel/5/6 .ac.uk 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments

http://mbox.dr-chuck.net/sakai.devel/6/7 2005-12-09T13:32:31-06:00 re: sakaiportallogin and presense

http://mbox.dr-chuck.net/sakai.devel/7/8 .ac.uk 2005-12-09T13:42:24+00:00 re: lms/vle rants/comments

The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message.

Sometimes there is missing a message. Perhaps administrators can delete messages or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit a missing message, go into the SQLite Manager and add a row with the missing id - leave all the other fields blank - and then restart gmane.py. This will unstick the spidering process and allow it to continue. These empty messages will be ignored in the next phase of the process.

One nice thing is that once you have spidered all of the messages and have them in content.sqlite, you can run gmane.py again to get new messages as they get sent to the list. gmane.py will quickly scan to the end of the already-spidered pages and check if there are new messages and then quickly retrieve those messages and add them to content.sqlite.

The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. This is intentional as it allows you to look at content.sqlite to debug the process. It would be a bad idea to run any queries against this database as they would be slow.

gmodel.py

The second process is running the program gmodel.py. gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text.

Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the data cleaning process.

Running gmodel.py works as follows:

python gmodel.py

Loaded allsenders 1588 and mapping 28 dns mapping 1

1 2005-12-08T23:34:30-06:00

251 2005-12-22T10:03:20-08:00

501 2006-01-12T11:17:34-05:00

751 2006-01-24T11:13:28-08:00

...

The gmodel.py program does a number of data cleaing steps

Domain names are truncated to two levels for .com, .org, .edu, and .net other domain names are truncated to three levels. So si.umich.edu becomes umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are forced to lower case and some of the @gmane.org address like the e.org

are converted to the real address whenever there is a matching real email address elsewhere in the message corpus.

When you are done, you will have a nicely indexed version of the email in index.sqlite. This is the file to use to do data analysis. With this file, data analysis will be really quick.

gbasic.py

The first, simplest data analysis is to do a "who does the most" and "which organization does the most"? This is done using gbasic.py:

python gbasic.py

How many to dump? 5

Loaded messages= 51330 subjects= 25033 senders= 1584

Top 5 Email list .uk .za 1184

Top 5 Email list organizations

gmail.com 7339

umich.edu 6243

uct.ac.za 2451

indiana.edu 2258

unicon.net 2055

gword.py

There is a simple vizualization of the word frequence in the subject lines in the file gword.py:

python gword.py

Range of counts: 33229 129

Output written to gword.js

This produces the file gword.js which has he top 100 words found in the emails. You can view them in a word cloud using the file gword.htm. Once you get gword.py to work you will need to enhance the program to filter the output as follows:

  • The output should only contain words with letters that are are 4 letters or longer (no numbers)
  • The output should remove common words (stop words)
  • The output should remove sakai and email, common words in the output that are not meaningful for the word cloud.
  • The output should use content from the subjects of the emails.

The filters should be added in place of "words = text.split(" ")" in the sample program.

gline.py

A second visualization is in gline.py. It visualizes email participation by organizations over time.

python gline.py

Loaded messages= 51330 subjects= 25033 senders= 1584

Top 10 Oranizations

['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk']

Output written to gline.js

Its output is written to gline.js which is visualized using gline.htm.

Change the gline.py program to show the message count by month instead of by year. You can switch from a by-year to a by-month visualization by changing only a few lines in gline.js. The puzzle is to figure out the smallest change to accomplish the change.

Your Own Visualization:

Once you have gotten the visualization to work for gword and gline you should create one other visualization to display the data in a different way. When creating your own visualization:

  • it must output data that is different (at least slightly) from that used in gword and gline.
  • it must use a different chart type than used in gword and gline

You can create a Bubble chart. This chart can be used as an alternative to the word cloud. Instead of JSON data this uses csv data. A sample bubble chart is shown in sampleBubble.htm using the csv data in flare.csv (in the zip file for the assignment). If you were to choose the bubble chart you would need to:

  • Create a new python file (gbubble.py) that is like your final gword.py except that the output is different
  • Change the output to the csv format seen in flare.csv
  • Output data for all words with a count of more than 10 (or 50 if you downloaded lots of the data)
  • Output actual count data instead of the scaled font size for the word cloud.

You could also choose another visualization to use with your data. d3 supports a wide type of visualizations that you can use with your data: https://github.com/d3/d3/wiki/Gallery (Links to an external site.)

Links to an external site.

.

Some other URLs for other visualization ideas:

https://developers.google.com/chart/ (Links to an external site.)

Links to an external site.

https://developers.google.com/chart/interactive/docs/gallery/motionchart (Links to an external site.)

Links to an external site.

https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart_time_formats (Links to an external site.)

Links to an external site.

https://developers.google.com/chart/interactive/docs/gallery/annotatedtimeline (Links to an external site.)

Links to an external site.

http://bost.ocks.org/mike/uberdata/ (Links to an external site.)

Links to an external site.

http://nltk.org/install.html (Links to an external site.)

Links to an external site.

Submitting Your Work

Please Upload Your Submission:

  • A screen shot of you running the gmane.py application to produce the content.sqlite database.
  • A screen shot of you running the gmodel.py application to produce the index.sqlite database.
  • A screen shot of you running the gbasic.py program to compute basic histogram data on the messages you have retrieved.
  • A screen shot of word cloud visualization for the messages you have retrieved, before you applied the filters.
  • A screen shot of word cloud visualization for the messages you have retrieved, after you have applied the appropriate filters.
  • A screen shot of time line visualization for the messages you have retrieved, by year.
  • A screen shot to the by month visualization for the messages.
  • A screen shot to the new visualization.
  • A zip file containing all of the py, js, csv, htm and sqlite files you used as a part of the assignment.

Rubric

Some Rubric

Some Rubric

CriteriaRatingsPtsThis criterion is linked to a Learning Outcome

A screen shot of you running the gmane.py application to produce the content.sqlite database

5.0 pts

This criterion is linked to a Learning Outcome

A screen shot of you running the gmodel.py application to produce the index.sqlite database.

5.0 pts

This criterion is linked to a Learning Outcome

A screen shot of you running the gbasic.py program to compute basic histogram data on the messages you have retrieved.

5.0 pts

This criterion is linked to a Learning Outcome

A screen shot of word cloud visualization for the messages you have retrieved, before you applied the filters.

5.0 pts

This criterion is linked to a Learning Outcome

gword.py edited to apply appropriate filters and to get data from the subject of the emails.

20.0 pts

This criterion is linked to a Learning Outcome

A screen shot of word cloud visualization for the messages you have retrieved, after you have applied the appropriate filters.

10.0 pts

This criterion is linked to a Learning Outcome

A screen shot of time line visualization for the messages you have retrieved, by year.

5.0 pts

This criterion is linked to a Learning Outcome

gline.py edited to output data by month (or month and year)

15.0 pts

This criterion is linked to a Learning Outcome

A screen shot to the by month visualization for the messages.

10.0 pts

This criterion is linked to a Learning Outcome

a new .py file (e.g. gbubble.py) containing code to uptput necessary data for another visualization .

10.0 pts

This criterion is linked to a Learning Outcome

A new data file (e.g. gbubble.csv) containing data for the new visualization.

5.0 pts

This criterion is linked to a Learning Outcome

A screen shot to the new visualization.

5.0 pts

Total Points: 100.0

Show more
LEARN MORE EFFECTIVELY AND GET BETTER GRADES!
Ask a Question