lab09 : Working With Data, (by Miles and Phill)
num | ready? | description | assigned | due |
---|---|---|---|---|
lab09 | true | Working With Data, (by Miles and Phill) | Wed 08/17 01:15PM | Mon 08/22 04:45PM |
Working With Data
In this lab you will work with some data similar to the beer data that we saw in the Facets presentation by UCSD Professor Julian McAuley. The data he presented was about reviews of beers. Since SPIS participants are typically under age 21, we’ll focus only on the reviews of non-alcoholic beers.
References
As you work through the steps below, you may find it helpful to refer to:
- the notes from the lecture Miles gave on Wednesday 08/17, which can be found here: Transcript of Python Shell Session
- The pages on the SPIS 2016 website about:
One more hint: the dictionary methods .keys()
and .values()
may come in handy. This example illustrates how they work:
>>> foo = {'a':3, 'b':3, 'c':5 }
>>> foo.keys()
['a', 'c', 'b']
>>> foo.values()
[3, 5, 3]
>>>
A consequence of this is that if you want to iterate over every key in a dictionary called foo
,
you can just do a for loop on foo.keys()
, e.g.
for k in foo.keys():
# do something with key k and value foo[k]
Step 1: Create a lab09 repo
Go to github.com and create a new repo called spis16-lab09-part1-Name-Name using Method 1. When creating the repo import the starter code from this git repo: https://github.com/ucsd-cse-spis-2016/lab09start
Then use git clone
to clone this into your ~/github
directory.
Step 2: Work with the file non_alcoholic_data.py
file
In the starter code, you’ll find the file non_alcoholic_data.py
. Open this file in IDLE, and run it.
This file loads a variable into memory called data
. That data
variable is list of dictionaries.
Now, you might think that the first thing you’d want to do is type data
at the Python prompt, and hit enter, but PLEASE DON’T DO THAT. It isn’t that anything awful will
happen; it is just that it will temporarily lock up your computer session, and you might have to close it and start over.
Let’s explain two things right away:
- What you should do instead
- Why typing
data
and hitting enter (which you should NOT do) is a bad idea.
What you should do instead
Here are several things you can do after running the non_alcoholic_data.py
file with the data
variable that are
perfectly fine:
- Type
type(data)
. - Type
len(data)
- Type
data[0]
- Type
data[42]
- Type
type(data[0])
- Type
data[-1]
Try each of those, and see what you get. Your purpose right now is simply to understand what the data represents by looking through it.
For the result of type(data[0])
you should have gotten dict
, which stands for dictionary.
If you need a review about Python dictionaries, you can consult the lecture notes from either Phill or Miles, or you can review this page: Python: Dictionaries
Why typing data
and hitting return is a bad idea
The reason is simple: data is a huge list. It takes a very long time to convert that list into its string representation and then print it out in your IDLE Python Shell. While it is doing that string conversion, your entire session just locks up, and it appears that nothing is happening. Eventually, you will get your prompt back, along with a zillion pages of output. But that probably isn’t what you wanted to see.
Some other things you could do
Each element in the list data
represents a single review by some reviewer of some beer.
If we want to see all the pieces of data that we can access for a review, we can just look at the string representation of a single Python dictionary for a single review, like this:
>>> data[0]
{'beer/style': 'Low Alcohol Beer', 'beer/ABV': 0.1, 'beer/beerId': '59239', 'review/timeStruct': {'wday': 5, 'isdst': 0, 'mday': 2, 'hour': 21, 'min': 31, 'sec': 21, 'year': 2011, 'yday': 183, 'mon': 7}, 'review/aroma': 3.0, 'review/appearance': 3.5, 'review/timeUnix': 1309642281, 'review/palate': 3.0, 'review/taste': 3.5, 'beer/name': 'Amstel Malt', 'beer/brewerId': '163', 'review/overall': 3.0, 'review/text': "I had this 'beer' at a beachbar in IJmuiden today. Had to drive back home so i needed something without alcohol. This was a good drink for that purpose.\t\tPoured in the original Amstel glass this beer looks quit ok to be honest. Smell in not spectacular but for a malt beer ok. Watery as expected.\t\tSome sweetness in this beer. Drinkable when you have to drive....", 'user/profileName': 'Brabander'}
>>>
Or, if I want to see just the keys in the dictionary, I can use data[0].keys()
>>> data[0].keys()
['beer/style', 'beer/ABV', 'beer/beerId', 'review/timeStruct', 'review/aroma', 'review/appearance', 'review/timeUnix', 'review/palate', 'review/taste', 'beer/name', 'beer/brewerId', 'review/overall', 'review/text', 'user/profileName']
>>>
If I wanted to see a list of the names, overall scores, and the review name for the first 10 beers in the list, I could use this:
>>> for i in range(10):
print data[i]['beer/name'],data[i]['review/overall'],data[i]['user/profileName']
Amstel Malt 3.0 Brabander
Amstel Malt 3.5 kingcrowing
Harboe Den Glada Danskens Lättöl 2.0 gnoff
Harboe Den Glada Danskens Lättöl 2.0 bark
Bernard S čistou Hlavou 2.0 philipquarles
Juleøl 4.5 jreitman
Mørkt Hvidtøl 3.5 bark
Buckler 3.0 drabmuh
Buckler 4.5 McStagger
Buckler 3.0 jifigz
>>>
However, very quickly, doing this kind of thing at the command line is going to get very tedious.
You are going to get very tired of typing these long strings over and over. You’ll want to define some convenient functions to work with this data.
So, using IDLE, create a new file called `my_analysis.py’.
In that file, put this code:
from non_alcoholic_data import parseData
def reviewSummary(review):
return "beer name: " + review['beer/name'] + \
" overall: " + str(review['review/overall']) + \
" reviewer: " + review['user/profileName']
if __name__ == "__main__":
print "Reading Data..."
data = list(parseData("http://jmcauley.ucsd.edu/cse255/data/beer/non-alcoholic-beer.json"))
for i in range(10):
print reviewSummary(data[i])
print "done"
The function definition reviewSummary
provides a way to print out three particular fields, in this case:
beer/name
review/overall
user/profileName
And, it customizes the label that appears next to each of those so that we can make more sense of the data we are looking at.
Reading Data...
review beer name reviewer
beer name: Amstel Malt overall: 3.0 reviewer: Brabander
beer name: Amstel Malt overall: 3.5 reviewer: kingcrowing
beer name: Harboe Den Glada Danskens Lättöl overall: 2.0 reviewer: gnoff
beer name: Harboe Den Glada Danskens Lättöl overall: 2.0 reviewer: bark
beer name: Bernard S čistou Hlavou overall: 2.0 reviewer: philipquarles
beer name: Juleøl overall: 4.5 reviewer: jreitman
beer name: Mørkt Hvidtøl overall: 3.5 reviewer: bark
beer name: Buckler overall: 3.0 reviewer: drabmuh
beer name: Buckler overall: 4.5 reviewer: McStagger
beer name: Buckler overall: 3.0 reviewer: jifigz
done
However, it is still a bit messy. To make it neater, we can decide that we want the data to show up in some nicely aligned columns. Here is some code that prints a few fields from the first ten review in some nice columns:
def niceReviewSummary(review):
return str(review['review/overall']).rjust(5) + \
review['beer/name'].rjust(50) + \
review['user/profileName'].rjust(20)
if __name__ == "__main__":
print "Reading Data..."
data = list(parseData("http://jmcauley.ucsd.edu/cse255/data/beer/non-alcoholic-beer.json"))
print "review".rjust(5), "beer name".rjust(50), "reviewer".rjust(20)
for i in range(10):
print niceReviewSummary(data[i])
print "done"
Now the output looks like this:
Reading Data...
review beer name reviewer
3.0 Amstel Malt Brabander
3.5 Amstel Malt kingcrowing
2.0 Harboe Den Glada Danskens Lättöl gnoff
2.0 Harboe Den Glada Danskens Lättöl bark
2.0 Bernard S čistou Hlavou philipquarles
4.5 Juleøl jreitman
3.5 Mørkt Hvidtøl bark
3.0 Buckler drabmuh
4.5 Buckler McStagger
3.0 Buckler jifigz
done
Now that we see how to look at the data, we might want to actually calculate some things from this data.
For example:
- We might want to be able to filter out reviews based on whether or not we know the age of the reviewer.
- We might want to make a list of just the ages of the reviewers.
- Once we have just that list, we might want to do some things with it.
- We might want to know what highest rated and lowest rated beers are.
- etc.
In order to be able to answer these types of questions, we need to be able to work with the data in various ways. One of our most basis tools is to reduce the data down to a simpler form: for example, instead of a list of dictionaries, just a list of numbers.
Your task is to write functions that will take as their parameter, data
, the result that is returned by the parseData
function, and then return various things.
Here is a list of the functions you should write. Add each of these to your my_analysis.py
file.
You are also encouraged to write a file test_my_analysis.py
, containing tests for your functions.
To test these functions, you’d make your own values for data
, that contain probably far fewer reviews than the
real value of data
, so that you can predict, by hand, what the results of the functions would be.
You can then write test cases for the functions, following the examples of test cases that you saw in lab06.
-
def beerUserRating(data):
returns a list of Python tuples, where each tuple consists of three values: thebeer/name
,user/profileName
, and thereview/overall
value for each review in the listdata
. -
def beersRatedFive(data):
returns a list of strings, which are the names of the beers that are rated 5.0 (exactly). Ideally, your function should NOT include any duplicates. That is, if a particular beer has been rated 5.0 by more than one reviewer, the beer should still appear in the list only once. -
def reviewerToAgeInSeconds(data):
returns a dictionary (the regular kind, not adefaultdict
where the keys are theuser/profileName
values fromdata
, and the values are theuser/ageInSeconds
values for each of those users. Note that a user may appear more than once if they wrote more than one review. We are going to assume that the ages are all the same, even if that isn’t the case. And, note that a dictionary already gets rid of duplicates, since if you update a value for a key that already exists, it just overwrites the old value, rather than creating a duplicate entry. Note, though, that not every review has a value for age. So, if you encounter a review where the user/ageInSeconds is not present, you’ll need to set the value for that user to the special Python valueNone
. -
def ageInSecondsToAgeInYears(seconds)
should convert age in seconds to age in years. Note that with age, we don’t round up. If you are 20 years and 355 days old, you are still “age 20”, not “age 21”. -
def reviewerToAgeInYears(data):
should provide a dictionary that maps reviewer name (those are the keys) to reviewer age in years. In your solution, please make use of theageInSecondsToAgeInYears(seconds)
function that you wrote in a previous step. -
def averageReviewerAge(reviewerAgeDict)
should take, as its parameter, the result of callingreviewerToAgeInYears
, that is, a dictionary that maps reviewer names to reviewer ages. It should return the average age, in years, of a reviewer, as a floating point number (that is, the average age could be27.6
years, for example.) As a hint: if you have a dictionary foo, you can get a list of all of its value by -
def beerToNumReviews(data)
should return adefaultdict(int)
where the keys arebeer/name
, and the values are a count of how many reviews appear for that beer. -
def beerToListOfReviews(data)
should return adefaultdict(list)
where the keys arebeer/name
, and the value for each key is a list of all of the individualreview/overall
values for that beer. These lists may indeed contain duplicates if multiple reviewers rated a beer the same. -
def beerToAvgReview(data)
should return a dictionary (the regular kind) where they keys arebeer/name
, and each value is average rating for that beer. You should make use of thebeerToNumReviews(data)
andbeerToListOfReviews(data)
When you have written definitions for each of these functions, you can test them in either, or preferably both, of the following ways:
- running the test cases you came up with in your
test_my_analysis.py
file - call the functions on the value
data
, inside yourif __name__ == '__main__'
block, and print out the results. - call the functions interactively on the value
data
using the Python shell prompt.
For each function, when you are satisfied that that function is working, do the steps to commit your changes to github:
git add my_analysis.py
git commit -m "AB/CD describe which function you worked on here"
git push origin master
When they all work, you are done with lab09!