Exam - Mon 10, Jun 2019
Scientific Programming - Data Science @ University of Trento
Download exercises and solution
Introduction
Taking part to this exam erases any vote you had before
What to do
Download
sciprog-ds-2019-06-10-exam.zip
and extract it on your desktop. Folder content should be like this:
sciprog-ds-2019-06-10-FIRSTNAME-LASTNAME-ID
exam-2019-06-10.ipynb
stack.py
stack_test.py
tree.py
tree_test.py
jupman.py
sciprog.py
Rename
sciprog-ds-2019-06-10-FIRSTNAME-LASTNAME-ID
folder: put your name, lastname an id number, likesciprog-ds-2019-06-10-john-doe-432432
From now on, you will be editing the files in that folder. At the end of the exam, that is what will be evaluated.
Edit the files following the instructions in this worksheet for each exercise. Every exercise should take max 25 mins. If it takes longer, leave it and try another exercise.
When done:
if you have unitn login: zip and send to examina.icts.unitn.it/studente
If you don’t have unitn login: tell instructors and we will download your work manually
Part A
Open Jupyter and start editing this notebook exam-2019-06-10.ipynb
A1 ITEA real estate
NOTICE: this part of the exam was ported to softpython website
There you can find a more curated version (notice it may be longer than here)
You will now analyze public real estates in Trentino, which are managed by ITEA agency. Every real estate has a type, and we will find the type distribution.
Data provider: ITEA - dati.trentino.it
A function load_itea
is given to load the dataset (you don’t need to implement it):
[2]:
def load_itea():
"""Loads file data and RETURN a list of dictionaries with the stop times
"""
import csv
with open('data/itea.csv', newline='', encoding='latin-1',) as csvfile:
reader = csv.DictReader(csvfile, delimiter=';')
lst = []
for d in reader:
lst.append(d)
return lst
itea = load_itea()
IMPORTANT: look at the dataset by yourself !
Here we show only first 5 rows, but to get a clear picture of the dataset you need to study it a bit by yourself
[3]:
itea[:5]
[3]:
[OrderedDict([('Tipologia', 'ALTRO'),
('Proprietà', 'ITEA'),
('Indirizzo', "Codice unita': 30100049"),
('Frazione', ''),
('Comune', "BASELGA DI PINE'")]),
OrderedDict([('Tipologia', 'ALLOGGIO'),
('Proprietà', 'ITEA'),
('Indirizzo', "Codice unita': 43100011"),
('Frazione', ''),
('Comune', 'TRENTO')]),
OrderedDict([('Tipologia', 'ALLOGGIO'),
('Proprietà', 'ITEA'),
('Indirizzo', "Codice unita': 43100002"),
('Frazione', ''),
('Comune', 'TRENTO')]),
OrderedDict([('Tipologia', 'ALLOGGIO'),
('Proprietà', 'ITEA'),
('Indirizzo', 'VIALE DELLE ROBINIE 26'),
('Frazione', ''),
('Comune', 'TRENTO')]),
OrderedDict([('Tipologia', 'ALLOGGIO'),
('Proprietà', 'ITEA'),
('Indirizzo', 'VIALE DELLE ROBINIE 26'),
('Frazione', ''),
('Comune', 'TRENTO')])]
A1.1 calc_types_hist
Implement function calc_types_hist
to extract the types ('Tipologia'
) of ITEA real estate and RETURN a histogram which associates to each type its frequency.
You will discover there are three types of apartments: ‘ALLOGGIO’, ‘ALLOGGIO DUPLEX’ and ‘ALLOGGIO MONOLOCALE’. In the resulting histogram you must place only the key ‘ALLOGGIO’ which will be the sum of all of them.
Same goes for ‘POSTO MACCHINA’ (parking lot): there are many of them ( ‘POSTO MACCHINA COMUNE ESTERNO’, ‘POSTO MACCHINA COMUNE INTERNO’, ‘POSTO MACCHINA ESTERNO’, ‘POSTO MACCHINA INTERNO’, ‘POSTO MACCHINA SOTTO TETTOIA’) but we only want to see ‘POSTO MACCHINA’ as key with the sum of all of them. NOTE: Please don’t use 5 ifs, try to come up with some generic code to catch all these cases ..)
[4]:
def calc_types_hist(db):
raise Exception('TODO IMPLEMENT ME !')
calc_types_hist(itea)
[4]:
{'ALTRO': 64,
'ALLOGGIO': 10778,
'POSTO MACCHINA': 3147,
'MAGAZZINO': 143,
'CABINA ELETTRICA': 41,
'LOCALE COMUNE': 28,
'NEGOZIO': 139,
'CANTINA': 40,
'GARAGE': 2221,
'CENTRALE TERMICA': 4,
'UFFICIO': 29,
'TETTOIA': 2,
'ARCHIVIO ITEA': 10,
'SALA / ATTIVITA SOCIALI': 45,
'AREA URBANA': 6,
'ASILO': 1,
'CASERMA': 2,
'LABORATORIO PER ARTI E MESTIERI': 3,
'MUSEO': 1,
'SOFFITTA': 3,
'AMBULATORIO': 1,
'LEGNAIA': 3,
'RUDERE': 1}
A1.2 calc_types_series
Takes a dictionary histogram and RETURN a list of tuples containing key/value pairs, sorted from most frequent iyems to least frequent.
HINT: if you don’t remember how to sort by an element of a tuple, look at this example and also in python documentation about sorting.
Show solution[5]:
def calc_types_series(hist):
raise Exception('TODO IMPLEMENT ME !')
tipologie = calc_types_series(calc_types_hist(itea))
tipologie
[5]:
[('ALLOGGIO', 10778),
('POSTO MACCHINA', 3147),
('GARAGE', 2221),
('MAGAZZINO', 143),
('NEGOZIO', 139),
('ALTRO', 64),
('SALA / ATTIVITA SOCIALI', 45),
('CABINA ELETTRICA', 41),
('CANTINA', 40),
('UFFICIO', 29)]
A1.3 Real estates plot
Once you obtained the series as above, plot the first 10 most frequent items, in decreasing order.
please pay attention to plot title, width and height, axis labels. Everything MUST display in a readable way.
try also to print nice the labels, if they are too long / overlap like for ‘SALA / ATTIVITA SOCIALI’ put carriage returns in a generic way.
[6]:
# write here
[7]:
A2 Air quality
You will now analyze air_quality in Trentino. You are given a dataset which records various pollutants (‘Inquinante’) at various stations ('Stazione'
) in Trentino. Pollutants values can be 'PM10'
, 'Biossido Zolfo'
, and a few others. Each station records some set of pollutants. For each pollutant values are recorded ('Valore'
) 24 times per day.
Data provider: PAT Ag. Provinciale per la protezione dell’Ambiente - dati.trentino.it
A function load_air_quality
is given to load the dataset (you don’t need to implement it):
[8]:
def load_air_quality():
"""Loads file data and RETURN a list of dictionaries with the stop times
"""
import csv
with open('data/air-quality.csv', newline='', encoding='latin-1') as csvfile:
reader = csv.DictReader(csvfile)
lst = []
for d in reader:
lst.append(d)
return lst
air_quality = load_air_quality()
IMPORTANT 1: look at the dataset by yourself !
Here we show only first 5 rows, but to get a clear picture of the dataset you need to study it a bit by yourself
IMPORTANT 2: EVERY field is a STRING, including ‘Valore’ !
[9]:
air_quality[:5]
[9]:
[OrderedDict([('Stazione', 'Parco S. Chiara'),
('Inquinante', 'PM10'),
('Data', '2019-05-04'),
('Ora', '1'),
('Valore', '17'),
('Unità di misura', 'µg/mc')]),
OrderedDict([('Stazione', 'Parco S. Chiara'),
('Inquinante', 'PM10'),
('Data', '2019-05-04'),
('Ora', '2'),
('Valore', '19'),
('Unità di misura', 'µg/mc')]),
OrderedDict([('Stazione', 'Parco S. Chiara'),
('Inquinante', 'PM10'),
('Data', '2019-05-04'),
('Ora', '3'),
('Valore', '17'),
('Unità di misura', 'µg/mc')]),
OrderedDict([('Stazione', 'Parco S. Chiara'),
('Inquinante', 'PM10'),
('Data', '2019-05-04'),
('Ora', '4'),
('Valore', '15'),
('Unità di misura', 'µg/mc')]),
OrderedDict([('Stazione', 'Parco S. Chiara'),
('Inquinante', 'PM10'),
('Data', '2019-05-04'),
('Ora', '5'),
('Valore', '13'),
('Unità di misura', 'µg/mc')])]
Now implement the following function:
Show solution[10]:
def calc_avg_pollution(db):
""" RETURN a dictionary containing two elements tuples as keys:
- first tuple element is the station ('Stazione'),
- second tuple element is the name of a pollutant ('Inquinante')
To each tuple key, you must associate as value the average for that station
_and_ pollutant over all days.
"""
raise Exception('TODO IMPLEMENT ME !')
calc_avg_pollution(air_quality)
[10]:
{('Parco S. Chiara', 'PM10'): 11.385752688172044,
('Parco S. Chiara', 'PM2.5'): 7.9471544715447155,
('Parco S. Chiara', 'Biossido di Azoto'): 20.828146143437078,
('Parco S. Chiara', 'Ozono'): 66.69541778975741,
('Parco S. Chiara', 'Biossido Zolfo'): 1.2918918918918918,
('Via Bolzano', 'PM10'): 12.526881720430108,
('Via Bolzano', 'Biossido di Azoto'): 29.28493894165536,
('Via Bolzano', 'Ossido di Carbonio'): 0.5964769647696474,
('Piana Rotaliana', 'PM10'): 9.728744939271255,
('Piana Rotaliana', 'Biossido di Azoto'): 15.170068027210885,
('Piana Rotaliana', 'Ozono'): 67.03633916554509,
('Rovereto', 'PM10'): 9.475806451612904,
('Rovereto', 'PM2.5'): 7.764784946236559,
('Rovereto', 'Biossido di Azoto'): 16.284167794316645,
('Rovereto', 'Ozono'): 70.54655870445345,
('Borgo Valsugana', 'PM10'): 11.819407008086253,
('Borgo Valsugana', 'PM2.5'): 7.413746630727763,
('Borgo Valsugana', 'Biossido di Azoto'): 15.73806275579809,
('Borgo Valsugana', 'Ozono'): 58.599730458221025,
('Riva del Garda', 'PM10'): 9.912398921832883,
('Riva del Garda', 'Biossido di Azoto'): 17.125845737483086,
('Riva del Garda', 'Ozono'): 68.38159675236807,
('A22 (Avio)', 'PM10'): 9.651821862348179,
('A22 (Avio)', 'Biossido di Azoto'): 33.0650406504065,
('A22 (Avio)', 'Ossido di Carbonio'): 0.4228848821081822,
('Monte Gaza', 'PM10'): 7.794520547945205,
('Monte Gaza', 'Biossido di Azoto'): 4.34412955465587,
('Monte Gaza', 'Ozono'): 99.0858310626703}
Part B
B1 Theory
Let L
be a list containing n
lists, each of them of size m
. Return the computational complexity of function fun()
with respect to n
and m
.
Write the solution in separate ``theory.txt`` file
def fun(L):
for r1 in L:
for r2 in L:
if r1 != r2 and sum(r1) == sum(r2):
print("Similar:")
print(r1)
print(r2)
B2 WStack
Using a text editor, open file stack.py
. You will find a WStack
class skeleton which represents a simple stack that can only contain integers.
B2.1 implement class WStack
Fill in missing methods in class WStack
in the order they are presented so to have a .weight()
method that returns the total sum of integers in the stack in O(1)
time.
Example:
[11]:
from stack_sol import *
[12]:
s = WStack()
[13]:
print(s)
WStack: weight=0 elements=[]
[14]:
s.push(7)
[15]:
print(s)
WStack: weight=7 elements=[7]
[16]:
s.push(4)
[17]:
print(s)
WStack: weight=11 elements=[7, 4]
[18]:
s.push(2)
[19]:
s.pop()
[19]:
2
[20]:
print(s)
WStack: weight=11 elements=[7, 4]
B2.2 accumulate
Implement function accumulate
:
def accumulate(stack1, stack2, min_amount):
""" Pushes on stack2 elements taken from stack1 until the weight of
stack2 is equal or exceeds the given min_amount
- if the given min_amount cannot possibly be reached because
stack1 has not enough weight, raises early ValueError without
changing stack1.
- DO NOT access internal fields of stacks, only use class methods.
- MUST perform in O(n) where n is the size of stack1
- NOTE: this function is defined *outside* the class !
"""
Testing: python -m unittest stacks_test.AccumulateTest
Example:
[21]:
s1 = WStack()
print(s1)
WStack: weight=0 elements=[]
[22]:
s1.push(2)
s1.push(9)
s1.push(5)
s1.push(3)
[23]:
print(s1)
WStack: weight=19 elements=[2, 9, 5, 3]
[24]:
s2 = WStack()
print(s2)
WStack: weight=0 elements=[]
[25]:
s2.push(1)
s2.push(7)
s2.push(4)
[26]:
print(s2)
WStack: weight=12 elements=[1, 7, 4]
[27]:
# attempts to reach in s2 a weight of at least 17
[28]:
accumulate(s1,s2,17)
[29]:
print(s1)
WStack: weight=11 elements=[2, 9]
Two top elements were taken from s1 and now s2 has a weight of 20, which is >= 17
[30]:
print(s2)
WStack: weight=20 elements=[1, 7, 4, 3, 5]
B3 GenericTree
Open file tree.py
in a text editor and read following instructions.
B3.1 is_triangle
A triangle is a node which has exactly two children.
Let’s see some example:
a
/ \
/ \
b ----- c
/|\ /
d-e-f g
/ \
h---i
/
l
The tree above can also be represented like this:
a
├b
|├d
|├e
|└f
└c
└g
├h
└i
└l
node
a
is a triangle because has exactly two childrenb
andc
, note it doesn’t matter ifb
orc
have children)b
is not a triangle (has 3 children)c
andi
are not triangles (have only 1 child)g
is a triangle as it has exactly two childrenh
andi
d
,e
,f
,h
andl
are not triangles, because they have zero children
Now implement this method:
def is_triangle(self, elems):
""" RETURN True if this node is a triangle matching the data
given by list elems.
In order to match:
- first list item must be equal to this node data
- second list item must be equal to this node first child data
- third list item must be equal to this node second child data
- if elems has less than three elements, raises ValueError
"""
Testing: python -m unittest tree_test.IsTriangleTest
Examples:
[31]:
from tree_test import gt
[32]:
# this is the tree from the example above
tb = gt('b', gt('d', gt('e'), gt('f')))
tg = gt('g', gt('h'), gt('i', gt('l')))
ta = gt('a', tb, gt('c', tg))
ta.is_triangle(['a','b','c'])
[32]:
True
[33]:
ta.is_triangle(['b','c','a'])
[33]:
False
[34]:
tb.is_triangle(['b','d','e'])
[34]:
False
[35]:
tg.is_triangle(['g','h','i'])
[35]:
True
[36]:
tg.is_triangle(['g','i','h'])
[36]:
False
B3.2 has_triangle
Implement this method:
def has_triangle(self, elems):
""" RETURN True if this node *or one of its descendants* is a triangle
matching given elems. Otherwise, return False.
- a recursive solution is acceptable
"""
Testing: python -m unittest tree_test.HasTriangleTest
Examples:
[37]:
# example tree seen at the beginning
tb = gt('b', gt('d', gt('e'), gt('f')))
tg = gt('g', gt('h'), gt('i', gt('l')))
tc = gt('c', tg)
ta = gt('a', tb, tc)
ta.has_triangle(['a','b','c'])
[37]:
True
[38]:
ta.has_triangle(['a','c','b'])
[38]:
False
[39]:
ta.has_triangle(['b','c','a'])
[39]:
False
[40]:
tb.is_triangle(['b','d','e'])
[40]:
False
[41]:
tg.has_triangle(['g','h','i'])
[41]:
True
[42]:
tc.has_triangle(['g','h','i']) # check recursion
[42]:
True
[43]:
ta.has_triangle(['g','h','i']) # check recursion
[43]:
True