Exam - Mon 10, Jun 2019

Scientific Programming - Data Science @ University of Trento

Download exercises and solution

Introduction

  • Taking part to this exam erases any vote you had before

What to do

  1. Download sciprog-ds-2019-06-10-exam.zip and extract it on your desktop. Folder content should be like this:

sciprog-ds-2019-06-10-FIRSTNAME-LASTNAME-ID
    exam-2019-06-10.ipynb
    stack.py
    stack_test.py
    tree.py
    tree_test.py
    jupman.py
    sciprog.py
  1. Rename sciprog-ds-2019-06-10-FIRSTNAME-LASTNAME-ID folder: put your name, lastname an id number, like sciprog-ds-2019-06-10-john-doe-432432

From now on, you will be editing the files in that folder. At the end of the exam, that is what will be evaluated.

  1. Edit the files following the instructions in this worksheet for each exercise. Every exercise should take max 25 mins. If it takes longer, leave it and try another exercise.

  2. When done:

  • if you have unitn login: zip and send to examina.icts.unitn.it/studente

  • If you don’t have unitn login: tell instructors and we will download your work manually

Part A

Open Jupyter and start editing this notebook exam-2019-06-10.ipynb

A1 ITEA real estate

NOTICE: this part of the exam was ported to softpython website

There you can find a more curated version (notice it may be longer than here)

You will now analyze public real estates in Trentino, which are managed by ITEA agency. Every real estate has a type, and we will find the type distribution.

Data provider: ITEA - dati.trentino.it

A function load_itea is given to load the dataset (you don’t need to implement it):

[2]:
def load_itea():
    """Loads file data and RETURN a list of dictionaries with the stop times
    """

    import csv
    with open('data/itea.csv', newline='',  encoding='latin-1',) as csvfile:
        reader = csv.DictReader(csvfile,  delimiter=';')
        lst = []
        for d in reader:
            lst.append(d)
    return lst


itea = load_itea()

IMPORTANT: look at the dataset by yourself !

Here we show only first 5 rows, but to get a clear picture of the dataset you need to study it a bit by yourself

[3]:
itea[:5]
[3]:
[OrderedDict([('Tipologia', 'ALTRO'),
              ('Proprietà', 'ITEA'),
              ('Indirizzo', "Codice unita': 30100049"),
              ('Frazione', ''),
              ('Comune', "BASELGA DI PINE'")]),
 OrderedDict([('Tipologia', 'ALLOGGIO'),
              ('Proprietà', 'ITEA'),
              ('Indirizzo', "Codice unita': 43100011"),
              ('Frazione', ''),
              ('Comune', 'TRENTO')]),
 OrderedDict([('Tipologia', 'ALLOGGIO'),
              ('Proprietà', 'ITEA'),
              ('Indirizzo', "Codice unita': 43100002"),
              ('Frazione', ''),
              ('Comune', 'TRENTO')]),
 OrderedDict([('Tipologia', 'ALLOGGIO'),
              ('Proprietà', 'ITEA'),
              ('Indirizzo', 'VIALE DELLE ROBINIE 26'),
              ('Frazione', ''),
              ('Comune', 'TRENTO')]),
 OrderedDict([('Tipologia', 'ALLOGGIO'),
              ('Proprietà', 'ITEA'),
              ('Indirizzo', 'VIALE DELLE ROBINIE 26'),
              ('Frazione', ''),
              ('Comune', 'TRENTO')])]

A1.1 calc_types_hist

Implement function calc_types_hist to extract the types ('Tipologia') of ITEA real estate and RETURN a histogram which associates to each type its frequency.

  • You will discover there are three types of apartments: ‘ALLOGGIO’, ‘ALLOGGIO DUPLEX’ and ‘ALLOGGIO MONOLOCALE’. In the resulting histogram you must place only the key ‘ALLOGGIO’ which will be the sum of all of them.

  • Same goes for ‘POSTO MACCHINA’ (parking lot): there are many of them ( ‘POSTO MACCHINA COMUNE ESTERNO’, ‘POSTO MACCHINA COMUNE INTERNO’, ‘POSTO MACCHINA ESTERNO’, ‘POSTO MACCHINA INTERNO’, ‘POSTO MACCHINA SOTTO TETTOIA’) but we only want to see ‘POSTO MACCHINA’ as key with the sum of all of them. NOTE: Please don’t use 5 ifs, try to come up with some generic code to catch all these cases ..)

Show solution
[4]:
def calc_types_hist(db):
    raise Exception('TODO IMPLEMENT ME !')

calc_types_hist(itea)
[4]:
{'ALTRO': 64,
 'ALLOGGIO': 10778,
 'POSTO MACCHINA': 3147,
 'MAGAZZINO': 143,
 'CABINA ELETTRICA': 41,
 'LOCALE COMUNE': 28,
 'NEGOZIO': 139,
 'CANTINA': 40,
 'GARAGE': 2221,
 'CENTRALE TERMICA': 4,
 'UFFICIO': 29,
 'TETTOIA': 2,
 'ARCHIVIO ITEA': 10,
 'SALA / ATTIVITA SOCIALI': 45,
 'AREA URBANA': 6,
 'ASILO': 1,
 'CASERMA': 2,
 'LABORATORIO PER ARTI E MESTIERI': 3,
 'MUSEO': 1,
 'SOFFITTA': 3,
 'AMBULATORIO': 1,
 'LEGNAIA': 3,
 'RUDERE': 1}

A1.2 calc_types_series

Takes a dictionary histogram and RETURN a list of tuples containing key/value pairs, sorted from most frequent iyems to least frequent.

HINT: if you don’t remember how to sort by an element of a tuple, look at this example and also in python documentation about sorting.

Show solution
[5]:
def calc_types_series(hist):
    raise Exception('TODO IMPLEMENT ME !')

tipologie = calc_types_series(calc_types_hist(itea))

tipologie
[5]:
[('ALLOGGIO', 10778),
 ('POSTO MACCHINA', 3147),
 ('GARAGE', 2221),
 ('MAGAZZINO', 143),
 ('NEGOZIO', 139),
 ('ALTRO', 64),
 ('SALA / ATTIVITA SOCIALI', 45),
 ('CABINA ELETTRICA', 41),
 ('CANTINA', 40),
 ('UFFICIO', 29)]

A1.3 Real estates plot

Once you obtained the series as above, plot the first 10 most frequent items, in decreasing order.

  • please pay attention to plot title, width and height, axis labels. Everything MUST display in a readable way.

  • try also to print nice the labels, if they are too long / overlap like for ‘SALA / ATTIVITA SOCIALI’ put carriage returns in a generic way.

[6]:
# write here

Show solution
[7]:

../../../_images/exams_2019-06-10_solutions_exam-2019-06-10-sol_24_0.png

A2 Air quality

You will now analyze air_quality in Trentino. You are given a dataset which records various pollutants (‘Inquinante’) at various stations ('Stazione') in Trentino. Pollutants values can be 'PM10', 'Biossido Zolfo', and a few others. Each station records some set of pollutants. For each pollutant values are recorded ('Valore') 24 times per day.

Data provider: PAT Ag. Provinciale per la protezione dell’Ambiente - dati.trentino.it

A function load_air_quality is given to load the dataset (you don’t need to implement it):

[8]:
def load_air_quality():
    """Loads file data and RETURN a list of dictionaries with the stop times
    """

    import csv
    with open('data/air-quality.csv', newline='', encoding='latin-1') as csvfile:
        reader = csv.DictReader(csvfile)
        lst = []
        for d in reader:
            lst.append(d)
    return lst


air_quality = load_air_quality()

IMPORTANT 1: look at the dataset by yourself !

Here we show only first 5 rows, but to get a clear picture of the dataset you need to study it a bit by yourself

IMPORTANT 2: EVERY field is a STRING, including ‘Valore’ !

[9]:
air_quality[:5]
[9]:
[OrderedDict([('Stazione', 'Parco S. Chiara'),
              ('Inquinante', 'PM10'),
              ('Data', '2019-05-04'),
              ('Ora', '1'),
              ('Valore', '17'),
              ('Unità di misura', 'µg/mc')]),
 OrderedDict([('Stazione', 'Parco S. Chiara'),
              ('Inquinante', 'PM10'),
              ('Data', '2019-05-04'),
              ('Ora', '2'),
              ('Valore', '19'),
              ('Unità di misura', 'µg/mc')]),
 OrderedDict([('Stazione', 'Parco S. Chiara'),
              ('Inquinante', 'PM10'),
              ('Data', '2019-05-04'),
              ('Ora', '3'),
              ('Valore', '17'),
              ('Unità di misura', 'µg/mc')]),
 OrderedDict([('Stazione', 'Parco S. Chiara'),
              ('Inquinante', 'PM10'),
              ('Data', '2019-05-04'),
              ('Ora', '4'),
              ('Valore', '15'),
              ('Unità di misura', 'µg/mc')]),
 OrderedDict([('Stazione', 'Parco S. Chiara'),
              ('Inquinante', 'PM10'),
              ('Data', '2019-05-04'),
              ('Ora', '5'),
              ('Valore', '13'),
              ('Unità di misura', 'µg/mc')])]

Now implement the following function:

Show solution
[10]:
def calc_avg_pollution(db):
    """ RETURN a dictionary containing two elements tuples as keys:
        -  first tuple element is the station ('Stazione'),
        - second tuple element  is the name of a pollutant ('Inquinante')

        To each tuple key, you must associate as value the average for that station
        _and_ pollutant over all days.

    """
    raise Exception('TODO IMPLEMENT ME !')

calc_avg_pollution(air_quality)
[10]:
{('Parco S. Chiara', 'PM10'): 11.385752688172044,
 ('Parco S. Chiara', 'PM2.5'): 7.9471544715447155,
 ('Parco S. Chiara', 'Biossido di Azoto'): 20.828146143437078,
 ('Parco S. Chiara', 'Ozono'): 66.69541778975741,
 ('Parco S. Chiara', 'Biossido Zolfo'): 1.2918918918918918,
 ('Via Bolzano', 'PM10'): 12.526881720430108,
 ('Via Bolzano', 'Biossido di Azoto'): 29.28493894165536,
 ('Via Bolzano', 'Ossido di Carbonio'): 0.5964769647696474,
 ('Piana Rotaliana', 'PM10'): 9.728744939271255,
 ('Piana Rotaliana', 'Biossido di Azoto'): 15.170068027210885,
 ('Piana Rotaliana', 'Ozono'): 67.03633916554509,
 ('Rovereto', 'PM10'): 9.475806451612904,
 ('Rovereto', 'PM2.5'): 7.764784946236559,
 ('Rovereto', 'Biossido di Azoto'): 16.284167794316645,
 ('Rovereto', 'Ozono'): 70.54655870445345,
 ('Borgo Valsugana', 'PM10'): 11.819407008086253,
 ('Borgo Valsugana', 'PM2.5'): 7.413746630727763,
 ('Borgo Valsugana', 'Biossido di Azoto'): 15.73806275579809,
 ('Borgo Valsugana', 'Ozono'): 58.599730458221025,
 ('Riva del Garda', 'PM10'): 9.912398921832883,
 ('Riva del Garda', 'Biossido di Azoto'): 17.125845737483086,
 ('Riva del Garda', 'Ozono'): 68.38159675236807,
 ('A22 (Avio)', 'PM10'): 9.651821862348179,
 ('A22 (Avio)', 'Biossido di Azoto'): 33.0650406504065,
 ('A22 (Avio)', 'Ossido di Carbonio'): 0.4228848821081822,
 ('Monte Gaza', 'PM10'): 7.794520547945205,
 ('Monte Gaza', 'Biossido di Azoto'): 4.34412955465587,
 ('Monte Gaza', 'Ozono'): 99.0858310626703}

Part B

B1 Theory

Let L be a list containing n lists, each of them of size m. Return the computational complexity of function fun() with respect to n and m.

Write the solution in separate ``theory.txt`` file

def fun(L):
    for r1 in L:
        for r2 in L:
            if r1 != r2 and sum(r1) == sum(r2):
                print("Similar:")
                print(r1)
                print(r2)
Show answer

B2 WStack

Using a text editor, open file stack.py. You will find a WStack class skeleton which represents a simple stack that can only contain integers.

B2.1 implement class WStack

Fill in missing methods in class WStack in the order they are presented so to have a .weight() method that returns the total sum of integers in the stack in O(1) time.

Example:

[11]:
from stack_sol import *
[12]:
s = WStack()
[13]:
print(s)
WStack: weight=0 elements=[]
[14]:
s.push(7)
[15]:
print(s)
WStack: weight=7 elements=[7]
[16]:
s.push(4)
[17]:
print(s)
WStack: weight=11 elements=[7, 4]
[18]:
s.push(2)
[19]:
s.pop()
[19]:
2
[20]:
print(s)
WStack: weight=11 elements=[7, 4]

B2.2 accumulate

Implement function accumulate:

def accumulate(stack1, stack2, min_amount):
    """ Pushes on stack2 elements taken from stack1 until the weight of
        stack2 is equal or exceeds the given min_amount

        - if the given min_amount cannot possibly be reached because
          stack1 has not enough weight, raises early ValueError without
          changing stack1.
        - DO NOT access internal fields of stacks, only use class methods.
        - MUST perform in O(n) where n is the size of stack1
        - NOTE: this function is defined *outside* the class !
    """

Testing: python -m unittest stacks_test.AccumulateTest

Example:

[21]:


s1 = WStack() print(s1)
WStack: weight=0 elements=[]
[22]:
s1.push(2)
s1.push(9)
s1.push(5)
s1.push(3)

[23]:
print(s1)
WStack: weight=19 elements=[2, 9, 5, 3]
[24]:
s2 = WStack()
print(s2)
WStack: weight=0 elements=[]
[25]:
s2.push(1)
s2.push(7)
s2.push(4)

[26]:
print(s2)

WStack: weight=12 elements=[1, 7, 4]
[27]:
# attempts to reach in s2 a weight of at least 17
[28]:
accumulate(s1,s2,17)
[29]:
print(s1)
WStack: weight=11 elements=[2, 9]

Two top elements were taken from s1 and now s2 has a weight of 20, which is >= 17

[30]:
print(s2)
WStack: weight=20 elements=[1, 7, 4, 3, 5]

B3 GenericTree

Open file tree.py in a text editor and read following instructions.

B3.1 is_triangle

A triangle is a node which has exactly two children.

Let’s see some example:

      a
    /   \
   /     \
  b ----- c
 /|\     /
d-e-f   g
       / \
      h---i
         /
        l

The tree above can also be represented like this:

a
├b
|├d
|├e
|└f
└c
 └g
  ├h
  └i
   └l
  • node a is a triangle because has exactly two children b and c, note it doesn’t matter if b or c have children)

  • b is not a triangle (has 3 children)

  • c and i are not triangles (have only 1 child)

  • g is a triangle as it has exactly two children h and i

  • d, e, f, h and l are not triangles, because they have zero children

Now implement this method:

def is_triangle(self, elems):
    """ RETURN True if this node is a triangle matching the data
        given by list elems.

        In order to match:
        - first list item must be equal to this node data
        - second list item must be equal to this node first child data
        - third list item must be equal to this node second child data

        - if elems has less than three elements, raises ValueError
    """

Testing: python -m unittest tree_test.IsTriangleTest

Examples:

[31]:
from tree_test import gt
[32]:

# this is the tree from the example above tb = gt('b', gt('d', gt('e'), gt('f'))) tg = gt('g', gt('h'), gt('i', gt('l'))) ta = gt('a', tb, gt('c', tg)) ta.is_triangle(['a','b','c'])
[32]:
True
[33]:
ta.is_triangle(['b','c','a'])
[33]:
False
[34]:
tb.is_triangle(['b','d','e'])
[34]:
False
[35]:
tg.is_triangle(['g','h','i'])
[35]:
True
[36]:
tg.is_triangle(['g','i','h'])
[36]:
False

B3.2 has_triangle

Implement this method:

def has_triangle(self, elems):
    """ RETURN True if this node *or one of its descendants* is a triangle
        matching given elems. Otherwise, return False.

        - a recursive solution is acceptable
    """

Testing: python -m unittest tree_test.HasTriangleTest

Examples:

[37]:

# example tree seen at the beginning tb = gt('b', gt('d', gt('e'), gt('f'))) tg = gt('g', gt('h'), gt('i', gt('l'))) tc = gt('c', tg) ta = gt('a', tb, tc) ta.has_triangle(['a','b','c'])
[37]:
True
[38]:
ta.has_triangle(['a','c','b'])

[38]:
False
[39]:
ta.has_triangle(['b','c','a'])

[39]:
False
[40]:
tb.is_triangle(['b','d','e'])

[40]:
False
[41]:
tg.has_triangle(['g','h','i'])

[41]:
True
[42]:
tc.has_triangle(['g','h','i'])  # check recursion

[42]:
True
[43]:
ta.has_triangle(['g','h','i'])  # check recursion
[43]:
True