Indexing

Download exercises zip

Browse files online

In this worksheet we will try to apply some indexing technique to improve our algorithms performance. Typically, if you want to avoid scanning an entire sequence repeatedly, a good idea might be to first preprocess it by indexing its contents in a dictionary which maps subsequence of elements to their positions.

THESE EXERCISES ARE ABOUT CUSTOM INDEXING, NOT STRINGS!

  • DO NOT convert everything to a string, use regexes, nor use string methods like .index(), .replace(), etc

1. Exercise - chains

You will be doing exercises about chainable lists, using plain old Python lists: we just want to detect duplicates and chain sequences fast.

Start editing the file chains.py and read the following.

1.1 has_duplicates

Implement this function:

def has_duplicates(external_list):
    """ Returns True if internal lists inside external_list contain duplicates,
        False otherwise. See tests for more examples.

        INPUT: a list of list of strings, possibily containing repetitions, like:

            [
                ['ab', 'c', 'de'],
                ['v', 'a'],
                ['c', 'de', 'b']
            ]

        OUTPUT: Boolean  (in the example above it would be True)
    """
  • MUST RUN IN \(O(m*n)\), where \(m\) is the number of internal lists and \(n\) is the length of the longest internal list (just to calculate complexity think about the scenario where all lists have equal size)

  • HINT: Given the above constraint, whenever you find an item, you cannot start another for loop to check if the item exists elsewhere - that would cost around \(O(m^2*n)\). Instead, you need to keep track of found items with some other data structure of your choice, which must allow fast read and writes.

Testing: python3 -m unittest chains_test.TestHasDuplicates

1.2 chain

Implement this function:

def chain(external_list):
    """ Takes a list of list of strings and return a list containing all the strings
        from external_list in sequence, joined by the ending and starting strings
        of the internal lists. See tests for more examples.

        INPUT: a list of list of strings , like:

                [
                    ['ab', 'c', 'de'],
                    ['gh', 'i'],
                    ['de', 'f', 'gh']
                ]

        OUTPUT: a list of strings, like   ['ab', 'c', 'de', 'f', 'gh', 'i']
    """

It is assumed that

  • external_list always contains at least one internal list

  • internal lists always contain at least two strings

  • no string is duplicated among all internal lists

Output sequence is constructed as follows:

  • it starts will all the items from the first internal list

  • successive items are taken from an internal list which starts with a string equal to the previous taken internal list last string

  • sequence must not contain repetitions (so joint strings are taken only once).

  • all internal lists must be used. If this is not possible (because there are no joint strings), raise ValueError

Be careful that:

  • MUST BE WRITTEN WITH STANDARD PYTHON FUNCTIONS

  • MUST RUN IN \(O(m * n)\), where \(m\) is the number of internal lists and \(n\) is the length of the longest internal list (just to calculate complexity think about the scenario where all lists have equal size)

  • HINT: Given the above constraint, whenever you find a string, you cannot start another for loop to check if the string exists elsewhere (that would likely introduce a quadratic \(m^2\) factor) Instead, you need to first keep track of both starting strings and the list they are contained within using some other data structure of your choice, which must allow fast read and writes.

  • if possible avoid slicing (which doubles memory usage) and use itertools.islice instead

Testing: python3 -m unittest chains_test.TestChain

2. Exercise - Bank

A bank stores all its transactions in a log: possible types of transactions are opening an account, transfering some amount of money, withdrawal, etc. For simplicity, we represent each transaction type with a character, and in the log which is a Python list we put only such characters (we don’t care about their meaning).

The bank knows some sequences of events are extremely rare, and probably denote suspicious activity. For simplicitly, we assume such sequences only contain three transactions and represent them in tuples called triseqs.

Note this was given at an exam, we changed biseqs to triseqs to make it more realistic.

A triseq can occur multiple times, but you can consider the number of repetitions negligible with respect to bank log size.

Start editing bank.py and read the following

2.1 bank constructor, log and pos

Implement following methods, adding some appropriate indexing data structure.

BEFORE IMPLEMENTING, DOUBLE CHECK COMPLEXITY REQUIREMENTS!

The idea is that given an entire triseq (not just the first character !), we want to find all of its (rare) occurrences very fast in \(O(1)\)

Since performance is important here, passing tests does not imply your work is correct!

class Bank:

    def __init__(self):
        """ Initializes an empty bank

            Add a data structure of your choice for indexing the log
        """
        self._trans = []
        raise Exception("TODO IMPLEMENT ME!")

    def log(self, transaction):
        """ Appends transaction to the log

            - REMEMBER to also update the index
            - MUST EXECUTE IN O(1)
        """
def pos(self, triseq):
    """ RETURN a NEW list with all the indeces where the sequence of
        three transactions can be found

        - MUST execute in O(1)
    """

Testing: python3 -m unittest bank_test.LogTest

Example:

[4]:
from bank_sol import *

bank = Bank()
print(bank)
Bank:
[5]:
bank.log('a')
[6]:
print(bank)
Bank: a
[7]:
print(bank.pos( ('a','b','a') ))  # <-- a triseq is a tuple
[]
[8]:
bank.log('b')
[9]:
print(bank.pos( ('a','b','a') ))
[]
[10]:
bank.log('a')
[11]:
print(bank.pos( ('a','b','a') ))
[0]
[12]:
bank.log('a')
bank.log('a')
bank.log('b')
bank.log('a')
bank.log('a')
bank.log('a')
bank.log('b')
bank.log('a')
print(bank)
Bank: a,b,a,a,a,b,a,a,a,b,a
[13]:
print(bank.pos(('a','b','a')))
[0, 4, 8]
[14]:
print(bank.pos(('a','a','a')))
[2, 6]
[15]:
print(bank.pos(('b','a','a')))
[1, 5]
[16]:
print(bank.pos(('b','a','c')))
[]

2.2 bank revert

def revert(self):
    """ Completely eliminates last transaction and RETURN it

        - if bank is empty, raises IndexError

        - REMEMBER to update any index referring to it
        - *MUST* EXECUTE IN O(1)
    """

Testing: python3 -m unittest RevertTest

Example:

[18]:
bank = Bank()
bank.log('a')
bank.log('b')
bank.log('a')
bank.log('a')
bank.log('a')
bank.log('b')
bank.log('a')
bank.log('a')
print(bank)
Bank: a,b,a,a,a,b,a,a
[19]:
print(bank.pos(('b','a','a')))
[1, 5]
[20]:
bank.revert()
[20]:
'a'
[21]:
print(bank)
Bank: a,b,a,a,a,b,a
[22]:
print(bank.pos(('b','a','a')))
[1]
[23]:
bank.revert()
[23]:
'a'
[24]:
print(bank)
Bank: a,b,a,a,a,b
[25]:
print(bank.pos(('b','a','a')))
[1]
[26]:
bank.revert()
[26]:
'b'
[27]:
print(bank)
Bank: a,b,a,a,a
[28]:
print(bank.pos(('b','a','a')))
[1]
[29]:
bank.revert()
[29]:
'a'
[30]:
print(bank)
Bank: a,b,a,a
[31]:
print(bank.pos(('b','a','a')))
[1]

2.3 bank max_interval

Typically, there is a triseq which triggers a period of suspicious activity, and there is another triseq which ends it. So given a start and an end triseq, the bank wants a report of all the transactions that happened in between:

def max_interval(self, tri_start, tri_end):
    """ RETURN a list with all the transactions occurred between
        the *largest* interval among tri-sequences tri_start and tri_end

        - tri_start and tri_end are EXCLUDED
        - if tri_start / tri_end are not found, or if tri_end is before/includes tri_start,
          raise LookupError


        - DO *NOT* MODIFY the data structure
        - MUST EXECUTE IN O(k) where k is the length of the *largest* interval you can return
        - consider number of repetitions a negligible size
    """

Testing: python -m unittest bank_test.MaxIntervalTest

Example:

[33]:
bank = Bank()
bank.log('c')
bank.log('d')
bank.log('c')
bank.log('a') #      a
bank.log('b') # <--- b
bank.log('e') #      e
bank.log('f') #      --- f     |
bank.log('a') #          a     |
bank.log('f') #          f     | k
bank.log('b') #          b     |
bank.log('c') #          c     |
bank.log('b') #      --- b     |
bank.log('a') # <--- a
bank.log('f') #      f
bank.log('b') #      b
bank.log('e')
bank.log('a')
bank.log('b')
bank.log('e')
bank.log('l')
bank.max_interval( ('a','b','e'), ('a','f','b') )
[33]:
['f', 'a', 'f', 'b', 'c', 'b']