A new taxi company hired an advertising agency to advertise their services on screens at Times Square in New York, NY. The marketing company was tasked to identify the five best screens for their client. In order to reach the maximum number of potential clients for the new taxi company the criterion they decided to use was the average number of taxi pickups in close proximity to an advertising screen.
The marketing company found two public datasets that they are going to use:
1. The list of screens at the Time Square Download The list of screens at the Time Square:
The illustration above (which was created by importing a given dataset to Google Maps) visualizes the locations of the screens.
Large datasets like this one are usually consist of a) data dictionary, a table that lists all the fields in the dataset; b) and the actual dataset in a variety of formats (Excel compatible comma separated values (.cvs), XML or JSON).
Large datasets like this one usually consist of a) data dictionary, a table that lists all the fields in the dataset; b) and the actual dataset in a variety of formats (Excel compatible comma-separated values (.cvs), XML or JSON).
Familiarize yourself with both datasets. Note, that the second dataset files are very large (up to 1 GB).
The ridership data is also given in separate files grouped by taxi companies (e.g. Yellow). Pick a dataset related to any company that services the Time Square area.
With the dataset structures (field names, or dictionaries) in mind, use Word to design a flow chart of the algorithm to describe the process of identifying the top five screens that would be seen most often by the taxi riders.
Note that you don’t need to provide code, and you don’t need to calculate top screens, just provide a pseudo code for the algorithm that would perform that task.
Pseudocode is a somewhat structured description of the steps of an algorithm written in plain English. You may also use variable names to refer to the same data multiple times if needed.
An example of pseudocode
Assume we need to identify the list of courses a top student in a current class has to take to graduate. We have three datasets: (1) one dataset has students grades in the current class (student_id, total_grade); (2) another one has the list of past enrollments as a (student_id, course_id); (3) list, of course, ids required by a program.
1. Sort the dataset (1) by the total_grade values (descending)
2. Select the first element of the sorted list and save it as a top_student_id
3. Select all pairs from the dataset (2) where student_id equals top_student_id store course_id in a new list (4)
4. Each course in (3) which is not present in (4) should be added to a new list (5).
Then the new list (5) is the list of the courses that the top student has to take.