CODE | CIS5113 | ||||||||||||
TITLE | Large Scale Databases | ||||||||||||
UM LEVEL | 05 - Postgraduate Modular Diploma or Degree Course | ||||||||||||
MQF LEVEL | 7 | ||||||||||||
ECTS CREDITS | 5 | ||||||||||||
DEPARTMENT | Computer Information Systems | ||||||||||||
DESCRIPTION | This study-unit focuses on current research topics in databases, data modelling for consolidation and presentation of an orginisation’s data infrastructure. Consequent to this part of the content is the scaling of data processing operations under varying data consistency requirements and conversely starting with operational targets and indicating an acceptable level of operational support. Given that a number of databases that are accessible to an organisation then it is in a position to consolidate sources together so as to provide "a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions" (B. Inmon). Additionally, combining a company’s data assets with data streaming from various sources and structures creates the possibility for it to investigate and follow-up opportunities that arise from day to day. This unit presents knowledge and know how on building repositories onto which data warehousing and data mining exercises are executable. Handling of large data sets - origin of which can be transactional systems or pattern extraction programs (e.g., data extraction from large repositories). To accommodate better the velocity aspect of Big Data an overview of data stream processing systems and respective scenarios are presented. Design and implementation techniques in SQL and procedural extensions to SQL are presented. Furthermore specialise tools and techniques are studied to consolidate and validate the quality of data. It is now accepted that a portion of this processing is farmed to less general purpose DBMSs in a direct effort to reach performance targets. A substantial part is devoted to query design and optimisation for these massive data repositories. Study-unit Aims: The aims of this study-unit are to: - instil techniques of how to identify, understand the underlying databases (and the processes executed over them); - introduce methodologies on how to move data from a source to a destination, and then integrate it into a centralised repository. This centralised database needs to adhere to its own set of integrity constraints and gives the capability of tracing back data to its source; - pursue further knowledge to the physical design of a database by including hardware and design techniques (e.g. what, how, when, where to index) that are very different from online systems; - apply data warehousing and data mining techniques that require extensive computational load if executed over massive datasets. Such queries/algorithms require careful study to design and optimise for execution. It has become customary that a number of specific techniques are applied to known problems; - allow students to consider data intensive distributed computing infrastuctures. Another crop of tools to be introduced are the current NoSQL DBMS that offer added performance if set-up is acceptable (e.g. does not affect business process); - understand what the Velocity aspect of Big Data is; - appreciate the challenges posed by Velocity; - apply a data processing framework such as Spark or Storm to process data in this scenario. Learning Outcomes: 1. Knowledge & Understanding By the end of the study-unit the student will be able to: - recognise the need of and know how to build a cross organisation data infrastructure for a data analytics exercise; - evaluate data sources and how to extract and move data into a staging area; - build an organisation wide data repository for data warehousing and data mining (at logical and physical level); - write complex queries in SQL and SQL procedural extensions; - write complex queries in NoSQL and application code; - understand the concept of distributed data and functions across networks of computers; - build a framework that enables distributed computation across databases and massive datasets; - explain the difference between building the infrastructure and querying it in terms of computational load; - explain query processing and optimisation in massive datasets. - gain awareness of Velocity by analysing typical scenarios; - appreciate the different architectures used to deal with the Velocity aspect of Big Data. 2. Skills By the end of the study-unit the student will be able to: - create large scale distributed and interoperable systems; - write and implement complex database design for an enterprise infrastructure with a database high level language; - write and implement problematic extract, load and transform methods to consolidate the source databases into the infrastructure; - write and implement extract, load and transform methods to read the output of pattern extraction programs; - write SQL commands for roll-up (and cube), top-n, group by, partitions and CTE; - write procedures with embedded queries for basic algorithms that extract patterns; - write code for specific data intensive problems: e.g., association rules, rules, clustering; - write code for dimension reduction for data intensive problems and datasets; - write code to implement data mining in time series datasets; - select, use, and deploy specialised tools for data warehousing and data mining; - identify cases of Velocity data; - apply the right architecture to Velocity data depending on the scenario. Main Text/s and any supplementary readings: • Fundamentals of Database Systems, Ramez Elmasri, Shamkant B. Navathe, 7th Edition, 2015, Pearson, ISBN-13: 978-0133970777 • Data Mining: Concepts and Techniques, Jiawei Han, , Jian Pei, Hanghang Tong, 4th Edition, 2022,The Morgan Kaufmann , ISBN-13: 978-0128117606 • Data Warehouse Design: Modern Principles and Methodologies, Matteo Golfarelli, Stefano Rizzi, 2011, McGraw-Hill Osborne, ISBN-13: 978-1441988331 • M.T. Ozsu, P.V. Valduriez., Principles of Distributed Databases, 2020, Springer, ISBN 978-3-030-26252-5. • A number of research papers are made available • System Manuals as per need (and available in department's labs). Note: Inmon and Kimball books are still a good read for data warehousing. |
||||||||||||
RULES/CONDITIONS | Before TAKING THIS UNIT YOU ARE ADVISED TO TAKE CIS2090 OR TAKE CIS3107 | ||||||||||||
STUDY-UNIT TYPE | Lecture, Independent Study and Practical | ||||||||||||
METHOD OF ASSESSMENT |
|
||||||||||||
LECTURER/S | Joseph Vella |
||||||||||||
The University makes every effort to ensure that the published Courses Plans, Programmes of Study and Study-Unit information are complete and up-to-date at the time of publication. The University reserves the right to make changes in case errors are detected after publication.
The availability of optional units may be subject to timetabling constraints. Units not attracting a sufficient number of registrations may be withdrawn without notice. It should be noted that all the information in the description above applies to study-units available during the academic year 2024/5. It may be subject to change in subsequent years. |