Please use this identifier to cite or link to this item: https://www.um.edu.mt/library/oar/handle/123456789/93373
Title: Web page extractor
Authors: Cini, Shirley (2012)
Keywords: Computer crimes
Natural language processing (Computer science)
HTML (Document markup language)
Algorithms
Issue Date: 2012
Citation: Cini, S. (2012). Web page extractor (Bachelor’s dissertation).
Abstract: Nowadays, a web page extractor has become a necessary tool due to huge amount of information offered on the web. There are several techniques for extracting information from web pages. The most important feature in an extraction system is the recognition of named entities. The goal of the 'Web Page Extractor' system is to automate a substantial amount of the work involved in investigating cyber-crime web pages. Currently, the case reports generated by the cyber-crime unit contain limited information. The artifact is developed in Java and it is platform and browser independent. Given a URL the system is able to identify personal details, addresses, locations and organizations. A case report showing the latter information and the WHOIS data of the domain is generated. The report is initially displayed in a web page but the user has the option to convert it into a PDF. In addition, an option for generating a graph of relations concerning the persons recognized in the content of the web page being investigated is available. The graph aids users in getting an insight of the degree of participation these users have with respect to the web page content; hence, they can easily detect which persons play an important role and which can be neglected. The graph feature devices a search engine to find additional information about the persons selected by the users in online sources. These results are examined by the information extractor and any entities matching those extracted from the original URL are retained. Moreover, if the user selects more than one person, the graph features captures any data that is similar between the whole group of persons to demonstrate any relationships that might exist. The system is also capable of parsing the HTML source code of the web page specified by the user with aim of downloading any video links, images, external links and cookies in folder on the hard drive.
Description: B.Sc. IT (Hons)(Melit.)
URI: https://www.um.edu.mt/library/oar/handle/123456789/93373
Appears in Collections:Dissertations - FacICT - 2012
Dissertations - FacICTAI - 2002-2014

Files in This Item:
File Description SizeFormat 
B.SC.(HONS)ICT_Cini_Shirley_2012.pdf
  Restricted Access
16.62 MBAdobe PDFView/Open Request a copy


Items in OAR@UM are protected by copyright, with all rights reserved, unless otherwise indicated.