Research Article Open Access

A Layout Based Detachment Approach for Extracting Content from Webpages

Deepa Chandran1 and Anna Saro Vijendran2
  • 1 Department of Information Technology, SNR Sons College, Coimbatore, India
  • 2 MCA, SNR Sons College, Coimbatore, India

Abstract

Enormous amount of useful information presented in Internet is usually formatted for the web users. But it is a really complex task to extract the relevant data from various web sources. Recently, various approaches for the extraction of data from the webpages were proposed. This study provides a simple but effective approach, named Layout Based Detachment Approach (LBDA). The proposed approach extracts the main content from the webpage by removing the irrelevant information like header-footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: Tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags and data extraction to retrieve the necessary contents. The proposed approach eliminates noise and perform effective extraction of the main content blocks from the webpage and display of the essential content to the users. The performance of the proposed approach is evaluated using the performance metrics such as accuracy, precision, recall, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach exhibits better performance than the existing heuristic approach.

American Journal of Applied Sciences
Volume 12 No. 6, 2015, 411-420

DOI: https://doi.org/10.3844/ajassp.2015.411.420

Submitted On: 17 August 2013 Published On: 25 July 2015

How to Cite: Chandran, D. & Vijendran, A. S. (2015). A Layout Based Detachment Approach for Extracting Content from Webpages. American Journal of Applied Sciences, 12(6), 411-420. https://doi.org/10.3844/ajassp.2015.411.420

  • 2,793 Views
  • 2,090 Downloads
  • 0 Citations

Download

Keywords

  • Webpage Content Extraction
  • Web Mining
  • DOM Tree Analysis
  • Web Structure Mining