RE: E-MINE: A novel web mining approach
||E-MINE:A NOVEL WEB MINING APPROACH
E-MINING.pptx (Size: 482.42 KB / Downloads: 21)
INTRODUCTION AND DEFINITION
Now a days web is used widely as the medium of publication.Hence,a large collection of documents, images, text files and other forms of data in structured, semi structured and unstructured forms are available on web.
Several attempts have been made to extract the regularly structured data from the web page.
Existing automatic techniques are not satisfactory because of their poor accuracies.
E-Mine – An effective method to mine the data region from a web page automatically.
MDR(Mining Data Records)is a technique mainly used in the area of data mining.
It exploits the regularities in HTML tag structure directly.
MDR algorithm makes use of all the HTML tag tree of the web page to extract data records from the page.
THE PROPOSED TECHNIQUE
Visual Information helps in three ways.
It enables the system to identify gaps that separate records, which helps to segment data records correctly.
The visual information also contains information about the hierarchical structure of the tags.
By observing a webpage, it can be analysed that the relevant data region occupies the major central part of the Webpage.
The E-Mine technique is based on three observations:
A group of data records ,is typically presented in the neighboring region of the web page.
The area covered by a rectangle that bounds the data region is more than the area covered by the rectangles bounding other regions. e.g., Advertisements and links.
The height of an irrelevant data record within a collection of data records is less than the average height of the relevant data records within that data region.
HOW THE ALGORITHM WORKS?
Determining the height and width of all bounding rectangles.
Identification of the largest rectangle.
Identification of the container within the largest rectangle.
Identification of data region containing data records within the container.
This is a new approach to extract structured data from web pages.
eMine is a pure visual structure oriented method that can correctly identify the data regions.
eMine overcomes the drawbacks of existing methods and performs significantly better than existing methods.