Monday, 29 December 2014

Parse HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

Hi Folks Today in this article going to explain how to Parse and  fetch the  html DOM elements and it's  content  from web site URL using Html Agility pack......!

Follow this steps :--

1.  Download HtmlAgilityPack from codeplex (Click Here)
2.  Add reference to HtmlAgilityPack.dll  into the solution
3.  Add Namespace Imports HtmlAgilityPack


Basic properties for traversing the DOM, including:

ParentNode,
ChildNodes,
NextSibling, and
PreviousSibling

There are properties for determining information about the node itself, such as:

Name - gets or sets the node's name. For HTML elements this property returns (or assigns) the name of the tag - "body" for the <body> tag, "p" for a <p> tag, and so on.
Attributes - returns the collection of attributes for this element, if any.
InnerHtml - gets or sets the HTML content within the node.
InnerText - returns the text within the node.
NodeType - indicates the type of the node. Can be Document, Element, Comment, or Text.


Basic  idea of Parsing :-

The HtmlDocument object provides a getElementById method helps to fetch the specific node using its Id.

'Create the object of htmlaglitity pack
 Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
 htmlDoc.LoadHtml("Pass The Url here")

' the below code helps to prase the Content based on ID in html page....
Dim _pagenumber As String = htmlDoc.GetElementbyId("firstLineCriteriaContainer").InnerText

'the below code extracts all links under a specific node that have an href that begins with "http://"......
HtmlNodeCollection allLinks = document.DocumentNode.SelectNodes("//*[@id='mynode']//a[starts-with(@href,'http://')]")

'the below code extracts single node based on Division(Div) class  ex:- Header_company_Name......
 Dim _companyname As String = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='Header_Company_Name']").InnerText

'the below code extracts tilte  meta tag node in header part..
 Dim documenttitle As String = (From _link In htmlDoc.DocumentNode.Descendants("title") Select _link.InnerText).FirstOrDefault

 'Remove Unwanted tags and scripts from html page ...call below function..
  RemoveTag(htmlDoc, "script")
  RemoveTag(htmlDoc, "link")
  RemoveTag(htmlDoc, "style")
  RemoveTag(htmlDoc, "comment")
  RemoveTag(htmlDoc, "meta")
 
' calling Funtion to remove tags
  Private Sub RemoveTag(doc As HtmlAgilityPack.HtmlDocument, tag As String)
        For Each n In If(doc.DocumentNode.SelectNodes("//" + tag), New  HtmlAgilityPack.HtmlNodeCollection(doc.DocumentNode))
            n.Remove()
        Next
    End Sub

'if u want to remove only single node...use this code
 Dim _pageBody As HtmlAgilityPack.HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='page-body']")
 _pageBody.SelectSingleNode("script").Remove()



















Share:

No comments:

Post a Comment

© TBGsharepointforum All rights reserved | Designed by Blogger Templates