Hi Folks Today in this article going to explain how to Parse and fetch the html DOM elements and it's content from web site URL using Html Agility pack......!
Follow this steps :--
1. Download HtmlAgilityPack from codeplex (Click Here)
2. Add reference to HtmlAgilityPack.dll into the solution
3. Add Namespace Imports HtmlAgilityPack
Basic properties for traversing the DOM, including:
ParentNode,
ChildNodes,
NextSibling, and
PreviousSibling
There are properties for determining information about the node itself, such as:
Name - gets or sets the node's name. For HTML elements this property returns (or assigns) the name of the tag - "body" for the <body> tag, "p" for a <p> tag, and so on.
Attributes - returns the collection of attributes for this element, if any.
InnerHtml - gets or sets the HTML content within the node.
InnerText - returns the text within the node.
NodeType - indicates the type of the node. Can be Document, Element, Comment, or Text.
Basic idea of Parsing :-
The HtmlDocument object provides a getElementById method helps to fetch the specific node using its Id.
'Create the object of htmlaglitity pack
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml("Pass The Url here")
' the below code helps to prase the Content based on ID in html page....
Dim _pagenumber As String = htmlDoc.GetElementbyId("firstLineCriteriaContainer").InnerText
'the below code extracts all links under a specific node that have an href that begins with "http://"......
HtmlNodeCollection allLinks = document.DocumentNode.SelectNodes("//*[@id='mynode']//a[starts-with(@href,'http://')]")
'the below code extracts single node based on Division(Div) class ex:- Header_company_Name......
Dim _companyname As String = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='Header_Company_Name']").InnerText
'the below code extracts tilte meta tag node in header part..
Dim documenttitle As String = (From _link In htmlDoc.DocumentNode.Descendants("title") Select _link.InnerText).FirstOrDefault
'Remove Unwanted tags and scripts from html page ...call below function..
RemoveTag(htmlDoc, "script")
RemoveTag(htmlDoc, "link")
RemoveTag(htmlDoc, "style")
RemoveTag(htmlDoc, "comment")
RemoveTag(htmlDoc, "meta")
' calling Funtion to remove tags
Private Sub RemoveTag(doc As HtmlAgilityPack.HtmlDocument, tag As String)
For Each n In If(doc.DocumentNode.SelectNodes("//" + tag), New HtmlAgilityPack.HtmlNodeCollection(doc.DocumentNode))
n.Remove()
Next
End Sub
'if u want to remove only single node...use this code
Dim _pageBody As HtmlAgilityPack.HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='page-body']")
_pageBody.SelectSingleNode("script").Remove()
Follow this steps :--
1. Download HtmlAgilityPack from codeplex (Click Here)
2. Add reference to HtmlAgilityPack.dll into the solution
3. Add Namespace Imports HtmlAgilityPack
Basic properties for traversing the DOM, including:
ParentNode,
ChildNodes,
NextSibling, and
PreviousSibling
There are properties for determining information about the node itself, such as:
Name - gets or sets the node's name. For HTML elements this property returns (or assigns) the name of the tag - "body" for the <body> tag, "p" for a <p> tag, and so on.
Attributes - returns the collection of attributes for this element, if any.
InnerHtml - gets or sets the HTML content within the node.
InnerText - returns the text within the node.
NodeType - indicates the type of the node. Can be Document, Element, Comment, or Text.
Basic idea of Parsing :-
The HtmlDocument object provides a getElementById method helps to fetch the specific node using its Id.
'Create the object of htmlaglitity pack
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.LoadHtml("Pass The Url here")
' the below code helps to prase the Content based on ID in html page....
Dim _pagenumber As String = htmlDoc.GetElementbyId("firstLineCriteriaContainer").InnerText
'the below code extracts all links under a specific node that have an href that begins with "http://"......
HtmlNodeCollection allLinks = document.DocumentNode.SelectNodes("//*[@id='mynode']//a[starts-with(@href,'http://')]")
'the below code extracts single node based on Division(Div) class ex:- Header_company_Name......
Dim _companyname As String = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='Header_Company_Name']").InnerText
'the below code extracts tilte meta tag node in header part..
Dim documenttitle As String = (From _link In htmlDoc.DocumentNode.Descendants("title") Select _link.InnerText).FirstOrDefault
'Remove Unwanted tags and scripts from html page ...call below function..
RemoveTag(htmlDoc, "script")
RemoveTag(htmlDoc, "link")
RemoveTag(htmlDoc, "style")
RemoveTag(htmlDoc, "comment")
RemoveTag(htmlDoc, "meta")
' calling Funtion to remove tags
Private Sub RemoveTag(doc As HtmlAgilityPack.HtmlDocument, tag As String)
For Each n In If(doc.DocumentNode.SelectNodes("//" + tag), New HtmlAgilityPack.HtmlNodeCollection(doc.DocumentNode))
n.Remove()
Next
End Sub
'if u want to remove only single node...use this code
Dim _pageBody As HtmlAgilityPack.HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//div[@class='page-body']")
_pageBody.SelectSingleNode("script").Remove()
No comments:
Post a Comment