2

I have got this new project that I am not familiar in working with. One task is that I need to navigate some websites to collect some data. One sample website would be this: https://www.hudhomestore.com/Home/Index.aspx

enter image description here

I have read and watched tutorials on "collecting" data from a web page, such as:

But my question is how do we usually set preferences, to "search" based on our preferences, and then use the above links to load the results in my code?

EDIT

This is correct for setting the searching criteria based on my selection. However, total count of the search (If I do it manually for MI state) is 223, but i I execute the below code, tdNodeCollection is only 121. Can you show me where am I going wrong?

 HtmlWeb web = new HtmlWeb(); HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); string zipCode = "", city = "", county = "", street = "", sState = "MI", fromPrice = "0", toPrice = "0", fcaseNumber = "", bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "", stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH"; var doc = await (Task.Factory.StartNew(() => web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" + "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState + "&fromPrice=" + fromPrice + "&toPrice=" + toPrice + "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath + "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities + "&outdoorAmenities=" + outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories + "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage))); HtmlNodeCollection tdNodeCollection = doc .DocumentNode .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td"); 
12
  • Can you explain a bit about "set preferences to search". Commented Feb 7, 2017 at 3:57
  • In theory each search criteria represents a key/value in the database, in this particular example the form is submitted using the GET method, where the search criterias are passed as query strings in the URL, then used a template where the search results are displayed based on the results retrieve from the DB Commented Feb 7, 2017 at 3:57
  • Hi @MAdeelKhalid yes of course. For example, in my application, I would like to ask the user what state would he like to view, to then display him the result. So how can I "query" the website with a specific "filter", to then go to a result page and parse that page into my code. Commented Feb 7, 2017 at 3:58
  • @SergioAlen Is it doable in my case? To query their DB from my application and retrieve results? Commented Feb 7, 2017 at 4:01
  • @SergioAlen is telling about WebService or MVC pattern and your question is something else I think, right? Commented Feb 7, 2017 at 4:03

1 Answer 1

2

You can make use of HTMLAgilityPack for this purpose. I've made a small testing code and tested with the second page you wish to scrap based on the search criteria which you can set.

 HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); HtmlWeb web = new HtmlWeb(); //string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx"; //Here you need to set the values of these variable to whatever user inputs //after setting these values, add them to initial URL string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "", bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "", stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH"; HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" + "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState + "&fromPrice=" + fromPrice + "&toPrice=" + toPrice + "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath + "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities + "&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories + "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage); HtmlNodeCollection tdNodeCollection = document .DocumentNode .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td"); 

Count them again and look at your expression, there are exactly 121 td's within tr with id="dgPropertyList" Next, check your td manually and trace what you need from that td and fetch that data.

 foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection) { //Do you say you want to access to <h2>, <p> here? //You can do: HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too //And you can also take a look at the children, without using XPath (like in a tree): HtmlNode h2Node_ = node.ChildNodes["h2"]; } 

I've tested the code, it works and parse the whole document to reach the required table. It will get you all the rows within that table inside div. So, you can further dig into these rows, find your td and get what you need.

Another option could be using Selenium webdriver, Get your hands on Selenium

If you don't want the browser to be visible and still want to use Selenium like functionality then you can make use of PhantomJS

Hope it helps.

Sign up to request clarification or add additional context in comments.

5 Comments

Awesome, thank you! Can you see my edit? Why it is 121 only? Could you please debug it with me? Also, how can I dig in each node to retrieve the link of each item? Can I just search the string InnerHTML of each node?
Look at the modified answer.
I think you answered my question so I accepted your answer, however I am unfamiliar with any of these h2 and tr` stuff so I will try another approach and I have another question if you can help: stackoverflow.com/questions/42084130/… And thanks again Adeel !
And by the way when I execute your new code, h2Node, allH2Nodes and h2Node_ are always null
That was just an example to fetch h2 tag of html within that node nothing else. There might be no h2 within that td node. Its a pleasure for me If I can help. :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.