0

I would like to get the data from this website and put them into a dictionary.

Basically these are prices and quantities for some financial instruments.

I have this source code for the page (here is just an extract of the whole text):

<tr> <td class="quotesMaxTime1414148558" id="notation115602071"><span>4,000.00</span></td> <td><span>0</span></td> <td class="icon red"><span id="domhandler:8.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PERFORMANCE_PCT.wtkm:options_options_snapshot_1">-3.87%</span></td> <td><span id="domhandler:9.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PRICE.wtkm:options_options_snapshot_1">960.40</span></td> </tr> 

Now I would like to extraxt the following information:

  1. The value "4000" from the second line;
  2. The value "-3.87%" from the fourth line;
  3. The value "960.40" from the fifth line.

I have tried to use the following to extract the first information (the value 4000):

 string url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411"; var webGet = new HtmlWeb(); var document = webGet.Load(url); var firstData = from x in document.DocumentNode.Descendants() where x.Name == "td" && x.Attributes.Contains("class") select x.InnerText; 

but firstData doesn't contains the info I want (the value 4000) but this:

System.Linq.Enumerable+WhereSelectEnumerableIterator`2[HtmlAgilityPack.HtmlNode,System.String] 

How can I get these data? I would also need to repeat this task several times cause in the page there is more than one line containing similar information. Is HTML Agility Pack useful in this context? Thanks.

2
  • this approach is highly brittle, if they change the site it will break, I'd urge you to consider using a data feed from the company Commented Oct 24, 2014 at 13:28
  • i know but this is what I have at the moment and I would like to get this information. Commented Oct 24, 2014 at 13:30

5 Answers 5

1

This may be somewhat ugly but it was quickly thrown together and could probably be cleaned up greatly, but it returns all of the values that you are looking for from the Prices/Quotes table found on that page. hope it helps.

 var url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411"; var webGet = new HtmlWeb(); var document = webGet.Load(url); var pricesAndQuotesDataTable = (from elem in document.DocumentNode.Descendants() .Where( d => d.Attributes["class"] != null && d.Attributes["class"].Value == "toggleTitle" && d.ChildNodes.Any(h => h.InnerText != null && h.InnerText == "Prices/Quotes")) select elem.Descendants() .FirstOrDefault( d => d.Attributes["class"] != null && d.Attributes["class"].Value == "dataTable")).FirstOrDefault(); if (pricesAndQuotesDataTable != null) { var dataRows = from elem in pricesAndQuotesDataTable.Descendants() where elem.Name == "tr" && elem.ParentNode.Name == "tbody" select elem; var dataPoints = new List<object>(); foreach (var row in dataRows) { var dataColumns = (from col in row.ChildNodes.Where(n => n.Name == "td") select col).ToList(); dataPoints.Add( new { StrikePrice = dataColumns[0].InnerText, DifferenceToPreviousDay = dataColumns[9].InnerText, LastPrice = dataColumns[10].InnerText }); } } 

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

This works nicely and at this point it is easy to store into a dictionary. Thanks a lot for your help!
1

That's because your LINQ hasn't executed. If you check the Results View in the debugger and run the query, you'll get all the items, the first being that value you are looking for.

So, this will get you 4,000.00

var firstData = (from x in document.DocumentNode.Descendants() where x.Name == "td" && x.Attributes.Contains("class") select x.InnerText).First(); 

if you want them all, call ToList() instead of First()

3 Comments

Thanks a lot, it works fine for the first field but how can I access fields in the forth/fifth line?
use the same approach, you'll have to find something that will uniquely identify the things you want. Like a commenter said, this is a very untrustworthy solution
In line 4 I can use the string "PERFORMANCE_PCT" while in line 5 the string "PRICE"... but if I use x.Attributes.Contains("PRICE") it doesn't find anything. Maybe that is not an attribute. But what is the field in which those strings appear?
1

if you open to use CSQuery.. then try this one.

 static void Main() { CsQuery.CQ cq = CsQuery.CQ.CreateFromUrl("http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411"); string str = cq["#notation115602071 span"].Text(); } 

Comments

1

You could use the HtmlAgility Pack. Unlike XmlDocument or XDocument, the Html Agility pack is tolerant of malformed HTML (which exists all over the internet and probably on the site you are trying to parse).

Not all HTML pages can be assumed to be valid XML.

With the HTMLAgility pack you can load your page and parse it with XPath or an object model similar to System.Xml.

Html Agility Pack

Optionally, you could use a PDF to Text Converter and parse a text file with much better accuracy, since the website you linked offers a PDF Export of that same data,

PDF Export Link

Convert PDF to Text

2 Comments

If you go with the route you are going, switch to using XDocument from the System.Xml.Linq namespace instead of XmlDocument, it's a much better and more linq tolerant XML reader. Secondly, don't look for unique data at all, just assume that they are what they are based on the index of the TR row. E.g. Get all of the TR rows, and the first value is TR Index 0, and the next one is in TR Index 1, etc etc.
Thanks for your comment. Can you provide an example ho to use HtmlAgility Pack in this context to get the 3 values I'm looking for and then repeat this for the remaining rows of the website? Thanks.
1

We did a similar project a few years back to spider all the major online betting websites and create a comparison tool to get the best prices for each type of event, eg. display all the major bookmakers with betting odds for a particular football game in order of best return.

Turned out to be a complete nightmare- the rendered html output for the websites kept changing almost daily and quite often generated poorly formed html which could sometimes crash the spider daemon, so we had to constantly maintain the system to keep it working properly.

With these sorts of things its often economical to subscribe to a data feed which requires much less maintenance and easier integration.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.