Using C# HttpClient to login on a website and scrape information from another page

Question

I am trying to use C# and Chrome Web Inspector to login on http://www.morningstar.com and retrieve some information on the page http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=en-US.

I do not quite understand what is the mental process one must use to interpret the information from Web Inspector to simulate a login and simulate keeping the session and navigating to the next page to collect information.

Can someone explain or point me to a resource ?

For now, I have only some code to get the content of the home page and the login page:

public class Morningstar { public async static void Ru4n() { var url = "http://www.morningstar.com/"; var httpClient = new HttpClient(); httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml"); httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate"); httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0"); httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1"); var response = await httpClient.GetAsync(new Uri(url)); response.EnsureSuccessStatusCode(); using (var responseStream = await response.Content.ReadAsStreamAsync()) using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress)) using (var streamReader = new StreamReader(decompressedStream)) { //Console.WriteLine(streamReader.ReadToEnd()); } var loginURL = "https://members.morningstar.com/memberservice/login.aspx"; response = await httpClient.GetAsync(new Uri(loginURL)); response.EnsureSuccessStatusCode(); using (var responseStream = await response.Content.ReadAsStreamAsync()) using (var streamReader = new StreamReader(responseStream)) { Console.WriteLine(streamReader.ReadToEnd()); } }

EDIT: In the end, on the advice of Muhammed, I used the following piece of code:

 ScrapingBrowser browser = new ScrapingBrowser(); //set UseDefaultCookiesParser as false if a website returns invalid cookies format //browser.UseDefaultCookiesParser = false; WebPage homePage = browser.NavigateToPage(new Uri("https://members.morningstar.com/memberservice/login.aspx")); PageWebForm form = homePage.FindFormById("memberLoginForm"); form["email_textbox"] = "[email protected]"; form["pwd_textbox"] = "password"; form["go_button.x"] = "57"; form["go_button.y"] = "22"; form.Method = HttpVerb.Post; WebPage resultsPage = form.Submit();

Josh Correia · Accepted Answer · 2022-06-18 21:27:41Z

You should simulate the login process of the web site. The easiest way to do this is inspecting the website via some debugger (for example Fiddler).

Here is the login request of the web site:

POST https://members.morningstar.com/memberservice/login.aspx?CustId=&CType=&CName=&RememberMe=true&CookieTime= HTTP/1.1 Accept: text/html, application/xhtml+xml, */* Referer: https://members.morningstar.com/memberservice/login.aspx ** omitted ** Cookie: cookies=true; TestCookieExist=Exist; fp=001140581745182496; __utma=172984700.91600904.1405817457.1405817457.1405817457.1; __utmb=172984700.8.10.1405817457; __utmz=172984700.1405817457.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=172984700; ASP.NET_SessionId=b5bpepm3pftgoz55to3ql4me [email protected]&pwd_textbox=password&remember=on&email_textbox2=&go_button.x=36&go_button.y=16&__LASTFOCUS=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=omitted&__EVENTVALIDATION=omited

When you inspect this, you'll see some cookies and form fields like "__VIEWSTATE". You'll need the actual values of this field to log in. You can use following steps:

Make a request and scrape fields like "__LASTFOCUS", "__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "__EVENTVALIDATION", and cookies.
Create a new POST request to the same page, use CookieContainer from the previous one; build a post string using scraped fields, username and password. Post it with MIME type application/x-www-form-urlencoded.
If successful use the cookies for further requests to stay logged in.

Note: You can use htmlagilitypack, or scrapysharp to scrape html. ScrapySharp provide easy to use tools for form posting forms and browsing websites.

Mr Rivero · Accepted Answer · 2014-07-20 00:56:21Z

the mental is process is simulate a person login in the website, some logins are made using AJAX or traditional POST request, so, the first thing you need to do, is made that request like browser does, in the server response, you will get cookies, headers and other information, you need to use that info to build a new request, this are the scrappy request.

Steps are:

1) Build a request, like browser does, to authenticate yourself to the app. 2) Inspect response, and saves headers, cookies or other useful info to persisting your session with the server. 3) Make another request to server, using the info you gathered from second step. 4) Inspect response, and use data analysis algorithm or something else to extract the data.

Tips:

You are not using here javascript engine, some websites use javascript to show graphs, or execute some interation in the DOM document. In that cases, maybe you will need to use WebKit lib wrapper.

Collectives™ on Stack Overflow

Using C# HttpClient to login on a website and scrape information from another page

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related