update
In my effort to figure this out I did some more thinking and diggin. I hope sharing more will help start a dialog with someone. This is what I discovered :
In order to get the lay of the land, I visited http://ca.megabus.com/BusStops.aspx and viewed all the GET requests in the network view. I then clicked the drop down menu and chose a random origin and destination to generate a POST request. I did not click search though. From there, I opened up the POST generated BusStops.aspx file in the left sidebar.
Inside it, I focused on the event target in the header which is :
__EVENTTARGET:confirm1$ddlTravellingTo and the view state which is a really long string of randomly generated letters and numbers. I assume this is because the fields have the value of hidden. I also noticed it has this value in the header :
X-MicrosoftAjax:Delta=true which I saw on Github. lawnjam has a gist of a scrape for the megabus UK site using Python :
https://github.com/lawnjam/megabus-scraper/blob/master/megabus-routes.py
megasoup seems to be a Python version of nokogiri (I think) and I also believe Nokogiri was built off of it. Atappreciate any rate, there is that and urllib2 seems to be a library of commands for working with scrape data I think (http://docs.python.org/3/library/urllib.html). I am 90ish percent sure mechanize gives me all of that, especially since that gist is 3 years old.
Ok, back to the matter at hand. From what I can decipher in that code, it looks like lawnjam pulls all the data fields manually and set them to new local variables. Take the headers and values for example :
headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-gb,en;q=0.8,en-us;q=0.5,gd;q=0.3', 'Accept-Encoding': 'gzip,deflate', 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'} # set other form values values = { 'Welcome1_ScriptManager1_HiddenField': '', 'Welcome1$ScriptManager1': 'SearchAndBuy1$upSearchAndBuy|SearchAndBuy1$ddlLeavingFrom', '__EVENTTARGET': 'SearchAndBuy1$ddlLeavingFrom', '__EVENTARGUMENT': '', 'Welcome1$hdnBasketItemCount': '0', 'Language1$ddlLanguage': 'en', 'SearchAndBuy1$txtPassengers': '1', 'SearchAndBuy1$txtConcessions': '0', 'SearchAndBuy1$txtNUSExtra': '0', 'SearchAndBuy1$txtOutboundDate': '', 'SearchAndBuy1$txtReturnDate': '', 'SearchAndBuy1$txtPromotionalCode': '', '__ASYNCPOST': 'true' } headers['X-MicrosoftAjax'] = 'Delta=true’ From there though, it gets hazy. To elaborate, in this next section of code, it looks like he is taking the valuesfeedback and assigning them local variables again but I am not sure how to approach making a loop like that in Ruby or if I can even do that. The urlib2 is throwing me off. :
for a in startLocations: values['SearchAndBuy1$ddlLeavingFrom'] = a values['__EVENTVALIDATION'] = eventvalidation values['__VIEWSTATE'] = viewstate data = urllib.urlencode(values) req = urllib2.Request('http://uk.megabus.com/default.aspx', data, headers) Next, I think he is referring to
UserStatus$ScriptManager1:confirm1$UpdatePanel1|confirm1$ddlTravellingTo from the form data section of the POST file’s headerthanks in the inspector network tab when he coded this part :
# store the received (pipe-separated) data in a list L = urllib2.urlopen(req).read().split('|’) Now this is where I fall further down the rabbit hole. I can figure out that this next loop is just iterating through each location one at a time but I do not know what position is and where is it defined. The Python style might be throwing me off here :
for position, item in enumerate(L): if item == 'SearchAndBuy1_upSearchAndBuy': html = L[position + 1] if item == '__VIEWSTATE': viewstate = L[position + 1] # save __VIEWSTATE for the next iteration if item == '__EVENTVALIDATION': eventvalidation = L[position + 1] # save __EVENTVALIDATION for the next iteration This next part seems to be where the list of stops gets populated but beautiful soup is throwing me off. Is it analogous to :
agent = Mechanize.new options = agent.find(name=….. ?
megaSoup = BeautifulSoup(html) options = megaSoup.find(name='select', attrs={'name': 'SearchAndBuy1$ddlTravellingTo'}).findAll('option') endLocations = {} for o in options: if int(o['value']) > 0: print '"' + startLocations[a] + '","' + o.find(text=True) + '"' #endLocations[int(o['value'])] = o.find(text=True I would be appreciative of any feedback.advance!