0

I'm in the process of extracting some HTML code using "Mechanize". However, I'm having a problem with the HTML code outputted. Essentially, it seems like Mechanize is replacing the content inside certain elements to '(n/a)'.

Example (structure shown in Firebug)

<tr> <td> <img class="bullet" src="images/bulletorange.gif" alt=""> <span class="detailCaption">Video Format Mode:</span> <span class="settingValue" id="vidSdSdiAnlgFormatSelectionMode.1.1">Auto</span> </td> </tr> 

Example (structure output by Mechanize)

<tr> <td> <img class='bullet' src='images/bulletorange.gif' alt='' /> <span class='detailCaption'>Video Format Mode:</span> <span class='settingValue' id="vidSdSdiAnlgFormatSelectionMode.1.1">(n/a)</span> </td> </tr> 

The problem is that "Auto" is being replaced by "(n/a)". I'm not really sure why!

Please help. Why is mechanize doing this? And how can I fix it?

Below my code...

def login_and_return_html(self, url_login, url_after_login, form_username, form_password, username, password): """ Description: Returns html code form a website that requires login. Input Arguments: url_login (str)-The url where you enter the login username and password url_after_login (str)-The url where you want to go after you login form_username (str)-The name of the form for the username input field form_password (str)-The name of the form for the password input field username (str)-The actual username password (str)- The actual password Return or Output: Returns HTML code of the 'url_after_login' page Modules and Classes: mechanize ssl """ try: # Unabling SSL certificate validation _create_unverified_https_context = ssl._create_unverified_context except AttributeError: # Legacy Python that doesn't verify HTTPS certificates by default pass else: # Handle target environment that doesn't support HTTPS verification ssl._create_default_https_context = _create_unverified_https_context br = mechanize.Browser() # Browser br.set_handle_equiv(True) # Browser options br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) cj = mechanize.CookieJar() # Cookie Jar br.set_cookiejar(cj) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) # Follows refresh 0 but not hangs on refresh > 0 br.open(url_login) # Login br.select_form(nr=0) try: br.form[form_username] = username #Fill in the blank username form br.form[form_password] = password #Fill in the blank password form br.submit() except: control = br.form.find_control(form_username) for item in control.items: #Dropdown menu username form if item.name == username: item.selected = True br.form[form_password] = password #Fill in the blank password form br.submit() html = br.open(url_after_login).read() return html 

2 Answers 2

1

Why is mechanize doing this?

Mechanize probably isn't but the browser is. My guess is that the site uses Javascript which is not supported with mechanize and thus you get the HTML in it's original form, i.e. the content before any Javascript got executed.

And how can I fix it?

Not with mechanize but you need some solution which supports Javascript. See Mechanize and Javascript for more information and possible solutions.

Sign up to request clarification or add additional context in comments.

1 Comment

Let me give it a try and see what I come up with. Thanks!
0

Here is the solution to how I was able to obtain both, the HTML and the Javascript code.

I used the selenium library.

from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import time #Using Firefox 48.0.2 and the new WebDriver caps = DesiredCapabilities.FIREFOX caps["marionette"] = True br = webdriver.Firefox(capabilities=caps) br.get('http://XXX.XXX.XXX.XXX/') #Input Username and Password username = br.find_element_by_name('SOME_NAME') username.send_keys('USERNAME') password = br.find_element_by_name('SOME_NAME') password.send_keys('PASSWORD') form = br.find_element_by_name('submitButton') form.submit() time.sleep(20) #THIS IS WHAT IS DIFFERENT... td_element = br.find_element_by_xpath('/html') html = br.execute_script("return arguments[0].innerHTML;", td_element) print html 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.