Python Mechanize HTML code different from Firebug HTML code

Question

I'm in the process of extracting some HTML code using "Mechanize". However, I'm having a problem with the HTML code outputted. Essentially, it seems like Mechanize is replacing the content inside certain elements to '(n/a)'.

Example (structure shown in Firebug)

<tr> <td> <img class="bullet" src="images/bulletorange.gif" alt=""> <span class="detailCaption">Video Format Mode:</span> <span class="settingValue" id="vidSdSdiAnlgFormatSelectionMode.1.1">Auto</span> </td> </tr>

Example (structure output by Mechanize)

<tr> <td> <img class='bullet' src='images/bulletorange.gif' alt='' /> <span class='detailCaption'>Video Format Mode:</span> <span class='settingValue' id="vidSdSdiAnlgFormatSelectionMode.1.1">(n/a)</span> </td> </tr>

The problem is that "Auto" is being replaced by "(n/a)". I'm not really sure why!

Please help. Why is mechanize doing this? And how can I fix it?

Below my code...

def login_and_return_html(self, url_login, url_after_login, form_username, form_password, username, password): """ Description: Returns html code form a website that requires login. Input Arguments: url_login (str)-The url where you enter the login username and password url_after_login (str)-The url where you want to go after you login form_username (str)-The name of the form for the username input field form_password (str)-The name of the form for the password input field username (str)-The actual username password (str)- The actual password Return or Output: Returns HTML code of the 'url_after_login' page Modules and Classes: mechanize ssl """ try: # Unabling SSL certificate validation _create_unverified_https_context = ssl._create_unverified_context except AttributeError: # Legacy Python that doesn't verify HTTPS certificates by default pass else: # Handle target environment that doesn't support HTTPS verification ssl._create_default_https_context = _create_unverified_https_context br = mechanize.Browser() # Browser br.set_handle_equiv(True) # Browser options br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) cj = mechanize.CookieJar() # Cookie Jar br.set_cookiejar(cj) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) # Follows refresh 0 but not hangs on refresh > 0 br.open(url_login) # Login br.select_form(nr=0) try: br.form[form_username] = username #Fill in the blank username form br.form[form_password] = password #Fill in the blank password form br.submit() except: control = br.form.find_control(form_username) for item in control.items: #Dropdown menu username form if item.name == username: item.selected = True br.form[form_password] = password #Fill in the blank password form br.submit() html = br.open(url_after_login).read() return html

Community · Accepted Answer · 2017-05-23 10:29:14Z

Why is mechanize doing this?

Mechanize probably isn't but the browser is. My guess is that the site uses Javascript which is not supported with mechanize and thus you get the HTML in it's original form, i.e. the content before any Javascript got executed.

And how can I fix it?

Not with mechanize but you need some solution which supports Javascript. See Mechanize and Javascript for more information and possible solutions.

Carlos Rodriguez · Accepted Answer · 2016-08-30 19:03:51Z

Here is the solution to how I was able to obtain both, the HTML and the Javascript code.

I used the selenium library.

from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import time #Using Firefox 48.0.2 and the new WebDriver caps = DesiredCapabilities.FIREFOX caps["marionette"] = True br = webdriver.Firefox(capabilities=caps) br.get('http://XXX.XXX.XXX.XXX/') #Input Username and Password username = br.find_element_by_name('SOME_NAME') username.send_keys('USERNAME') password = br.find_element_by_name('SOME_NAME') password.send_keys('PASSWORD') form = br.find_element_by_name('submitButton') form.submit() time.sleep(20) #THIS IS WHAT IS DIFFERENT... td_element = br.find_element_by_xpath('/html') html = br.execute_script("return arguments[0].innerHTML;", td_element) print html

Collectives™ on Stack Overflow

Python Mechanize HTML code different from Firebug HTML code

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related