1

I am trying to read a table from a web-page. Generally, my company has strict authentication policies restricting us in the way we can scrape the data. But the following code is how I am trying to use to do the same

from urllib.request import urlopen from requests_kerberos import HTTPKerberosAuth, OPTIONAL import os import lxml.html as LH import requests import pandas as pd cert = r"C:\\Users\\name\\Desktop\\cacert.pem" os.environ["REQUESTS_CA_BUNDLE"] = cert kerberos = HTTPKerberosAuth(mutual_authentication=OPTIONAL) session = requests.Session() link = 'weblink' data=session.get(link,auth=kerberos,verify=False).content.decode("latin-1") 

And that leaves me with the entire HTML of the webpage in "data". How do I convert this into a dataframe?

Note : I couldn't provide the weblink due to privacy concerns.. I was just wondering if there was a general way which I can use to tackle this situation.

3
  • 1
    How could we help without knowing anything about the data? Commented Oct 21, 2019 at 3:57
  • I was just wondering if there was a procedure to convert the HTML into a dataframe. That's what the question is about Commented Oct 21, 2019 at 4:06
  • pandas.read_html if there are tables, they can be read directly into pandas. Commented Oct 21, 2019 at 5:06

1 Answer 1

1

It looks like you're looking for something like this, using Beautifulsoup?

From there, you'll have to create the data frame itself, but you will have passed the 'procedure to convert the HTML into' a data structure step. (that is, read the HTML table into a list or dictionary, and then transform it into a dataframe)

Edit 1

Actually, you can use Pandas' read_html. You might need Beautifulsoup still to get exactly what you want, but depending on how the source HTML looks like, it might be enough alone.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.