Web scraping a hidden table using Python

Question

I am trying to scrape the "Traits" table from this website https://www.ebi.ac.uk/gwas/genes/SAMD12 (actually, the URL can change according to my necessity, but the structure will be the same).

The problem is that my knowledge is quite limited in web scraping, and I can't get this table using the basic BeautifulSoup workflow I've seen up to here.

Here's my code:

import requests from bs4 import BeautifulSoup url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12' page = requests.get(url)

I'm looking for the "efotrait-table":

efotrait = soup.find('div', id='efotrait-table-loading') print(efotrait.prettify())

<div class="row" id="efotrait-table-loading" style="margin-top:20px"> <div class="panel panel-default" id="efotrait_panel"> <div class="panel-heading background-color-primary-accent"> <h3 class="panel-title"> <span class="efotrait_label"> Traits </span> <span class="efotrait_count badge available-data-btn-badge"> </span> </h3> <span class="pull-right"> <span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px"> <span class="glyphicon glyphicon-chevron-up"> </span> </span> </span> </div> <div class="panel-body"> <table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table"> </table> </div> </div> </div>

Specifically, this one:

soup.select('table#efotrait-table')[0]

<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table"> </table>

As you can see, the table's content doesn't show up. In the website, there's an option for saving the table as csv. It would be awesome if I get this downloadable link somehow. But when I click in the link in order to copy it, I get "javascript:void(0)" instead. I've not studied javascript, should I?

The table is hidden, and even if it's not, I would need to interactively select more rows per page to get the whole table (and the URL doesn't change, so I can't get the table either).

I would like to know a way to get access to this table programmatically (unstructured info), then the minors about organizing the table will be fine. Any clues for how doing that (or what I should study) will be greatly appreciated.

Thanks in advance

@jaibalaji, I'll definitely try this! Do you think, without an API, this is the only option? Or at least the must-to-go? — Cainã Max Couto da Silva
– Cainã Max Couto da Silva, Commented May 17, 2020 at 0:32

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2020-05-16 12:29:25Z

Desired data is available within API call.

import requests data = { "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"", "max": "99999", "group.limit": "99999", "group.field": "resourcename", "facet.field": "resourcename", "hl.fl": "shortForm,efoLink", "hl.snippets": "100", "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform", "raw": "fq:resourcename:association or resourcename:study" } def main(url): r = requests.post(url, data=data).json() print(r) main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

You can follow the r.keys() and load your desired data by access the dict.

But here's a quick load (Lazy Code):

import requests import re import pandas as pd data = { "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"", "max": "99999", "group.limit": "99999", "group.field": "resourcename", "facet.field": "resourcename", "hl.fl": "shortForm,efoLink", "hl.snippets": "100", "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform", "raw": "fq:resourcename:association or resourcename:study" } def main(url): r = requests.post(url, data=data) match = {item.group(2, 1) for item in re.finditer( r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)} df = pd.DataFrame.from_dict(match) print(df) main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

Output:

0 heel bone mineral density Heel bone mineral density 1 interleukin-8 measurement Chronic obstructive pulmonary disease-related ... 2 self reported educational attainment Educational attainment (years of education) 3 waist-hip ratio Waist-hip ratio 4 eye morphology measurement Eye morphology 5 CC16 measurement Chronic obstructive pulmonary disease-related ... 6 age-related hearing impairment Age-related hearing impairment (SNP x SNP inte... 7 eosinophil percentage of leukocytes Eosinophil percentage of white cells 8 coronary artery calcification Coronary artery calcified atherosclerotic plaq... 9 multiple sclerosis Multiple sclerosis 10 mathematical ability Highest math class taken (MTAG) 11 risk-taking behaviour General risk tolerance (MTAG) 12 coronary artery calcification Coronary artery calcified atherosclerotic plaq... 13 self reported educational attainment Educational attainment (MTAG) 14 pancreatitis Pancreatitis 15 hair colour measurement Hair color 16 breast carcinoma Breast cancer specific mortality in breast cancer 17 eosinophil count Eosinophil counts 18 self rated health Self-rated health 19 bone density Bone mineral density

Hi, thanks very much! Really! I'm diving into that dictionary now... In fact, I've seen the GWAS API (ebi.ac.uk/gwas/docs/api), but I didn't find a way to use the gene name (or ID) as an input. So I wonder, how did you find out this?
@CainãMaxCouto-Silva check that and you will understand stackoverflow.com/a/61515665/7658985

Collectives™ on Stack Overflow

Web scraping a hidden table using Python

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related