In general, to get Open Graph protocol (OGP) data for a given web page, one would need to retrieve the actual HTML, and then extract the meta tags from it.
However, this has two problems:
Instead of getting only
metatags, the whole page needs to be downloaded.If the website denies bots, it becomes impossible to get OGP data.
Are there APIs that would retrieve just the OGP data?
If not, how exactly am I expected to get OGP data in a responsible way, without risking being considered as a web scraper and without needlessly wasting the bandwidth of the server?
Update(2025/09/11):
Dynamic sites don't deny access, but return contents for non-JS users, and they include messages such as "Please enable JS and disable any ad blocker"
Actual cases that I encounter are:
I'm running a trend analysis bot on Bluesky, which shows trend words and news article links with thumbnails if they exist: https://bsky.app/profile/did:plc:wwqlk2n45es2ywkwrf4dwsr2/lists/3kob6kalezl2a However, some sites have a bot block system, such as the anti-AI scraper of Cloudflare. ex: https://www.wsj.com/us-news/law/epstein-birthday-book-congress-9d79ab34 (if you've already visited the site before, the confirmation screen will not show)
<head></head>tag, then you can always just download the page (and parse) chunk by chunk, until you see</head>tag. Choose an xml parser that works chunk by chunk, there are plenty of them. Doable, but pain in the a**. Plus closing an incomplete connection might be suspicious.