1

In general, to get Open Graph protocol (OGP) data for a given web page, one would need to retrieve the actual HTML, and then extract the meta tags from it.

However, this has two problems:

  1. Instead of getting only meta tags, the whole page needs to be downloaded.

  2. If the website denies bots, it becomes impossible to get OGP data.

Are there APIs that would retrieve just the OGP data?

If not, how exactly am I expected to get OGP data in a responsible way, without risking being considered as a web scraper and without needlessly wasting the bandwidth of the server?

Update(2025/09/11):

Dynamic sites don't deny access, but return contents for non-JS users, and they include messages such as "Please enable JS and disable any ad blocker"

Actual cases that I encounter are:

I'm running a trend analysis bot on Bluesky, which shows trend words and news article links with thumbnails if they exist: https://bsky.app/profile/did:plc:wwqlk2n45es2ywkwrf4dwsr2/lists/3kob6kalezl2a However, some sites have a bot block system, such as the anti-AI scraper of Cloudflare. ex: https://www.wsj.com/us-news/law/epstein-birthday-book-congress-9d79ab34 (if you've already visited the site before, the confirmation screen will not show)

9
  • 4
    Good question. I took a liberty to make a few changes, in order to make the question clearer and reduce the risk for it to be downvoted and closed. Check if your intention was preserved. You may also want to add the example of your particular case, i.e. why exactly do you want to extract OGP in the first place—answers may vary depending on that. Commented Sep 8 at 22:17
  • "If website denies bots, it becomes impossible to get OGP data." I don't understand this statement. You literally just make a request to the web server and parse the result. There's no way for the server to prevent that (well, unless you do like millions of requests in short time). They cannot deny you. Just like they cannot deny a human user. There is no difference, as long as you behave. As for the first question: why downloading entire page is a problem? HTML doesn't weight that much compared to say images or videos. Commented Sep 9 at 8:25
  • If you truely want to download only meta tags, which typically reside inside <head></head> tag, then you can always just download the page (and parse) chunk by chunk, until you see </head> tag. Choose an xml parser that works chunk by chunk, there are plenty of them. Doable, but pain in the a**. Plus closing an incomplete connection might be suspicious. Commented Sep 9 at 8:30
  • 4
    Correct me, but AFAIK the purpose of a robots.txt is usually to stop search crawlers to scan an entire web site frequently, not to stop anybody from seeing the content of a site (or their headlines) at all. If someone adds OGP data to their site, they want the headlines to be presented on social media / newsfeeds, and the content of robots.txt should usually be in line with that goal (otherwise is misdesigned, which is nothing which should not be your concern.) Commented Sep 9 at 10:54
  • 1
    I know this doesn't help you but I can't see why they didn't use headers for this. Maybe I'm missing something but it looks like another example of meta creating a new standard for something that's already solved by HTTP natively. Commented Sep 11 at 17:43

1 Answer 1

2

A "trend analysis bot" sounds to me like the kind of bot which certain web site providers want to block - it is IMHO a kind of scraper. Instead of trying not getting caught, better respect the wishes of the providers and stop analysing sites which tell you by their robots.txt they don't want to be analysed automatically.

how exactly am I expected to get OGP data in a responsible way

Well, use it for what is it intended: when someone - manually - sets a link in some Messenger channel or Facebook channel to a web page with OGP data included, the social media site can use it for intelligent "preview" functionality, or whatever the extra meta data allows.

To avoid downloading a full HTML page to the OGP meta data, freakish wrote in a comment how to accomplish this: just read the page sequentially, from tag <head> </head>. We don't answer coding questions here, but when you ask one of the popular LLMs, I am sure you will get an example how to accomplish this using you favorite programming language.

8
  • This bot accesses news site only when posting, the rest of the time it analyzes Bluesky posts and doesn't access news sites. Commented Sep 9 at 21:10
  • And, "read the page sequentially" is possible? Not file but website? Commented Sep 9 at 21:13
  • @Lamron of course. When you download a page, you always download it chunk by chunk. The data arrives at your computer in network packets, which you can analyze and parse on the fly. In fact browsers do this all the time, they won't wait until entire page is downloaded, they will try to parse, run scripts (and maybe even render it) on the fly. You can do something similar. But how to do that exactly, depends on the language and framework you are using. Commented Sep 10 at 9:46
  • I found that doesn't work, for example: gist.github.com/lamrongol/c7079bd8b5057aecf1ba916f4c9150a0 . The site returns "Please enable JS and disable any ad blocker" even if only small size is requested. Commented Sep 10 at 19:02
  • @Lamron: that's a new question about your specific code. Try ask it on Stackoverflow (but make the code part of your question, don't just link to an external site). Commented Sep 10 at 19:10

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.