I'm working on a web crawler and I'm trying to understand how the IP substitution works.
From what I have read, DNS hostname should be resolved to its IP address (one of many) and used instead of the host name in requests. Supposedly, it should improve performance (by resolving and caching), because the user agent no longer needs to resolve DNS itself.
It doesn't seem to work with HTTPS. I tried the following approaches:
With Node.js and playwright:
import { chromium } from "playwright"; import { resolve4 } from "dns/promises"; export const crawlPage = async (pageUrl: string) => { const url = new URL(pageUrl); const dns = await resolve4(url.hostname); console.log(dns); const ip = dns[0]!; const browser = await chromium.launch(); const context = await browser.newContext(); const page = await context.newPage(); await page.goto(pageUrl); console.log(`Page title: ${await page.title()}`); url.hostname = ip; await page.goto(url.toString()); console.log(`Page title: ${await page.title()}`); await browser.close(); }; And invoked like this: await crawlPage("https://example.com");
The output looks like this:
[ '23.192.228.80', '23.192.228.84', ... ] Page title: Example Domain node:internal/process/promises:391 triggerUncaughtException(err, true /* fromPromise */); ^ page.goto: net::ERR_CERT_COMMON_NAME_INVALID at https://23.192.228.80/ Call log: - navigating to "https://23.192.228.80/", waiting until "load" ... internal call stack Node.js v20.18.0 With curl it looks similar:
$ curl -H "Host: example.com" https://23.192.228.80 curl: (60) schannel: SNI or certificate check failed: SEC_E_WRONG_PRINCIPAL (0x80090322) - The target principal name is incorrect. More details here: https://curl.se/docs/sslcerts.html curl failed to verify the legitimacy of the server and therefore could not establish a secure connection to it. To learn more about this situation and how to fix it, please visit the webpage mentioned above. How should it look like to work?
P.S. Am I overthinking this? Should I just drop it and use hostname?