Web Crawler Url Storing At Database - Fast URL Lookup - Hashing - C#

Question

I am writing my own web crawler. Currently i am storing urls directly as uri.absoluteurl . So when i query database whether that url is already added or not i am directly querying database as select pageid from mytable where url='absoluteurl' . I suppose this is causing extra stress on the database because my core i 7 @ 4.5 ghz cpu is almost at 100% all the time.

So it came my mind that if i also store md5 hashes of urls in the database and lookup them whether that url exist or not could increase the lookup speed.

So waiting your ideas about this. For checking whether that url exist in the database or not what would be the best approach ?

c# 4.0, MS-sql 2008

EXAMPLE:

Eugen Rieck please look at this image : img62.imageshack.us/img62/589/exampleimage.png — Furkan Gözükara
– Furkan Gözükara, Commented Jan 11, 2012 at 12:29
This does look strange - Query plan shows negligable CPU, some I/O cost (which is what I'd expect). Is the CPU use userland or kernel? — Eugen Rieck
– Eugen Rieck, Commented Jan 11, 2012 at 12:36
CPU is working on windows 7 ultimate 64 bit administrator account. Core i7 2600k 8 cores @ 4.5 GHZ. i can clearly say this query is too much cost. So i wonder some kind of hashing could speed up things. — Furkan Gözükara
– Furkan Gözükara, Commented Jan 11, 2012 at 12:43
Again: Userland or Kernel? (In Task manager chose "View"->"Show kernel times" or similar) — Eugen Rieck
– Eugen Rieck, Commented Jan 11, 2012 at 12:44

Louis Ricci · Accepted Answer · 2012-01-11 16:19:59Z

Since you already have an index on the Url column, my guess is the SELECT (get pageid) then if it doesn't exist INSERT (new URL) is what's causing the CPU to peak. If you're crawler has multiple threads going you might be taxing the concurrency/locking mechanisms in SQL on tblPages.

With regards to your specific question, I would use CHECKSUM (crc) instead of HASHBYTES (md). CHECKSUM is faster, it returns an INT instead of a VARBINARY so it would be easier/faster to index.

However, precisely because CHECKSUM returns an INT it is prone to collisions so you should also search the URL as an AND clause.

SELECT PageId FROM tblPages WHERE HashedUrl=CHECKSUM(@url) AND PageUrl=@url

Now ONLY put a column index on HashedUrl (not PageUrl). The index will have to be NON-UNIQUE because of the possibility for collisions. This will give you the fastest INSERTs and SELECTs until you start getting to a table row count well over 4billion, in which case, the amount of INT CHECKSUM collisions will cause a lot of partial table scans on the un-indexed PageUrl column.

UPDATE

Here's the simple benchmark code I used

GO /* NORMAL METHOD */ BEGIN SET STATISTICS TIME ON -- IF EXISTS(SELECT * FROM tempdb.dbo.sysobjects WHERE ID = OBJECT_ID(N'tempdb..#Store1')) BEGIN DROP TABLE #Store1 END -- Normal CREATE TABLE #Store1 (Id INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED, Data VARCHAR(4000)) CREATE UNIQUE CLUSTERED INDEX CIX_STORE1_DATA ON #Store1(Data) -- Help Create Data DECLARE @Data TABLE(Data VARCHAR(4000)) INSERT INTO @Data(Data) VALUES ('red.'), ('YELLOW/'), ('green'), ('.BLUE'), ('/violet'), ('PURPLE-'), ('-orange') -- The data set we'll use for testing INSERT INTO @Data SELECT a.Data + b.Data + c.Data + d.Data + e.Data + f.Data + g.Data FROM @Data a, @Data b, @Data c, @Data d, @Data e, @Data f, @Data g -- INSERTION TESTS PRINT('INSERT INTO NORMAL') INSERT INTO #Store1(Data) SELECT Data FROM @Data -- SELECTION TESTS PRINT('SELECT FROM NORMAL') SELECT TOP 5000 d.Data, (SELECT s.Id FROM #Store1 s WHERE s.Data = d.Data) FROM @Data d -- SET STATISTICS TIME OFF END GO /* USING YOUR OWN CHECKSUM/HASH */ BEGIN SET STATISTICS TIME ON -- IF EXISTS(SELECT * FROM tempdb.dbo.sysobjects WHERE ID = OBJECT_ID(N'tempdb..#Store2')) BEGIN DROP TABLE #Store2 END -- With Hash CREATE TABLE #Store2 (Id INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED, Hsh INT, Data VARCHAR(4000)) CREATE CLUSTERED INDEX CIX_STORE2_CRC ON #Store2(Hsh) -- Help Create Data DECLARE @Data TABLE(Data VARCHAR(4000)) INSERT INTO @Data(Data) VALUES ('red.'), ('YELLOW/'), ('green'), ('.BLUE'), ('/violet'), ('PURPLE-'), ('-orange') -- The data set we'll use for testing INSERT INTO @Data SELECT a.Data + b.Data + c.Data + d.Data + e.Data + f.Data + g.Data FROM @Data a, @Data b, @Data c, @Data d, @Data e, @Data f, @Data g -- INSERTION TESTS PRINT('INSERT INTO CHECKSUM/HASH') INSERT INTO #Store2(Hsh, Data) SELECT CHECKSUM(Data), Data FROM @Data -- SELECTION TESTS PRINT('SELECT FROM CHECKSUM/HASH') SELECT TOP 5000 d.Data, (SELECT s.Id FROM #Store2 s WHERE Hsh = CHECKSUM(d.Data) AND Data = d.Data) FROM @Data d -- SET STATISTICS TIME OFF END

Results (in brief) my method achieves faster (+30%) INSERTS "elapsed time = 7339 ms" vs "elapsed time = 10318 ms", however, slower (-30%) SELECTS "elapsed time = 37 ms" vs "elapsed time = 28 ms".

Another interesting note is you can't "correctly" INDEX a URL VARCHAR field because the length (according to http spec ~4kb) would be greater than 900 bytes (SQL 2008's max allowable key size). While SQL only gives a warning for this, the warning does note that some INSERTS/UPDATES could potentially fail.

Warning! The maximum key length is 900 bytes. The index 'CIX_STORE1_DATA' has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.

I'm not a SQL Guru per se, perhaps my testing method isn't the most accurate/useful, but the topic is very interesting with regards not sensible user-end optimization versus the 'black box'.

I had something similar to this recently. Isn't the index on the url likely to do clever things like using hashes/checksums internally for a faster indexing? I asked this question: stackoverflow.com/questions/7954602/… to which most replies were "don't insert the hashcode/checksum, just let the database worry about it".
There's a lot of "just let {x} worry about it" on stackoverflow. I fear for innovation and the problem solving skills of the future when we always rely on others to solve problems we can easilly resolve ourselves. Anyways - to add my little two cents of value to this comment (without adding any actual information): do a test - time it both ways with a decent sample set. If it turns our that SQL is faster without inserting the hash, then GJ Microsoft - if not, then GJ MonsterMMORPG you got your app faster and learned another trick for your toolbelt.
Thank you very much for the answer. Now i am crawling only certain websites. So i can clearly say that the total url count will never exceed 10 million. I suppose collision chance is very low out of 10m. And it is not very critical issue so i can accept several collisions with the benefit of having a lot faster.
@AdamG.Carstensen: Its not always a good answer but a database's entire design is optimised to do lookups as fast as possible so in the case of most people once you have got indexes where you want them then I would have though it is very unlikely that people will write better things for optimising. I might be wrong which is why I asked the question rather than just downvoting and declaring outright wrongness. You are right that we should test these things if we feel they are a concern but after the question I asked I concluded that it would be a waste of my time to try to improve on MS's work.
@Chris - I'm glad that you got your problem sorted out, and I apologize if my comment seemed hostile. It certainly wasn’t intended. The thing about SQL and all databases is that while they’re packed full of feature sets which could choke a wide necked animal, they still release new versions and updates (indication that they’re not perfect). Remember that solving everyone’s problems means making some sacrifices. Luckily, sometimes you can, through your own sweat and persistence, help to make up for some of those.

Collectives™ on Stack Overflow

Web Crawler Url Storing At Database - Fast URL Lookup - Hashing - C#

1 Answer 1

9 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Linked

Related