Skip to main content

I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :

How to write a search engine

Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.

I decided to write "How to write a search engine". I chose this topic for two reasons:

  1. There is not much good info on this on the web.
  2. I am currently writing one for one of my clients.

My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.

I'm using a fairly simple technique, I have table (tblKeywords) with three fields:

  1. Itemid (If you are doing a web search this would be URL)
  2. KeyWord (Indexed Keyword)
  3. Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord

First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.

The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more tanthan once or\and in more thethan one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.

If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.

SQL :

Select Itemid, sum(weight) as totWeight from tblKeywords group by itemId having keyword in ('pink','shirt') 

So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :

How to write a search engine

Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.

I decided to write "How to write a search engine". I chose this topic for two reasons:

  1. There is not much good info on this on the web.
  2. I am currently writing one for one of my clients.

My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.

I'm using a fairly simple technique, I have table (tblKeywords) with three fields:

  1. Itemid (If you are doing a web search this would be URL)
  2. KeyWord (Indexed Keyword)
  3. Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord

First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.

The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more tan once or\and in more the one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.

If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.

SQL :

Select Itemid, sum(weight) as totWeight from tblKeywords group by itemId having keyword in ('pink','shirt') 

So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :

How to write a search engine

Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.

I decided to write "How to write a search engine". I chose this topic for two reasons:

  1. There is not much good info on this on the web.
  2. I am currently writing one for one of my clients.

My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.

I'm using a fairly simple technique, I have table (tblKeywords) with three fields:

  1. Itemid (If you are doing a web search this would be URL)
  2. KeyWord (Indexed Keyword)
  3. Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord

First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.

The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more than once or\and in more than one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.

If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.

SQL :

Select Itemid, sum(weight) as totWeight from tblKeywords group by itemId having keyword in ('pink','shirt') 

So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

don't sign your posts :) great answer!
Source Link
yannis
  • 39.7k
  • 40
  • 185
  • 218

I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :

How to write a search engine

Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.

I decided to write "How to write a search engine". I chose this topic for two reasons:

  1. There is not much good info on this on the web.
  2. I am currently writing one for one of my clients.

My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.

I'm using a fairly simple technique, I have table (tblKeywords) with three fields:

  1. Itemid (If you are doing a web search this would be URL)
  2. KeyWord (Indexed Keyword)
  3. Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord

First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.

The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more tan once or\and in more the one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.

If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.

SQL :

Select Itemid, sum(weight) as totWeight from tblKeywords group by itemId having keyword in ('pink','shirt') 

So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

Morons.

I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :

How to write a search engine

Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.

I decided to write "How to write a search engine". I chose this topic for two reasons:

  1. There is not much good info on this on the web.
  2. I am currently writing one for one of my clients.

My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.

I'm using a fairly simple technique, I have table (tblKeywords) with three fields:

  1. Itemid (If you are doing a web search this would be URL)
  2. KeyWord (Indexed Keyword)
  3. Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord

First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.

The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more tan once or\and in more the one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.

If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.

SQL :

Select Itemid, sum(weight) as totWeight from tblKeywords group by itemId having keyword in ('pink','shirt') 

So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

Morons.

I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :

How to write a search engine

Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.

I decided to write "How to write a search engine". I chose this topic for two reasons:

  1. There is not much good info on this on the web.
  2. I am currently writing one for one of my clients.

My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.

I'm using a fairly simple technique, I have table (tblKeywords) with three fields:

  1. Itemid (If you are doing a web search this would be URL)
  2. KeyWord (Indexed Keyword)
  3. Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord

First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.

The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more tan once or\and in more the one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.

If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.

SQL :

Select Itemid, sum(weight) as totWeight from tblKeywords group by itemId having keyword in ('pink','shirt') 

So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

Source Link
Morons
  • 14.7k
  • 4
  • 40
  • 73

I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :

How to write a search engine

Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.

I decided to write "How to write a search engine". I chose this topic for two reasons:

  1. There is not much good info on this on the web.
  2. I am currently writing one for one of my clients.

My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.

I'm using a fairly simple technique, I have table (tblKeywords) with three fields:

  1. Itemid (If you are doing a web search this would be URL)
  2. KeyWord (Indexed Keyword)
  3. Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord

First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.

The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more tan once or\and in more the one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.

If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.

SQL :

Select Itemid, sum(weight) as totWeight from tblKeywords group by itemId having keyword in ('pink','shirt') 

So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

Morons.