Skip to main content
added 202 characters in body
Source Link

I need to perform quick searches against a combination of tags while including date ranges:

Example:

  • Users
    • who have requested notifications
    • who did not respond to a notification sent at least 3 days ago
    • and who have not been sent any other notification in the past 3 days

Data Structure

The event data structure is pretty simple:

  • Event
    • EntityId
    • EventType
    • EventName
    • Date

Normalized Database Structure Performance Concerns

  • With billions of events, doing a table scan will not work
  • The only column that would index well is Date and it will not always be included in every filter
  • EventName will not be distributed well for an index (some event names could include 1/4 of the records)
  • Doing a simple WHERE query against this normalized table would most likely require a full table scan which won't be fast enough

LIKE or Full-Text Search

Another approach is to convert these tags into a single text column one per entity.

  • EntryType_EntryName_Date_Time1, EntryType_EntryName_Date_Time2, EntryType_EntryName_Date_Time3

Then, I can do a SQL full-text search.

This would reduce the number of rows by at least 10 times, but I cannot figure out how to search by date range:

  • User
    • CONTAINS (RequestedNotifications*)
    • NOT CONTAINS (OpenNotification_ID4*) (Before 3 days ago ???)
    • NOT CONTAINS (SentNotification*) (Since 3 days ago ???)

At best, I could reduce the table and then scan a smaller partition, but I don't think it will help much.

In-Memory Solution

I have thought about creating a dedicated VM with an in-memory data structure of the entire complex data set.

Basically, I would create a dictionary for each tag type with buckets for date-time ranges and keep everything in hashtables for rapid intersections:

// Some structures like these Dictionary<EventType, Dictionary<EventName, HashSet<int>>> nameIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, HashSet<int>>>> dayIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, Dictionary<Hour, HashSet<int>>>>> hourIndex; // To search like this (kind of) var entityIds = filters.Select(f => hourIndex[f.tagType][f.tagName][f.day][f.hour]) .IntersectMultiple() .ToList();   // Note: In order to perform an operation like before 3 days var filterResults = nameIndex[eventType][eventName].except(dayIndex[...][...][today].union(...[today-1]).union(...[today-2])); 

On a dedicated server, it could handle around a billion events in-memory.

  • Each Event would require ~12 bytes (~4 bytes for each maintained index)
  • 1 Billion Events ~ 12 GB memory ~ $250/month A5 VM

This could be scaled by partioning the entities and persistence wouldn't be too difficult.

But before going that very custom route, I would like to know if there is a simpler way to do this.

Question

  • How can I structure an index so that fast searches can be performed against multiple text and date range filters as per my example?
  • What Azure solution would be the best way to solve this problem?

I need to perform quick searches against a combination of tags while including date ranges:

Example:

  • Users
    • who have requested notifications
    • who did not respond to a notification sent at least 3 days ago
    • and who have not been sent any other notification in the past 3 days

Data Structure

The event data structure is pretty simple:

  • Event
    • EntityId
    • EventType
    • EventName
    • Date

Normalized Database Structure Performance Concerns

  • With billions of events, doing a table scan will not work
  • The only column that would index well is Date and it will not always be included in every filter
  • EventName will not be distributed well for an index (some event names could include 1/4 of the records)
  • Doing a simple WHERE query against this normalized table would most likely require a full table scan which won't be fast enough

LIKE or Full-Text Search

Another approach is to convert these tags into a single text column one per entity.

  • EntryType_EntryName_Date_Time1, EntryType_EntryName_Date_Time2, EntryType_EntryName_Date_Time3

Then, I can do a SQL full-text search.

This would reduce the number of rows by at least 10 times, but I cannot figure out how to search by date range:

  • User
    • CONTAINS (RequestedNotifications*)
    • NOT CONTAINS (OpenNotification_ID4*) (Before 3 days ago ???)
    • NOT CONTAINS (SentNotification*) (Since 3 days ago ???)

At best, I could reduce the table and then scan a smaller partition, but I don't think it will help much.

In-Memory Solution

I have thought about creating a dedicated VM with an in-memory data structure of the entire complex data set.

Basically, I would create a dictionary for each tag type with buckets for date-time ranges and keep everything in hashtables for rapid intersections:

// Some structures like these Dictionary<EventType, Dictionary<EventName, HashSet<int>>> nameIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, HashSet<int>>>> dayIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, Dictionary<Hour, HashSet<int>>>>> hourIndex; // To search like this (kind of) var entityIds = filters.Select(f => hourIndex[f.tagType][f.tagName][f.day][f.hour]) .IntersectMultiple() .ToList(); 

On a dedicated server, it could handle around a billion events in-memory.

  • Each Event would require ~12 bytes (~4 bytes for each maintained index)
  • 1 Billion Events ~ 12 GB memory ~ $250/month A5 VM

This could be scaled by partioning the entities and persistence wouldn't be too difficult.

But before going that very custom route, I would like to know if there is a simpler way to do this.

Question

  • How can I structure an index so that fast searches can be performed against multiple text and date range filters as per my example?
  • What Azure solution would be the best way to solve this problem?

I need to perform quick searches against a combination of tags while including date ranges:

Example:

  • Users
    • who have requested notifications
    • who did not respond to a notification sent at least 3 days ago
    • and who have not been sent any other notification in the past 3 days

Data Structure

The event data structure is pretty simple:

  • Event
    • EntityId
    • EventType
    • EventName
    • Date

Normalized Database Structure Performance Concerns

  • With billions of events, doing a table scan will not work
  • The only column that would index well is Date and it will not always be included in every filter
  • EventName will not be distributed well for an index (some event names could include 1/4 of the records)
  • Doing a simple WHERE query against this normalized table would most likely require a full table scan which won't be fast enough

LIKE or Full-Text Search

Another approach is to convert these tags into a single text column one per entity.

  • EntryType_EntryName_Date_Time1, EntryType_EntryName_Date_Time2, EntryType_EntryName_Date_Time3

Then, I can do a SQL full-text search.

This would reduce the number of rows by at least 10 times, but I cannot figure out how to search by date range:

  • User
    • CONTAINS (RequestedNotifications*)
    • NOT CONTAINS (OpenNotification_ID4*) (Before 3 days ago ???)
    • NOT CONTAINS (SentNotification*) (Since 3 days ago ???)

At best, I could reduce the table and then scan a smaller partition, but I don't think it will help much.

In-Memory Solution

I have thought about creating a dedicated VM with an in-memory data structure of the entire complex data set.

Basically, I would create a dictionary for each tag type with buckets for date-time ranges and keep everything in hashtables for rapid intersections:

// Some structures like these Dictionary<EventType, Dictionary<EventName, HashSet<int>>> nameIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, HashSet<int>>>> dayIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, Dictionary<Hour, HashSet<int>>>>> hourIndex; // To search like this (kind of) var entityIds = filters.Select(f => hourIndex[f.tagType][f.tagName][f.day][f.hour]) .IntersectMultiple() .ToList();   // Note: In order to perform an operation like before 3 days var filterResults = nameIndex[eventType][eventName].except(dayIndex[...][...][today].union(...[today-1]).union(...[today-2])); 

On a dedicated server, it could handle around a billion events in-memory.

  • Each Event would require ~12 bytes (~4 bytes for each maintained index)
  • 1 Billion Events ~ 12 GB memory ~ $250/month A5 VM

This could be scaled by partioning the entities and persistence wouldn't be too difficult.

But before going that very custom route, I would like to know if there is a simpler way to do this.

Question

  • How can I structure an index so that fast searches can be performed against multiple text and date range filters as per my example?
  • What Azure solution would be the best way to solve this problem?
Source Link

Best Azure Solution for Complex Search Index

I need to perform quick searches against a combination of tags while including date ranges:

Example:

  • Users
    • who have requested notifications
    • who did not respond to a notification sent at least 3 days ago
    • and who have not been sent any other notification in the past 3 days

Data Structure

The event data structure is pretty simple:

  • Event
    • EntityId
    • EventType
    • EventName
    • Date

Normalized Database Structure Performance Concerns

  • With billions of events, doing a table scan will not work
  • The only column that would index well is Date and it will not always be included in every filter
  • EventName will not be distributed well for an index (some event names could include 1/4 of the records)
  • Doing a simple WHERE query against this normalized table would most likely require a full table scan which won't be fast enough

LIKE or Full-Text Search

Another approach is to convert these tags into a single text column one per entity.

  • EntryType_EntryName_Date_Time1, EntryType_EntryName_Date_Time2, EntryType_EntryName_Date_Time3

Then, I can do a SQL full-text search.

This would reduce the number of rows by at least 10 times, but I cannot figure out how to search by date range:

  • User
    • CONTAINS (RequestedNotifications*)
    • NOT CONTAINS (OpenNotification_ID4*) (Before 3 days ago ???)
    • NOT CONTAINS (SentNotification*) (Since 3 days ago ???)

At best, I could reduce the table and then scan a smaller partition, but I don't think it will help much.

In-Memory Solution

I have thought about creating a dedicated VM with an in-memory data structure of the entire complex data set.

Basically, I would create a dictionary for each tag type with buckets for date-time ranges and keep everything in hashtables for rapid intersections:

// Some structures like these Dictionary<EventType, Dictionary<EventName, HashSet<int>>> nameIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, HashSet<int>>>> dayIndex; Dictionary<EventType, Dictionary<EventName, Dictionary<Day, Dictionary<Hour, HashSet<int>>>>> hourIndex; // To search like this (kind of) var entityIds = filters.Select(f => hourIndex[f.tagType][f.tagName][f.day][f.hour]) .IntersectMultiple() .ToList(); 

On a dedicated server, it could handle around a billion events in-memory.

  • Each Event would require ~12 bytes (~4 bytes for each maintained index)
  • 1 Billion Events ~ 12 GB memory ~ $250/month A5 VM

This could be scaled by partioning the entities and persistence wouldn't be too difficult.

But before going that very custom route, I would like to know if there is a simpler way to do this.

Question

  • How can I structure an index so that fast searches can be performed against multiple text and date range filters as per my example?
  • What Azure solution would be the best way to solve this problem?