4

I'm storing documents with fields that are set in a DataSource table. The document is written to the database as follows:

Document table (ID, DataSourceID) Document fields (DocumentID, DataSourceField, FieldValue)

For example:

Document 100 | "Invoice" 101 | "Receipt" Fields 100 | "Date" | 2011/12/01 100 | "Amount" | 1200,00 101 | "Date" | 2011/12/02 101 | "Warehouse" | "Central" 

The documents have many more fields. Now, the question is how to write the fastest query to find documents by entering some filter values on fields (ie. Date between '2011/12/1' and '2011/12/31' and Amount > 1000,00)?

This is what I came up:

select * from Document, Fields where DocumentID = FieldDocumentID and DataSource = "Invoice" and DocumentID in (select FieldDocumentID from Fields where Field = "Date" and FieldValue between '2011/12/01' and '2011/12/31') and DocumentID in (select FieldDocumentID from Fields where Field = "Amount" and FieldValue >= 1000) 

I'm not sure how well this is going to work with millions of rows. How could I do this better?

12
  • 4
    The best idea is not to use this structure, since there's no way to enforce data types and you will have all sorts of implicit conversions. EAV is an antipattern for a reason. Commented Dec 2, 2011 at 13:25
  • 2
    How about more conventional database structure - table with fields DocumentID, Date, Amount and so on? Or is this out of question? Commented Dec 2, 2011 at 13:26
  • 3
    it is all well and good to allow the user to define their own fields, however you will never get any good performance out of this sort of system. Your best bet is to do a compromise and have a parent table contain all of the foreseeable fields, including your DocumentID and Date, etc. You can then create some proper indexes and find all the relvelant DocumnetIDs. From there you can do the joins to get all the miscellaneous attributes. Commented Dec 2, 2011 at 13:55
  • 2
    Note that you do a text comparison on the dates, but as long as it's in a yyyy/mm/dd format it should work out ok. Commented Dec 2, 2011 at 14:36
  • 2
    @HansKesting is right. The checks will work for dates (and strings) but probably not for numeric data. I would expect the SQL to evaluate '1200.00' < '800.00' as true! Commented Dec 2, 2011 at 14:48

3 Answers 3

5

EAV based systems (e.g. Agresso) have query performance problems inherent to the EAV structure. The problems are really based on three underlying issues:

  • The structure is fiddly to query, so you get added complexity in your queries.

  • You can't index the base table on any of the attributes, which drives some inherent inefficiency in the query plans.

  • You have to do multiple joins against a large table.

Some possible strategies for mitigating EAV performance problems are:

  • Using a clustered index to keep all records relating to a given parent in close physical proximity can help somewhat. This minimises the I/O.

  • Consider using the metadata to create a view that flattens out the EAV structure. You may have to redefine the view every time it changes, and it won't necessarily help performance much, but it will be much easier to query. Try to create the view as a roll-up rather than a multiple way join.

  • Any attribute that is not strictly user defined (eg. transaction dates, amounts) should go on the base table. These can also be indexed.

  • Resolve the most selective predicates in subqueries first and join against the sub-query.

  • If you have the option flatten out the physical structure, even if it has to be into a series of fields like 'StringField1', 'StringField2', 'DateField1', 'MoneyField1'. Your application will have to hold the configration data anyway, so you can just map to a field on the database.

    This sounds clumsy, but it's much more efficient to query, and it doesn't have to be in the users' faces. You can also use the metadata to generate a view over the table that names the fields, so other customers of the data can see a meaningful view of the data and query it without necessarily having access to the application metadata.

    If you run out of columns in the table, just add more in the database. The field mapping system can always query the system data dictionary to see what's available.

At some point in these debates, somebody often chips in with a comment about XML columns ("I know, I'll just use XML"). To paraphrase JWZ, now you have two problems. Around 1998 they installed another circle in hell just for people who defile their systems with XML blob fields.

Although it doesn't look slick and high-tech, just having a set of user attribute columns on the table is by far the best solution. It is the most efficient, and there are many ways to make this approach more user friendly at the database level.

EDIT: Here is a snippet of T-SQL showing how to create a view that flattens out an EAV structure. Note that you wound have to write a generator for the view based on the application's attribute metadata and re-create it every time this is changed.

if object_id ('dbo.Foo') is not null drop table dbo.Foo go if object_id ('dbo.FooView') is not null drop view dbo.FooView go create table dbo.Foo ( FooID int not null ,FooKey varchar (20) ,FooVal varchar (100) ) insert Foo (FooID, FooKey, FooVal) values (1, 'Date', '2011-11-11') insert Foo (FooID, FooKey, FooVal) values (1, 'Amount', '100') insert Foo (FooID, FooKey, FooVal) values (2, 'Date', '2012-12-12') insert Foo (FooID, FooKey, FooVal) values (2, 'Amount', '200') go create view dbo.FooView as select * from (select FooID ,FooKey ,FooVal from dbo.Foo) keys pivot (Max (FooVal) for FooKey in ([Date],[Amount])) pvt go select * from dbo.FooView 

This is much more efficient than solutions involving multiple self-joins on the table, and will be somewhat efficient with a clustered index on the base entity and attribute type keys. You will probably also want to put type conversions into the view so you can sort the results correctly.

9
  • 1
    This is an OR query, not an AND. Why do you think a PIVOT would be more efficient than a self-join? Commented Dec 2, 2011 at 16:39
  • 1
    It's an example of flattening out an EAV structure for reporting purposes. A view of this sort would be more efficient on aggregate queries where it would be just one table scan instead of a large multi-way self-join - i.e. better for reporting. Commented Dec 2, 2011 at 17:04
  • 1
    Why do you think that one table scan over all values is better than two independent scans on appropriate portions of the index? If we have 1M documents with 20 attributes in each, your solution would need to scan 20M records while even op's one would only need to scan 2M. Commented Dec 2, 2011 at 17:17
  • 1
    Depends how selective the query is. If you're just selecting a few rows then what you're talking about in your post will probably be more efficnent. However, I think it would end up with some sort of intermediate bookmarking operation. More than a few percent of rows overall and a single table scan would be cheaper. When I tested it the plan was just a table scan, then aggregation and filtering the aggregate. Commented Dec 2, 2011 at 21:48
  • 1
    A properly designed EAV table (well, let's say a least improperly designed table since we're talking EAV) should be clustered on fieldName leading, so there is no row lookup overhead here. With a clustered seek, even 99% percent of rows would be cheaper than 100%. Commented Dec 4, 2011 at 0:47
2

Create computed numeric and date fields in your EAV table and indexes on them:

ALTER TABLE fields ADD FieldValueNumeric AS CASE ISNUMERIC(FieldValue) WHEN 1 THEN CAST(FieldValue AS NUMERIC) END ALTER TABLE fields ADD FieldValueDate AS CASE ISDATE(FieldValue) WHEN 1 THEN CAST(FieldValue AS DATE) END CREATE UNIQUE INDEX ix_fields_field_num_doc ON fields (dataSourceField, fieldValueNumeric, documentId) CREATE UNIQUE INDEX ix_fields_field_date_doc ON fields (dataSourceField, fieldValueDate, documentId) 

and use them in your query:

SELECT * FROM document WHERE documentID IN ( SELECT fieldDocumentID FROM fields WHERE dataSourceField = 'Date' AND fieldValueDate BETWEEN '2011/12/01' AND '2011/12/31' ) AND documentID IN ( SELECT fieldDocumentID FROM fields WHERE dataSourceField = 'Amount' AND fieldValueNumeric >= 1000 ) 

EAV has its drawbacks, but for this very query (with ranged conditions on two dirrefent attributes) it's no worse than if the attributes were in one table, since a single index on two fields would be useless for ranged scans on both (in SQL Server).

5
  • Thanks. I was thinking of creating several columns in my table (one for int, nvarchar, char, varbinary, bool and datetime). Still not sure if I'm going to create indexes on them. Commented Dec 5, 2011 at 10:51
  • 1
    @Nezreli: you can just keep the datatype in a separate column and use the computed columns like this: valueDate AS CASE DATATYPE WHEN 'D' THEN CAST(value AS DATE) END etc. There is no point in persisting the columns. Of course you should index them if you want faster ranged searches. Commented Dec 5, 2011 at 10:57
  • I already have a datatype column. Not sure if all databases support creation of such computed fields (Informix and PostgreSQL ie.) hence my decision for more conservative approach. But, I'll look into it since it's more elegant solution. Commented Dec 5, 2011 at 11:04
  • 1
    @Nezreli: Not sure about Informix, but Oracle and PostgreSQL allow indexing expressions: CREATE INDEX ON (name, CASE datatype WHEN 'D' THEN CAST(value AS DATE) END, id). You should use the expression verbatim in you query for this index to work. MySQL does not support this (as well as creating computed columns, persisted or not, at all). Commented Dec 5, 2011 at 11:06
  • Ok, wikipedia says that Informix supports indexing expressions. Thanks, I'll go with computed columns. Commented Dec 5, 2011 at 11:28
0

For starters, I would go with something like this:

SELECT * FROM Document d INNER JOIN Fields f ON f.FieldDocumentID = d.DocumentID WHERE d.DataSource = 'Invoice' AND ( (f.Field = 'Date' AND FieldValue BETWEEN '2011/12/01' AND '2011/12/31') OR (f.Field = 'Amount' AND FieldValue >= 1000) OR ... ); 

It's always a good idea to spell-out your JOINs instead of specifying multiple tables in the FROM clause. This helps prevent unexpected results, and protects you from doing an unintended Cartesian JOIN (if you make a mistake in your WHERE clause).

7
  • In what way would this help performance? Commented Dec 2, 2011 at 13:34
  • This will not even work. No row can have f.field equal to both 'Date' and 'Amount' Commented Dec 2, 2011 at 13:39
  • I don't thinks this is a correct SQL query. This one will return just Date and Amount rows. I need all document fileds for those documents which satisfy my filter criteria. Commented Dec 2, 2011 at 13:42
  • @JNK- I've eliminated two subqueries...the performance benefit should be obvious. With the mish-mash of datatypes in the Fields table (and assuming a database redesign isn't going to happen) I think this should do a fair job of reaching the goal here. Commented Dec 2, 2011 at 13:45
  • @ypercube- If you look, the f.Field checks in the WHERE clause are in different parts of an "OR". Commented Dec 2, 2011 at 13:47