Full Text Search Throwdown Bill Karwin, Percona Inc.
In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/
StackOverflow Test Data • Data dump, exported December 2011 • 7.4 million Posts = 8.18 GB www.percona.com
StackOverflow ER diagram searchable text www.percona.com
The Baseline: Naive Search Predicates
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky www.percona.com
Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: SELECT * FROM Posts WHERE Body LIKE '%one%' • Regular expressions in MySQL support escapes for word boundaries: SELECT * FROM Posts WHERE Body RLIKE '[[:<:]]one[[:>:]]' www.percona.com
Performance issue • LIKE with wildcards: ! SELECT * FROM Posts 49 sec WHERE title LIKE '%performance%' ! OR body LIKE '%performance%' ! OR tags LIKE '%performance%'; • POSIX regular expressions: ! SELECT * FROM Posts 7 min 57 sec WHERE title RLIKE '[[:<:]]performance[[:>:]]' ! OR body RLIKE '[[:<:]]performance[[:>:]]' ! OR tags RLIKE '[[:<:]]performance[[:>:]]'; www.percona.com
Why so slow? CREATE TABLE TelephoneBook ( ! FullName VARCHAR(50)); CREATE INDEX name_idx ON TelephoneBook ! (FullName); INSERT INTO TelephoneBook VALUES ! ('Riddle, Thomas'), ! ('Thomas, Dean'); www.percona.com
Why so slow? • Search for all with last name “Thomas” uses index SELECT * FROM telephone_book WHERE full_name LIKE 'Thomas%' • Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE '%Thomas' can’t use index www.percona.com
Because: B-Tree indexes can’t ☞ search for substrings www.percona.com
• FULLTEXT in MyISAM • FULLTEXT in InnoDB • Apache Solr • Sphinx Search • Trigraphs
FULLTEXT in MyISAM
FULLTEXT Index with MyISAM • Special index type for MyISAM • Integrated with SQL queries • Indexes always in sync with data • Balances features vs. speed vs. space www.percona.com
Insert Data into Index (MyISAM) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 33 min, 34 sec www.percona.com
Build Index on Data (MyISAM) mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 31 min, 18 sec www.percona.com
Querying SELECT * FROM Posts WHERE MATCH(column(s)) AGAINST('query pattern'); must include all columns of your index, in the order you defined www.percona.com
Natural Language Mode (MyISAM) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags ) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 200 milliseconds www.percona.com
Query Profile: Natural Language Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000068 | | checking permissions | 0.000006 | | Opening tables | 0.000017 | | init | 0.000032 | | System lock | 0.000007 | | optimizing | 0.000007 | | statistics | 0.000018 | | preparing | 0.000006 | | FULLTEXT initialization | 0.198358 | | executing | 0.000012 | | Sending data | 0.001921 | | end | 0.000005 | | query end | 0.000003 | | closing tables | 0.000018 | | freeing items | 0.000341 | | cleaning up | 0.000012 | +-------------------------+----------+ www.percona.com
Boolean Mode (MyISAM) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 16 milliseconds www.percona.com
Query Profile: Boolean Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000031 | | checking permissions | 0.000003 | | Opening tables | 0.000008 | | init | 0.000017 | | System lock | 0.000004 | | optimizing | 0.000004 | | statistics | 0.000008 | | preparing | 0.000003 | | FULLTEXT initialization | 0.000008 | | executing | 0.000001 | | Sending data | 0.015703 | | end | 0.000004 | | query end | 0.000002 | | closing tables | 0.000007 | | freeing items | 0.000381 | | cleaning up | 0.000007 | +-------------------------+----------+ www.percona.com
FULLTEXT in InnoDB
FULLTEXT Index with InnoDB • Under development in MySQL 5.6 • I’m testing 5.6.6 m1 • Usage very similar to FULLTEXT in MyISAM • Integrated with SQL queries • Indexes always* in sync with data • Read the blogs for more details: • http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/ • http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/ • http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/ • http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/ www.percona.com
Insert Data into Index (InnoDB) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 55 min 46 sec www.percona.com
Build Index on Data (InnoDB) • Still under development; you might see problems: mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); ERROR 2013 (HY000): Lost connection to MySQL server during query www.percona.com
Build Index on Data (InnoDB) • Solution: make sure you define a primary key column `FTS_DOC_ID` explicitly: mysql> ALTER TABLE Posts CHANGE COLUMN PostId `FTS_DOC_ID` BIGINT UNSIGNED; mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 25 min 27 sec www.percona.com
Natural Language Mode (InnoDB) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 740 milliseconds www.percona.com
Query Profile: Natural Language Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000074 | | checking permissions | 0.000007 | | Opening tables | 0.000020 | | init | 0.000034 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000020 | | preparing | 0.000008 | | FULLTEXT initialization | 0.577257 | | executing | 0.000013 | | Sending data | 0.106279 | | end | 0.000018 | | query end | 0.000012 | | closing tables | 0.000018 | | freeing items | 0.055584 | | cleaning up | 0.000039 | +-------------------------+----------+ www.percona.com
Boolean Mode (InnoDB) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 350 milliseconds www.percona.com
Query Profile: Boolean Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000064 | | checking permissions | 0.000005 | | Opening tables | 0.000017 | | init | 0.000047 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000019 | | preparing | 0.000008 | | FULLTEXT initialization | 0.347172 | | executing | 0.000014 | | Sending data | 0.008089 | | end | 0.000011 | | query end | 0.000012 | | closing tables | 0.000015 | | freeing items | 0.001570 | | cleaning up | 0.000023 | +-------------------------+----------+ www.percona.com
Apache Solr
Apache Solr • http://lucene.apache.org/solr/ • Formerly known as Lucene, started 2001 • Apache License • Java implementation • Web service architecture • Many sophisticated search features www.percona.com
DataImportHandler • conf/solrconfig.xml: . . . <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> . . . www.percona.com
DataImportHandler • conf/data-config.xml: <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/testpattern?useUnicode=true" batchSize="-1" user="xxxx" password="xxxx"/> <document> <entity name="id" query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts"> </entity> </document> </dataConfig> extremely important to avoid buffering the whole query result! www.percona.com
DataImportHandler •conf/schema.xml: . . . <fields> <field name="PostId" type="string" indexed="true" stored="true" required="true" /> <field name="ParentId" type="string" indexed="true" stored="true" required="false" /> <field name="Title" type="text_general" indexed="false" stored="false" required="false" /> <field name="Body" type="text_general" indexed="false" stored="false" required="false" / > <field name="Tags" type="text_general" indexed="false" stored="false" required="false" / > <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/ > <fields> <uniqueKey>PostId</uniqueKey> <defaultSearchField>text</defaultSearchField> <copyField source="Title" dest="text"/> <copyField source="Body" dest="text"/> <copyField source="Tags" dest="text"/> . . . www.percona.com
Insert Data into Index (Solr) • http://localhost:8983/solr/dataimport? command=full-import time: 14 min 28 sec www.percona.com
Searching Solr • http://localhost:8983/solr/select/?q=mysql+AND +performance time: 79ms Query results are cached (like MySQL Query Cache), so they return much faster on subsequent execution www.percona.com
Sphinx Search
Sphinx Search • http://sphinxsearch.com/ • Started in 2001 • GPLv2 license • C++ implementation • SphinxSE storage engine for MySQL • Supports MySQL protocol, SQL-like queries • Many sophisticated search features www.percona.com
sphinx.conf source src1 { ! type = mysql ! sql_host = localhost ! sql_user = xxxx ! sql_pass = xxxx ! sql_db = testpattern ! sql_query = SELECT PostId, ParentId, Title, ! ! Body, Tags FROM Posts ! sql_query_info = SELECT * FROM Posts ! ! WHERE PostId=$id } www.percona.com
sphinx.conf ! index test1 { ! source = src1 ! path = C:Sphinxdata } www.percona.com
Insert Data into Index (Sphinx) ! C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1 Sphinx 2.0.5-release (r3309) using config file 'sphinx.conf'... indexing index 'test1'... collected 7397507 docs, 5731.8 MB time: 8 min 20 sec sorted 920.3 Mhits, 100.0% done total 7397507 docs, 5731776959 bytes total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3 msec/call avg total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/ call avg Execution time: 500.196 s www.percona.com
Querying index $ mysql --port 9306 Server version: 2.0.5-release (r3309) mysql> SELECT * FROM test1 WHERE MATCH('mysql performance'); +---------+--------+ | id | weight | +---------+--------+ | 6016856 | 6600 | | 4207641 | 6595 | | 2656325 | 6593 | | 7192928 | 5605 | | 8118235 | 5598 | . . . 20 rows in set (0.02 sec) www.percona.com
Querying index mysql> SHOW META; +---------------+-------------+ | Variable_name | Value | +---------------+-------------+ | total | 1000 | | total_found | 7672 | | time | 0.013 | | keyword[0] | mysql | | docs[0] | 162287 | time: 13ms | hits[0] | 363694 | | keyword[1] | performance | | docs[1] | 147249 | | hits[1] | 210895 | +---------------+-------------+ www.percona.com
Trigraphs
Trigraphs Overview • Not very fast, but still better than LIKE / RLIKE • Generic, portable SQL solution • No dependency on version, storage engine, third- party technology www.percona.com
Three-Letter Sequences ! CREATE TABLE AtoZ ( ! c! ! CHAR(1), ! PRIMARY KEY (c)); ! INSERT INTO AtoZ (c) VALUES ('a'), ('b'), ('c'), ... ! CREATE TABLE Trigraphs ( ! Tri! ! CHAR(3), ! PRIMARY KEY (Tri)); ! INSERT INTO Trigraphs (Tri) SELECT CONCAT(t1.c, t2.c, t3.c) FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3; www.percona.com
Insert Data Into Index my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr; $sth->execute() or die $dbh1->errstr; $dbh2->begin_work; my $i = 0; while (my $row = $sth->fetchrow_hashref ) { my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags}))); my %tri; map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g )); next unless %tri; my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri)); my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES $tuple_list"; $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr; if (++$i % 1000 == 0) { print "."; $dbh2->commit; $dbh2->begin_work; time: 116 min 50 sec } } space: 16.2GiB print ".n"; $dbh2->commit; rows: 519 million www.percona.com
Indexed Lookups ! SELECT p.* time: 46 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' ! www.percona.com
Search Among Fewer Matches ! SELECT p.* time: 19 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' www.percona.com
Search Among Fewer Matches ! SELECT p.* time: 22 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' www.percona.com
Search Among Fewer Matches ! SELECT p.* time: 13.6 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' www.percona.com
Narrow Down Further ! SELECT p.* time: 13.8 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%' ! AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%'; www.percona.com
And the winner is... Jarrett Campbell http://www.flickr.com/people/77744839@N00
Time to Insert Data into Index LIKE expression n/a FULLTEXT MyISAM 33 min, 34 sec FULLTEXT InnoDB 55 min, 46 sec Apache Solr 14 min, 28 sec Sphinx Search 8 min, 20 sec Trigraphs 116 min, 50 sec www.percona.com
Insert Data into Index (sec) 8000 6000 4000 2000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
Time to Build Index on Data LIKE expression n/a FULLTEXT MyISAM 31 min, 18 sec FULLTEXT InnoDB 25 min, 27 sec Apache Solr n/a Sphinx Search n/a Trigraphs n/a www.percona.com
Build Index on Data (sec) 4000 3000 2000 1000 n/a n/a n/a 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
Index Storage LIKE expression n/a FULLTEXT MyISAM 2382 MiB FULLTEXT InnoDB ? MiB Apache Solr 2766 MiB Sphinx Search 3355 MiB Trigraphs 16589 MiB www.percona.com
Index Storage (MiB) 20000 15000 10000 5000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
Query Speed LIKE expression 49,000ms - 399,000ms FULLTEXT MyISAM 16-200ms FULLTEXT InnoDB 350-740ms Apache Solr 79ms Sphinx Search 13ms Trigraphs 13800ms www.percona.com
Query Speed (ms) 400000 350000 300000 250000 200000 150000 100000 50000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
1000 Query Speed (ms) 750 500 250 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
Bottom Line build insert storage query solution 49k-399k LIKE expression 0 0 0 ms SQL FULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQL FULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6 Apache Solr n/a 14:28 2766MiB 79ms Java Sphinx Search n/a 8:20 3487MiB 13ms C++ Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL www.percona.com
Final Thoughts • Third-party search engines are complex to keep in sync with data, and adding another type of server adds more operations work for you. • Built-in FULLTEXT indexes are therefore useful even if they are not absolutely the fastest. • Different search implementations may return different results, so you should evaluate what works best for your project. • Any indexed search solution is orders of magnitude better than LIKE! www.percona.com
New York, October 1-2, 2012 London, December 3-4, 2012 Santa Clara, April 22-25, 2013 www.percona.com/live
Expert instructors In-person training Custom onsite training Live virtual training http://www.percona.com/training www.percona.com
http://www.pragprog.com/titles/bksqla/ www.percona.com
Copyright 2012 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work. www.percona.com

Full Text Search Throwdown

  • 1.
    Full Text Search Throwdown Bill Karwin, Percona Inc.
  • 2.
    In a fulltext search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/
  • 3.
    StackOverflow Test Data •Data dump, exported December 2011 • 7.4 million Posts = 8.18 GB www.percona.com
  • 4.
  • 5.
  • 6.
    Some people, whenconfronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky www.percona.com
  • 7.
    Accuracy issue • Irrelevantor false matching words ‘one’, ‘money’, ‘prone’, etc.: SELECT * FROM Posts WHERE Body LIKE '%one%' • Regular expressions in MySQL support escapes for word boundaries: SELECT * FROM Posts WHERE Body RLIKE '[[:<:]]one[[:>:]]' www.percona.com
  • 8.
    Performance issue • LIKEwith wildcards: ! SELECT * FROM Posts 49 sec WHERE title LIKE '%performance%' ! OR body LIKE '%performance%' ! OR tags LIKE '%performance%'; • POSIX regular expressions: ! SELECT * FROM Posts 7 min 57 sec WHERE title RLIKE '[[:<:]]performance[[:>:]]' ! OR body RLIKE '[[:<:]]performance[[:>:]]' ! OR tags RLIKE '[[:<:]]performance[[:>:]]'; www.percona.com
  • 9.
    Why so slow? CREATETABLE TelephoneBook ( ! FullName VARCHAR(50)); CREATE INDEX name_idx ON TelephoneBook ! (FullName); INSERT INTO TelephoneBook VALUES ! ('Riddle, Thomas'), ! ('Thomas, Dean'); www.percona.com
  • 10.
    Why so slow? •Search for all with last name “Thomas” uses index SELECT * FROM telephone_book WHERE full_name LIKE 'Thomas%' • Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE '%Thomas' can’t use index www.percona.com
  • 11.
    Because: B-Tree indexes can’t ☞ search for substrings www.percona.com
  • 12.
    • FULLTEXT inMyISAM • FULLTEXT in InnoDB • Apache Solr • Sphinx Search • Trigraphs
  • 13.
  • 14.
    FULLTEXT Index withMyISAM • Special index type for MyISAM • Integrated with SQL queries • Indexes always in sync with data • Balances features vs. speed vs. space www.percona.com
  • 15.
    Insert Data intoIndex (MyISAM) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 33 min, 34 sec www.percona.com
  • 16.
    Build Index onData (MyISAM) mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 31 min, 18 sec www.percona.com
  • 17.
    Querying SELECT * FROMPosts WHERE MATCH(column(s)) AGAINST('query pattern'); must include all columns of your index, in the order you defined www.percona.com
  • 18.
    Natural Language Mode(MyISAM) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags ) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 200 milliseconds www.percona.com
  • 19.
    Query Profile: Natural Language Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000068 | | checking permissions | 0.000006 | | Opening tables | 0.000017 | | init | 0.000032 | | System lock | 0.000007 | | optimizing | 0.000007 | | statistics | 0.000018 | | preparing | 0.000006 | | FULLTEXT initialization | 0.198358 | | executing | 0.000012 | | Sending data | 0.001921 | | end | 0.000005 | | query end | 0.000003 | | closing tables | 0.000018 | | freeing items | 0.000341 | | cleaning up | 0.000012 | +-------------------------+----------+ www.percona.com
  • 20.
    Boolean Mode (MyISAM) •Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 16 milliseconds www.percona.com
  • 21.
    Query Profile: Boolean Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000031 | | checking permissions | 0.000003 | | Opening tables | 0.000008 | | init | 0.000017 | | System lock | 0.000004 | | optimizing | 0.000004 | | statistics | 0.000008 | | preparing | 0.000003 | | FULLTEXT initialization | 0.000008 | | executing | 0.000001 | | Sending data | 0.015703 | | end | 0.000004 | | query end | 0.000002 | | closing tables | 0.000007 | | freeing items | 0.000381 | | cleaning up | 0.000007 | +-------------------------+----------+ www.percona.com
  • 22.
  • 23.
    FULLTEXT Index withInnoDB • Under development in MySQL 5.6 • I’m testing 5.6.6 m1 • Usage very similar to FULLTEXT in MyISAM • Integrated with SQL queries • Indexes always* in sync with data • Read the blogs for more details: • http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/ • http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/ • http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/ • http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/ www.percona.com
  • 24.
    Insert Data intoIndex (InnoDB) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 55 min 46 sec www.percona.com
  • 25.
    Build Index onData (InnoDB) • Still under development; you might see problems: mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); ERROR 2013 (HY000): Lost connection to MySQL server during query www.percona.com
  • 26.
    Build Index onData (InnoDB) • Solution: make sure you define a primary key column `FTS_DOC_ID` explicitly: mysql> ALTER TABLE Posts CHANGE COLUMN PostId `FTS_DOC_ID` BIGINT UNSIGNED; mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 25 min 27 sec www.percona.com
  • 27.
    Natural Language Mode(InnoDB) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 740 milliseconds www.percona.com
  • 28.
    Query Profile: Natural Language Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000074 | | checking permissions | 0.000007 | | Opening tables | 0.000020 | | init | 0.000034 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000020 | | preparing | 0.000008 | | FULLTEXT initialization | 0.577257 | | executing | 0.000013 | | Sending data | 0.106279 | | end | 0.000018 | | query end | 0.000012 | | closing tables | 0.000018 | | freeing items | 0.055584 | | cleaning up | 0.000039 | +-------------------------+----------+ www.percona.com
  • 29.
    Boolean Mode (InnoDB) •Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 350 milliseconds www.percona.com
  • 30.
    Query Profile: Boolean Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000064 | | checking permissions | 0.000005 | | Opening tables | 0.000017 | | init | 0.000047 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000019 | | preparing | 0.000008 | | FULLTEXT initialization | 0.347172 | | executing | 0.000014 | | Sending data | 0.008089 | | end | 0.000011 | | query end | 0.000012 | | closing tables | 0.000015 | | freeing items | 0.001570 | | cleaning up | 0.000023 | +-------------------------+----------+ www.percona.com
  • 31.
  • 32.
    Apache Solr • http://lucene.apache.org/solr/ •Formerly known as Lucene, started 2001 • Apache License • Java implementation • Web service architecture • Many sophisticated search features www.percona.com
  • 33.
    DataImportHandler • conf/solrconfig.xml: . .. <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> . . . www.percona.com
  • 34.
    DataImportHandler • conf/data-config.xml: <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/testpattern?useUnicode=true" batchSize="-1" user="xxxx" password="xxxx"/> <document> <entity name="id" query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts"> </entity> </document> </dataConfig> extremely important to avoid buffering the whole query result! www.percona.com
  • 35.
    DataImportHandler •conf/schema.xml: . . . <fields> <field name="PostId" type="string" indexed="true" stored="true" required="true" /> <field name="ParentId" type="string" indexed="true" stored="true" required="false" /> <field name="Title" type="text_general" indexed="false" stored="false" required="false" /> <field name="Body" type="text_general" indexed="false" stored="false" required="false" / > <field name="Tags" type="text_general" indexed="false" stored="false" required="false" / > <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/ > <fields> <uniqueKey>PostId</uniqueKey> <defaultSearchField>text</defaultSearchField> <copyField source="Title" dest="text"/> <copyField source="Body" dest="text"/> <copyField source="Tags" dest="text"/> . . . www.percona.com
  • 36.
    Insert Data intoIndex (Solr) • http://localhost:8983/solr/dataimport? command=full-import time: 14 min 28 sec www.percona.com
  • 37.
    Searching Solr • http://localhost:8983/solr/select/?q=mysql+AND +performance time: 79ms Query results are cached (like MySQL Query Cache), so they return much faster on subsequent execution www.percona.com
  • 38.
  • 39.
    Sphinx Search • http://sphinxsearch.com/ •Started in 2001 • GPLv2 license • C++ implementation • SphinxSE storage engine for MySQL • Supports MySQL protocol, SQL-like queries • Many sophisticated search features www.percona.com
  • 40.
    sphinx.conf source src1 { ! type= mysql ! sql_host = localhost ! sql_user = xxxx ! sql_pass = xxxx ! sql_db = testpattern ! sql_query = SELECT PostId, ParentId, Title, ! ! Body, Tags FROM Posts ! sql_query_info = SELECT * FROM Posts ! ! WHERE PostId=$id } www.percona.com
  • 41.
    sphinx.conf ! index test1 { ! source = src1 ! path = C:Sphinxdata } www.percona.com
  • 42.
    Insert Data intoIndex (Sphinx) ! C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1 Sphinx 2.0.5-release (r3309) using config file 'sphinx.conf'... indexing index 'test1'... collected 7397507 docs, 5731.8 MB time: 8 min 20 sec sorted 920.3 Mhits, 100.0% done total 7397507 docs, 5731776959 bytes total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3 msec/call avg total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/ call avg Execution time: 500.196 s www.percona.com
  • 43.
    Querying index $ mysql--port 9306 Server version: 2.0.5-release (r3309) mysql> SELECT * FROM test1 WHERE MATCH('mysql performance'); +---------+--------+ | id | weight | +---------+--------+ | 6016856 | 6600 | | 4207641 | 6595 | | 2656325 | 6593 | | 7192928 | 5605 | | 8118235 | 5598 | . . . 20 rows in set (0.02 sec) www.percona.com
  • 44.
    Querying index mysql> SHOWMETA; +---------------+-------------+ | Variable_name | Value | +---------------+-------------+ | total | 1000 | | total_found | 7672 | | time | 0.013 | | keyword[0] | mysql | | docs[0] | 162287 | time: 13ms | hits[0] | 363694 | | keyword[1] | performance | | docs[1] | 147249 | | hits[1] | 210895 | +---------------+-------------+ www.percona.com
  • 45.
  • 46.
    Trigraphs Overview • Notvery fast, but still better than LIKE / RLIKE • Generic, portable SQL solution • No dependency on version, storage engine, third- party technology www.percona.com
  • 47.
    Three-Letter Sequences ! CREATETABLE AtoZ ( ! c! ! CHAR(1), ! PRIMARY KEY (c)); ! INSERT INTO AtoZ (c) VALUES ('a'), ('b'), ('c'), ... ! CREATE TABLE Trigraphs ( ! Tri! ! CHAR(3), ! PRIMARY KEY (Tri)); ! INSERT INTO Trigraphs (Tri) SELECT CONCAT(t1.c, t2.c, t3.c) FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3; www.percona.com
  • 48.
    Insert Data IntoIndex my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr; $sth->execute() or die $dbh1->errstr; $dbh2->begin_work; my $i = 0; while (my $row = $sth->fetchrow_hashref ) { my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags}))); my %tri; map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g )); next unless %tri; my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri)); my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES $tuple_list"; $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr; if (++$i % 1000 == 0) { print "."; $dbh2->commit; $dbh2->begin_work; time: 116 min 50 sec } } space: 16.2GiB print ".n"; $dbh2->commit; rows: 519 million www.percona.com
  • 49.
    Indexed Lookups ! SELECTp.* time: 46 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' ! www.percona.com
  • 50.
    Search Among FewerMatches ! SELECT p.* time: 19 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' www.percona.com
  • 51.
    Search Among FewerMatches ! SELECT p.* time: 22 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' www.percona.com
  • 52.
    Search Among FewerMatches ! SELECT p.* time: 13.6 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' www.percona.com
  • 53.
    Narrow Down Further !SELECT p.* time: 13.8 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%' ! AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%'; www.percona.com
  • 54.
    And the winneris... Jarrett Campbell http://www.flickr.com/people/77744839@N00
  • 55.
    Time to InsertData into Index LIKE expression n/a FULLTEXT MyISAM 33 min, 34 sec FULLTEXT InnoDB 55 min, 46 sec Apache Solr 14 min, 28 sec Sphinx Search 8 min, 20 sec Trigraphs 116 min, 50 sec www.percona.com
  • 56.
    Insert Data intoIndex (sec) 8000 6000 4000 2000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 57.
    Time to BuildIndex on Data LIKE expression n/a FULLTEXT MyISAM 31 min, 18 sec FULLTEXT InnoDB 25 min, 27 sec Apache Solr n/a Sphinx Search n/a Trigraphs n/a www.percona.com
  • 58.
    Build Index onData (sec) 4000 3000 2000 1000 n/a n/a n/a 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 59.
    Index Storage LIKE expression n/a FULLTEXT MyISAM 2382 MiB FULLTEXT InnoDB ? MiB Apache Solr 2766 MiB Sphinx Search 3355 MiB Trigraphs 16589 MiB www.percona.com
  • 60.
    Index Storage (MiB) 20000 15000 10000 5000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 61.
    Query Speed LIKE expression 49,000ms - 399,000ms FULLTEXT MyISAM 16-200ms FULLTEXT InnoDB 350-740ms Apache Solr 79ms Sphinx Search 13ms Trigraphs 13800ms www.percona.com
  • 62.
    Query Speed (ms) 400000 350000 300000 250000 200000 150000 100000 50000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 63.
    1000 Query Speed (ms) 750 500 250 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 64.
    Bottom Line build insert storage query solution 49k-399k LIKE expression 0 0 0 ms SQL FULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQL FULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6 Apache Solr n/a 14:28 2766MiB 79ms Java Sphinx Search n/a 8:20 3487MiB 13ms C++ Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL www.percona.com
  • 65.
    Final Thoughts • Third-partysearch engines are complex to keep in sync with data, and adding another type of server adds more operations work for you. • Built-in FULLTEXT indexes are therefore useful even if they are not absolutely the fastest. • Different search implementations may return different results, so you should evaluate what works best for your project. • Any indexed search solution is orders of magnitude better than LIKE! www.percona.com
  • 66.
    New York, October1-2, 2012 London, December 3-4, 2012 Santa Clara, April 22-25, 2013 www.percona.com/live
  • 67.
    Expert instructors In-person training Custom onsite training Live virtual training http://www.percona.com/training www.percona.com
  • 68.
  • 69.
    Copyright 2012 BillKarwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work. www.percona.com