28
\$\begingroup\$

It was suggested on Meta for July 2015 Community Challenge to make a bot that can find questions with bad titles. I wrote this query to help others find examples to help them make write their code.

I made it a point to document as I wrote it so that someone who is a beginner with SQL could still make changes to it easily. Is there room for improvements?

Here is the query on SEDE

DECLARE @Question INT = (SELECT Id FROM PostTypes WHERE Name = 'Question'); DECLARE @Answer INT = (SELECT Id FROM PostTypes WHERE Name = 'Answer'); SELECT /* Uncomment below if you want to limit the results */ /* TOP 100 */ Posts.Id AS [Post Link] /* only for sanity check in case the filter is removed from WHERE clause */ , CASE WHEN Posts.PostTypeId = @Question THEN 'Question' WHEN Posts.PostTypeId = @Answer THEN 'Answer' ELSE 'Other' END AS [Post Type] , Posts.Score , Posts.CreationDate , Users.Id AS [User Link] , Users.Reputation FROM Posts INNER JOIN Users ON Posts.OwnerUserId = Users.Id WHERE /* Filter out non-question posts */ Posts.PostTypeId = @Question /* Generic title */ AND ( Posts.Title LIKE '%please%' OR Posts.Title LIKE '%review%' OR Posts.Title LIKE '%improve%' OR Posts.Title LIKE '%better%' OR Posts.Title LIKE '%how to%'OR Posts.Title LIKE '%how can i%' OR Posts.Title LIKE '%how do i%' ) /* Post not closed yet */ AND Posts.ClosedDate IS NULL /* Question doesn't have positive score */ /* AND Posts.Score <= 0 */ /* User has low reputation */ AND Users.Reputation <= 500 ORDER BY /* Lowest scores first - Change to DESC if you want Highest first */ Posts.Score ASC /* Newest first - Change to ASC if you want oldest first */ , Posts.CreationDate DESC 
\$\endgroup\$
8
  • 13
    \$\begingroup\$ You do realize that your own post will be identified as having a bad title... right? \$\endgroup\$ Commented Jun 11, 2015 at 18:58
  • 1
    \$\begingroup\$ not as of right now @rolfl since the rep-requirement isn't met... \$\endgroup\$ Commented Jun 11, 2015 at 18:58
  • 3
    \$\begingroup\$ @rolfl You think? ;-) \$\endgroup\$ Commented Jun 11, 2015 at 19:01
  • 5
    \$\begingroup\$ This has gotta be the single worst title on a highly-upvoted question I've ever seen. Good job, SQL-guy \$\endgroup\$ Commented Jun 11, 2015 at 22:04
  • 1
    \$\begingroup\$ Oh, I know why you chose a bad title. It's still bad, though. (Well, besides the little parenthetical to the right) \$\endgroup\$ Commented Jun 11, 2015 at 23:14

3 Answers 3

13
\$\begingroup\$

Your query is well structured, and consistent, but there is one issue you have failed to incorporate.... not all bad questions have a user. When questions are migrated, or there's a user deleted, the question may have no link back to the Users table.

This requires an Outer Join to Users.

Additionally, I have found that CTE expressions are fantastic for aliasing column names. For example, both Posts and Users tables have an ID column, and you need to have Posts.ID and Users.ID in different places. Both also have a CreatedDate. When these aliases become long, and complicated, it bogs down the readability of the query.

Oh, about the readability, the case statement is dead code. Don't use code that serves no purpose, or, alternatively, "debug code" should be removed before deployment. If you cannot trust yourself to have the right conditions on the query, then you cannot trust yourself to have the right debug code either.

So, if you alias these values in a CTE, and use the CTE to make the Outer join neat, you can reduce your query to:

SELECT PostID as [Post Link], Score, PostDate, UserID as [User Link], Reputation FROM BadTitles LEFT OUTER JOIN Users on UserId = Id ORDER BY Score ASC, PostDate DESC 

OK, that's a cheat, because BadTitles is a CTE, but, the full query, using your expanded comments, is:

DECLARE @Question INT = (SELECT Id FROM PostTypes WHERE Name = 'Question'); DECLARE @Answer INT = (SELECT Id FROM PostTypes WHERE Name = 'Answer'); with BadTitles as ( SELECT Id as PostId, OwnerUserId as UserId, Score, CreationDate as PostDate FROM Posts WHERE /* Filter out non-question posts */ Posts.PostTypeId = @Question /* Post not closed yet */ AND Posts.ClosedDate IS NULL /* Generic title */ AND ( Posts.Title LIKE '%please%' OR Posts.Title LIKE '%review%' OR Posts.Title LIKE '%improve%' OR Posts.Title LIKE '%better%' OR Posts.Title LIKE '%how to%' OR Posts.Title LIKE '%how can i%' OR Posts.Title LIKE '%how do i%' ) ) SELECT PostID as [Post Link], Score, PostDate, UserID as [User Link], Reputation FROM BadTitles LEFT OUTER JOIN Users on UserId = Id WHERE (Reputation IS NULL OR Reputation <= 500) ORDER BY /* Lowest scores first - Change to DESC if you want Highest first */ Score ASC, /* Newest first - Change to ASC if you want oldest first */ PostDate DESC 

Note that this produces an addition hundred or so questions from migrations, etc.

\$\endgroup\$
2
  • \$\begingroup\$ Hmmm... just realized this does not accomodate the reputation criteria... I misplaced that. \$\endgroup\$ Commented Jun 11, 2015 at 21:00
  • 2
    \$\begingroup\$ I can always count on you to find obscure bugs in my SEDE queries, that's great thanks! \$\endgroup\$ Commented Jun 11, 2015 at 21:01
7
\$\begingroup\$

I noticed that you put Comma first in your select statements, something that I also do is to put the conjunctions first in my where conditions, it has similar advantages

 AND ( Posts.Title LIKE '%please%' OR Posts.Title LIKE '%review%' OR Posts.Title LIKE '%improve%' OR Posts.Title LIKE '%better%' OR Posts.Title LIKE '%how to%'OR Posts.Title LIKE '%how can i%' OR Posts.Title LIKE '%how do i%' ) 

I would have written it like this

 AND ( Posts.Title LIKE '%please%' OR Posts.Title LIKE '%review%' OR Posts.Title LIKE '%improve%' OR Posts.Title LIKE '%better%' OR Posts.Title LIKE '%how to%' OR Posts.Title LIKE '%how can i%' OR Posts.Title LIKE '%how do i%' ) 
\$\endgroup\$
2
  • 4
    \$\begingroup\$ While I like the or, and, ,'s, etc at the start of the line as well... the OCD in me wants to scream that your Posts don't line up :) \$\endgroup\$ Commented Jun 11, 2015 at 23:47
  • 1
    \$\begingroup\$ lol, I know what you mean. \$\endgroup\$ Commented Jun 12, 2015 at 1:01
4
\$\begingroup\$

Let me address a possible DRY issue in the query in exchange for possible poorer performance.

I'm in particular talking about this bit:

 /* Generic title */ AND ( Posts.Title LIKE '%please%' OR Posts.Title LIKE '%review%' OR Posts.Title LIKE '%improve%' OR Posts.Title LIKE '%better%' OR Posts.Title LIKE '%how to%'OR Posts.Title LIKE '%how can i%' OR Posts.Title LIKE '%how do i%' ) 

At compatibility level 130 SQL Server offers a String_Split function. It allows you to feed it a string and split it on a single character into a table. For example:

STRING_SPLIT('please|review|improve|better|how to|how can i|how do i','|') 

would give us a Table with a column Value containing rows with each word that is separated by a |.

We can use that table then to do a CROSS APPLY with the Posts table because that allows us to reduce the LIKE sizzle to:

 /* Generic title */ AND CHARINDEX( Phrases.Value , Posts.Title COLLATE Latin1_General_CI_AI ) > 0 

Notice that I replaced the LIKE with a CHARINDEX. That by itself shouldn't matter much as none of those constructs would benefit from an index.

I also added a COLLATE instruction with a case and accent insensitive collation so it won't miss cases where capitals are used.

Further more I suggest to remove the CASE/WHEN/END statement for the PostType. You can join with the PostTypes table and then use the Namecolumn in the projection.

Applying all that your query makes this as the end result:

DECLARE @Question INT = (SELECT Id FROM PostTypes WHERE Name = 'Question'); DECLARE @Answer INT = (SELECT Id FROM PostTypes WHERE Name = 'Answer'); DECLARE @Phrases NVARCHAR(200) ='please|review|improve|better|how to|how can i|how do i'; SELECT DISTINCT -- the distinct is a bit bogus here if you keep the Phrase column /* Uncomment below if you want to limit the results */ TOP 1000 Posts.Id AS [Post Link] , PostTypes.Name AS [Post Type] , Posts.Score , Posts.CreationDate , Users.Id AS [User Link] , Users.Reputation , Phrases.Value FROM Posts INNER JOIN PostTypes ON PostTypes.Id = Posts.PostTypeId CROSS APPLY STRING_SPLIT(@Phrases, '|') Phrases LEFT OUTER JOIN Users ON Posts.OwnerUserId = Users.Id WHERE /* Filter out non-question posts */ Posts.PostTypeId = @Question /* Generic title */ AND CHARINDEX( Phrases.Value , Posts.Title COLLATE Latin1_General_CI_AI ) > 0 /* Post not closed yet */ AND Posts.ClosedDate IS NULL /* Question doesn't have positive score */ /* AND Posts.Score <= 0 */ /* User has low reputation - Un-comment to limit query further */ AND (Users.Id is NULL OR Users.Reputation <= 500) ORDER BY /* Lowest scores first - Change to DESC if you want Highest first */ Posts.Score ASC /* Newest first - Change to ASC if you want oldest first */ , Posts.CreationDate DESC 

Notice how it does show the extra column Phrase to indicate what phrase qualified for the title to be included.

results with phrase columns at the end

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.