Query Stack Overflow for Top 5 Tags and Children Posts
For this you can use the Stack Exchange Data Explorer online at no charge. Click the ‘Compose Query’ button in the top-right. You should see the stackoverflow image in the top right corner:
You should give your query a title but it is not necessary. It does help keep track of your edits as you move along.
I want to find the top five tags on Stack Overflow along with the top five posts per tag. The Stack Exchange Data Explorer gives us free access to this data. The data itself is slightly out-of-date, but this is fine.
I will use three subqueries:
Find the top 5 tags
LEFT JOINthose tags to the posts table
Filter the results per tag.
The tables I need to focus on are:
The dataset is denormalized so some of the calculations we need to make have already been done. For example, in
Tags there is a
Count column that already tells us how many posts exist for each tag. In
Tags column contains a list of tags. Without this, I would need to write a more complicated query to split every tag for every row then run a tally. A challenge, for sure.
PostTags is the bridge between
Posts. Once I know what
Tags I am searching, I use
PostTags to find all relevant posts per tag, joining
Posts to also get all of the tags per post and the score of the post.
Finally, I will close off the query with a simple filter statement using
The overall query is very simple to build and keeps the result set small. Additionally, it is very easy to edit. So, if we want the top 10 tags and 10 posts per tag, we just make two small edits. Or, even better, use variables.
Building the Query
Avoiding Aggregate Errors when Grouping
Earlier when I numbered the posts within each tag, I took what may have seemed unnecessary steps. In other words, why did I need create a new variable,
n, to begin with? Why not just
LEFT JOIN Posts,
GROUP BY PostTags.TagId, and
ORDER BY Posts.Score DESC?
DECLARE @n_tags int; SET @n_tags = 5; SELECT TOP 5 PostTags.PostId, PostTags.TagId, Posts.Score FROM PostTags LEFT JOIN Posts ON PostTags.PostId = Posts.Id WHERE PostTags.TagId IN ( SELECT TOP (@n_tags) Tags.Id FROM Tags ORDER BY Tags.Count DESC) GROUP BY PostTags.TagId ORDER BY Posts.Score DESC;
Surely, the query above would have accomplished the same results, right?
The problem is that when running this type of query the server does not like it. Particularly, Data Exchange will halt execution with the following error:
Column 'PostTags.PostId' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I have asked the server to return the top five post id’s per tag id. The server, however, recognizes I’m also asking for tag id and score. The server doesn’t know which values of tag id and score to return and, therefore, throws an error.
Many database servers prior would have allowed this query to run without throwing an error. Now, most will halt execution though you can modify a setting to allow it to continue with a warning.
The better alternative is to simply group the result set by the tag id then create a new variable that effectively ranks the posts within each tag based on the posts score. This is the purpose of
Then we can filter our result set on the top five (or any
n) without issue.
You now have a query that you can easily edit to expand tag groups or include more posts per group. Data Exchange returns 50,000 rows in the result pane so you have plenty of room to work with.
Some things you may want to try:
Instead of sorting by
Posts.Score, can you edit the code to sort by
Examine the execution plan of my query? With that information, modify the code to include
Posts.Tagson the second subquery (starting at line #19). This removes the need for the outer
LEFT JOINstatement (line #44) but how does this impact the execution time of the query?
If you are feeling really brave, instead of relying on
PostTags.Count, can you split
Posts.Tagsand aggregate posts based on each tag? How does this query compare to the simpler
SELECTquery used originally?
The raw query and dataset can be found in my Datasets repo. If you have noticed any errors or simply want to fork the code, please feel free.
Lastly, please feel free to reach out to me on twitter for any questions or comments.
Thank you for reading!