AuthorizedProjectsWorker, AuthorizedProjectUpdate::UserRefreshWithLowUrgencyWorker is issuing many hotspot recursive CTE namespace queries

This morning GitLab.com alerting fired with an SLO alert for the rails_sql SLI:

image

https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1

There were also SLO alerts for Sidekiq at the same time:

image

https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=1616483958715&to=1616503979310

Looking at the postgres slowlog, and pg_stat_statements metrics, we can see that the server is dominated by requests for queryid -7232084447659837857.

This query:

WITH RECURSIVE "namespaces_cte" AS ((
 SELECT
 "namespaces"."id",
 "members"."access_level"
 FROM
 "namespaces"
 INNER JOIN "members" ON "namespaces"."id" = "members"."source_id"
 WHERE
 "members"."type" = $1
 AND "members"."source_type" = $2
 AND "namespaces"."type" = $3
 AND "members"."user_id" = $4
 AND "members"."requested_at" IS NULL
 AND (access_level >= $5))
 UNION (
 SELECT
 "namespaces"."id",
 LEAST ("members"."access_level", "group_group_links"."group_access") AS access_level
 FROM
 "namespaces"
 INNER JOIN "group_group_links" ON "group_group_links"."shared_group_id" = "namespaces"."id"
 INNER JOIN "members" ON "group_group_links"."shared_with_group_id" = "members"."source_id"
 AND "members"."source_type" = $6
 AND "members"."requested_at" IS NULL
 AND "members"."user_id" = $7
 AND "members"."access_level" > $8
 WHERE
 "namespaces"."type" = $9)
UNION (
 SELECT
 "namespaces"."id",
 GREATEST ("members"."access_level", "namespaces_cte"."access_level") AS access_level
FROM
 "namespaces"
 INNER JOIN "namespaces_cte" ON "namespaces_cte"."id" = "namespaces"."parent_id"
 LEFT OUTER JOIN "members" ON "members"."source_id" = "namespaces"."id"
 AND "members"."source_type" = $10
 AND "members"."requested_at" IS NULL
 AND "members"."user_id" = $11
 AND "members"."access_level" > $12
 WHERE
 "namespaces"."type" = $13))
SELECT
 "project_authorizations"."project_id",
 MAX(access_level) AS access_level
FROM ((
 SELECT
 projects.id AS project_id,
 members.access_level
 FROM
 "projects"
 INNER JOIN "members" ON "projects"."id" = "members"."source_id"
 WHERE
 "members"."type" = $14
 AND "members"."source_type" = $15
 AND "members"."user_id" = $16
 AND "members"."requested_at" IS NULL)
 UNION (
 SELECT
 projects.id AS project_id,
 $17 AS access_level
 FROM
 "projects"
 INNER JOIN "namespaces" ON "projects"."namespace_id" = "namespaces"."id"
 WHERE
 "namespaces"."owner_id" = $18
 AND "namespaces"."type" IS NULL)
 UNION (
 SELECT
 "projects"."id" AS project_id,
 "namespaces"."access_level"
 FROM
 "namespaces_cte" "namespaces"
 INNER JOIN "projects" ON "projects"."namespace_id" = "namespaces"."id")
UNION (
 SELECT
 "project_group_links"."project_id",
 LEAST ("namespaces"."access_level", "project_group_links"."group_access") AS access_level
 FROM
 "namespaces_cte" "namespaces"
 INNER JOIN project_group_links ON project_group_links.group_id = namespaces.id
 INNER JOIN projects ON projects.id = project_group_links.project_id
 INNER JOIN namespaces p_ns ON p_ns.id = projects.namespace_id
 WHERE (p_ns.share_with_group_lock IS FALSE))) project_authorizations
GROUP BY
 "project_authorizations"."project_id"

Charting at the total time spent in statements, we can see this query dominating during the slowdowns:

image

https://thanos.gitlab.net/graph?g0.range_input=6h&g0.max_source_resolution=0s&g0.expr=topk(10%2C%20rate(pg_stat_statements_seconds_total%7Benv%3D%22gprd%22%2Cfqdn%3D%22patroni-03-db-gprd.c.gitlab-production.internal%22%7D%5B5m%5D))&g0.tab=0

From the slowlog we can deduce that the problem is AuthorizedProjectsWorker.

cc @lmcandrew @dsatcher @mushakov

Edited by Andrew Newdigate