Skip to main content

SQL 中的断层和扇形陷阱

YeongHer

设计关联是数据库设计中必经的一个环节。在实体建立之后,我们开始考虑实体与实体之间的相互关系,从而建立关联。然而在实际操作中,关联的设计常常会进入误区。 下面介绍两种比较著名的扇形陷阱(Fan Traps)和断层陷阱(Chasm Traps)。

断层陷阱(Chasm Trap)

定义:分析命中的实体间不存在直接连接,而是通过关系表进行连接,且实体间的关系存在多对一的场景

如下图所示,employees 表和 customers 表之间并没有直接的联系,且两张表与 orders 表都为一对多的关系,当同时连接三张表时,即会出现错误的统计数据

分析场景:统计每个员工接待的客户数量

-- 错误的查询
SELECT employees.employee_name, COUNT(customers.company_name) FROM orders
INNER JOIN employees ON employees.employee_id = orders.employee_id
INNER JOIN customers ON customers.customer_id = orders.customer_id
GROUP BY employees.employee_name

执行结果:


-- 正确的查询,由Agile Query 生成
WITH "qu_0" AS (
SELECT "ta_0"."employee_name" AS "co_2",
"ta_1"."customer_id" AS "co_6"
FROM "public"."orders" AS "ta_2"
INNER JOIN "public"."customers" AS "ta_1" ON "ta_2"."customer_id" = "ta_1"."customer_id"
INNER JOIN "public"."employees" AS "ta_0" ON "ta_2"."employee_id" = "ta_0"."employee_id"
GROUP BY "co_2",
"co_6"
),
"qu_1" AS (
SELECT "ta_1"."customer_id" AS "co_6",
COUNT("ta_1"."company_name") AS "co_0"
FROM "public"."customers" AS "ta_1"
GROUP BY "co_6"
),
"qu_2" AS (
SELECT "ta_3"."co_2" AS "co_3",
SUM("ta_4"."co_0") AS "co_1"
FROM "qu_1" AS "ta_4"
RIGHT JOIN "qu_0" AS "ta_3" ON "ta_4"."co_6" = "ta_3"."co_6"
GROUP BY "ta_3"."co_2"
)
SELECT "ta_5"."co_3" AS "co_4",
"ta_5"."co_1" AS "co_5"
FROM "qu_2" AS "ta_5"

执行结果

扇形陷阱(Fan Trap)

定义:分析命中的实体存在多个一对多连接,且多个连接间的数据不对称,从而导致数据少的表存在重复数据

分析场景:统计每个客户的订单数量和销售量

-- 错误的查询
SELECT customers.customer_name, COUNT(orders.order_id), SUM(order_details.quantity) FROM orders
INNER JOIN customers ON customers.customer_id = orders.customer_id
INNER JOIN order_details ON order_details.order_id = orders.order_id
GROUP BY customers.customer_name

执行结果

-- 正确的查询,由Agile Query 生成
WITH "qu_0" AS (
SELECT "ta_0"."company_name" AS "co_2",
SUM("ta_1"."quantity") AS "co_0"
FROM "public"."order_details" AS "ta_1"
INNER JOIN "public"."orders" AS "ta_2" ON "ta_1"."order_id" = "ta_2"."order_id"
INNER JOIN "public"."customers" AS "ta_0" ON "ta_2"."customer_id" = "ta_0"."customer_id"
GROUP BY "co_2"
),
"qu_1" AS (
SELECT "ta_0"."company_name" AS "co_2",
COUNT("ta_2"."order_id") AS "co_1"
FROM "public"."orders" AS "ta_2"
INNER JOIN "public"."customers" AS "ta_0" ON "ta_2"."customer_id" = "ta_0"."customer_id"
GROUP BY "co_2"
)
SELECT "ta_3"."co_2" AS "co_3",
"ta_4"."co_1" AS "co_4",
"ta_3"."co_0" AS "co_5"
FROM "qu_0" AS "ta_3"
INNER JOIN "qu_1" AS "ta_4" ON "ta_3"."co_2" = "ta_4"."co_2"

执行结果

总结

上述两个示例说明,查询命中的表不同,编写SQL 的逻辑就会不同,如果直接将多张表连接,有时会输出错误的统计结果。这也是业务开发过程中最为常见的错误。 当数据量越大,这类错误就会隐藏的越深,输出的数据报告的质量也就越难被控制。统计规则越复杂,这类错误就更难被发现。

上述的统计场景只是非常简单的场景,实际业务中的统计规则会复杂的多,参与计算的表也会很多,对数据工程师的个人能力和数据产品质量管控的要求会越来越高。

参考:Chasm and Fan Traps