问题描述
我有一个这样的在线会话表(空行只是为了更好的可见性):
I have a table with online sessions like this (empty rows are just for better visibility):
ip_address | start_time | stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:12
10.10.10.10 | 2016-04-02 08:11 | 2016-04-02 08:20
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:10
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:08
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:11
10.10.10.10 | 2016-04-02 09:02 | 2016-04-02 09:15
10.10.10.10 | 2016-04-02 09:10 | 2016-04-02 09:12
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
而且我需要信封"在线时间跨度:
And I need the "envelop" online time spans:
ip_address | full_start_time | full_stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:20
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:15
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
我有这个返回所需结果的查询:
I have this query which returns desired result:
WITH t AS
-- Determine full time-range of each IP
(SELECT ip_address, MIN(start_time) AS min_start_time, MAX(stop_time) AS max_stop_time FROM IP_SESSIONS GROUP BY ip_address),
t2 AS
-- compose ticks
(SELECT DISTINCT ip_address, min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE AS ts
FROM t
CONNECT BY min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE <= max_stop_time),
t3 AS
-- get all "online" ticks
(SELECT DISTINCT ip_address, ts
FROM t2
JOIN IP_SESSIONS USING (ip_address)
WHERE ts BETWEEN start_time AND stop_time),
t4 AS
(SELECT ip_address, ts,
LAG(ts) OVER (PARTITION BY ip_address ORDER BY ts) AS previous_ts
FROM t3),
t5 AS
(SELECT ip_address, ts,
SUM(DECODE(previous_ts,NULL,1,0 + (CASE WHEN previous_ts + INTERVAL '1' MINUTE <> ts THEN 1 ELSE 0 END)))
OVER (PARTITION BY ip_address ORDER BY ts ROWS UNBOUNDED PRECEDING) session_no
FROM t4)
SELECT ip_address, MIN(ts) AS full_start_time, MAX(ts) AS full_stop_time
FROM t5
GROUP BY ip_address, session_no
ORDER BY 1,2;
但是,我担心性能.该表有数亿行,时间分辨率为毫秒(不是示例中给出的一分钟).因此 CTE t3
将是巨大的.有没有人有避免自加入和CONNECT BY"的解决方案?
However, I am concerned about the performance. The table has hundreds of million rows and the time resolution is millisecond (not one Minute as given in example). Thus CTE t3
is gonna be huge. Does anybody have a solution which avoids the Self-Join and "CONNECT BY"?
单个智能分析函数很棒.
推荐答案
也试试这个.我尽我所能对其进行了测试,我相信它涵盖了所有可能性,包括合并相邻的间隔(10:15 到 10:30 和 10:30 到 10:40 合并为一个间隔,10:15 到 10:40)).它也应该很快,它用的不多.
Try this one, too. I tested it the best I could, I believe it covers all the possibilities, including coalescing adjacent intervals (10:15 to 10:30 and 10:30 to 10:40 are combined into a single interval, 10:15 to 10:40). It should also be quite fast, it doesn't use much.
with m as
(
select ip_address, start_time,
max(stop_time) over (partition by ip_address order by start_time
rows between unbounded preceding and 1 preceding) as m_time
from ip_sessions
union all
select ip_address, NULL, max(stop_time) from ip_sessions group by ip_address
),
n as
(
select ip_address, start_time, m_time
from m
where start_time > m_time or start_time is null or m_time is null
),
f as
(
select ip_address, start_time,
lead(m_time) over (partition by ip_address order by start_time) as stop_time
from n
)
select * from f where start_time is not null
/
这篇关于获取信封.即重叠的时间跨度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!