如何根据 pandas 滚动窗口中的多列查找重复项?

How to find duplicate based upon multiple columns in a rolling window in pandas?(如何根据 pandas 滚动窗口中的多列查找重复项?)
本文介绍了如何根据 pandas 滚动窗口中的多列查找重复项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

样本数据

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.

我有一些这样的代码

    df = pd.DataFrame()
for line in sys.stdin:
    data = json.loads(line)
    # df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
    df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
    df1['time'] = pd.to_datetime(df1['time'])
    df = df.append(df1)
    # df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()

print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))

电流输出

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z  merchantE      90 2019-02-13 11:02:30

2

预期输出

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30

由于数据正在流式传输.我想检查重复记录(其商家和金额值相同)是否在两分钟内到达,所以我将其丢弃并且不对其进行处理.将其打印为副本.

As the data is streaming. I want to check if a duplicate record(whose merchant and amount value are same) arrives withing two minutes so I discard it as and do no processing on it. print it as a duplicate.

我必须对索引压缩或 groupby 做些什么吗?但是然后如何等同于多列.或者两列上有一些滚动条件,但找不到任何方法.

Do I have to do something with index zipping or groupby? but then how to equate of multiple columns. Or some rolling condition on two columns but can't find anything how to do it.

我在这里错过了什么?

谢谢

编辑

#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
     res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        # res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
        res['timediff'] = (data['transaction']['time'] - res['time'])
        if len(res.index) >1:
           print(res)

所以我尝试这样的事情,如果结果小于 120 秒,我可以处理它.但生成的df目前以

so im trying something like this and if the result is less than 120 seconds i can process it. But the resulting df in currently in the form of

                      merchant  amount                time       concat          timediff
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00  merchantF10 -1 days +23:59:20
2019-02-13 11:02:20  merchantF      10 2019-02-13 11:02:20  merchantF10          00:00:00

2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30  merchantE10 00:01:00
2019-02-13 11:02:00  merchantE      10 2019-02-13 11:02:00  merchantE10 00:00:30
2019-02-13 11:02:30  merchantE      10 2019-02-13 11:02:30  merchantE10 00:00:00

-1 天 +23:59:20 这种格式我觉得可以用绝对值代替?

-1 days +23:59:20 this format I think can be delt with taking Absolute value?

如何将时间转换为可以与 120 秒比较的格式?pd.to_deltatime() 对我不起作用,或者我使用错误.

how can I convert the time in a format that I can compare it with 120 seconds? pd.to_deltatime() didn't work for me or maybe I'm using it wrong.

推荐答案

所以我让它工作但不是滚动窗口,因为它不支持字符串类型.该功能也在 Pandas Repo 上报告和请求.

So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.

我的问题解决方案片段:

My solution snippet to the problem:

    if len(df.index) > 0:
        res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
        if res.timediff.any():
            continue
    df = df.append(df1)
print(df)

样本数据:

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}

输出:

                      merchant  amount                time
2019-02-13 10:00:00  merchantA      20 2019-02-13 10:00:00
2019-02-13 11:00:01  merchantB      90 2019-02-13 11:00:01
2019-02-13 11:00:10  merchantC      10 2019-02-13 11:00:10
2019-02-13 11:00:20  merchantD      10 2019-02-13 11:00:20
2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00
2019-02-13 11:05:20  merchantF      10 2019-02-13 11:05:20

这篇关于如何根据 pandas 滚动窗口中的多列查找重复项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

python: iterating through a dictionary with list values(python:遍历具有列表值的字典)
What is the difference between chain and chain.from_iterable in itertools?(itertools中chain和chain.from_iterable有什么区别?)
python JSON only get keys in first level(python JSON只获取第一级的键)
Iterate over n successive elements of list (with overlapping)(迭代列表的 n 个连续元素(重叠))
Loop problem while iterating through a list and removing recurring elements(遍历列表并删除重复元素时出现循环问题)
Elegant way to skip elements in an iterable(跳过可迭代元素的优雅方式)