问题描述
我有一个 DataFrame
包含 index
和 text
列.
I have a DataFrame
contains index
and text
columns.
例如:
index | text
1 | "I have a pen, but I lost it today."
2 | "I have pineapple and pen, but I lost it today."
现在我有一个很长的列表,我想将 text
中的每个单词与列表进行匹配.
Now I have a long list, and I want to match each of the words in text
with the list.
假设:
long_list = ['pen', 'pineapple']
我想创建一个 FunctionTransformer
来匹配 long_list
中的单词与列值的每个单词,如果匹配,则返回计数.
I would want to create a FunctionTransformer
to match words in the long_list
with each word of the column value, if there is a match, return the count.
index | text | count
1 | "I have a pen, but I lost it today." | 1
2 | "I have pineapple and pen, but I lost it today." | 2
我是这样做的:
def count_words(df):
long_list = ['pen', 'pineapple']
count = 0
for c in df['tweet_text']:
if c in long_list:
count = count + 1
df['count'] = count
return df
count_word = FunctionTransformer(count_words, validate=False)
我如何开发其他 FunctionTransformer
的示例如下:
An example of how I develop my other FunctionTransformer
will be:
def convert_twitter_datetime(df):
df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
return df
convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)
推荐答案
灵感来自@Quang Hoang 的回答
Inspired by @Quang Hoang's answer
import pandas as pd
import sklearn as sk
y=['pen', 'pineapple']
def count_strings(X, y):
pattern = r'{}'.format('|'.join(y))
return X['text'].str.count(pattern)
string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)
结果
text count
1 "I have a pen, but I lost it today." 1
2 "I have pineapple and pen, but I lost it today. 2
对于下面的df2
:
#df2
text
1 "I have a pen, but I lost it today. pen pen"
2 "I have pineapple and pen, but I lost it today."
我们得到
string_transformer.transform(X=df2)
#result
1 3
2 2
Name: text, dtype: int64
这表明,我们将函数转换为 sklearn
样式的对象.为了进一步抽象这一点,我们可以将列名作为关键字参数传递给 count_strings
.
This shows, that we converted the function to an sklearn
-style object. To abstact this even further we can hand over the column name as key-word argument to count_strings
.
这篇关于使用预定义列表获取 pandas 列中匹配单词的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!