Python:在单词边界上拆分 unicode 字符串

本文介绍了Python:在单词边界上拆分 unicode 字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

我需要一个字符串，并将其缩短为 140 个字符.

I need to take a string, and shorten it to 140 characters.

目前我在做:

if len(tweet) > 140:
    tweet = re.sub(r"s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
            break
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140

所以这对英文和类似英文的字符串非常有效，但对于中文字符串却失败了，因为 tweet.split() 只返回一个数组:

So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:

>>> s = u"简讯：新華社報道，美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域，預計約30分鐘後抵達浦東國際機場，開展他上任後首次訪華之旅。"
>>> s
u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002'
>>> s.split()
[u'u7b80u8bafuff1au65b0u83efu793eu5831u9053uff0cu7f8eu570bu7e3du7d71u5967u5df4u99acu4e58u5750u7684u300cu7a7au8ecdu4e00u865fu300du5c08u6a5fu665au4e0a10u664242u5206u9032u5165u4e0au6d77u7a7au57dfuff0cu9810u8a08u7d0430u5206u9418u5f8cu62b5u9054u6d66u6771u570bu969bu6a5fu5834uff0cu958bu5c55u4ed6u4e0au4efbu5f8cu9996u6b21u8a2au83efu4e4bu65c5u3002']

我应该怎么做才能处理 I18N?这对所有语言都有意义吗?

How should I do this so it handles I18N? Does this make sense in all languages?

如果这很重要，我正在使用 python 2.5.4.

I'm on python 2.5.4 if that matters.

Python:在单词边界上拆分 unicode 字符串

问题描述

推荐答案

相关文档推荐