python中的字符串比较但不是Levenshtein距离(我认为)

本文介绍了python中的字符串比较但不是Levenshtein距离(我认为)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

我在我正在阅读的一篇论文中发现了一个粗略的字符串比较，如下所示:

I found a crude string comparison in a paper I am reading done as follows:

他们使用的方程式如下(摘自论文，稍作改动以使其更通用和可读)由于作者的描述不是很清楚(使用作者的例子)，我试图用我自己的话解释更多

The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable) I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author)

例如对于 2 个序列 ABCDE 和 BCEFA，有两个可能的图

For example for 2 sequences ABCDE and BCEFA, there are two possible graphs

图 1) 连接 B 与 B C 与 C 和 E 与 E

graph 1) which connects B with B C with C and E with E

图 2) 连接 A 和 A

graph 2) connects A with A

当我连接其他三个(图 1)时，我无法将 A 与 A 连接起来，因为那将是交叉线(假设您在 B-B、C-C 和 E-E 之间画线)；也就是说，A-A 上墨的线将穿过连接 B-B、C-C 和 E-E 的线.所以这两个序列产生了两个可能的图；一个有 3 个连接(BB、CC 和 EE)，另一个只有一个(AA)，然后我按照以下等式计算得分 d.

I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing lines (imagine you draw lines between B-B, C-C and E-E); that is the line inking A-A will cross the lines linking B-B, C-C and E-E. So these two sequences result in 2 possible graphs; one has 3 connections (BB, CC and EE) and the other only one (AA) then I calculate the score d as given by the equation below.

因此，定义两个之间的相似程度五弦我们计算它们之间的距离d.对齐两个五弦，我们寻找它们之间的所有身份字符，无论它们位于何处.如果每个身份都是由两个五弦之间的链接表示，我们定义了一个图对于这一对.我们将此图的任何部分称为配置.

Consequently, to define the degree of similarity between two penta-strings we calculate the distance d between them. Aligning the two penta-strings, we look for all the identities between their characters, wherever these may be located. If each identity is represented by a link between both penta-strings, we define a graph for this pair. We call any part of this graph a configuration.

接下来，我们保留所有没有字符的配置交叉配对(含义在我上面的示例中进行了解释，即相同字符之间没有交叉链接，只保留那些图形).然后将这些中的每一个作为与图形相关的字符数 p，位移 Δi 为对应对和连接字符之间的间隙δij每个五弦.最小值被选为特征和称为距离d:d Min(50 – 10p + ΣΔi + Σδij) 虽然很粗略，该措施通常与定性观察非常吻合引导估计.例如 abcde 和 abcfg 之间的距离是 20，而 abcde 和 abfcg 之间是 23 =(50 – 30 + 1 +2).

Next, we retain all of those configurations in which there is no character cross pairing (the meaning is explained in my example above, i.e., no crossings of links between identical characters and only those graphs are retained). Each of these is then evaluated as a function of the number p of characters related to the graph, the shifting Δi for the corresponding pairs and the gap δij between connected characters of each penta-string. The minimum value is chosen as characteristic and is called distance d: d Min(50 – 10p + ΣΔi + Σδij) Although very rough, this measure is generally in good agreement with the qualitative eye guided estimation. For instance, the distance between abcde and abcfg is 20, whereas that between abcde and abfcg is 23 =(50 – 30 + 1 +2).

我对如何去做这件事感到困惑.任何可以帮助我的建议将不胜感激.

I am confused as to how to go about doing this. Any suggestions to help me would be much appreciated.

我尝试了 Levenshtein 以及用于蛋白质序列比较的简单序列比对论文的链接是:http://peds.oxfordjournals.org/content/16/2/103.长

I tried the Levenshtein and also simple sequence alignment as used in protein sequence comparison The link to the paper is: http://peds.oxfordjournals.org/content/16/2/103.long

我找不到关于第一作者 Alain Figureau 的任何信息以及我给 MA Soto 的电子邮件尚未得到答复(截至今天).

I could not find any information on the first author, Alain Figureau and my emails to MA Soto have not been answered (as of today).

谢谢

python中的字符串比较但不是Levenshtein距离(我认为)

问题描述

推荐答案

相关文档推荐