• <bdo id='cU9Tt'></bdo><ul id='cU9Tt'></ul>
  • <small id='cU9Tt'></small><noframes id='cU9Tt'>

    <legend id='cU9Tt'><style id='cU9Tt'><dir id='cU9Tt'><q id='cU9Tt'></q></dir></style></legend>
    <tfoot id='cU9Tt'></tfoot>
      <i id='cU9Tt'><tr id='cU9Tt'><dt id='cU9Tt'><q id='cU9Tt'><span id='cU9Tt'><b id='cU9Tt'><form id='cU9Tt'><ins id='cU9Tt'></ins><ul id='cU9Tt'></ul><sub id='cU9Tt'></sub></form><legend id='cU9Tt'></legend><bdo id='cU9Tt'><pre id='cU9Tt'><center id='cU9Tt'></center></pre></bdo></b><th id='cU9Tt'></th></span></q></dt></tr></i><div id='cU9Tt'><tfoot id='cU9Tt'></tfoot><dl id='cU9Tt'><fieldset id='cU9Tt'></fieldset></dl></div>

        Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌

        Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing(Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌)
          <bdo id='VOm0P'></bdo><ul id='VOm0P'></ul>

            <tbody id='VOm0P'></tbody>

          <legend id='VOm0P'><style id='VOm0P'><dir id='VOm0P'><q id='VOm0P'></q></dir></style></legend>
        • <small id='VOm0P'></small><noframes id='VOm0P'>

              • <tfoot id='VOm0P'></tfoot>
                <i id='VOm0P'><tr id='VOm0P'><dt id='VOm0P'><q id='VOm0P'><span id='VOm0P'><b id='VOm0P'><form id='VOm0P'><ins id='VOm0P'></ins><ul id='VOm0P'></ul><sub id='VOm0P'></sub></form><legend id='VOm0P'></legend><bdo id='VOm0P'><pre id='VOm0P'><center id='VOm0P'></center></pre></bdo></b><th id='VOm0P'></th></span></q></dt></tr></i><div id='VOm0P'><tfoot id='VOm0P'></tfoot><dl id='VOm0P'><fieldset id='VOm0P'></fieldset></dl></div>
                  本文介绍了Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我正在寻找一种使用 Apache Lucene 编写自定义索引的方法(准确地说是 PyLucene,但 Java 的答案很好).

                  I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).

                  我想做的是:当向索引添加文档时,Lucene 会对其进行标记,删除停用词等.如果我不是,通常使用 Analyzer 来完成搞错了.

                  What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.

                  我想要实现的是以下内容:在 Lucene 存储给定术语之前,我想执行查找(例如,在字典中)以检查是否保留该术语或丢弃它(如果该术语存在在我的字典中,我保留它,否则我丢弃它).

                  What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).

                  我应该如何进行?

                  这是(在 Python 中)我对 Analyzer 的自定义实现:

                  Here is (in Python) my custom implementation of the Analyzer :

                  class CustomAnalyzer(PythonAnalyzer):
                  
                      def createComponents(self, fieldName, reader):
                  
                          source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
                          filter = StandardFilter(Version.LUCENE_4_10_1, source)
                          filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
                          filter = StopFilter(Version.LUCENE_4_10_1, filter,
                                              StopAnalyzer.ENGLISH_STOP_WORDS_SET)
                  
                          ts = tokenStream.getTokenStream()
                          token = ts.addAttribute(CharTermAttribute.class_)
                          offset = ts.addAttribute(OffsetAttribute.class_)
                  
                          ts.reset()
                  
                           while ts.incrementToken():
                             startOffset = offset.startOffset()
                             endOffset = offset.endOffset()
                             term = token.toString()
                             # accept or reject term 
                  
                           ts.end()
                           ts.close()
                  
                             # How to store the terms in the index now ?
                  
                           return ????
                  

                  提前感谢您的指导!

                  EDIT 1:深入研究 Lucene 的文档后,我认为它与 TokenStreamComponents 有关.它返回一个 TokenStream,您可以使用它来遍历您正在索引的字段的 Token 列表.

                  EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.

                  现在我不明白与 Attributes 有什么关系.或者更准确地说,我可以读取令牌,但不知道接下来应该如何进行.

                  Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.

                  编辑 2:我发现了这个 post 他们提到了 CharTermAttribute 的使用.但是(尽管在 Python 中)我无法访问或获取 CharTermAttribute.有什么想法吗?

                  EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?

                  EDIT3:我现在可以访问每个术语,请参阅更新代码片段.现在剩下要做的实际上是存储所需的术语...

                  EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...

                  推荐答案

                  我试图解决问题的方法是错误的.这个帖子和femtoRgon 的答案就是解决方案.

                  The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.

                  通过定义一个扩展 PythonFilteringTokenFilter 的过滤器,我可以利用函数 accept()(在 StopFilter 中使用的那个)实例).

                  By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).

                  下面是对应的代码片段:

                  Here is the corresponding code snippet :

                  class MyFilter(PythonFilteringTokenFilter):
                  
                    def __init__(self, version, tokenStream):
                      super(MyFilter, self).__init__(version, tokenStream)
                      self.termAtt = self.addAttribute(CharTermAttribute.class_)
                  
                  
                    def accept(self):
                      term = self.termAtt.toString()
                      accepted = False
                      # Do whatever is needed with the term
                      # accepted = ... (True/False)
                      return accepted
                  

                  然后只需将过滤器附加到其他过滤器(如问题的代码所示):

                  Then just append the filter to the other filters (as in the code snipped of the question) :

                  filter = MyFilter(Version.LUCENE_4_10_1, filter)
                  

                  这篇关于Apache Lucene:如何在索引时使用 TokenStream 手动接受或拒绝令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

                  相关文档推荐

                  Lucene Porter Stemmer not public(Lucene Porter Stemmer 未公开)
                  How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?(如何在 lucene 中索引 pdf、ppt、xl 文件(基于 java 或 python 或 php 中的任何一个都可以)?)
                  KeywordAnalyzer and LowerCaseFilter/LowerCaseTokenizer(KeywordAnalyzer 和 LowerCaseFilter/LowerCaseTokenizer)
                  How to search between dates (Hibernate Search)?(如何在日期之间搜索(休眠搜索)?)
                  How to get positions from a document term vector in Lucene?(如何从 Lucene 中的文档术语向量中获取位置?)
                  Java Lucene 4.5 how to search by case insensitive(Java Lucene 4.5如何按不区分大小写进行搜索)

                  <small id='iCmCI'></small><noframes id='iCmCI'>

                    <legend id='iCmCI'><style id='iCmCI'><dir id='iCmCI'><q id='iCmCI'></q></dir></style></legend>
                        <bdo id='iCmCI'></bdo><ul id='iCmCI'></ul>

                        <tfoot id='iCmCI'></tfoot>
                        1. <i id='iCmCI'><tr id='iCmCI'><dt id='iCmCI'><q id='iCmCI'><span id='iCmCI'><b id='iCmCI'><form id='iCmCI'><ins id='iCmCI'></ins><ul id='iCmCI'></ul><sub id='iCmCI'></sub></form><legend id='iCmCI'></legend><bdo id='iCmCI'><pre id='iCmCI'><center id='iCmCI'></center></pre></bdo></b><th id='iCmCI'></th></span></q></dt></tr></i><div id='iCmCI'><tfoot id='iCmCI'></tfoot><dl id='iCmCI'><fieldset id='iCmCI'></fieldset></dl></div>
                            <tbody id='iCmCI'></tbody>