1. <small id='hsjT9'></small><noframes id='hsjT9'>

        <bdo id='hsjT9'></bdo><ul id='hsjT9'></ul>
    2. <legend id='hsjT9'><style id='hsjT9'><dir id='hsjT9'><q id='hsjT9'></q></dir></style></legend>

      1. <tfoot id='hsjT9'></tfoot>
      2. <i id='hsjT9'><tr id='hsjT9'><dt id='hsjT9'><q id='hsjT9'><span id='hsjT9'><b id='hsjT9'><form id='hsjT9'><ins id='hsjT9'></ins><ul id='hsjT9'></ul><sub id='hsjT9'></sub></form><legend id='hsjT9'></legend><bdo id='hsjT9'><pre id='hsjT9'><center id='hsjT9'></center></pre></bdo></b><th id='hsjT9'></th></span></q></dt></tr></i><div id='hsjT9'><tfoot id='hsjT9'></tfoot><dl id='hsjT9'><fieldset id='hsjT9'></fieldset></dl></div>

        用 lucene 提取 tf-idf 向量

        Extract tf-idf vectors with lucene(用 lucene 提取 tf-idf 向量)

          <legend id='sFGtc'><style id='sFGtc'><dir id='sFGtc'><q id='sFGtc'></q></dir></style></legend>
          <tfoot id='sFGtc'></tfoot>
            <tbody id='sFGtc'></tbody>
        • <small id='sFGtc'></small><noframes id='sFGtc'>

            <bdo id='sFGtc'></bdo><ul id='sFGtc'></ul>

              1. <i id='sFGtc'><tr id='sFGtc'><dt id='sFGtc'><q id='sFGtc'><span id='sFGtc'><b id='sFGtc'><form id='sFGtc'><ins id='sFGtc'></ins><ul id='sFGtc'></ul><sub id='sFGtc'></sub></form><legend id='sFGtc'></legend><bdo id='sFGtc'><pre id='sFGtc'><center id='sFGtc'></center></pre></bdo></b><th id='sFGtc'></th></span></q></dt></tr></i><div id='sFGtc'><tfoot id='sFGtc'></tfoot><dl id='sFGtc'><fieldset id='sFGtc'></fieldset></dl></div>

                  本文介绍了用 lucene 提取 tf-idf 向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我已经使用 lucene 索引了一组文档.我还为每个文档内容存储了 DocumentTermVector.我写了一个程序,得到了每个文档的词频向量,但是如何获取每个文档的 tf-idf 向量呢?

                  I have indexed a set of documents using lucene. I also have stored DocumentTermVector for each document content. I wrote a program and got the term frequency vector for each document, but how can I get tf-idf vector of each document?

                  这是我在每个文档中输出词频的代码:

                  Here is my code that outputs term frequencies in each document:

                  Directory dir = FSDirectory.open(new File(indexDir));
                      IndexReader ir = IndexReader.open(dir);
                      for (int docNum=0; docNum<ir.numDocs(); docNum++) {
                          System.out.println(ir.document(docNum).getField("filename").stringValue());
                          TermFreqVector tfv = ir.getTermFreqVector(docNum, "contents");
                          if (tfv == null) {
                          // ignore empty fields
                          continue;
                          }
                          String terms[] = tfv.getTerms();
                          int termCount = terms.length;
                          int freqs[] = tfv.getTermFrequencies();
                  
                          for (int t=0; t < termCount; t++) {
                          System.out.println(terms[t] + " " + freqs[t]);
                          }
                      }
                  

                  lucene 中是否有任何内置函数可以让我这样做?

                  Is there any buit-in function in lucene for me to do that?

                  没有人帮忙,我自己做了:

                  Nobody helped, and I did it by myself:

                      Directory dir = FSDirectory.open(new File(indexDir));
                      IndexReader ir = IndexReader.open(dir);
                  
                      int docNum;
                      for (docNum = 0; docNum<ir.numDocs(); docNum++) {
                          TermFreqVector tfv = ir.getTermFreqVector(docNum, "title");
                          if (tfv == null) {
                                  // ignore empty fields
                                  continue;
                          }
                          String tterms[] = tfv.getTerms();
                          int termCount = tterms.length;
                          int freqs[] = tfv.getTermFrequencies();
                  
                          for (int t=0; t < termCount; t++) {
                              double idf = ir.numDocs()/ir.docFreq(new Term("title", tterms[t]));
                              System.out.println(tterms[t] + " " + freqs[t]*Math.log(idf));
                          }
                      }
                  

                  有什么方法可以找到每个词条的ID号吗?

                  is there any way to find the ID number of each term?

                  没有人帮忙,我自己又做了一次:

                  Nobody helped, and I did it by myself again:

                      List list = new LinkedList();
                      terms = null;
                      try
                      {
                          terms = ir.terms(new Term("title", ""));
                          while ("title".equals(terms.term().field()))
                          {
                          list.add(terms.term().text());
                          if (!terms.next())
                              break;
                          }
                      }
                      finally
                      {
                          terms.close();
                      }
                      int docNum;
                      for (docNum = 0; docNum<ir.numDocs(); docNum++) {
                          TermFreqVector tfv = ir.getTermFreqVector(docNum, "title");
                          if (tfv == null) {
                                  // ignore empty fields
                                  continue;
                          }
                          String tterms[] = tfv.getTerms();
                          int termCount = tterms.length;
                          int freqs[] = tfv.getTermFrequencies();
                  
                          for (int t=0; t < termCount; t++) {
                              double idf = ir.numDocs()/ir.docFreq(new Term("title", tterms[t]));
                              System.out.println(Collections.binarySearch(list, tterms[t]) + " " + tterms[t] + " " + freqs[t]*Math.log(idf));
                          }
                      }
                  

                  推荐答案

                  你可能找不到 tf-idf 向量.但正如您已经完成的那样,您可以手动计算 IDF.最好使用 DefaultSimilarity(或您使用的任何相似性实现)为您计算它.

                  You'll probably not found a tf-idf vector. But as you've already done, you can calculate IDF by hand. It is probably better to use the DefaultSimilarity (or whatever Similarity implementation you are using) to calculate it for you.

                  关于 Term ID,我认为目前你不能.至少直到 Lucene 4.0,见 这个.

                  Regarding Term ID, I think currently you can't. At least not until Lucene 4.0, see this.

                  这篇关于用 lucene 提取 tf-idf 向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

                  相关文档推荐

                  Lucene Porter Stemmer not public(Lucene Porter Stemmer 未公开)
                  How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?(如何在 lucene 中索引 pdf、ppt、xl 文件(基于 java 或 python 或 php 中的任何一个都可以)?)
                  KeywordAnalyzer and LowerCaseFilter/LowerCaseTokenizer(KeywordAnalyzer 和 LowerCaseFilter/LowerCaseTokenizer)
                  How to search between dates (Hibernate Search)?(如何在日期之间搜索(休眠搜索)?)
                  How to get positions from a document term vector in Lucene?(如何从 Lucene 中的文档术语向量中获取位置?)
                  Java Lucene 4.5 how to search by case insensitive(Java Lucene 4.5如何按不区分大小写进行搜索)
                  <tfoot id='Hpm1e'></tfoot>

                  • <small id='Hpm1e'></small><noframes id='Hpm1e'>

                  • <i id='Hpm1e'><tr id='Hpm1e'><dt id='Hpm1e'><q id='Hpm1e'><span id='Hpm1e'><b id='Hpm1e'><form id='Hpm1e'><ins id='Hpm1e'></ins><ul id='Hpm1e'></ul><sub id='Hpm1e'></sub></form><legend id='Hpm1e'></legend><bdo id='Hpm1e'><pre id='Hpm1e'><center id='Hpm1e'></center></pre></bdo></b><th id='Hpm1e'></th></span></q></dt></tr></i><div id='Hpm1e'><tfoot id='Hpm1e'></tfoot><dl id='Hpm1e'><fieldset id='Hpm1e'></fieldset></dl></div>

                          <bdo id='Hpm1e'></bdo><ul id='Hpm1e'></ul>
                          <legend id='Hpm1e'><style id='Hpm1e'><dir id='Hpm1e'><q id='Hpm1e'></q></dir></style></legend>

                              <tbody id='Hpm1e'></tbody>