问题描述
我的目标是创建一个以字符串为键、条目值为字符串的 HashSet 的哈希图.
My aim is to create a hashmap with a String as the key, and the entry values as a HashSet of Strings.
输出
现在的输出如下所示:
Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]]
按照我的想法,应该是这样的:
According to my idea, it should look like this:
[Hudson+(surname)=[Q2720681,Q141445,Q5928530,Q2272323,Q2672022]]
<小时>
目的是在维基数据中存储一个特定的名称,然后存储与其相关的所有 Q 值的消歧,例如:
The purpose is to store a particular name in Wikidata and then all of the Q values associated with it's disambiguation, so for example:
这个是布什"的页面.
我希望布什成为关键,然后对于所有不同的出发点,布什
可以与维基数据的终端页面相关联的所有不同方式,我想存储相应的Q 值"或唯一的字母数字标识符.
I want Bush to be the Key, and then for all of the different points of departure, all of the different ways that Bush
could be associated with a terminal page of Wikidata, I want to store the corresponding "Q value", or unique alpha-numeric identifier.
我实际上正在做的是尝试从维基百科消歧中抓取不同的名称、值,然后在 wikidata 中查找与该值关联的唯一字母数字标识符.
What I'm actually doing is trying to scrape the different names, values, from the wikipedia disambiguation and then look up the unique alpha-numeric identifier associated with that value in wikidata.
例如,使用 Bush
我们有:
For example, with Bush
we have:
George H. W. Bush
George W. Bush
Jeb Bush
Bush family
Bush (surname)
相应的 Q 值为:
乔治 HW 布什 (Q23505)
乔治·W·布什(Q207)
杰布·布什 (Q221997)
布什家族 (Q2743830)
Bush family (Q2743830)
布什 (Q1484464)
Bush (Q1484464)
我的想法是数据结构应该按如下方式来解释
关键:布什
条目集: Q23505、Q207、Q221997、Q2743830、Q1484464
但我现在的代码并没有这样做.
But the code I have now doesn't do that.
它为每个名称和 Q 值创建一个单独的条目.即
It creates a seperate entry for each name and Q value. i.e.
密钥:杰布·布什
条目集: Q221997
钥匙:乔治·W·布什
条目集: Q207
等等.
完整的代码可以在 mygithub页面,但我也会在下面总结一下.
The full code in all it's glory can be seen on my github page, but I'll summarize it below also.
这是我用来为我的数据结构添加值的方法:
This is what I'm using to add values to my data strucuture:
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
这是我获取内容的方式:
This is how I fetch the content:
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
// if we can determine it's a disambig page we need to send it off to get all
// the possible senses in which it can be used.
Pattern disambig_pattern = Pattern.compile("<div class="wikibase-entitytermsview-heading-description ">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
//off to get the different usages
Wikipedia_Disambig_Fetcher.all_possibilities( variable_entity );
}
else
{
//get the Q value off the page by matching
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class="wikibase-toolbar-container"><span class="wikibase-toolbar-item " +
"wikibase-toolbar ">\[<span class="wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit"><a " +
"href="/wiki/Special:SetSiteLink/(.*?)">edit</a></span>\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
// 'Q' should be appended to an array, since each entity can hold multiple
// Q values on that basis of disambig
put_to_hash( variable_entity, Q );
}
}
}
这就是我处理消歧页面的方式:
and this is how I deal with a disambiguation page:
public static void all_possibilities( String variable_entity ) throws Exception
{
System.out.println("this is a disambig page");
//if it's a disambig page we know we can go right to the wikipedia
//get it's normal wiki disambig page
Document docx = Jsoup.connect( "https://en.wikipedia.org/wiki/" + variable_entity ).get();
//this can handle the less structured ones.
Elements linx = docx.select( "p:contains(" + variable_entity + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = linq.text().replace(' ', '+');
Wikidata_Q_Reader.getQ( linq_nospace );
}
}
我在想也许我可以传递 Key
值,但我真的不知道.我有点卡住了.也许有人可以看到我如何实现这个功能.
I was thinking maybe I could pass the Key
value around, but I really don't know. I'm kind of stuck. Maybe someone can see how I can implement this functionality.
推荐答案
我不清楚你的问题是什么不起作用,或者你是否看到实际错误.但是,虽然您的基本数据结构想法(String
到 Set
的 HashMap
是合理的,但添加"中有一个错误功能.
I'm not clear from your question what isn't working, or if you're seeing actual errors. But, while your basic data structure idea (HashMap
of String
to Set<String>
) is sound, there's a bug in the "add" function.
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
在第一次看到键的情况下(if (!q_valMap.containsKey(key))
),它会为该键激活一个新的 HashSet
,但它不会在返回之前添加 value
给它.(并且返回的值是该键的旧值,因此它将为空.)因此您将丢失每个术语的 Q 值.
In the case where a key is seen for the first time (if (!q_valMap.containsKey(key))
), it vivifies a new HashSet
for that key, but it doesn't add value
to it before returning. (And the returned value is the old value for that key, so it'll be null.) So you're going to be losing one of the Q-values for every term.
对于像这样的多层数据结构,我通常特例只是中间结构的激活,然后在单个代码路径中进行添加和返回.我认为这会解决它.(我也将它称为 valSet
因为它是一个集合而不是一个列表.而且没有必要每次都将集合重新添加到地图中;它是一个引用类型并被添加第一次遇到那个键.)
For multi-layered data structures like this, I usually special-case just the vivification of the intermediate structure, and then do the adding and return in a single code path. I think this would fix it. (I'm also going to call it valSet
because it's a set and not a list. And there's no need to re-add the set to the map each time; it's a reference type and gets added the first time you encounter that key.)
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key)) {
q_valMap.put(key, new HashSet<String>());
}
HashSet<String> valSet = q_valMap.get(key);
valSet.add(value);
return valSet;
}
还要注意,您返回的 Set
是对该键的实时 Set
的引用,因此在调用者中修改它时需要小心,如果你正在做多线程,你会遇到并发访问问题.
Also be aware that the Set
you return is a reference to the live Set
for that key, so you need to be careful about modifying it in callers, and if you're doing multithreading you're going to have concurrent access issues.
或者只使用 Guava Multimap
这样您就不必担心自己编写实现.
Or just use a Guava Multimap
so you don't have to worry about writing the implementation yourself.
这篇关于用一个固定的Key对应一个HashSet创建一个HashMap.出发点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!