问题描述
我需要解析 Html 代码.更具体地说,解析所有表中每一行的每个单元格.每行代表一个对象,每个单元格代表不同的属性.我想解析这些以便能够编写一个包含每个数据的 XML 文件(没有无用的 HTML 代码).我已经成功地解析了 HTML 文件中的每一列,但现在我不知道将其写入 XML 文件的选项是什么.我很困惑.
I need to parse Html code. More specifically, parse each cell of every rows in all tables. Each row represent a single object and each cell represent different properties. I want to parse these to be able to write an XML file with every data inside (without the useless HTML code). I have successfully been able to parse each column from the HTML file but now I don't know what my options are for writing this to an XML file. I am baffled.
HTML:
<tr><tr>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF">
1
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="left">
<a href="/ice/player.htm?id=8471675">Sidney Crosby</a>
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center">
PIT
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="center">
C
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
39
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
32
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
33
</td>
<td class="statBox sorted" style="border-width:0px 1px 1px 0px; background-color: #E0E0E0" align="right">
<font color="#000000">
65
</font>
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
20
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
29
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
10
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
1
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
3
</td>
<td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right">
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
0
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
154
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
20.8
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
21:54
</td>
<td class="statBox" style="border-width:0px 1px 1px 0px; background-color: #FFFFFF" align="right">
22.6
</td>
<td class="statBox" style="border-width:0px 0px 1px 0px; background-color: #FFFFFF" align="right">
55.7
</td>
</tr></tr>
C#:
using HtmlAgilityPack;
namespace Stats
{
class StatsParser
{
private string htmlCode;
private static string fileName = "[" + DateTime.Now.ToShortDateString() + " NHL Stats].xml";
public StatsParser(string htmlCode)
{
this.htmlCode = htmlCode;
this.ParseHtml();
}
public void ParseHtml()
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
try
{
// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td[@class='statBox']");
for (int j = 0; j < cols.Count; ++j)
{
// Get the value of the column and print it
string value = cols[j].InnerText;
if (value!="")
System.Windows.MessageBox.Show(value);
}
}
}
catch (NullReferenceException)
{
System.Windows.Forms.MessageBox.Show("Exception!!");
}
}
XML:
<?xml version="1.0" encoding="utf-8" ?>
<Stats Date="2011-01-01">
<Player Rank="1">
<Name>Sidney Crosby</Name>
<Team>PIT</Team>
<Position>C</Position>
<GamesPlayed>39</GamesPlayed>
<Goals>32</Goals>
<Assists>33</Assists>
</Player>
</Stats>
推荐答案
看了一圈MSDN,终于找到了解决自己问题的实现方案:
After looking around MSDN, I finally found an implementation solution to my problem:
using System;
using HtmlAgilityPack;
using System.Xml;
namespace HockeyStats
{
class StatsParser
{
private string htmlCode;
private static string fileName = "[" + DateTime.Now.ToShortDateString() + " NHL Stats].xml";
public StatsParser(string htmlCode)
{
this.htmlCode = htmlCode;
this.ParseHtml();
}
public void ParseHtml()
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlCode);
XmlWriter writer = null;
try
{
// Create an XmlWriterSettings object with the correct options.
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = (" ");
settings.OmitXmlDeclaration = false;
// Create the XmlWriter object and write some content.
writer = XmlWriter.Create(@"...."+fileName, settings);
writer.WriteStartElement("Stats");
writer.WriteAttributeString("Date", DateTime.Now.ToShortDateString());
// Iterate all rows within another row
HtmlNodeCollection rows = doc.DocumentNode.SelectNodes(".//tr/tr");
for (int i = 0; i < rows.Count; ++i)
{
// Iterate all columns in this row
HtmlNodeCollection cols = rows[i].SelectNodes(".//td[@class='statBox']");
for (int j = 0; j < 20; ++j)
{
switch (j)
{
case 0:
{
writer.WriteStartElement("Player");
writer.WriteAttributeString("Rank", cols[j].InnerText.Trim()); break;
}
case 1: writer.WriteElementString("Name", cols[j].InnerText.Trim()); break;
case 2: writer.WriteElementString("Team", cols[j].InnerText.Trim()); break;
case 3: writer.WriteElementString("Pos", cols[j].InnerText.Trim()); break;
case 4: writer.WriteElementString("GP", cols[j].InnerText.Trim()); break;
case 5: writer.WriteElementString("G", cols[j].InnerText.Trim()); break;
case 6: writer.WriteElementString("A", cols[j].InnerText.Trim()); break;
case 7: writer.WriteElementString("PlusMinus", cols[j].InnerText.Trim()); break;
case 8: writer.WriteElementString("PIM", cols[j].InnerText); break;
case 9: writer.WriteElementString("PP", cols[j].InnerText); break;
case 10: writer.WriteElementString("SH", cols[j].InnerText); break;
case 11: writer.WriteElementString("GW", cols[j].InnerText); break;
case 12: writer.WriteElementString("OT", cols[j].InnerText); break;
case 13: writer.WriteElementString("Shots", cols[j].InnerText); break;
case 14: writer.WriteElementString("ShotPctg", cols[j].InnerText); break;
case 15: writer.WriteElementString("TOIPerGame", cols[j].InnerText); break;
case 16: writer.WriteElementString("ShiftsPerGame", cols[j].InnerText); break;
case 17: writer.WriteElementString("FOWinPctg", cols[j].InnerText); break;
}
}
}
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.Flush();
}
finally
{
if (writer != null)
writer.Close();
}
}
}
}
提供以下 XML 文件作为输出:
which gives the following XML file as an output:
<?xml version="1.0" encoding="utf-8" ?>
<Stats Date="2011-01-01">
<Player Rank="1">
<Name>Sidney Crosby</Name>
<Team>PIT</Team>
<Pos>C</Pos>
<GP>39</GP>
<G>32</G>
<A>33</A>
<PlusMinus>20</PlusMinus>
<PIM>29</PIM>
<PP>10</PP>
<SH>1</SH>
<GW>3</GW>
<Shots>0</Shots>
<ShotPctg>154</ShotPctg>
<TOIPerGame>20.8</TOIPerGame>
<ShiftsPerGame>21:54</ShiftsPerGame>
<FOWinPctg>22.6</FOWinPctg>
</Player>
</Stats>
这篇关于在 C# 中使用 Html 敏捷性解析表格、单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!