Extracting data from a string

Status
Not open for further replies.

pankaj

Active Member
1,817
2009
149
0
I've extract this from a webpage and stored in a string.

----------------------------------------------------------------------------
----------------------
<tr height="25">
<td nowrap class="odd" align="center"><img
src="/forums/images/icon_topic_new.gif" width=14 height=14 alt='New Topic'
border=0></td>

<td nowrap class="odd" align="center">&nbsp;</td>

<td nowrap class="odd" align="center">&nbsp;</td>
<td width="85%" class="even" align="left"><font class="new-row"><a
href="topic.asp?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Started 7/15/2005 - pages <a
href="topic.asp?tid=106110">1</a> - last posted by <a
href="profile.asp?action=view&id=Shandy" onmouseover="window.status='Show
the authors profile'; return true;" onmouseout="window.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><font
class="new-row"><a href="profile.asp?action=view&id=DiscoInferno"
onmouseover="window.status='Show the authors profile'; return true;"
onmouseout="window.status=''; return true;">DiscoInf<BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"><font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
----------------------------------------------------------------------------
--------------------------------------------

It's a table which shows the latest posts of a forum. I'd like to pull out
the following information:
Topic: Quality ebay auction
Original poster: DiscoInferno
Started: 7/15/2005
Last Post By: Shandy
Last Post Date: 7/15/2005 5:02:16 PM

This *type* of information is repeated down the web page although the data will change and I want to do this with the whole page.


Any suggestions or sample hint codes ?

Need for C#
 
38 comments
Download Simple Dom from Source Forge.

include the file into your application.

And use this code:

PHP:
//Include file here

//Source
$source = '
tr height="25">
<td nowrap class="odd" align="center"><img
src="/forums/images/icon_topic_new.gif" width=14 height=14 alt='New Topic'
border=0></td>
<td nowrap class="odd" align="center">&nbsp;</td>
<td nowrap class="odd" align="center">&nbsp;</td>
<td width="85%" class="even" align="left"><font class="new-row"><a
href="topic.asp?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Started 7/15/2005 - pages <a
href="topic.asp?tid=106110">1</a> - last posted by <a
href="profile.asp?action=view&id=Shandy" onmouseover="window.status='Show
the authors profile'; return true;" onmouseout="window.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><font
class="new-row"><a href="profile.asp?action=view&idiscoInferno"
onmouseover="window.status='Show the authors profile'; return true;"
onmouseout="window.status=''; return true;">DiscoInf<BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"><font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
';

$html = str_get_html($source);

$topic = $html->find('tr td font.new-row a').value();

//You get the idea.....

Peace
 
I had already stored the extracted web page in a string.

I didn't understood what you did with $topic for finding topic.
Where you specified starting with string and ending with string and extracting the things between both strings ?
How to use a loop and extract "topic" and then again "topic2" from same page which has another same type of string.

PHP:
$topic = $html->find('tr td font.new-row a').value();

By the way, I downloaded that and it has php files in it. How to include it ?
 
You would do something like

PHP:
$html = get_html('http://site.com/topic.asp?id=2'); //Not sure about the function.

foreach($html->find('table tr') as $tableRow)
{
    //You can now use this like the DOM
    $topicLink = $tableRow->find('td font.new-row a')->val();
}

This is not regex but Dom traversing, such as jQuery inside PHP.
 
Okay, if some knows C#, i need this.

Some code for extracting xyz.abc from <a tag><c tag>xyz.abc</c></a>
See in first post what I need to search and how.
 
try using some logic where the angular brackets close & the text starts with out opening angular bracket then u store each char in a character array using pointer.... i can do it in C...didnt start C# yet :P
 
It's relatively simple actually. You can use RegEx as mentioned already, here's an example for the "<a tag><c tag>xyz.abc</c></a>" string.

Code:
string StringToSearch = "<a tag><c tag>xyz.abc</c></a>";
string StringFound = Regex.Match(StringToSearch, "<a tag><c tag>(.*)<\/c><\/a>").Groups.Item(1).Value;
MessageBox.Show(StringFound);


You can adopt the code yourself for whatever stuff you need :)
If you need all found matches, use the Matches function instead of Match, then mess around with the Groups and Item array followed by a foreach loop.
 
Never use regex for parsing markup. Download SharpLeech and add a reference to the Engine dll in your project. Then add these using's in your code:

PHP:
using Hyperz.SharpLeech.Engine.Html;
using Hyperz.SharpLeech.Engine.Net;
Now you can use it like:
PHP:
var html = new HtmlDocument();

// load the html
html.LoadHtml("<div class=\"example\">foo</div>");

// use XPath to select the div
var node = html.DocumentNode.SelectSingleNode("//div[@class='example']");
var divContent = HttpUtility.HtmlDecode(node.InnerText);
XPath info: http://www.w3schools.com/xpath/default.asp
 
Where to put this file - Hyperz.SharpLeech.Engine.dll


I guess your code is extracting the word "example" from between //div i.e. <div> tags.
But how to extract links those are starting from http and ends with .extension
 
Can't find what files? You only need Hyperz.SharpLeech.Engine.dll. And nope, the example extracts the word foo. Take a look at XPath via the link I posted.

Regarding the other question:
PHP:
var html = new HtmlDocument();

// load the html
html.LoadHtml(yourHtmlHere);

// use XPath to select all "A" elements from the html
var anchors = html.DocumentNode.SelectNodes("//a");

// filter out those that start with http
var filter = from a in anchors
             where a.GetAttributeValue("href", "").StartsWith("http")
             select a;
Just experiment with it.
 
Never use regex for parsing markup. Download SharpLeech and add a reference to the Engine dll in your project. Then add these using's in your code:

PHP:
using Hyperz.SharpLeech.Engine.Html;
using Hyperz.SharpLeech.Engine.Net;
Now you can use it like:
PHP:
var html = new HtmlDocument();

// load the html
html.LoadHtml("<div class=\"example\">foo</div>");

// use XPath to select the div
var node = html.DocumentNode.SelectSingleNode("//div[@class='example']");
var divContent = HttpUtility.HtmlDecode(node.InnerText);
XPath info: http://www.w3schools.com/xpath/default.asp
I beg to differ - regex is much cleaner lol
 
Cleaner? That sounds like something a VB6 coder would say <_<. You being a coder should know that you can't use regex for parsing markup. For one it is much to slow for that. And secondly your expressions are static. It can't handle changes in the DOM structure without having to redo it all. Then there is the issue of inner html, etc etc.

The only case in which you can use regex is when you need only 1 simple string from a small html document of which you know the contents wont change. For anything else it'll change into an slow unmanageable mess. I'd be more happy to put this to the test ;).
 
Using a DOM parser is faster than regex? I thought DOM parsers used regex :| Anyway those parsers use up too much memory. For his case regex is simple - and using a parser is overkill
 
Status
Not open for further replies.
Back
Top