A doubt about php's preg_match result.

PCManiac · May 26, 2011

So, I have this html file that basicaly says:

PHP:

echo '<table>
<tr>
     <td>Welcome to:</td>
     <td><strong>Hell</strong></td>
</tr>
</table>';

(whether this code is or not valid, does not matter.. Is just of the sake of having an example)

Now, I have another php file that (should) gets (through CURL and preg_match) the "Hell" part:

Curl Result ends in the var: $Result
Pattern I want to match against the $Result: $pattern = '/ <td>Welcome to\:<\/td>
<td>(.*)<\/td>/m';
Preg Match result var: $pregResult

PHP:

preg_match ($pattern, $Result, $pregResult)

doing a var_dump($pregResult); returns an empty array

PHP:

array(0) { }

yet, testing my example and reg expr in the site http://www.spaweditor.com/scripts/regex/index.php it returns the expected value (Hell)

So, can anyone point me to what is going on wrong because I am out of ideas.

My setup is windows + XAMPP with phpversion 5.3.5.

PC

Here are the actual php files:
-------------- test.php ------------

PHP:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>

<body>
<?php

echo '<table width="50%" border="0" cellspacing="0" cellpadding="0">
  <tr>
    <td>Welcome to:</td>
  </tr>
  <tr>
    <td><strong>Hell</strong></td>
  </tr>
</table>';

?>
</body>
</html>

-------------- curl_teste.php ------------

PHP:

<?php
error_reporting(-1);  

$curlTarget = "http://127.0.0.1/multiline/test.php";


$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$curlTargget);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

$result = curl_exec ($ch);
curl_close($ch);

$patternCoiso = '/    <td>Welcome to:<\/td>
    <td>(.*)<\/td>/mi';

preg_match($patternCoiso, $result, $match);


var_dump($match);

?>

Zeokat · May 26, 2011

I think that ":" not need to be scaped... try with:

Code:

$pattern = '/    <td>Welcome to:<\/td>
    <td>(.*)<\/td>/m';

Not enougth time to test bymyself now....

Loget · May 26, 2011

The Regex is fine, your preg_match order is messed up though. It should be:

PHP:

preg_match($pattern, $Result, $pregResult);

You first specify the pattern, then the subject and then the name of the array that the results will be put in.

PCManiac · May 26, 2011

bleh.. sorry Loget, that's how I have coded.. I just messed it while retyping in to the post. >_<

*fixes*

Meaning that is not the problem :\

Zeokat: you are right, it's not needed but being scaped won't (or shouldn't) change in anyway the result. i scaped it just to make sure it wasn't that that was messing it up. thanks for the reply though.

NewEraCracker · May 26, 2011

any uninitialized variable in code?
Add this to the beginning of the code for debugging after the <?php tag

PHP:

error_reporting(-1);

PCManiac · May 26, 2011

No error output NEC. :\

I just copied/pasted the actual files to the main post.

Lock Down · May 27, 2011

Here is the code to do exactly what you want:

PHP:

$Result = '<table>
<tr>
     <td>Welcome to:</td>
     <td><strong>Hell</strong></td>
</tr>
</table>';
$w = "[\s\S]*?"; 
$pattern = "/\<td>Welcome to:<\/td\>$w<td\>(.*)<\/td\>/";
preg_match($pattern, $Result, $pregResult);
var_dump($pregResult);

$pregResult is :

Code:

array(2) {
  [0]=>
  string(57) "<td>Welcome to:</td>
     <td><strong>Hell</strong></td>"
  [1]=>
  string(21) "<strong>Hell</strong>"
}

Mr Happy · May 27, 2011

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n?erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege???x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi?ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c??o??rrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg?ex parsers for HTML will ins[/b]?tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes~~, the pestilent sl~~ithy regex-infection wil?l devour your HT?ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi?ght he com?e?s, ?h?i?s un?ho?ly radian?ce? destro?ying all enli??^?ghtenment, HTML tags lea?kiÂ¸n?g fr?o?m ?yo??ur eye?s? ?l?ik?e liq?uid pain, the song of re?gular exp?re~~ssion parsing~~ will exti?nguish the voices of mor?tal man from the sp?here I can see it can you see _????i^??tÂ´??_??_? it is beautiful t?he final snuffing of the lie?s of Man ALL IS LOSÂ´???????T ALL I?S LOST the pon?y he comes he c??om~~es he co~~mes the ich?or permeates all MY FACE MY FACE ?h god no NO NOO?O?O NT stop the an?*?????Â¯????g????????l??????????eÂ¯?s<code> ?aÂ¸??r?????e</code> n?ot re`???a?l~??????? ZA????LG? IS?^??????_ TO???????? TH?E??? ?P???O??N?Y? H???Â¯?"????E????`Â´Â¸??? ???Â¸??_???C???????_??O??????M????????_?E?????????O͇̹̺ͅƝ̴ȳ̳ TH̘Ã‹͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Have you tried using an XML parser instead?

____________________________________________________________________
Doesn't look great here but the original answer to this type of question is the most popular reply ever on the worlds busiest coding site. It's the best answer I've ever seen to a coding question.
http://stackoverflow.com/questions/...ept-xhtml-self-contained-tags/1732454#1732454

Cory · May 27, 2011

PHP:

preg_match ("/(<h1)(.*?)(>)(.*?)(<\/h1>)/i", $data, $match);
$found = $match[4];

That's how I do it...Works good, I wrote a script that scans a download file and parses a specific html tag, in this case <h1

-- Based off some code I use to parse a title of a webpage by getting <title>

So it's possible.

Might have to play with $matches (no pun intended) to find the match, 0-w/e prolly be 0-1

PCManiac · May 27, 2011

Thanks for your replies guys.

Lockdown, It's weird but that does not work on my test environment... Am starting to think that the php version I have (or any of its libs) is buggy...

Mr Happy, I am not a programmer.. am just an entusiast xD would you point me to the right direction about parsing HTML with a XML parser?

Cory · May 27, 2011

Created: test.php

PHP:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>

<body>
<?php

echo '<table width="50%" border="0" cellspacing="0" cellpadding="0">
  <tr>
    <td>Welcome to:</td>
  </tr>
  <tr>
    <td><strong>Hell</strong></td>
  </tr>
</table>';

?>
</body>
</html>

Created: curl.php

PHP:

<?php
error_reporting(-1);  

$curlTarget = "http://127.0.0.1/test.php";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$curlTarget);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);

$result = curl_exec ($ch);
curl_close($ch);

preg_match ("/(<strong)(.*?)(>)(.*?)(<\/strong>)/i", $result, $match);
echo $match[4];

?>

Test: http://djdog2006.com/curl/test.php
Curl: http://djdog2006.com/curl/curl.php

JmZ · May 27, 2011

Seriously this is like watching regex being mutilated into some evil devil child lol.

It *SHOULD* be (for the html above, to match just the contents of strong):

Code:

#<strong>([^<]+)</strong>#i

I don't know why anyone is matching two wildcards, strong doesn't have attributes (unless you're some insane guy who gives strong tags style attributes).

As for your original post it would've been:

Code:

#<td>Welcome to:</td>\s*<td>([^<]+)</td>#i

Why you decided to make the whitespace it's own variable, Lock Down, I do not know. It isn't even whitespace.

This:

Code:

[\s\S]*

is the same as this:

Code:

.*

Any amount of non-whitespace or whitespace characters == any amount of any characters.

Lock Down · May 27, 2011

Regex is some evil devil STEP child !!

)

PCManiac · May 27, 2011

Thank you fore your replies guys.
The replies + my fiddling with my sources @work computer made me see that something in my system is messing with my xampp php install so I will be reinstalling from scratch on a virtual box properly.

Yet, I have a question JMZ, what does the ([^>]+) and ([^<]+) means? I have seen it over and over on other sites but I can't find a proper description for that.

Cory · May 27, 2011

That's what I did in my example;

If I was coding this myself, I would use;

<strong id="identify" and display that in there so I know what it's actually getting.

JmZ · May 28, 2011

PCManiac said:
Yet, I have a question JMZ, what does the ([^>]+) and ([^<]+) means? I have seen it over and over on other sites but I can't find a proper description for that.

Code:

[^<]+

Means one or more characters which aren't '<'.
Quick way of matching things inside tags, but it will break with nested tags.

A doubt about php's preg_match result.

PCManiac

Member

Zeokat

Active Member

Loget

Active Member

PCManiac

Member

NewEraCracker

Active Member

PCManiac

Member

Lock Down

Active Member

Mr Happy

Active Member

Cory

Active Member

PCManiac

Member

Cory

Active Member

JmZ

(╯°□°）╯︵ ┻━┻

Lock Down

Active Member

PCManiac

Member

Cory

Active Member

JmZ

(╯°□°）╯︵ ┻━┻