FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » strip_tags function
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: strip_tags function [message #178771 is a reply to message #178752] Tue, 31 July 2012 19:17 Go to previous messageGo to previous message
Tim Fardell is currently offline  Tim Fardell
Messages: 5
Registered: July 2012
Karma:
Junior Member
On Sat, 28 Jul 2012 12:42:59 +0200, Thomas 'PointedEars' Lahn
<PointedEars(at)web(dot)de> wrote:

> Tim Fardell wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>> Tim Fardell wrote:
>>>> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>>>> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>>> > However, am I right in thinking that the strip_tags() function simply
>>>> > assumes that any less-than character (<) occurring within a string is
>>>> > the beginning of a tag?
>>>> >
>>>> > I hope I'm wrong, because that would be completely crap and useless :-)
>>>>
>>>> [?]
>>>> I think I am correct that strip_tags assumes any '<' character to be the
>>>> beginning of a tag -
>>>
>>> No, you are not. The function, at least as of PHP 5.3.10, is context-
>>> sensitive:
>>>
>>> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
>>> foo
>>
>> Possibly true,
>
> What do you mean – "possibly"? I tested it there.

By "possibly" I mean that what you say may well be true, and I do not dispute
that it is. All I am saying is that your example does not demonstrate it, and
thus does not prove it.

>> but that example does not demonstrate it.
>
> But it does.

No, it does not. A function that assumes every occurrence of the '<' character
is the beginning of a tag, and removes it and all text from it up to and
including the next '>' character in the string would yield exactly the same
output. You have not demonstrated context sensitivity.

>> Try:
>>
>> $ php -r "echo strip_tags('<a title=\'<\'>f<o<o</a>');"
> ^
>> Output should be
>>
>> f<o<o
>
> No, it should not.

If the function is intended to be used to process plain unformatted ASCII text
and remove any unwanted HTML tags from it, then yes, it should. If the function
is intended to remove all HTML tags from an HTML document, then I agree it
should not. Which is what I said to start with in the followup to my original
post.

> The markup is syntactically invalid in the first place

What markup? If, as you suggest later on, and also implied in your previous
post, the input text is supposed to be plain, unformatted text, which may
unintentionally contains rogue HTML tags that should not be there, then the
input text is unlikely to be syntactically correct markup anyway.

>> Definitely doesn't work in PHP 5.3.3, and according to php.net the
>> function hasn't changed since 5.0.0.
>
> It would be prudent if you learned about SGML-based markup languages before
> you attempted to pass on judgement on the correctness of their parsers.

Actually I think what I have been saying all along is that the function behaves
correctly if the input text is HTML. I believe the function is supposed to take
correct HTML as input, in which case it works exactly as I would expect.

>>>> ut this doesn't actually matter, since the input string should be HTML
>>>> encoded anyway, so all '<' characters should be escaped as '&lt;' - so
>>>> all actual '<' characters will indeed be tags :-)
>>>
>>> You are not making sense. The *input* data should *never* be "HTML
>>> encoded".

Wait, didn't you just say the opposite? If the input is not HTML-encoded then it
is *critical* that any '<' character which does not form part of a tag is
ignored. It is not, as my example above proves.

>> Then I must have completely misunderstood something here. I thought the
>> whole point of strip_tags() was to remove all HTML tags from the input
>> string.
>
> Yes, HTML *tags*.

Good, we agree on something.

>> Therefore the input has to be HTML or the function is pointless.
>
> You have defined "HTML-encoded" above to mean that "<" would be "&lt;".
> (Which is a common, and semantically correct definition of the term.)
>
> So HTML-*encoded* strings are _not_ HTML *markup*; by definition, there are
> no tags in them.

Agreed again.

>> Unless the idea is to remove rogue HTML tags from a plain text string,
>
> ISTM that is the general idea.

Oh dear, we're going to start disagreeing again here.

> Some HTML elements (which consist of start
> tag, optional content and optional end tag, depending on the element type),
> are potentially detrimental to the expected display and functionality of a
> Web document. For example, consider that people were allowed to use text-
> formatting elements in a blog comment, but where not allowed to insert
> `script' elements (to avoid XSS) or `img' and `object' elements (to avoid
> reduction of loading speed and interference with other multimedia on the
> site); you would only list the text-formatting elements in the function's
> second parameter then.
>
> Other people might want to remove HTML tags altogether, leaving only the
> text content of the document (fragment).

OK, so given no parameters other than the input string, the behaviour of
strip_tags() is to remove all HTML tags from the input string. Correct?

You have just said that you believe the input string should be plain text, i.e.
it *should* not contain any tags or formatting information of any kind, but may
contain rogue HTML tags which need to be removed. You also said that the input
string should *never* be HTML-encoded. Therefore, the input string could quite
easily contain '<' characters which are not part of HTML tags, as '<' is a
perfectly valid ASCII character. Therefore, by your reasoning, it is important
that any '<' character which does not form part of a tag is ignored.

This is not the case.

>> in which case my original point about it assuming all < symbols are tags
>
> The `<' character is an STAGO (STart Tag Open) delimiter in SGML-based
> markup languages. It delimits a start tag on the left-hand side (a `>'
> character delimits it on the right-hand side). Obviously, therefore it
> cannot be a tag itself.

Sorry but that is extremely pedantic. I think it's obvious what I meant - Just
to clarify, I meant to say "...in which case my original point about it assuming
all < symbols *are the beginning of* tags..."

>> remains valid - it's crap and useless for this.
>
> Evidently, you do not know what you are talking about.

No need for rudeness is there?

I think basically what I'm saying is that the strip_tags() function is great and
is really useful if its input text is correct HTML.

I therefore believe this function is intended to take correct HTML as input. It
is therefore crap and useless if you pass it anything other than correct HTML,
which is what you seem to be disputing.

--
Please remove all-your-clothes before replying.
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: PHP to PDF
Next Topic: ncurses on Linux how to capture F1 key?
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Sun Dec 01 02:01:11 GMT 2024

Total time taken to generate the page: 0.03585 seconds