FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » strip_tags function
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: strip_tags function [message #178775 is a reply to message #178771] Wed, 01 August 2012 04:49 Go to previous message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma:
Senior Member
Tim Fardell wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Tim Fardell wrote:
>>> Thomas 'PointedEars' Lahn wrote:
>>>> Tim Fardell wrote:
>>>> > On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>>>> > <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>>> >> However, am I right in thinking that the strip_tags() function simply
>>>> >> assumes that any less-than character (<) occurring within a string is
>>>> >> the beginning of a tag?
>>>> >>
>>>> >> I hope I'm wrong, because that would be completely crap and useless
>>>> >> :-)
>>>> >
>>>> > [?]
>>>> > I think I am correct that strip_tags assumes any '<' character to be
>>>> > the beginning of a tag -
>>>>
>>>> No, you are not. The function, at least as of PHP 5.3.10, is context-
>>>> sensitive:
>>>>
>>>> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
>>>> foo
>>>
>>> Possibly true,
>>
>> What do you mean – "possibly"? I tested it there.
>
> By "possibly" I mean that what you say may well be true, and I do not
> dispute that it is. All I am saying is that your example does not
> demonstrate it, and thus does not prove it.

A matter of interpretation.

>>> but that example does not demonstrate it.
>> But it does.
>
> No, it does not. A function that assumes every occurrence of the '<'
> character is the beginning of a tag, and removes it and all text from it
> up to and including the next '>' character in the string would yield
> exactly the same output. You have not demonstrated context sensitivity.

An example that proves my point and applies to your now-clarified assumption
can be easily created:

$ php -r "echo strip_tags('<a title=\'>\'>foo</a>');"
foo

If it would be as you assumed, then the output would have to be something
similar to

'>foo

It is not.

>> The markup is syntactically invalid in the first place
>
> What markup? If, as you suggest later on, and also implied in your
> previous post, the input text is supposed to be plain, unformatted text,
> which may unintentionally contains rogue HTML tags that should not be
> there, then the input text is unlikely to be syntactically correct markup
> anyway.

You still misunderstand. The input is obviously supposed to be markup,
containing tags; it needs to be parsed.

However, you have (well-)defined "HTML-encoded" as content where the "<" of
tags would be replaced with "&lt;". That would not be markup anymore as
markup requires at least one tag. (`&lt;' is _not_ a tag.)

>>> Definitely doesn't work in PHP 5.3.3, and according to php.net the
>>> function hasn't changed since 5.0.0.
>>
>> It would be prudent if you learned about SGML-based markup languages
>> before you attempted to pass on judgement on the correctness of their
>> parsers.
>
> Actually I think what I have been saying all along is that the function
> behaves correctly if the input text is HTML.

No, you assumed the function would consider *all* `<'s to be a start of a
tag, which it obviously does not.

> I believe the function is supposed to take correct HTML as input, in which
> case it works exactly as I would expect.

If "correct" means *syntactically* correct, then you are right. However,
that does not mean that the function could not process syntactically invalid
markup correctly; the result just might not be what you expect, as there is
no specified definition of correctness with regard to parsing syntactically
invalid markup.

>>>> > ut this doesn't actually matter, since the input string should be HTML
>>>> > encoded anyway, so all '<' characters should be escaped as '&lt;' - so
>>>> > all actual '<' characters will indeed be tags :-)
>>>>
>>>> You are not making sense. The *input* data should *never* be "HTML
>>>> encoded".
>
> Wait, didn't you just say the opposite?

No, I did not.

>>> Then I must have completely misunderstood something here. I thought the
>>> whole point of strip_tags() was to remove all HTML tags from the input
>>> string.
>>
>> Yes, HTML *tags*.
>
> Good, we agree on something.

But do you *really* know what a HTML tag is? It sure does not look like
that.

>>> Therefore the input has to be HTML or the function is pointless.
>>
>> You have defined "HTML-encoded" above to mean that "<" would be "&lt;".
>> (Which is a common, and semantically correct definition of the term.)
>>
>> So HTML-*encoded* strings are _not_ HTML *markup*; by definition, there
>> are no tags in them.
>
> Agreed again.

So, AISB, *HTML-encoded strings* (which is a different animal than HTML
markup) are not supposed to be the input to this function or any form of
server-side PHP processing. You should not get HTML-encoded strings from
the client and you should not store HTML-encoded strings in your database.

Instead, you should store the plain markup, and HTML-encode it for output if
necessary. The best approach is, of course, not to store HTML markup at all
in a database (data storage should be independent of output) but that is not
always possible.

>>> Unless the idea is to remove rogue HTML tags from a plain text string,
>> ISTM that is the general idea.
>
> Oh dear, we're going to start disagreeing again here.

Define: rogue HTML tags.

>> Some HTML elements (which consist of start tag, optional content and
>> optional end tag, depending on the element type), are potentially
>> detrimental to the expected display and functionality of a
>> Web document. For example, consider that people were allowed to use
>> text-formatting elements in a blog comment, but where not allowed to
>> insert `script' elements (to avoid XSS) or `img' and `object' elements
>> (to avoid reduction of loading speed and interference with other
>> multimedia on the site); you would only list the text-formatting elements
>> in the function's second parameter then.
>>
>> Other people might want to remove HTML tags altogether, leaving only the
>> text content of the document (fragment).
>
> OK, so given no parameters other than the input string, the behaviour of
> strip_tags() is to remove all HTML tags from the input string. Correct?

Correct.

> You have just said that you believe the input string should be plain text,

No, I did not. You misunderstood, probably because you cannot tell what is
a HTML tag or what you mean by "HTML-encoded string".

>>> in which case my original point about it assuming all < symbols are tags
>>
>> The `<' character is an STAGO (STart Tag Open) delimiter in SGML-based
>> markup languages. It delimits a start tag on the left-hand side (a `>'
>> character delimits it on the right-hand side). Obviously, therefore it
>> cannot be a tag itself.
>
> Sorry but that is extremely pedantic.

That is a luser's attitude. Correct and unambiguous terminology is
paramount to understanding and learning. `<' is _not_ a tag; it is a tag
*delimiter*.

> I think it's obvious what I meant -

I do not think you know what you are talking about, so you are modifying
your terminology as you go (like, "a tag is `<…>'", "a tag is &lt;", "a tag
is `<'"). Of course, this is where misconceptions are created, very common
in beginners.

> Just to clarify, I meant to say "...in which case my original point about
> it assuming all < symbols *are the beginning of* tags..."

ACK.

>>> remains valid - it's crap and useless for this.
>> Evidently, you do not know what you are talking about.
>
> No need for rudeness is there?

Given your statements, it is a matter of fact. (A fact that can be changed
by you, and eventually *only* you.) If you cannot deal with that, please
stop wasting my time.

On a side note, you can tell a troll from a regular when you see the former
contributing nothing but ad-hominem attack and misinformation. Forged
address headers are also a strong indication of a troll. Any form of
continued anti-social behavior, really.

--
PointedEars
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: PHP to PDF
Next Topic: ncurses on Linux how to capture F1 key?
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Sat Nov 30 23:45:12 GMT 2024

Total time taken to generate the page: 0.03994 seconds