Welcome to Geeklog, Anonymous Sunday, December 22 2024 @ 06:57 am EST

Geeklog Forums

Geeklog Handling of Two-Byte Characters in RSS feeds

05/08/09 09:37am (Read 1,923 times)

alank

Anonymous

The Geeklog installation, in question, is the latest security release of 1.5.2. However, the issue has been present in previous iterations and versions.

What is called in Geeklog, Content Syndication, has a character limit; the default limit is 150 characters. In Geeklog, a character is deemed one byte, it seems. However, a character set could have characters that are two bytes long. The potential to break two-byte characters is inherent in this apparent scheme. A broken character tends to break feed readers.

Perhaps there's a way to change a feed reader so that it can manage the issue on the client side. However, I am wondering if there's something that could be done on the server side to prevent Geeklog from serving feeds with broken characters.

Any thoughts? Thanks.

13 14 Quote

05/08/09 04:53pm

alank

Anonymous

Right, so I didn't count on a response, tho, I was hoping for one. I came up with a quick and dirty fix as follows, editing the lib-syndication.php file in the backend systems folder:

Text Formatted Code

function SYND_truncateSummary( $text, $length )

{

    if( $length == 0 )

    {

        return '';

    }

    else

    {

        $text = stripslashes( $text );

        $text = trim( $text );

        $text = preg_replace( "/(\015)/", "", $text );

        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))

        {

            $text = substr( $text, 0, $length - 3 ) . '...';

        }

        // Check if we broke html tag and storytext is now something

        // like "blah blah <a href= ...". Delete "<*" if so.

        if( strrpos( $text, '<' ) > strrpos( $text, '>' ))

        {

            $text = substr( $text, 0, strrpos( $text, '<' ) - 1 )

                  . ' ...';

        }

        $text = substr($text, 0, strrpos($text, ' '));//added code

        return $text;

    }

}

The "added code" near the bottom simply finds the nearest space at the end of the text and lops off the text to the right of that.

The assumption is that the first instance of a damaged character is always going to be to the right of the last space in the string.

19 14 Quote

05/08/09 05:01pm

alank

Anonymous

Oh, I forgot to add back the dots:

Text Formatted Code
$text = substr($text, 0, strrpos($text, ' ')) . ' ...'; //added code

15 12 Quote

05/08/09 05:45pm

Status: offline

Dirk

Site Admin

Admin

Registered: 01/12/02

Posts: 13073

Location:Stuttgart, Germany

Quote by: alank

Right, so I didn't count on a response, tho, I was hoping for one.

Whoa. At least give me a chance to understand the problem first. It wasn't clear (to me at least) what you were talking about. There is no 150 character limit in the Content Syndication, for example ...

I guess the proper solution would be to use our MBYTE_substr string function there.

Btw, you may want to report bugs in our bugtracker. Things can easily get lost in the forum traffic.

bye, Dirk

14 13 Quote

05/08/09 06:14pm

alank

Anonymous

Thanks for getting back. I thought there was a default limit (a suggested size, more like). In any event, it's arbitrary and can split multibyte characters apart and make the feed invalid. I spent most of my time trying to work a solution on the client side, not wanting to mess around with the geeklog code. But I found it much more difficult. You have to come up with some kind of preprocessing that seemed more trouble than it was worth. I would be interested in seeing how the mbyte thingy might be used to resolve this in a more elegant way.

Sorry for not submitting a bug report. I wasn't sure it was a bug.

14 10 Quote

05/08/09 07:58pm

alank

Anonymous

Yes, so as suggested, this seems to be quite the ticket,

Text Formatted Code
        $text = stripslashes( $text );

        $text = trim( $text );

        $text = strip_tags( $text );

        $text = preg_replace( "/(\015)/", "", $text );

        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))

        {

            $text = MBYTE_substr( $text, 0, $length - 3 ) . '...';

        }

I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.

13 14 Quote

05/09/09 10:51am

Status: offline

Dirk

Site Admin

Admin

Registered: 01/12/02

Posts: 13073

Location:Stuttgart, Germany

Quote by: alank

I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.

Actually, the result should contain HTML tags, so that's on purpose.

Looks like a proper fix is a bit more complex, as it also requires a working MBYTE_strrpos function. See if you can make sense of this changeset or wait for 1.6.0b2 ...

bye, Dirk

14 15 Quote

05/10/09 12:10pm

alank

Anonymous

Quote by: Dirk

Quote by: alank

I added the strip_tags bit to get better results in the face of image tag content which eats up a characters, and the rest.

I think I'll return the strip_tags to the client side.

The revised code seems alright, but I wonder what it means if the $mb_enabled value returned MBYTE_checkEnabled() is false, even though the character encoding is actually utf-8 (or some other multibyte encoding). Would it be useful to implement mb_check_encoding for this purpose? Or maybe mb_detect_encoding?

15 32 Quote

05/10/09 02:08pm

alank

Anonymous

I realized that it was significant to get the tags stripped before the length was applied, so that visible text would be fitted into the feed description. That can be achieved only within geeklog. So, without altering the $text value I was able to get what I needed this way,

Text Formatted Code

        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))

        {

            $text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';

        }

That way the characters inside html tags would not be counted.

16 11 Quote

05/10/09 02:28pm

alank

Anonymous

Quote by: alank

Text Formatted Code

        if(( $length > 3 ) && ( MBYTE_strlen( $text ) > $length ))

        {

            $text = MBYTE_substr( strip_tags( $text ), 0, $length - 3 ) . '...';

        }

That way the characters inside html tags would not be counted.

I take that back. It did alter the $text variable. Crumbs.

12 14 Quote

New Topic Post Reply

All times are EST. The time is now 06:57 am.

Normal Topic
Sticky Topic
Locked Topic

New Post
Sticky Topic W/ New Post
Locked Topic W/ New Post

View Anonymous Posts
Able to post
Filtered HTML Allowed
Censored Content

Geeklog Forums

Geeklog Handling of Two-Byte Characters in RSS feeds

alank

alank

alank

Dirk

alank

alank

Dirk

alank

alank

alank

Search

Resources

About

Getting started

Support

Development

Topics

User Functions

What's New

Articles last 4 weeks

Comments last 4 weeks

Pages last 4 weeks

Links last 4 weeks

Downloads last 4 weeks