Announcement

Collapse
No announcement yet.

Problems with MS Word HTML formatting

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with MS Word HTML formatting

    Hi, I'm importing from dotNetVB 2.4.2. My current version of VB is 3.8.2. All in all, the import worked great.

    However, I have a handful of posts where the user obviously copied/pasted from MS Word into dotNetVB, and the full post didn't import. In dotNetVB, the post looks like a formatted Word doc. In VB, it appears some of the garbage MS formatting was converted to smilies (see attachment). The attachment shows a small piece of one post. The smilies appear throughout the post.

    The actual source of the text in the attachment is this:
    Code:
    </o<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />><br />
    <br />
    <br />
    <br />
    <br />
    <v:shapetype id=_x0000_t75 stroked="f" filled="f" path="[email protected]@[email protected]@[email protected]@[email protected]@5xe" o<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />referrelative="t" o:spt="75" coordsize="21600,21600"><v:stroke joinstyle="miter"></v:stroke><v:formulas><v:f eqn="if lineDrawn pixelLineWidth 0"></v:f><v:f eqn="sum @0 1 0"></v:f><v:f eqn="sum 0 0 @1"></v:f><v:f eqn="prod @2 1 2"></v:f><v:f eqn="prod @3 21600 pixelWidth"></v:f><v:f eqn="prod @3 21600 pixelHeight"></v:f><v:f eqn="sum @0 0 1"></v:f><v:f eqn="prod @6 1 2"></v:f><v:f eqn="prod @7 21600 pixelWidth"></v:f><v:f eqn="sum @8 21600 0"></v:f><v:f eqn="prod @7 21600 pixelHeight"></v:f><v:f eqn="sum @10 21600 0"></v:f></v:formulas><v<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />ath o:connecttype="rect" gradientshapeok="t" o:extrusionok="f"></v<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />ath><o:lock aspectratio="t" v:ext="edit"></o:lock></v:shapetype></o<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />><br />
    09-03</o<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />><br />
    January 19, 2009</o<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />><br />
    </o<img src="images/smilies/tongue.gif" border="0" alt="" title="Stick Out Tongue" class="inlineimg" />><br />
    Any suggestions on how to handle this? I also have a bigger problem which also typically (from what I've found so far) occurs in these same MS copy/paste posts. The problem is that the posts are cut off. Is there are character limit to posts? Could the MS formatting be adding to this limit? The odd thing is that the cutoffs happen at odd places. For example, I might expect the parser to die in the middle of the garbage Word formatting but the posts are typically cutoff in the middle of sentences.

    For example one post dies on this sentence which has no special foramtting:

    The winning proposals will each receive a $10,000 budget for pursuing their concepts under the guidance of an a[CUTOFF POINT HERE]ssigned technical mentor.

    Based on sorting through the source, the post has 13,691 chars in dotNetVB but only 3,326 in VB. I could live with manually cleaning up formatting if I had to but these cutoff posts are a big problem.
    Attached Files

  • #2
    The regex to clean microsoft garble up would be mind blowing.

    Unless you can take it out in the source, it would likely be quicker to do them all by hand.
    I wrote ImpEx.

    Blog | Me

    Comment

    widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
    Working...
    X