TheIvIaxx wrote:
> Hello, I am trying to do a simple Docx to HTML conversion. I'm using
> the XSLT provided by Sharepoint Server. This works for most things,
> but it doesn't handle lists at all, it puts them into <p> tags for
> some reason. and after looking at the xml in document.xml, the
> bullets are in their own w
tags. This seems to be a problem as
> there is no identifying attribute saying where a list begins and ends
> or whether its a bullet list or a number list.
>
> Am i missing something here? This seems to be a basic feature to pull
> from a word doc.
This is a basic design feature in Word: it has no concept of
containment, so that it can allow authors to put anything anywhere. It
places "looking pretty" above "being right". All paragraph-level
elements are just paragraphs, identified by a style name or other
subelements such as for indentation, bulleting, etc, because that it all
that is required to format the document, and that's all that Word is
intended to do.
Interpreting and imposing structural containment on a flat data model is
one of the hardest tricks to do. One of the few big advantages of ODF
over OOXML and WordML is that it does put list items inside a list
container (ODF has a lot of other serious flaws, but that one they got
right).
The technique in XSLT goes something like this:
1. Write a template recognising the first item of a list as a w

with a
list style, whose immediately preceding w

does *not* have that style.
Call this element the list-starter, and create a variable set to its
unique ID.
2. Use that template to create the OL list container and output the
current w

as its first LI.
3. Still in this template, follow that first LI by processing
out-of-line all immediately contiguous following siblings that have the
same style. You identify them by testing the unique ID of their closest
preceding list-start element (one which is not preceded by another list
element) against the variable containing the ID of the current element.
This enables you to omit list elements from lower down the document,
because *their* closest preceding list-start element will be a different
one from the one you are currently processing.
4. Write a simple LI template to output these "included" elements.
5. Write a template to match them when they are encountered in normal
document-order processing, and do nothing (otherwise they'll get output
a second time when the list has finished). That is, match list item
elements which *are* immediately preceded by another list item element.
<xsl:template
match="w

[w

Pr[w

Style[@w:val='ListNumber']]]
[preceding-sibling::w

[1]/w

Pr/w

Style/@w:val!='ListNumber']">
<xsl:variable name="list-start" select="generate-id()"/>
<ol>
<li>
<xsl:apply-templates/>
</li>
<xsl:apply-templates mode="included"
select="following-sibling::w

[w

Pr[w

Style[@w:val='ListNumber']]]
[generate-id(
preceding-sibling::w

[w

Pr[w

Style[@w:val='ListNumber']]]
[preceding-sibling::w

[1]/w

Pr/w

Style/@w:val!='ListNumber'][1])
=$list-start]"/>
</ol>
</xsl:template>
<xsl:template mode="included"
match="w

[w

Pr[w

Style[@w:val='ListNumber']]]">
<li>
<xsl:apply-templates/>
</li>
</xsl:template>
<xsl:template
match="w

[w

Pr[w

Style[@w:val='ListNumber']]]
[preceding-sibling::w

[1]/w

Pr/w

Style/@w:val='ListNumber']"/>
In reality it's much more complex:
a. This will fail on nested lists where the author has used the same
(outer) style name and manually indented the list margin. In theory you
could check for margin and/or indentation values...good luck with that :-)
b. If these are formal or important documents intended for long life or
reuse or reference, they should be created using a strict set of named
styles from a .dot template, checked over by a document engineer before
release. Better, get them out of Word into a dialect of XML that uses a
real-life Schema or DTD where they can be maintained and accessed reliably.
c. If the author has been allowed to fiddle with decorative bullets or
change the list-numbering from the default (eg arabic, roman, alpha,
etc) then you need to check for that and make the appropriate changes to
the HTML markup and/or CSS. In a real-world document there would be a
corporate style that says (eg) outer lists are numbered 1,2,3...,
second-level lists are lettered a,b,c..., and so on for specific types
of lists, so that your XSLT can ignore the author's decorative amendments.
d. Lists which include multiple paragraphs to an item are harder to
detect: look at the indentation level if it is not obvious from the
style name. The same applies to figures, tables, and other block-level
objcts included inside a list.
How much restyling is needed depends on how much conformity to some
document standard is required, if any. In the la-la-land of office
wordprocessing, every author picks their own style and spends 50% of
their time fiddling with the font and colour menus.
* Such documents are likely to be transient or at a low level of
importance, otherwise they would be done to a set of rules. It is
therefore not worth wasting any effort trying to cope with the
infinite variation that they have spent the company's time on.
* Documents which *are* of importance are usually done more
rigorously, so there should be a set of rules you can apply. But
such documents are rarely done with a wordprocessor except at a very
early draft stage, unless the company is being very sloppy.
* Documents where the author is a designer, and where you are expected
to make as faithful a copy as possible, will require extreme
attention to every subelement in the w

, so that your indentation,
bulleting or numbering, alignment, font selection, justification,
etc are sedulously reflected in the HTML or CSS. Fortunately no
professional designer in her right mind would use Word as a page
design tool.
You can use a similar technique to the one above to detect the levels of
heading in a document (Heading1, Heading2, etc) and all their following
w

's up to but not including the next heading, and put them into a DIV,
nesting them according to implied level. This can reveal otherwise
hidden structural flaws in the author's construction, for example,
following a Heading1 with a Heading3 because it looked prettier :-)
///Peter