Subject: Re: UKNM: Repurposing text - thanks!
From: Carol Dukes
Date: Tue, 10 Aug 1999 14:58:13 +0100

Many thanks for all the contacts and help on repurposing text from quark to
anything-but-quark. I'll let the list know what we go with and how it works


From: Jamie Unwin <jamie [dot] unwinatguardian [dot] co [dot] uk>
To: <uk-netmarketingatchinwag [dot] com>
Sent: Friday, August 06, 1999 11:04 AM
> The biggest problem with extracting Quark or any (DTP copy) is that a
newspaper or magazine page isn't "marked up" in a formal way (unless you use
an old SGML style page layout system)
> ie. A designer will draw a box for a headline and then make another box
for the body and another for a picture, even worse they may decide to use 2
boxes for the body (over pages 2 &3) (the designer is more concerned with
layout that markup, and who really can blame them as their jobs is to lay
the page out- not to tag it ?)
> Therefore when you export your page the elements will all be separate, ie.
no relationships, this is fine for a book or a page of copy where the
document will have a style sheet and the headline and body will use
different styles, this allows you to separate the different page elements on
export (and then add them to your database or wrap them in XML or HTML
tags). However with complex layout you will have many elements with the same
style ie. 3 headlines,3 bodies and 3 subheads you are now left with the
problem of associating the correct headlines with the correct bodies, you
could use proximity but this isn't always reliable. The best method to use
is to write a Quark plugin to allow a manual operator to tie together the
various elements.
> We do exactly this with a "color" tagging tool ie. the page is brought on
the screen and someone then assigns a one color to all elements of a story
and another color to another story. We try to make this as automatic as
possible to reduce the manual intervention, unfortunately I can't give you a
name of a company as we wrote our tool in house. We then export this copy as
XML and finally import it into out content database.
> If your copy (ie. stories) is stored (as marked up or at least uses style
sheets) before it is layed out, then this is the stage to do the extraction
(and life will be simple !), unfortunately tweaks are often made to the copy
at the last stage (ie. the layout stage), if these are legal changes then
you will have to export the copy at this stage.
> If all you want is a HTML representation of your page (and you page is
simple) you could just reproduce the page using HTML tables (which is what
Beyond Press tries to do) however in my experience this produces very
unsatisfactory results and is certainly not future proof (ie. you are stuck
with HTML).
> The bottom line (and I have been faced with this problem on at least two
large Quark to web projects) is it's not simple and there is no 'real' off
the shelf solution, by constructing rules in your extraction process you can
make the process require less and less manual intervention, but at the end
of the day you are trying to convert layout to markup which isn't a very
easy task to automate.
> These products all claim to make tools which extract Quark to HTML (or
XML/SGML or CSV), however I would evaluate them VERY thoroughly before
parting with any money to see if they really do suit your needs.
> Atomic XT http://www.atomik-xt.com/
> Beyond Press http://www.astrobyte.com/
> Inso http://www.inso.com/
> Jamie
> >Does anyone know of either any applications, or of any agencies, for
> >repurposing fairly large quantities of Quark Express (and possibly
> >Pagemaker) files into something more like HTML?
> Jamie Unwin : jamie [dot] unwinatguardian [dot] co [dot] uk : http://www.guardian.co.uk
> The Guardian : 3-7 Ray Street, London, EC1R 3DJ : Tel. 0171 713 4469

