[Previous] [Next] - [Index] [Thread Index] - [Previous in Thread] [Next in Thread]

Subject: Re: UKNM: Repurposing text
From: Jamie Unwin
Date: Fri, 6 Aug 1999 13:05:47 +0100


The biggest problem with extracting Quark or any (DTP copy) is that a newspaper or magazine page isn't "marked up" in a formal way (unless you use an old SGML style page layout system)

ie. A designer will draw a box for a headline and then make another box for the body and another for a picture, even worse they may decide to use 2 boxes for the body (over pages 2 &3) (the designer is more concerned with layout that markup, and who really can blame them as their jobs is to lay the page out- not to tag it ?)

Therefore when you export your page the elements will all be separate, ie. no relationships, this is fine for a book or a page of copy where the document will have a style sheet and the headline and body will use different styles, this allows you to separate the different page elements on export (and then add them to your database or wrap them in XML or HTML tags). However with complex layout you will have many elements with the same style ie. 3 headlines,3 bodies and 3 subheads you are now left with the problem of associating the correct headlines with the correct bodies, you could use proximity but this isn't always reliable. The best method to use is to write a Quark plugin to allow a manual operator to tie together the various elements.

We do exactly this with a "color" tagging tool ie. the page is brought on the screen and someone then assigns a one color to all elements of a story and another color to another story. We try to make this as automatic as possible to reduce the manual intervention, unfortunately I can't give you a name of a company as we wrote our tool in house. We then export this copy as XML and finally import it into out content database.

If your copy (ie. stories) is stored (as marked up or at least uses style sheets) before it is layed out, then this is the stage to do the extraction (and life will be simple !), unfortunately tweaks are often made to the copy at the last stage (ie. the layout stage), if these are legal changes then you will have to export the copy at this stage.

If all you want is a HTML representation of your page (and you page is simple) you could just reproduce the page using HTML tables (which is what Beyond Press tries to do) however in my experience this produces very unsatisfactory results and is certainly not future proof (ie. you are stuck with HTML).

The bottom line (and I have been faced with this problem on at least two large Quark to web projects) is it's not simple and there is no 'real' off the shelf solution, by constructing rules in your extraction process you can make the process require less and less manual intervention, but at the end of the day you are trying to convert layout to markup which isn't a very easy task to automate.

These products all claim to make tools which extract Quark to HTML (or XML/SGML or CSV), however I would evaluate them VERY thoroughly before parting with any money to see if they really do suit your needs.

Atomic XT http://www.atomik-xt.com/
Beyond Press http://www.astrobyte.com/
Inso http://www.inso.com/


>Does anyone know of either any applications, or of any agencies, for
>repurposing fairly large quantities of Quark Express (and possibly
>Pagemaker) files into something more like HTML?
>Any thoughts greatly appreciated.
>Carol Dukes
>carolatbtinternet [dot] com

Jamie Unwin : jamie [dot] unwinatguardian [dot] co [dot] uk : http://www.guardian.co.uk
The Guardian : 3-7 Ray Street, London, EC1R 3DJ : Tel. 0171 713 4469
UKNM is sponsored by Excite UK, visit us at http://www.excite.co.uk.
Email Khalil Ibrahimi khalilatexcitecorp [dot] com (mailto:khalilatexcitecorp [dot] com) to advertise on Excite.
Change your UKNM subscription use http://www.chinwag.com/uknm.html

  Re: UKNM: Repurposing text - thanks!, Carol Dukes

  UKNM: Repurposing text, Carol Dukes

[Previous] [Next] - [Index] [Thread Index] - [Next in Thread] [Previous in Thread]