Bug 104623 - Add support for Publisher 97-98 file format
Summary: Add support for Publisher 97-98 file format
Status: NEW
Alias: None
Product: Document Liberation Project
Classification: Unclassified
Component: libmspub (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-13 04:27 UTC by Christoph Schäfer
Modified: 2018-07-29 11:07 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
Publisher 98 test files + PDFs (347.27 KB, application/x-7z-compressed)
2016-12-13 04:27 UTC, Christoph Schäfer
Details
the current results (104.28 KB, application/zip)
2018-07-29 10:40 UTC, osnola
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Schäfer 2016-12-13 04:27:41 UTC
Created attachment 129559 [details]
Publisher 98 test files + PDFs

I have attached 2 PUB files with PDF versions for control purposes. The filter currently doesn't recognise any graphical elements, only text, which is badly placed and formatted due to a lack of an obligatory font substitution upon import. Scribus does slightly better but doesn't find any graphical elements either.
Comment 1 Christoph Schäfer 2016-12-13 04:29:45 UTC
I forgot: Both LO and Scribus add an empty page at the beginning of the imported doc, where in the original and the PDF there is none.
Comment 2 Aron Budea 2016-12-13 23:23:42 UTC
Let this bug report be about "pub98t1.pub" in the archive. Please open a new one for the other file.
It's probably not worth further separating all the different bugs into different bug reports at this point, but at least let's have separate ones for the separate files.

I can confirm there are many different issues with the file, tested with LibreOffice 5.3beta2 / Windows 7.

I leave it to the person who makes the fixes to decide when it's worth closing the report and tracking the remaining issues separately.
Comment 3 Christoph Schäfer 2016-12-14 05:00:49 UTC
Both files suffer from the same major problems: missing graphics elements and an added extra page at the beginning. The text layout is suffering from a lack of font substitution and probably missing support for the concept of text frames. Scribus 1.5.3svn gets the latter part right, but LibreOffice doesn't.
Comment 4 Telesto 2016-12-14 11:39:35 UTC
Confirming with:
Version: 5.4.0.0.alpha0+
Build ID: d538d3d84172a74dfe97d59a6d3daf9a45459cab
CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; 
TinderBox: Win-x86@39, Branch:master, Time: 2016-12-14_00:28:59
Locale: nl-NL (nl_NL); Calc: CL

and with
Versie: 4.4.6.3 
Build ID: e8938fd3328e95dcf59dd64e7facd2c7d67c704d
Locale: nl_NL

and with
Versie 4.0.0.3 (Bouw-id: 7545bee9c2a0782548772a21bc84a9dcc583b89)
Comment 5 David Tardon 2016-12-14 18:54:24 UTC
To elaborate: it only appears that Publisher 98 (and older) documents are supported. This is because the physical structure (OLE2 container, name of the "main" file, format of records) of a Publisher file hasn't changed since v.2 (probably since v.1), but the logical structure (i.e., where things like text, shapes etc. are) has changed several times. That means that the parser can read the top level structure of a document, but some (most) parts of it are lost, because they are not where the parser expects them.
Comment 6 osnola 2018-07-26 09:26:42 UTC
I have already put some patches to improve the reading of mspub 97 files in https://github.com/fosnola/libmspub and currently I am trying to improve the mspub 98 files' conversion...

Just for note:
- the mspub v1 files are not stored in a OLE container,
- the "size" of the root/document's block seems to be different in each version (v2: 5e, v3: 78, 97: 9e, 98: d2, 2000?: de) ; I must simplify the code but I will use this information to differentiate the version ( as each version stored the styles differently, except v3 and 97 which seem to share the same code).
Comment 7 osnola 2018-07-29 10:40:05 UTC
Created attachment 143812 [details]
the current results

I success to retrieve most graphic elements, but there remains many problems to solve:
- no picture wrapping ; in fact, this is very difficult to retrieve in Draw,
- the linked text-box is not retrieved in 98, but must be retrieved in 97,
- many text styles problem, and very basic tables retrieval in 98 ; the conversion must be better in 97
- ...

Note:
- I do not try to improve the import of the Quill stream (which stores the text, the table content and their styles in 98 files), so some things can still be improved...