Bug 154229 - Text files without extension treated as WordPerfect files
Summary: Text files without extension treated as WordPerfect files
Status: NEW
Alias: None
Product: Document Liberation Project
Classification: Unclassified
Component: General (show other bugs)
Version:
(earliest affected)
unspecified
Hardware: All All
: medium minor
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-16 15:15 UTC by joseph.wong
Modified: 2023-09-06 22:30 UTC (History)
3 users (show)

See Also:
Crash report or crash signature:


Attachments
sample test document (93 bytes, text/plain)
2023-03-16 15:15 UTC, joseph.wong
Details
converted file (11.80 KB, application/pdf)
2023-03-16 15:19 UTC, joseph.wong
Details
sample txt file without an extension (82 bytes, application/octet-stream)
2023-03-17 16:55 UTC, joseph.wong
Details

Note You need to log in before you can comment on or make changes to this bug.
Description joseph.wong 2023-03-16 15:15:42 UTC
Description:
TXT files with right or left double quotes (/u201d) lose the text between quotes when converted to pdf.

Steps to Reproduce:
1. create a txt document with the double quotes
2. use the libreoffice convert to pdf feature to convert the document
3. look at the output pdf and note the disappearing text.

Actual Results:
the document should be converted to pdf

Expected Results:
words between the quotes are missing


Reproducible: Always


User Profile Reset: No

Additional Info:
n/a
Comment 1 joseph.wong 2023-03-16 15:15:56 UTC
Created attachment 186003 [details]
sample test document
Comment 2 joseph.wong 2023-03-16 15:19:42 UTC
Created attachment 186004 [details]
converted file
Comment 3 Robert Großkopf 2023-03-17 14:36:43 UTC
Have tried this: 
Downloaded the *.txt-file.
Opened the *.txt-file with LO.
Exported to *.pdf.
No content is lost.

Then I started in console
soffice --convert-to pdf:writer_pdf_Export --outdir /home/user *.txt
No content is lost.

So I couldn't reproduce the buggy behavior with
Version: 7.4.6.2 / LibreOffice Community
Build ID: 5b1f5509c2decdade7fda905e3e1429a67acd63d
CPU threads: 6; OS: Linux 5.3; UI render: default; VCL: kf5 (cairo+xcb)
Locale: de-DE (de_DE.UTF-8); UI: de-DE
Calc: threaded

(OpenSUSE 15.3 64bit rpm Linux)
Comment 4 joseph.wong 2023-03-17 16:55:25 UTC
I figured out the problem. When we remove the file extension from the text file, the bug appears.
Comment 5 joseph.wong 2023-03-17 16:55:44 UTC
Created attachment 186036 [details]
sample txt file without an extension
Comment 6 Robert Großkopf 2023-03-17 18:13:56 UTC
Seems it has nothing to do with converting to pdf. Please try to open the attached file without extension with Writer. It will be imported without the content between the double quotes.

Open a new bug for this.
Comment 7 Buovjaga 2023-03-31 13:57:27 UTC
If I add

--infilter="Text (encoded):UTF8,LF,Liberation Mono,en-US"

as mentioned in https://help.libreoffice.org/latest/en-US/text/shared/guide/convertfilters.html there is no problem. So I'm not sure it's a bug.

Arch Linux 64-bit, X11
Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: 2ca71b5c6e0374254e7c75c75e54fa6a8caebfde
CPU threads: 8; OS: Linux 6.2; UI render: default; VCL: kf5 (cairo+xcb)
Locale: fi-FI (fi_FI.UTF-8); UI: en-US
Calc: threaded
Built on 30 March 2023
Comment 8 Buovjaga 2023-03-31 14:45:13 UTC
I got help from Mike K. in the dev chat.

In the console output, these were seen:
WPSDocument::isFileFormatSupported()

Opening the file without extension in Writer UI and creating and running the macro:

Sub Main
On Error Resume Next
Dim vArgs 	'Media descriptor as an array of com.sun.star.beans.PropertyValue'
Dim s$		'Display string'
Dim i%		'Index variable'
REM Obtain the media descriptor. It turns out that this
REM can be represented as an array of PropertyValue services.
vArgs = ThisComponent.getArgs()
For i = 0 To UBound(vArgs)			'For each property'
	s = s & vArgs(i).Name & " = "		'Add the property name and an equals sign'
	s = s & vArgs(i).Value				'Obtaining the value may fail!'
	s = s & CHR$(10)					'Add a new-line character'
Next
MsgBox s,0, "Args"
End Sub

I could see:

FilterName = WordPerfect
Comment 9 Mike Kaganski 2023-03-31 17:17:47 UTC
(In reply to Buovjaga from comment #8)

So this is a but of libwpd::WPDocument::isFileFormatSupported, that returns some confidence level for this input. Some false positives are indeed inevitable, especially with short input data, but likely libwpd could use this sample to improve detection a bit.
Comment 10 Mike Kaganski 2023-03-31 20:04:23 UTC
The false detection happens in WP42Heuristics::isWP42FileFormat

https://sourceforge.net/p/libwpd/code/ci/master/tree/src/lib/WP42Heuristics.cpp#l61

and it seems to do the reasonable job - just the data happens to be suspiciously similar to the proper format. It is a UTF-8-encoded plain text file, which has ASCII characters (0x20 to 0x7F), and exactly one pair of the non-ASCII characters (the same “), which are encoded in UTF-8 as 0xE2 0x80. This pair forms one "variable-length functional group" (starting from 0xE2, and ending at the same 0xE2), and immediately after, one "single-character functional group", consisting of 0x80. Such a unlikely coincidence: the properties of 0xE2 in WordPerfect require the pair; and the properties of 0x80 make it a valid alone. If it were almost anything different; it the quotes were different, like “...”; or if there was at least one other non-ASCII; or...

I don't know if the detection can be improved. But the constellation is funny ;)