Description: TXT files with right or left double quotes (/u201d) lose the text between quotes when converted to pdf. Steps to Reproduce: 1. create a txt document with the double quotes 2. use the libreoffice convert to pdf feature to convert the document 3. look at the output pdf and note the disappearing text. Actual Results: the document should be converted to pdf Expected Results: words between the quotes are missing Reproducible: Always User Profile Reset: No Additional Info: n/a
Created attachment 186003 [details] sample test document
Created attachment 186004 [details] converted file
Have tried this: Downloaded the *.txt-file. Opened the *.txt-file with LO. Exported to *.pdf. No content is lost. Then I started in console soffice --convert-to pdf:writer_pdf_Export --outdir /home/user *.txt No content is lost. So I couldn't reproduce the buggy behavior with Version: 7.4.6.2 / LibreOffice Community Build ID: 5b1f5509c2decdade7fda905e3e1429a67acd63d CPU threads: 6; OS: Linux 5.3; UI render: default; VCL: kf5 (cairo+xcb) Locale: de-DE (de_DE.UTF-8); UI: de-DE Calc: threaded (OpenSUSE 15.3 64bit rpm Linux)
I figured out the problem. When we remove the file extension from the text file, the bug appears.
Created attachment 186036 [details] sample txt file without an extension
Seems it has nothing to do with converting to pdf. Please try to open the attached file without extension with Writer. It will be imported without the content between the double quotes. Open a new bug for this.
If I add --infilter="Text (encoded):UTF8,LF,Liberation Mono,en-US" as mentioned in https://help.libreoffice.org/latest/en-US/text/shared/guide/convertfilters.html there is no problem. So I'm not sure it's a bug. Arch Linux 64-bit, X11 Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: 2ca71b5c6e0374254e7c75c75e54fa6a8caebfde CPU threads: 8; OS: Linux 6.2; UI render: default; VCL: kf5 (cairo+xcb) Locale: fi-FI (fi_FI.UTF-8); UI: en-US Calc: threaded Built on 30 March 2023
I got help from Mike K. in the dev chat. In the console output, these were seen: WPSDocument::isFileFormatSupported() Opening the file without extension in Writer UI and creating and running the macro: Sub Main On Error Resume Next Dim vArgs 'Media descriptor as an array of com.sun.star.beans.PropertyValue' Dim s$ 'Display string' Dim i% 'Index variable' REM Obtain the media descriptor. It turns out that this REM can be represented as an array of PropertyValue services. vArgs = ThisComponent.getArgs() For i = 0 To UBound(vArgs) 'For each property' s = s & vArgs(i).Name & " = " 'Add the property name and an equals sign' s = s & vArgs(i).Value 'Obtaining the value may fail!' s = s & CHR$(10) 'Add a new-line character' Next MsgBox s,0, "Args" End Sub I could see: FilterName = WordPerfect
(In reply to Buovjaga from comment #8) So this is a but of libwpd::WPDocument::isFileFormatSupported, that returns some confidence level for this input. Some false positives are indeed inevitable, especially with short input data, but likely libwpd could use this sample to improve detection a bit.
The false detection happens in WP42Heuristics::isWP42FileFormat https://sourceforge.net/p/libwpd/code/ci/master/tree/src/lib/WP42Heuristics.cpp#l61 and it seems to do the reasonable job - just the data happens to be suspiciously similar to the proper format. It is a UTF-8-encoded plain text file, which has ASCII characters (0x20 to 0x7F), and exactly one pair of the non-ASCII characters (the same “), which are encoded in UTF-8 as 0xE2 0x80. This pair forms one "variable-length functional group" (starting from 0xE2, and ending at the same 0xE2), and immediately after, one "single-character functional group", consisting of 0x80. Such a unlikely coincidence: the properties of 0xE2 in WordPerfect require the pair; and the properties of 0x80 make it a valid alone. If it were almost anything different; it the quotes were different, like “...”; or if there was at least one other non-ASCII; or... I don't know if the detection can be improved. But the constellation is funny ;)