Bug 158329

Summary: Can't find text with Niqqud in exported PDF
Product: LibreOffice Reporter: Saburo <yosi3260+libre>
Component: Printing and PDF exportAssignee: Not Assigned <libreoffice-bugs>
Status: NEW ---    
Severity: normal CC: khaled, raal
Priority: medium Keywords: bibisected, bisected, regression
Version: 7.5.0.3 release   
Hardware: All   
OS: All   
Whiteboard:
Crash report or crash signature: Regression By: Khaled Hosny
Bug Depends on:    
Bug Blocks: 43808, 103378    
Attachments: sample file
exported747
exported242

Description Saburo 2023-11-23 00:06:53 UTC
Description:
PDFs exported with LibO7.4 can be found by searching for Hebrew in a PDF reader, but PDFs exported with LibO7.5 and later cannot be found by searching for Hebrew.
Posted on ask
https://ask.libreoffice.org/t/writer-pdf/98051

It seems that the characters are stored separately and cannot be recognized as words.

Steps to Reproduce:
1.Export sample files to PDF 
2.Open that PDF in a reader
3.Search for וַיְהִ֥י

Actual Results:
Not found.

Expected Results:
will hit


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 24.2.0.0.alpha0+ (X86_64) / LibreOffice Community
Build ID: ff3fb42b48c70ba5788507a6177bf0a9f3b50fdb
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Raster; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL threaded

Version: 7.4.7.2 (x64) / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 12; OS: Windows 10.0 Build 22621; UI render: Skia/Vulkan; VCL: win
Locale: ja-JP (ja_JP); UI: ja-JP
Calc: CL
Comment 1 Saburo 2023-11-23 00:07:32 UTC
Created attachment 190979 [details]
sample file
Comment 2 Saburo 2023-11-23 00:08:01 UTC
Created attachment 190980 [details]
exported747
Comment 3 Saburo 2023-11-23 00:26:36 UTC
Created attachment 190981 [details]
exported242

Sample file exported to PDF using LibO24.2

The same thing happens with [attachments file](https://bugs.documentfoundation.org/attachment.cgi?id=134028) in [Bug 91764](https://bugs.documentfoundation.org/show_bug.cgi?id=91764).
Comment 4 Eyal Rozenberg 2023-11-27 10:33:34 UTC
The most important thing to note about this bug is that the search term of interest contains Niqqud marks - marks indicating vowels, emphasis or intonation; and even one cantillation mark. See:

https://en.wikipedia.org/wiki/Niqqud
https://en.wikipedia.org/wiki/Hebrew_cantillation

without marks: ויהי
with marks:    וַיְהִ֥י

if we search for the no-Niqqud term, we find it on the second line, in both attached PDFs. If we search for the with-Niqqud term, we find it in the older-version export but not the newer-version.

I can also confirm the newer-behavior part of this bug with:

Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community
Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8
CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3
Locale: he-IL (en_IL); UI: en-US

Note that, in LO itself, and when searching - LO ignores the Niqqud and cantillation and just searches for the letter sequence, so both terms will match each other and themselves in the original document.
Comment 5 Eyal Rozenberg 2023-11-27 10:39:08 UTC
Oh, and: The problem is there even if we drop the cantillation mark. So Niqqud is enough for it to manifest.
Comment 6 raal 2023-11-30 19:59:30 UTC
This seems to have begun at the below commit in bibisect repository/OS linux-64-7.5.
Adding Cc: to Khaled Hosny ; Could you possibly take a look at this one?
Thanks
 ba8787d89bb90aced203271dee7231163446d7e9 is the first bad commit
commit ba8787d89bb90aced203271dee7231163446d7e9
Author: Jenkins Build User <tdf@pollux.tdf>
Date:   Wed Oct 5 22:14:28 2022 +0200

    source 09c076c3f29c28497f162d3a5b7baab040725d56

140994: tdf#151350: Fix extraneous gaps before marks | https://gerrit.libreoffice.org/c/core/+/140994
Comment 7 ⁨خالد حسني⁩ 2023-11-30 20:17:17 UTC
Text extraction from PDF is a lost cause.

We are now generating /ActualText spans where we didn’t previously, and PDF readers are now confused by this. I blame Adobe for creating such a backwards file format and never fixing it.

This probably can be fixed, but I don’t have the capacity to work on it right now.