Summary: | Can't find text with Niqqud in exported PDF | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | Saburo <yosi3260+libre> |
Component: | Printing and PDF export | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | NEW --- | ||
Severity: | normal | CC: | khaled, raal |
Priority: | medium | Keywords: | bibisected, bisected, regression |
Version: | 7.5.0.3 release | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Crash report or crash signature: | Regression By: | Khaled Hosny | |
Bug Depends on: | |||
Bug Blocks: | 43808, 103378 | ||
Attachments: |
sample file
exported747 exported242 |
Description
Saburo
2023-11-23 00:06:53 UTC
Created attachment 190979 [details]
sample file
Created attachment 190980 [details]
exported747
Created attachment 190981 [details] exported242 Sample file exported to PDF using LibO24.2 The same thing happens with [attachments file](https://bugs.documentfoundation.org/attachment.cgi?id=134028) in [Bug 91764](https://bugs.documentfoundation.org/show_bug.cgi?id=91764). The most important thing to note about this bug is that the search term of interest contains Niqqud marks - marks indicating vowels, emphasis or intonation; and even one cantillation mark. See: https://en.wikipedia.org/wiki/Niqqud https://en.wikipedia.org/wiki/Hebrew_cantillation without marks: ויהי with marks: וַיְהִ֥י if we search for the no-Niqqud term, we find it on the second line, in both attached PDFs. If we search for the with-Niqqud term, we find it in the older-version export but not the newer-version. I can also confirm the newer-behavior part of this bug with: Version: 24.2.0.0.alpha1+ (X86_64) / LibreOffice Community Build ID: 516f800f84b533db0082b1f39c19d1af40ab29c8 CPU threads: 4; OS: Linux 6.5; UI render: default; VCL: gtk3 Locale: he-IL (en_IL); UI: en-US Note that, in LO itself, and when searching - LO ignores the Niqqud and cantillation and just searches for the letter sequence, so both terms will match each other and themselves in the original document. Oh, and: The problem is there even if we drop the cantillation mark. So Niqqud is enough for it to manifest. This seems to have begun at the below commit in bibisect repository/OS linux-64-7.5. Adding Cc: to Khaled Hosny ; Could you possibly take a look at this one? Thanks ba8787d89bb90aced203271dee7231163446d7e9 is the first bad commit commit ba8787d89bb90aced203271dee7231163446d7e9 Author: Jenkins Build User <tdf@pollux.tdf> Date: Wed Oct 5 22:14:28 2022 +0200 source 09c076c3f29c28497f162d3a5b7baab040725d56 140994: tdf#151350: Fix extraneous gaps before marks | https://gerrit.libreoffice.org/c/core/+/140994 Text extraction from PDF is a lost cause. We are now generating /ActualText spans where we didn’t previously, and PDF readers are now confused by this. I blame Adobe for creating such a backwards file format and never fixing it. This probably can be fixed, but I don’t have the capacity to work on it right now. |