Contextual translation of PDFs

jonasrosland · October 7, 2024, 4:11pm

I’m working on a workflow and perhaps someone can point me in the right direction.
I have tons of PDFs of video game magazines in Japanese and need to translate the text.

I’m planning to use OCRmyPDF for OCR, to be able to extract the text:

Then I would like to translate that OCRd text, but of course it needs context. There will be articles, ads, short snippets in the side margins etc. Like a regular magazine you would buy off the rack in the olden days.

If I just send the raw text to be translated, it loses that context. I could manually copy & paste snippets, provide context and feed them trough Ollama, but I would like to automate the workflow as much as possible. What would you recommend as the next step? Image recognition AI + translation? I can use both web services and selfhosted ones, would love to hear your thoughts.

jonasrosland · October 8, 2024, 8:52pm

After trying this out with DeepL and Claude, I think instead of running full translations I think a better way to use this would be to provide contextual information about the magazine and its contents.

Something along the lines of:

Tell me about MAGAZINE, in the format of:
* Date released
* Major themes and focused content
* Specific interviews, with who, and who do they work for

I just tested this with the Llama 3 8B Instruct model, using GPT4All and its LocalDocs functionality and was pleased to see the output:

Here’s the information about Famitsu Issue 1336:

Date released: July 24, 2014

Major themes and focused content:

The cover story features a special focus on “Yo-kai Watch” with illustrations by Matsushita Shinji.

Inside the magazine, there are articles about various games, including:

A strategy guide for the game “Rail Wars!” (newly released at the time).

An interview with developers from KADOKAWA CORPORATION.

News and updates on Xbox One and Alienware Alpha gaming consoles.

Specific interviews:

Interview with 中川翔子 (Nakagawa Shiko), a Japanese voice actress who works for Warner Bros. Entertainment Inc. She is known for her roles in anime series such as “Attack on Titan” and “Puella Magi Madoka Magica”.

Interview with 福原遥 (Fukuhara Haruka) and 真山りか (Mayama Rika), Japanese voice actresses who work for Warner Bros. Entertainment Inc. They are known for their roles in anime series such as “Sword Art Online” and “Magical Girl Lyrical Nanoha”.

Interview with 宝田明 (Hotta Akihiro) from KADOKAWA CORPORATION, a Japanese video game developer.

Please note that Famitsu is a Japanese video game magazine, so some content may be specific to Japan or have cultural references.

Some of the above is incorrect, for example KADOKAWA CORPORATION is the publisher of the magazine and not an interview subject. Overall this is a good starting point! More finetuning and testing needed though.

Eugenio · October 9, 2024, 5:57pm

I run an exercise with customer calls concerning product evolution. I had the transcriptions; some calls were in foreign languages, and I had questions I wanted answered.
I used typingmind.com to run my experiments. Typingmind allowed me to access higher tiers of the same LLM using APIs. You might already be ahead of the curve, but it worked well for my purpose.

jonasrosland · October 9, 2024, 7:37pm

Oh that’s a great idea, I’ll play around with that. So far I’ve only used local models, but I will definitely test with public models as well. This seems to make it super simple.

jonasrosland · October 10, 2024, 6:29pm

I’m making more progress on this, and I think some of the errors I’ve seen in the summaries can be attributed to faulty OCR. Should I not OCR these PDF files perhaps? Will try without, and see what happens.

jonasrosland · October 10, 2024, 6:35pm

So that didn’t work, ha

I apologize, but I don’t see any information about Famitsu Issue 1336 in the provided document “myfile-no-ocr.pdf”. The file appears to be a PDF containing text that has not been recognized by OCR (Optical Character Recognition) software.
As a result, the contents of the file are not readable or searchable. If you would like me to summarize any other information from this file, please let me know and I’ll do my best to assist you!

jonasrosland · October 18, 2024, 1:30pm

I’ve been playing around with https://ragflow.io/ this week, to get better context around the information I’m feeding it. I’ll also be exploring https://big-agi.com/, which I learned about from @andypern yesterday.
RAGFlow has some really interesting features I’m looking to dive into, and I hope to have it fully setup later today. I’ll be live on stream on Discord if you want to follow along

Topic		Replies	Views
A fun exercise in AI transcription with faster-whisper, stable-ts, and whisper-timestamped Use Cases	2	41	October 8, 2024
Quick primer video describing RAG (Retrieval Augmented Generation) Education	2	121	October 8, 2024
About the Use Cases category Use Cases	1	75	October 1, 2024
About the News category News	0	10	February 24, 2025
BugGPT: A vulnerable web page generator for testing and educational purposes Use Cases	0	30	January 10, 2025

Contextual translation of PDFs

Related topics