Digitizing Memories: Using GPT-4o to Transcribe My Handwritten Journal Notes

Jul 08, 2024
llm
ocr
journaling

I've been wanting to get back to my writing habit, and I'm glad I have consistently wrote daily journal entries for the past six months. Using my journal entries, I aim to develop an AI life coach who can help me navigate through life's crests and troughs.

In this blog post, I'll share my experience in using OpenAI's GPT-4o to transcribe a representative sample of my handwritten journal notes. For context, I write my journal entries on a clear A6 notebook, as shown in my first journal entry below:

My first journal entry last January 1, 2024.

I took photos of my journal entires using my phone and fed them to GPT-4o with the system prompt:

You are the world's greatest transcriber of handwritten notes. Transcribe the text from this image accurately. Do not add any other words nor section separators in your response.

While GPT-4o was relatively good off the shelf in transcribing my handwritten notes, I have identified four major drawbacks of using OpenAI's vision-language model (VLM) in the task of Handwritten Text Recognition (HTR).

1. Insertion Problem#

GPT-4o has problems with insertions. It either misplaces the inserted text or removes it completely. Here are two examples showing the insertion problem.

Example 1.1

Ground Truth (GT): I was cooking Beef Monggo for the first time while on a video call with my mom.
GPT-4o Output: For the first time, I was cooking Beef Mongo while on a video call with my mom.

Example 1.2

GT: I got a positive reply from her stating that my appointment is actually scheduled tomorrow.
GPT-4o: I got a reply from her stating that my appointment was actually scheduled tomorrow.

2. Erasure Problem#

GPT-4o also has problems with erasures. It attempts to transcribe the strikethrough text and autocorrects it.

Example 2.1

GT: After taking a bath, I checked the quotation sent by the Insurance Renewal Specialist, and true enough, the premium you pay for car insurance goes down over time since the value of your car depreciates.
GPT-4o: After taking a bath, I checked the quotation sent by the Insurance Renewal Specialist and true enough, the premium you pay for car insurance depreciates/goes down over time since the value of your car depreciates.

The first two drawbacks are expected from GPT-4o which was trained on a large corpus of digital text that do not contain insertions and erasures. These are two features unique to handwritten text, and there's a large room for improvement for a VLM on the HTR task.

3. Punctuation Mark Removal#

GPT-4o sometimes removes punctuation marks such as quotation marks and large parentheses.

Example 3.1

GT: Hindi "ako muna bago sa lahat" kundi "ako muna para sa lahat".
GPT-4o: Hindi ako bago sa lahat kundi ako muna para sa lahat.

Example 3.2

GT: My past self would have just left and just returned to the Center some other day (with an explanation of my action, of course).
GPT-4o: My past self would have just left and just returned to the Center some other day with an explanation of my action, of course.

4. Hallucination#

GPT-4o, like any other language model, also hallucinates on handwritten text. Aside from proper nouns, it's not good with transcribing numbers.

Example 4.1

GT: At Uncle John’s, I bought Kopiko Coffee at ₱28, and to my delight, the cashier said it was a B1T1 promo.
GPT-4o: At Uncle John’s, I bought Kopiko Coffee at ₱82, and to my delight, the cashier and I had a B1T1 promo.

Example 4.2

GT: With 4.5 hours of sleep, I still felt energized to start the day at 6:00 AM.
GPT-4o: With 4.5 hours of sleep, I still felt energized to start the day at 6:15 AM.

It also generates false information presented as believable statements:

Example 4.3

GT: It’s a matter of perspective: you cannot change reality, but you can change how you look at it. You can brighten its colors by wearing rose-tinted glasses, or darken it using shades.
GPT-4o: It’s a matter of perspective: you cannot change reality, but you can change how you look at it. You can adapt to a car by wearing anti-reflective glasses, or darken it with shades.

Example 4.4

GT: With mom’s blessing, I bought the appliance and ate Goto with White Gulaman drink.
GPT-4o: With mom’s blessing, I bought the item and also grabbed Gods with the Gutenberg Bible.

An alarming behavior I observed is how GPT-4o produces transcriptions that blatantly contradict the original meaning of the handwritten text. Here are a few examples:

Example 4.5

GT: The half-life of a model is much shorter than the half-life of a dataset. Data is still the moat.
GPT-4o: The half-life of a model outlasts more than the half-life of a dataset. Data is only the lens.

Example 4.6

GT: I fulfilled my promise to myself that I'll clean the dirty floor in my bedroom.
GPT-4o: I also cleaned my bedroom and fulfilled my promise of wiping it out flat. I’ll clean the dirty floor in my bedroom later.

A Silver Lining#

Even with these limitations of GPT-4o, there are silver linings of using a VLM for transcribing handwritten notes. As seen in Example 3.1, GPT-4o can transcribe Tagalog words off the shelf without including a language-specific context in the prompt nor giving it an external language corpus to train on. It also autocorrects grammar, as shown below:

Example 5.1

GT: After surrendering to the reality that it takes time to cook good food, I was surprised on how tasty and comforting my Beef Monggo was.
GPT-4o: After surrendering to the reality that it takes time to cook good food, I was surprised at how tasty and comforting my Beef Mongo was.

Overall, while GPT-4o produced a relatively good transcription of my handwritten journal entries, the drawbacks I have presented here posits a more thorough investigation of how VLMs convert handwritten text from mobile pictures to digital format. There's a lot to unpack and discover!