Digitizing Memories: Using GPT-4o to Transcribe My Handwritten Journal Notes
I've been wanting to get back to my writing habit, and I'm glad I have consistently wrote daily journal entries for the past six months. Using my journal entries, I aim to develop an AI life coach who can help me navigate through life's crests and troughs.
In this blog post, I'll share my experience in using OpenAI's GPT-4o to transcribe a representative sample of my handwritten journal notes. For context, I write my journal entries on a clear A6 notebook, as shown in my first journal entry below:
I took photos of my journal entires using my phone and fed them to GPT-4o with the system prompt:
You are the world's greatest transcriber of handwritten notes. Transcribe the text from this image accurately. Do not add any other words nor section separators in your response.
While GPT-4o was relatively good off the shelf in transcribing my handwritten notes, I have identified four major drawbacks of using OpenAI's vision-language model (VLM) in the task of Handwritten Text Recognition (HTR).
1. Insertion Problem #
GPT-4o has problems with insertions. It either misplaces the inserted text or removes it completely. Here are two examples showing the insertion problem.
Example 1.1 |
---|
Ground Truth (GT): I was cooking Beef Monggo for the first time while on a video call with my mom. |
GPT-4o Output: For the first time, I was cooking Beef Mongo while on a video call with my mom. |
Example 1.2 |
---|
GT: I got a positive reply from her stating that my appointment is actually scheduled tomorrow. |
GPT-4o: I got a reply from her stating that my appointment was actually scheduled tomorrow. |
2. Erasure Problem #
GPT-4o also has problems with erasures. It attempts to transcribe the strikethrough text and autocorrects it.
Example 2.1 |
---|
GT: After taking a bath, I checked the quotation sent by the Insurance Renewal Specialist, and true enough, the premium you pay for car insurance goes down over time since the value of your car depreciates. |
GPT-4o: After taking a bath, I checked the quotation sent by the Insurance Renewal Specialist and true enough, the premium you pay for car insurance depreciates/goes down over time since the value of your car depreciates. |
The first two drawbacks are expected from GPT-4o which was trained on a large corpus of digital text that do not contain insertions and erasures. These are two features unique to handwritten text, and there's a large room for improvement for a VLM on the HTR task.
3. Punctuation Mark Removal #
GPT-4o sometimes removes punctuation marks such as quotation marks and large parentheses.
Example 3.1 |
---|
GT: Hindi "ako muna bago sa lahat" kundi "ako muna para sa lahat". |
GPT-4o: Hindi ako bago sa lahat kundi ako muna para sa lahat. |
Example 3.2 |
---|
GT: My past self would have just left and just returned to the Center some other day (with an explanation of my action, of course). |
GPT-4o: My past self would have just left and just returned to the Center some other day with an explanation of my action, of course. |
4. Hallucination #
GPT-4o, like any other language model, also hallucinates on handwritten text. Aside from proper nouns, it's not good with transcribing numbers.
Example 4.1 |
---|
GT: At Uncle John’s, I bought Kopiko Coffee at ₱28, and to my delight, the cashier said it was a B1T1 promo. |
GPT-4o: At Uncle John’s, I bought Kopiko Coffee at ₱82, and to my delight, the cashier and I had a B1T1 promo. |
Example 4.2 |
---|
GT: With 4.5 hours of sleep, I still felt energized to start the day at 6:00 AM. |
GPT-4o: With 4.5 hours of sleep, I still felt energized to start the day at 6:15 AM. |
It also generates false information presented as believable statements:
Example 4.3 |
---|
GT: It’s a matter of perspective: you cannot change reality, but you can change how you look at it. You can brighten its colors by wearing rose-tinted glasses, or darken it using shades. |
GPT-4o: It’s a matter of perspective: you cannot change reality, but you can change how you look at it. You can adapt to a car by wearing anti-reflective glasses, or darken it with shades. |
Example 4.4 |
---|
GT: With mom’s blessing, I bought the appliance and ate Goto with White Gulaman drink. |
GPT-4o: With mom’s blessing, I bought the item and also grabbed Gods with the Gutenberg Bible. |
An alarming behavior I observed is how GPT-4o produces transcriptions that blatantly contradict the original meaning of the handwritten text. Here are a few examples:
Example 4.5 |
---|
GT: The half-life of a model is much shorter than the half-life of a dataset. Data is still the moat. |
GPT-4o: The half-life of a model outlasts more than the half-life of a dataset. Data is only the lens. |
Example 4.6 |
---|
GT: I fulfilled my promise to myself that I'll clean the dirty floor in my bedroom. |
GPT-4o: I also cleaned my bedroom and fulfilled my promise of wiping it out flat. I’ll clean the dirty floor in my bedroom later. |
A Silver Lining #
Even with these limitations of GPT-4o, there are silver linings of using a VLM for transcribing handwritten notes. As seen in Example 3.1, GPT-4o can transcribe Tagalog words off the shelf without including a language-specific context in the prompt nor giving it an external language corpus to train on. It also autocorrects grammar, as shown below:
Example 5.1 |
---|
GT: After surrendering to the reality that it takes time to cook good food, I was surprised on how tasty and comforting my Beef Monggo was. |
GPT-4o: After surrendering to the reality that it takes time to cook good food, I was surprised at how tasty and comforting my Beef Mongo was. |
Overall, while GPT-4o produced a relatively good transcription of my handwritten journal entries, the drawbacks I have presented here posits a more thorough investigation of how VLMs convert handwritten text from mobile pictures to digital format. There's a lot to unpack and discover!