PDF Scraper/Compiler
In a doomed attempt to better myself, I decided to learn Vietnamese years ago. To date, I have yet to become conversationally fluent, but have managed to keep with it, which is all that can be asked on projects such as this. The strides you make in one day are but a drop in the bucket for the journey. Consistency is key here.
But man, Vietnamese is hard
In order to keep it fresh, I decided to work on reading books in Vietnamese and almost immediately realized that this presented two huge stumbling blocks. One, I did not understand enough of the words to even get a sense of meaning. And two, the few words I did understand made no sense in the context because I also didn’t know the grammar well enough.
Which lead me to deciding to divide and conquer the two.
Having no interest in looking up each word as it came and slowly, manually building a list of word that would then need to be turned into flashcards, I decided to write some code.
In an effort to save you some time (and avoid admitting that I didn’t comment the code too well) I’ll summarize what I set out to do. I wanted to take a page of text in Vietnamese, split each word into a separate string (line of text) then check to see if the string was already in the list of words to learn. If it was, we could add a number in the column next to it, if not, we would add it to the list. After a quick sort, we would end up with a list of word by how often they appeared in the text. Then any word that appeared more than three times could be converted into flashcards and I would be able to read the text and just focus on the grammar.
Then project creep kicked in.
Wouldn’t it be nice if the words were output into a text document that could be imported into Anki (my flashcard software) to automatically create the cards?
Wouldn’t it be nice if the program automatically translated the words so that I wouldn’t have to spend time looking through a dictionary to translate?
Wouldn’t it be nice if there was a master dictionary with all the words I already learned so that it wouldn’t give me duplicates of flash cards?
The answer to all of these questions was yes, of course, just one more feature. I have a great deal more empathy for the big software companies that are constantly updating their look and adding features that no one cares about. Because maybe some people actually do.
I also ran into several problems off the bat. Unlike romance languages that create compound words without spaces (i.e. prerequisite) Vietnamese creates these with a space between so that my word separator function could be dividing perfectly good words into their component parts. Now this isn’t inherently bad, these component words are also good to learn, but I needed to have the compound words as well.
My solution (well, my brother’s solution, thanks man) was to take each word and make it a compound word and add it to the list with its neighbors. If the word appeared more than a few times, it was likely a compound word and would appear on the final list. If it appeared only once, it was definitely either an uncommon word or not a compound word at all. Either way, it was of no interest to me.
This solution worked for the most part. You can imagine that there were plenty of false positives. Just think of all the times we say “I have” or “I am”. Both were caught by my program, but easily discarded even if it pains me to say it had to be done manually.


Looking back, all this was likely just an excuse to do something fun. It did help me in the long run, but now I am confronted by the fact that I have no excuse not to be doing flash cards every day. So in many ways, this project was a failure. I would have learned a lot more Vietnamese if I instead spent the twenty hours actually studying. Instead, I spent the time trying to make something to get out of actually doing the task.
But if that doesn’t make me a programmer, I don't know what does.