Testing obsidian-import's MarkItDown conversion against real files
In a previous post I wrote about youtube-to-obsidian. Since then I’ve wanted to turn PDFs and slide decks into notes the same way, so I wired in Microsoft’s MarkItDown to handle them. It’s no longer YouTube-only, so I renamed the repo to obsidian-import.
Now I can pull long academic PDFs, 100-plus-slide decks, lengthy blog posts, and TED talks into Obsidian as summary notes without reading or watching the whole thing. Instead of sitting through a one-hour video, I can skim a well-organized note in a few minutes. The input efficiency is on a different level.
That said, when I first built the feature I only wrote mock tests — I’d never actually thrown a real PDF or PPTX at it. This post is about squashing that gap.
An immediate error on PDF
I fed it a PDF I had lying around, and got this:
変換エラー: PdfConverter recognized the input as a potential .pdf file,
but the dependencies needed to read .pdf files have not been installed.MarkItDown uses pdfminer for PDF handling, but that’s an optional dependency — pip install markitdown alone doesn’t pull it in. You need pip install markitdown[pdf] or markitdown[all].
My install.sh was only installing bare markitdown, so I fixed it to install markitdown[all]. DOCX, PPTX, XLSX, and the rest have the same kind of optional dependencies, so if you want to support every format, it’s simplest to just install all from the start.
Test results by format
After fixing the dependencies, I ran it against real files and dummy data for each format.
| Format | Input | Result |
|---|---|---|
| Resume / work history | Japanese text extracted correctly. Lots of FontBBox warnings, but no effect on the output | |
| DOCX | Dummy document | Headings and paragraphs converted to Markdown correctly |
| PPTX | Dummy slides | Converted with slide numbers preserved |
| XLSX | Dummy table | Converted to a Markdown table |
| URL | example.com | HTML to Markdown conversion worked |
| Image | JPG/PNG | Rejected by the minimum-character-count check since there’s no text (working as intended) |
The FontBBox warnings on PDF are a known pdfminer quirk that shows up with PDFs that have incomplete font info. It floods stderr and looks alarming, but it doesn’t affect the conversion result.
Confirmed duplicate skipping too
I converted the same URL twice, and the second run was skipped as expected.
[1/1] https://example.com
スキップ(処理済み)convert.py names files by the MD5 hash of the source, and skips conversion if a file with that name already exists in .transcripts/. This was already covered by mock tests, but it was good to confirm it holds up in practice too.
What I fixed
The actual change was just three spots specifying the dependency package. README.md and CLAUDE.md also mention MarkItDown, but only as a tool name reference, so those didn’t need updating.
| File | Change |
|---|---|
install.sh | markitdown → markitdown[all] |
SKILL.md | Same |
~/.claude/commands/obsidian-import.md | Same (a copy of SKILL.md) |
You only find this by using it yourself
This particular problem was the kind mock tests can’t catch. As long as you’re mocking the return value of MarkItDown().convert(), there’s no way to test whether a missing dependency blows up at import time.
Installing the full markitdown[all] dependency set in CI would make it heavier, and bundling real PDF and PPTX files in the repo feels off too. So I left the mock tests as they are, and made real-file integration checks something I run manually on my own machine.
In the end, you don’t really know your own tool works until you use it yourself. Tests can all pass and it still falls over the moment you throw a real PDF at it. Same story as my previous post — the switch to caption-first only became obvious once I actually tried it on a real lecture video. Mocks and CI are good for a baseline, but nothing beats using the thing yourself as an actual user.