Testing obsidian-import's MarkItDown conversion against real files

In a previous post I wrote about youtube-to-obsidian. Since then I’ve wanted to turn PDFs and slide decks into notes the same way, so I wired in Microsoft’s MarkItDown to handle them. It’s no longer YouTube-only, so I renamed the repo to obsidian-import.

Now I can pull long academic PDFs, 100-plus-slide decks, lengthy blog posts, and TED talks into Obsidian as summary notes without reading or watching the whole thing. Instead of sitting through a one-hour video, I can skim a well-organized note in a few minutes. The input efficiency is on a different level.

That said, when I first built the feature I only wrote mock tests — I’d never actually thrown a real PDF or PPTX at it. This post is about squashing that gap.

An immediate error on PDF

I fed it a PDF I had lying around, and got this:

変換エラー: PdfConverter recognized the input as a potential .pdf file,
but the dependencies needed to read .pdf files have not been installed.

MarkItDown uses pdfminer for PDF handling, but that’s an optional dependency — pip install markitdown alone doesn’t pull it in. You need pip install markitdown[pdf] or markitdown[all].

My install.sh was only installing bare markitdown, so I fixed it to install markitdown[all]. DOCX, PPTX, XLSX, and the rest have the same kind of optional dependencies, so if you want to support every format, it’s simplest to just install all from the start.

Test results by format

After fixing the dependencies, I ran it against real files and dummy data for each format.

FormatInputResult
PDFResume / work historyJapanese text extracted correctly. Lots of FontBBox warnings, but no effect on the output
DOCXDummy documentHeadings and paragraphs converted to Markdown correctly
PPTXDummy slidesConverted with slide numbers preserved
XLSXDummy tableConverted to a Markdown table
URLexample.comHTML to Markdown conversion worked
ImageJPG/PNGRejected by the minimum-character-count check since there’s no text (working as intended)

The FontBBox warnings on PDF are a known pdfminer quirk that shows up with PDFs that have incomplete font info. It floods stderr and looks alarming, but it doesn’t affect the conversion result.

Confirmed duplicate skipping too

I converted the same URL twice, and the second run was skipped as expected.

[1/1] https://example.com
  スキップ(処理済み)

convert.py names files by the MD5 hash of the source, and skips conversion if a file with that name already exists in .transcripts/. This was already covered by mock tests, but it was good to confirm it holds up in practice too.

What I fixed

The actual change was just three spots specifying the dependency package. README.md and CLAUDE.md also mention MarkItDown, but only as a tool name reference, so those didn’t need updating.

FileChange
install.shmarkitdownmarkitdown[all]
SKILL.mdSame
~/.claude/commands/obsidian-import.mdSame (a copy of SKILL.md)

You only find this by using it yourself

This particular problem was the kind mock tests can’t catch. As long as you’re mocking the return value of MarkItDown().convert(), there’s no way to test whether a missing dependency blows up at import time.

Installing the full markitdown[all] dependency set in CI would make it heavier, and bundling real PDF and PPTX files in the repo feels off too. So I left the mock tests as they are, and made real-file integration checks something I run manually on my own machine.

In the end, you don’t really know your own tool works until you use it yourself. Tests can all pass and it still falls over the moment you throw a real PDF at it. Same story as my previous post — the switch to caption-first only became obvious once I actually tried it on a real lecture video. Mocks and CI are good for a baseline, but nothing beats using the thing yourself as an actual user.