The leading AI models are now good historians…

Benjamin Breen

Jan 22

125

... in specific domains. Three case studies with GPT-4o, o1, and Claude Sonnet 3.5, and what they mean

Read →

33 Comments

H.D. Miller

Jan 22

I've used AI for similar things. It's very good at transcription and translation and summaries of texts you give it. It does, however, have its limitations for some historical tasks. For example, I've been looking for the source of a half-remembered quote from something I'd read 20 years ago, about Japanese-American relations. I remember the gist of the quote and plugged it in to both Chat-GPT and Claude and both confidently gave me specific sources that were incorrect. So, it's not quite ready for primetime as a research assistant. (In AI's favor, I'm beginning to think the quote doesn't exist, that it was something I thought up myself and attributed to my reading. But, still, the answers were hallucinations.)

You can't really use it yet for deep drill downs in research, as a substitute for doing the reading yourself. It'll give you a better-than-wikipedia summary of a topic, but it lacks the nuance and depth of really good historical research.

Expand full comment

Reply (1)

H.D. Miller

Jan 22

Oh yeah, I forgot to say that this post was great. I really enjoyed it and it was very useful to me.

Expand full comment

Reply (1)

Benjamin Breen

Jan 31

Thank you, glad you liked it, and agreed about this sort of thing being an issue.

I am beginning to realize that one big reason for why my experiences with these tools differs from many other people is that I typically work with pretty large sets of public domain documents - think ~1k letters sent in the 18th or 19th centuries, or the complete works of an author from the 17th century, etc. So in a lot of cases, I'm not so much requesting that the LLM "find" an answer in its training data, but rather asking it to dig through a bunch of *additional* context I am supplying. Your example of hallucinating quotes is a good case where that makes a big difference. If you provide no new context and ask it "did Mark Twain say something about x topic," you'll get hallucinations. But if you drop in the complete works of Mark Twain, it will go very differently.

I think of it as sort of like a word search, but for fuzzier things like general concepts or tones. I.e. you could mention a half-remembered the Mark Twain quote about traditions being more durable than laws, and it will direct you to "Laws are sand, customs are rock." But *only* if you give it the full context!

Expand full comment

Derek Lomas

Jan 28

I loved this article.

I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).

Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.

I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.

And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…

LLMs will have a huge impact on humanities scholarship; we need methods and evals.

Would love to join any groups with similar interests to share notes!

Expand full comment

Reply (1)

Benjamin Breen

Jan 31

Really interesting, thanks for this comment. What you propose about benchmarking the Latin translations against experts is exactly the kind of thing that needs to be happening right now to make these tools trustworthy enough for general purpose research use. Currently I don't trust them in languages which I am not able to read on my own (even though, when it comes to Latin or Italian, it is already far better than me - I just need to be proficient enough to double check it).

Could you email me at bebreen [at] ucsc [dot] edu? I heard from someone working on a ML transcription tool for historical documents who I want to put you in touch with, but not sure if they want me to broadcast it publicly yet since it's in beta. Also keep in touch about the benchmarking and evals work, perhaps we can collaborate in some way.

Expand full comment

Gabe

Jan 28

This is a really fascinating post--thank you for sharing. I do wonder, though, just how good the analysis we get from o1 is. Applying Foucault to Galton is interesting, if not groundbreaking or even particularly original, but I am skeptical of its interpretation of James' motivations. Perhaps I simply don't know my James well enough (correct me if I'm wrong!), but I don't think he "insists that genuine knowledge of the mind should be irreducibly private and personal". Indeed, putting it this way seems to do violence to the complexity of his thinking about the self and his worries about the kind of work Galton was doing.

Still, I can see value in something that puts pieces together in novel and creative ways--this analysis could easily serve as a jumping-off point for further thought.

Expand full comment

Reply (1)

Benjamin Breen

Jan 31Edited

Agreed about your interpretation of William James. I think there's a general *direction* here which is accurate, but it's over-stated. He certainly didn't think that knowledge of the mind was exclusively private or personal - in fact I could imagine someone arguing just the opposite. Another key factor: like anyone, WJ's views changed over time, which is a central perspective that historical analysis should be capturing and which this fails to.

That quote is part of what I had in mind when I described it as similar to a first year grad student. It is making real connections and drawing on extensive knowledge, but not entirely landing its argument. When I say LLMs are "good historians," that's what I mean -- not at all that they've replaced human historical analysis, but more that they have recently become good to talk to *about* history. The frontier models of 2025 are more like having a conversation about a topic with a bright PhD student, as opposed to a confused undergrad who had memorized Wikipedia (which was the experience in 2023).

Expand full comment

Rob Nelson

Jan 23

Loved getting a glimpse into the book, to say nothing of the future of doing and teaching history.

Since the early days of ChatGPT, I've imagined what it would have been like to have an LLM transcribe and search all the diaries of school teachers I was trying to absorb. Since I never came close to reading them all myself (it would have taken far more time than I had to write my dissertation), having a tool to search and find references to specific events or books would have rescued what seemed an interesting project.

Tim Lee wrote apiece and did a podcast with Nathan Lambert about how post-training models for narrow purposes are where the action will be, at least in the near future. Your tool is nice evidence for that thesis.

Expand full comment

I.M.J. McInnis

Jan 23

This is impressive stuff... hard to shake the "depressingly good," though. I'm not sure why. It's not a conscious expectation of automation. But seeing historical analysis this good (heck, even the transcription and translation) from an AI... dread, it makes me feel dread.

Expand full comment

TrolleyJoe

Jan 29Edited

Benjamin, couldn't a lot of this, especially the analysis portion, be the LLMs absorbing the work you have put into the models, like the uploading of your research notes, and them just regurgitating them?

Expand full comment

Reply (1)

Benjamin Breen

Jan 31

It definitely will do a lot of riffing and repeating of your own notes if you supply them *before* requesting analysis, but in this case it was a new instance of the model with no prior context or access to my notes. The Historian's Friend (custom GPT I mentioned for transcription) does have a very lengthy system prompt I wrote explaining how I want it to transcribe things - though interestingly, I'm still not sure if it actually makes any difference at all relative to a new instance of "vanilla" GPT-4o!

Expand full comment

Reply (1)

TrolleyJoe

Mar 3

if youve uploaded your notes to chatgpt, it has access to your notes even if you start a new window/chat

Expand full comment

Varun Godbole

Jan 27

You mentioned that the median-ness is unlikely to change. And I suspect there's an element of truth to that. But these models are actually excellent role-players. Do you get meaningfully different results if you prompt o1 to produce "boundary-pushing" results from the perspective of different historians or different "styles" of historians?

Expand full comment

Reply (1)

Benjamin Breen

Jan 31

Good question. I have experimented a fair amount with prompts like "You are a highly experienced, mildly annoyed senior editor at a leading university press. Your task is to read this manuscript and give me a bullet point list, ranked by order of importance, of all the flaws you see in my argument and use of evidence. Be blunt and don't hold back on constructive criticism." That approach adds a noticeable altered tone and style to a response, and anecdotally, seems to make them a bit higher quality. Asking for different styles of historical analysis definitely works - it will gladly play along if you say something like "Approaching this corpus of public domain texts [which I've dropped into the context] from a social history perspective, identify the key themes here."

Not entirely clear to me if something like "You ARE an expert in social history [etc]" does better, but it sort of does seem to work in that way.

A lot of my work on this has been more about roleplaying with LLMs as a teaching tool, as opposed to research, but I do think there's a place for it in both domains: https://resobscura.substack.com/p/roleplaying-with-ai-will-be-powerful-tool

Expand full comment

Reply (1)

Varun Godbole

Feb 1

That's helpful context, thanks!

FWIW I've sometimes found qualitative differences when I tell to be a specific person, as opposed to merely describing the qualities of the person I'm looking for. This is all anecdotal and a jagged frontier, so please everything I've said with a grain of salt. But here are the general "clusters" of prompts that I've found that "work":

1. If you want a specific persona, give it an explicit name and use lots of adjectives to describe the persona. This seems to mostly only really influence the surface-level presentation of the text.

2. If you want specific changes in observable behaviour, then it's better to give it bulleted lists with examples. For example, in your prompt you used "importance" but didn't really describe what constitutes high and low importance flaws. And I actually sometimes wonder if it's better to separate the two in a prompt. Sort of like the separation of concerns in HTML/CSS/JS.

3. Thinking about prompts as communication in a low-context culture seems to help me craft better prompts - https://en.wikipedia.org/wiki/High-context_and_low-context_cultures.

As a shameless plug (but I do think it's relevant here), I've found this "cinematic universe" model of prompting to be sometimes helpful. Although there are cases when it isn't - https://github.com/varungodbole/prompt-tuning-playbook?tab=readme-ov-file#background-pre-training-vs-post-training. If you do end up reading it, I'd be curious if you have feedback on whether anything in that playbook jives with you.

I've generally found the effect of capitalization to be mostly noise. That is, it's not worth trying unless one has clear evals that can be used to assess impact.

Expand full comment

Plux Stahre

Jan 27

I can also strongly recommend https://www.transkribus.org/ for working with historical documents, it has an complete workflow and it also possible to use one of the basemodels and train it to get better at the document set that is in progress. We use it to digitize a City Archive at the moment, and the people behind it is city archives in Europe, The Swedish National Archives among others.

Expand full comment

richard woodcock

Jan 23

I know it would take us back to the 1970s, but.. handwritten exams!

Expand full comment

Reply (1)

richard woodcock

Jan 23

A really interesting piece, too. Thank you.

Expand full comment

Guido

Mar 14

Without doubting AI's abilities, I have experienced something rather surprising and written a summary on that experience. I could hare it

Expand full comment

Andres Kabel

Feb 25

How fascinating, Benjamin. My historical work is modern, from the 40s onwards. My own research has involved obtaining a heap of primary, secondary, and tertiary texts, some of them almost nonexistent on the Internet, plus a few concentrated, voluminous extractions of original manuscript data from the British and American archives. I’ve experimented with Perplexity, ChatGPT, and Claude, mostly the free versions but some short-term paid versions. I’ve occasionally had one of them track something down that I don’t already have, but mostly they just say I need to go to the archives. Claude in particular is not as good in this area of search as Google. I’ve only recently begun to get them to work on my documents; after reading your newsletter, I’ll push along with this. Again, reading your newsletter suggest to me that I should explore material in other languages (Swedish, French, etc.). I’ve made half-hearted attempts to use them for idea-prompting, with a bit of success. Overall, to date LLMs have been of some use but I haven’t spent enough time on them. Having read this newsletter, I’m spurred to try harder. Thanks.

Expand full comment

Craig Van Slyke

Feb 11

The idea of flattening is interesting and troubling.

Expand full comment

__⚪__

Feb 6

> "I favor the possibility that there is an inherent upper limit in what these models can do once they approach the “PhD level,” if we want to call it that."

Do you feel something like "Humanity's Last Exam" is the right kind of evidence to answer this question, or would we need something substantially different? Forgive me if you've already talked about that project before, I'm a new reader.

Expand full comment

Dirichlet-to-Neumann

Feb 5

That was a fascinating read. The best AI models are right now in the sweet spot where they can be highly useful for all kind of knowledge tasks when used by an expert in his field. What will it be ten* years from now ?

It's like chess really. For a long time the machine were just much weaker than humans. Then they were almost as good. Then they were significantly stronger but humans assisted by machines were significantly stronger than machine alone.

And then machine were just simply stronger than humans over all metrics.

*or rather 2.

Expand full comment

Pete Blackshaw

Feb 4

Excellent post and great to see a prof from my alma mater pushing the AI frontiers in a more enlightening direction.

Expand full comment

huhvgf6554

Feb 3

What we're all wondering, at least with what Deepseek demonstrated with R1-Zero, is that maybe models can be trained from scratch without any human feedback. That could lead to really unique and novel insights, much like how AlphaGo Zero introduced new ways to play the game.

Expand full comment