I've used AI for similar things. It's very good at transcription and translation and summaries of texts you give it. It does, however, have its limitations for some historical tasks. For example, I've been looking for the source of a half-remembered quote from something I'd read 20 years ago, about Japanese-American relations. I remember the gist of the quote and plugged it in to both Chat-GPT and Claude and both confidently gave me specific sources that were incorrect. So, it's not quite ready for primetime as a research assistant. (In AI's favor, I'm beginning to think the quote doesn't exist, that it was something I thought up myself and attributed to my reading. But, still, the answers were hallucinations.)
You can't really use it yet for deep drill downs in research, as a substitute for doing the reading yourself. It'll give you a better-than-wikipedia summary of a topic, but it lacks the nuance and depth of really good historical research.
Thank you, glad you liked it, and agreed about this sort of thing being an issue.
I am beginning to realize that one big reason for why my experiences with these tools differs from many other people is that I typically work with pretty large sets of public domain documents - think ~1k letters sent in the 18th or 19th centuries, or the complete works of an author from the 17th century, etc. So in a lot of cases, I'm not so much requesting that the LLM "find" an answer in its training data, but rather asking it to dig through a bunch of *additional* context I am supplying. Your example of hallucinating quotes is a good case where that makes a big difference. If you provide no new context and ask it "did Mark Twain say something about x topic," you'll get hallucinations. But if you drop in the complete works of Mark Twain, it will go very differently.
I think of it as sort of like a word search, but for fuzzier things like general concepts or tones. I.e. you could mention a half-remembered the Mark Twain quote about traditions being more durable than laws, and it will direct you to "Laws are sand, customs are rock." But *only* if you give it the full context!
I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).
Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.
I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.
And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…
LLMs will have a huge impact on humanities scholarship; we need methods and evals.
Would love to join any groups with similar interests to share notes!
Really interesting, thanks for this comment. What you propose about benchmarking the Latin translations against experts is exactly the kind of thing that needs to be happening right now to make these tools trustworthy enough for general purpose research use. Currently I don't trust them in languages which I am not able to read on my own (even though, when it comes to Latin or Italian, it is already far better than me - I just need to be proficient enough to double check it).
Could you email me at bebreen [at] ucsc [dot] edu? I heard from someone working on a ML transcription tool for historical documents who I want to put you in touch with, but not sure if they want me to broadcast it publicly yet since it's in beta. Also keep in touch about the benchmarking and evals work, perhaps we can collaborate in some way.
This is a really fascinating post--thank you for sharing. I do wonder, though, just how good the analysis we get from o1 is. Applying Foucault to Galton is interesting, if not groundbreaking or even particularly original, but I am skeptical of its interpretation of James' motivations. Perhaps I simply don't know my James well enough (correct me if I'm wrong!), but I don't think he "insists that genuine knowledge of the mind should be irreducibly private and personal". Indeed, putting it this way seems to do violence to the complexity of his thinking about the self and his worries about the kind of work Galton was doing.
Still, I can see value in something that puts pieces together in novel and creative ways--this analysis could easily serve as a jumping-off point for further thought.
Agreed about your interpretation of William James. I think there's a general *direction* here which is accurate, but it's over-stated. He certainly didn't think that knowledge of the mind was exclusively private or personal - in fact I could imagine someone arguing just the opposite. Another key factor: like anyone, WJ's views changed over time, which is a central perspective that historical analysis should be capturing and which this fails to.
That quote is part of what I had in mind when I described it as similar to a first year grad student. It is making real connections and drawing on extensive knowledge, but not entirely landing its argument. When I say LLMs are "good historians," that's what I mean -- not at all that they've replaced human historical analysis, but more that they have recently become good to talk to *about* history. The frontier models of 2025 are more like having a conversation about a topic with a bright PhD student, as opposed to a confused undergrad who had memorized Wikipedia (which was the experience in 2023).
Loved getting a glimpse into the book, to say nothing of the future of doing and teaching history.
Since the early days of ChatGPT, I've imagined what it would have been like to have an LLM transcribe and search all the diaries of school teachers I was trying to absorb. Since I never came close to reading them all myself (it would have taken far more time than I had to write my dissertation), having a tool to search and find references to specific events or books would have rescued what seemed an interesting project.
Tim Lee wrote apiece and did a podcast with Nathan Lambert about how post-training models for narrow purposes are where the action will be, at least in the near future. Your tool is nice evidence for that thesis.
This is impressive stuff... hard to shake the "depressingly good," though. I'm not sure why. It's not a conscious expectation of automation. But seeing historical analysis this good (heck, even the transcription and translation) from an AI... dread, it makes me feel dread.
Benjamin, couldn't a lot of this, especially the analysis portion, be the LLMs absorbing the work you have put into the models, like the uploading of your research notes, and them just regurgitating them?
It definitely will do a lot of riffing and repeating of your own notes if you supply them *before* requesting analysis, but in this case it was a new instance of the model with no prior context or access to my notes. The Historian's Friend (custom GPT I mentioned for transcription) does have a very lengthy system prompt I wrote explaining how I want it to transcribe things - though interestingly, I'm still not sure if it actually makes any difference at all relative to a new instance of "vanilla" GPT-4o!
You mentioned that the median-ness is unlikely to change. And I suspect there's an element of truth to that. But these models are actually excellent role-players. Do you get meaningfully different results if you prompt o1 to produce "boundary-pushing" results from the perspective of different historians or different "styles" of historians?
Good question. I have experimented a fair amount with prompts like "You are a highly experienced, mildly annoyed senior editor at a leading university press. Your task is to read this manuscript and give me a bullet point list, ranked by order of importance, of all the flaws you see in my argument and use of evidence. Be blunt and don't hold back on constructive criticism." That approach adds a noticeable altered tone and style to a response, and anecdotally, seems to make them a bit higher quality. Asking for different styles of historical analysis definitely works - it will gladly play along if you say something like "Approaching this corpus of public domain texts [which I've dropped into the context] from a social history perspective, identify the key themes here."
Not entirely clear to me if something like "You ARE an expert in social history [etc]" does better, but it sort of does seem to work in that way.
FWIW I've sometimes found qualitative differences when I tell to be a specific person, as opposed to merely describing the qualities of the person I'm looking for. This is all anecdotal and a jagged frontier, so please everything I've said with a grain of salt. But here are the general "clusters" of prompts that I've found that "work":
1. If you want a specific persona, give it an explicit name and use lots of adjectives to describe the persona. This seems to mostly only really influence the surface-level presentation of the text.
2. If you want specific changes in observable behaviour, then it's better to give it bulleted lists with examples. For example, in your prompt you used "importance" but didn't really describe what constitutes high and low importance flaws. And I actually sometimes wonder if it's better to separate the two in a prompt. Sort of like the separation of concerns in HTML/CSS/JS.
I've generally found the effect of capitalization to be mostly noise. That is, it's not worth trying unless one has clear evals that can be used to assess impact.
I can also strongly recommend https://www.transkribus.org/ for working with historical documents, it has an complete workflow and it also possible to use one of the basemodels and train it to get better at the document set that is in progress. We use it to digitize a City Archive at the moment, and the people behind it is city archives in Europe, The Swedish National Archives among others.
If we just focus on the benefits, I think there are a lot AI models can help with (as the examples you brought up in the article). It's definitely helping me when approaching sources in languages I don't speak, to make better sense of them. Even if I have to check plenty of details afterwards.
How fascinating, Benjamin. My historical work is modern, from the 40s onwards. My own research has involved obtaining a heap of primary, secondary, and tertiary texts, some of them almost nonexistent on the Internet, plus a few concentrated, voluminous extractions of original manuscript data from the British and American archives. I’ve experimented with Perplexity, ChatGPT, and Claude, mostly the free versions but some short-term paid versions. I’ve occasionally had one of them track something down that I don’t already have, but mostly they just say I need to go to the archives. Claude in particular is not as good in this area of search as Google. I’ve only recently begun to get them to work on my documents; after reading your newsletter, I’ll push along with this. Again, reading your newsletter suggest to me that I should explore material in other languages (Swedish, French, etc.). I’ve made half-hearted attempts to use them for idea-prompting, with a bit of success. Overall, to date LLMs have been of some use but I haven’t spent enough time on them. Having read this newsletter, I’m spurred to try harder. Thanks.
> "I favor the possibility that there is an inherent upper limit in what these models can do once they approach the “PhD level,” if we want to call it that."
Do you feel something like "Humanity's Last Exam" is the right kind of evidence to answer this question, or would we need something substantially different? Forgive me if you've already talked about that project before, I'm a new reader.
That was a fascinating read. The best AI models are right now in the sweet spot where they can be highly useful for all kind of knowledge tasks when used by an expert in his field. What will it be ten* years from now ?
It's like chess really. For a long time the machine were just much weaker than humans. Then they were almost as good. Then they were significantly stronger but humans assisted by machines were significantly stronger than machine alone.
And then machine were just simply stronger than humans over all metrics.
What we're all wondering, at least with what Deepseek demonstrated with R1-Zero, is that maybe models can be trained from scratch without any human feedback. That could lead to really unique and novel insights, much like how AlphaGo Zero introduced new ways to play the game.
I've used AI for similar things. It's very good at transcription and translation and summaries of texts you give it. It does, however, have its limitations for some historical tasks. For example, I've been looking for the source of a half-remembered quote from something I'd read 20 years ago, about Japanese-American relations. I remember the gist of the quote and plugged it in to both Chat-GPT and Claude and both confidently gave me specific sources that were incorrect. So, it's not quite ready for primetime as a research assistant. (In AI's favor, I'm beginning to think the quote doesn't exist, that it was something I thought up myself and attributed to my reading. But, still, the answers were hallucinations.)
You can't really use it yet for deep drill downs in research, as a substitute for doing the reading yourself. It'll give you a better-than-wikipedia summary of a topic, but it lacks the nuance and depth of really good historical research.
Oh yeah, I forgot to say that this post was great. I really enjoyed it and it was very useful to me.
Thank you, glad you liked it, and agreed about this sort of thing being an issue.
I am beginning to realize that one big reason for why my experiences with these tools differs from many other people is that I typically work with pretty large sets of public domain documents - think ~1k letters sent in the 18th or 19th centuries, or the complete works of an author from the 17th century, etc. So in a lot of cases, I'm not so much requesting that the LLM "find" an answer in its training data, but rather asking it to dig through a bunch of *additional* context I am supplying. Your example of hallucinating quotes is a good case where that makes a big difference. If you provide no new context and ask it "did Mark Twain say something about x topic," you'll get hallucinations. But if you drop in the complete works of Mark Twain, it will go very differently.
I think of it as sort of like a word search, but for fuzzier things like general concepts or tones. I.e. you could mention a half-remembered the Mark Twain quote about traditions being more durable than laws, and it will direct you to "Laws are sand, customs are rock." But *only* if you give it the full context!
I loved this article.
I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).
Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.
I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.
And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…
LLMs will have a huge impact on humanities scholarship; we need methods and evals.
Would love to join any groups with similar interests to share notes!
Really interesting, thanks for this comment. What you propose about benchmarking the Latin translations against experts is exactly the kind of thing that needs to be happening right now to make these tools trustworthy enough for general purpose research use. Currently I don't trust them in languages which I am not able to read on my own (even though, when it comes to Latin or Italian, it is already far better than me - I just need to be proficient enough to double check it).
Could you email me at bebreen [at] ucsc [dot] edu? I heard from someone working on a ML transcription tool for historical documents who I want to put you in touch with, but not sure if they want me to broadcast it publicly yet since it's in beta. Also keep in touch about the benchmarking and evals work, perhaps we can collaborate in some way.
This is a really fascinating post--thank you for sharing. I do wonder, though, just how good the analysis we get from o1 is. Applying Foucault to Galton is interesting, if not groundbreaking or even particularly original, but I am skeptical of its interpretation of James' motivations. Perhaps I simply don't know my James well enough (correct me if I'm wrong!), but I don't think he "insists that genuine knowledge of the mind should be irreducibly private and personal". Indeed, putting it this way seems to do violence to the complexity of his thinking about the self and his worries about the kind of work Galton was doing.
Still, I can see value in something that puts pieces together in novel and creative ways--this analysis could easily serve as a jumping-off point for further thought.
Agreed about your interpretation of William James. I think there's a general *direction* here which is accurate, but it's over-stated. He certainly didn't think that knowledge of the mind was exclusively private or personal - in fact I could imagine someone arguing just the opposite. Another key factor: like anyone, WJ's views changed over time, which is a central perspective that historical analysis should be capturing and which this fails to.
That quote is part of what I had in mind when I described it as similar to a first year grad student. It is making real connections and drawing on extensive knowledge, but not entirely landing its argument. When I say LLMs are "good historians," that's what I mean -- not at all that they've replaced human historical analysis, but more that they have recently become good to talk to *about* history. The frontier models of 2025 are more like having a conversation about a topic with a bright PhD student, as opposed to a confused undergrad who had memorized Wikipedia (which was the experience in 2023).
Loved getting a glimpse into the book, to say nothing of the future of doing and teaching history.
Since the early days of ChatGPT, I've imagined what it would have been like to have an LLM transcribe and search all the diaries of school teachers I was trying to absorb. Since I never came close to reading them all myself (it would have taken far more time than I had to write my dissertation), having a tool to search and find references to specific events or books would have rescued what seemed an interesting project.
Tim Lee wrote apiece and did a podcast with Nathan Lambert about how post-training models for narrow purposes are where the action will be, at least in the near future. Your tool is nice evidence for that thesis.
This is impressive stuff... hard to shake the "depressingly good," though. I'm not sure why. It's not a conscious expectation of automation. But seeing historical analysis this good (heck, even the transcription and translation) from an AI... dread, it makes me feel dread.
Benjamin, couldn't a lot of this, especially the analysis portion, be the LLMs absorbing the work you have put into the models, like the uploading of your research notes, and them just regurgitating them?
It definitely will do a lot of riffing and repeating of your own notes if you supply them *before* requesting analysis, but in this case it was a new instance of the model with no prior context or access to my notes. The Historian's Friend (custom GPT I mentioned for transcription) does have a very lengthy system prompt I wrote explaining how I want it to transcribe things - though interestingly, I'm still not sure if it actually makes any difference at all relative to a new instance of "vanilla" GPT-4o!
if youve uploaded your notes to chatgpt, it has access to your notes even if you start a new window/chat
You mentioned that the median-ness is unlikely to change. And I suspect there's an element of truth to that. But these models are actually excellent role-players. Do you get meaningfully different results if you prompt o1 to produce "boundary-pushing" results from the perspective of different historians or different "styles" of historians?
Good question. I have experimented a fair amount with prompts like "You are a highly experienced, mildly annoyed senior editor at a leading university press. Your task is to read this manuscript and give me a bullet point list, ranked by order of importance, of all the flaws you see in my argument and use of evidence. Be blunt and don't hold back on constructive criticism." That approach adds a noticeable altered tone and style to a response, and anecdotally, seems to make them a bit higher quality. Asking for different styles of historical analysis definitely works - it will gladly play along if you say something like "Approaching this corpus of public domain texts [which I've dropped into the context] from a social history perspective, identify the key themes here."
Not entirely clear to me if something like "You ARE an expert in social history [etc]" does better, but it sort of does seem to work in that way.
A lot of my work on this has been more about roleplaying with LLMs as a teaching tool, as opposed to research, but I do think there's a place for it in both domains: https://resobscura.substack.com/p/roleplaying-with-ai-will-be-powerful-tool
That's helpful context, thanks!
FWIW I've sometimes found qualitative differences when I tell to be a specific person, as opposed to merely describing the qualities of the person I'm looking for. This is all anecdotal and a jagged frontier, so please everything I've said with a grain of salt. But here are the general "clusters" of prompts that I've found that "work":
1. If you want a specific persona, give it an explicit name and use lots of adjectives to describe the persona. This seems to mostly only really influence the surface-level presentation of the text.
2. If you want specific changes in observable behaviour, then it's better to give it bulleted lists with examples. For example, in your prompt you used "importance" but didn't really describe what constitutes high and low importance flaws. And I actually sometimes wonder if it's better to separate the two in a prompt. Sort of like the separation of concerns in HTML/CSS/JS.
3. Thinking about prompts as communication in a low-context culture seems to help me craft better prompts - https://en.wikipedia.org/wiki/High-context_and_low-context_cultures.
As a shameless plug (but I do think it's relevant here), I've found this "cinematic universe" model of prompting to be sometimes helpful. Although there are cases when it isn't - https://github.com/varungodbole/prompt-tuning-playbook?tab=readme-ov-file#background-pre-training-vs-post-training. If you do end up reading it, I'd be curious if you have feedback on whether anything in that playbook jives with you.
I've generally found the effect of capitalization to be mostly noise. That is, it's not worth trying unless one has clear evals that can be used to assess impact.
I can also strongly recommend https://www.transkribus.org/ for working with historical documents, it has an complete workflow and it also possible to use one of the basemodels and train it to get better at the document set that is in progress. We use it to digitize a City Archive at the moment, and the people behind it is city archives in Europe, The Swedish National Archives among others.
I know it would take us back to the 1970s, but.. handwritten exams!
A really interesting piece, too. Thank you.
If we just focus on the benefits, I think there are a lot AI models can help with (as the examples you brought up in the article). It's definitely helping me when approaching sources in languages I don't speak, to make better sense of them. Even if I have to check plenty of details afterwards.
Brilliant article.
How fascinating, Benjamin. My historical work is modern, from the 40s onwards. My own research has involved obtaining a heap of primary, secondary, and tertiary texts, some of them almost nonexistent on the Internet, plus a few concentrated, voluminous extractions of original manuscript data from the British and American archives. I’ve experimented with Perplexity, ChatGPT, and Claude, mostly the free versions but some short-term paid versions. I’ve occasionally had one of them track something down that I don’t already have, but mostly they just say I need to go to the archives. Claude in particular is not as good in this area of search as Google. I’ve only recently begun to get them to work on my documents; after reading your newsletter, I’ll push along with this. Again, reading your newsletter suggest to me that I should explore material in other languages (Swedish, French, etc.). I’ve made half-hearted attempts to use them for idea-prompting, with a bit of success. Overall, to date LLMs have been of some use but I haven’t spent enough time on them. Having read this newsletter, I’m spurred to try harder. Thanks.
The idea of flattening is interesting and troubling.
> "I favor the possibility that there is an inherent upper limit in what these models can do once they approach the “PhD level,” if we want to call it that."
Do you feel something like "Humanity's Last Exam" is the right kind of evidence to answer this question, or would we need something substantially different? Forgive me if you've already talked about that project before, I'm a new reader.
That was a fascinating read. The best AI models are right now in the sweet spot where they can be highly useful for all kind of knowledge tasks when used by an expert in his field. What will it be ten* years from now ?
It's like chess really. For a long time the machine were just much weaker than humans. Then they were almost as good. Then they were significantly stronger but humans assisted by machines were significantly stronger than machine alone.
And then machine were just simply stronger than humans over all metrics.
*or rather 2.
Excellent post and great to see a prof from my alma mater pushing the AI frontiers in a more enlightening direction.
What we're all wondering, at least with what Deepseek demonstrated with R1-Zero, is that maybe models can be trained from scratch without any human feedback. That could lead to really unique and novel insights, much like how AlphaGo Zero introduced new ways to play the game.