A case study in PDF forensics: The Epstein PDFs

pdfa.org

341 points by DuffJohnson a day ago


Grisu_FTP - 3 minutes ago

PDF Files about PDF Files

anigbrowl - 20 hours ago

I found this part interesting:

There are also other documents that appear to simulate a scanned document but completely lack the “real-world noise” expected with physical paper-based workflows. The much crisper images appear almost perfect without random artifacts or background noise, and with the exact same amount of image skew across multiple pages. Thanks to the borders around each page of text, page skew can easily be measured, such as with VOL00007\IMAGES\0001\EFTA00009229.pdf. It is highly likely these PDFs were created by rendering original content (from a digital document) to an image (e.g., via print to image or save to image functionality) and then applying image processing such as skew, downscaling, and color reduction.

ted_bunny - 21 hours ago

Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find.

yonatan8070 - 20 hours ago

A bit off-topic, but I find it kinda funny that the "Decline" button on the cookie popup on this page is labled "Continue without consent".

nullbyte808 - 2 hours ago

DOJ are technically breakng the law by releasing a heavily moddified "reproduction" of the original files, not the "actual" files. The software they used "OmniPage CSDK 21.1" removes all usefull metadata and any encrypted files if any where stored.

waynenilsen - a day ago

> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.

hopefully someone is independently archiving all documents

my understanding is that some are being removed

embedding-shape - a day ago

Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.

originalvichy - a day ago

Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).

JKCalhoun - 9 hours ago

Interesting, there are a handful of PDFs in the drop that appear to be an email with a Base64 encoded attachment—inline.

OCR is so bad of course that decoding the Base64 seems futile without a lot of effort.

Example: https://www.justice.gov/epstein/files/DataSet%2011/EFTA02609...

(More mentioned here: https://old.reddit.com/r/Epstein/comments/1qu9az2/theres_unr...)

Beijinger - 17 hours ago

What would be more interesting: His Bank accounts.

Who paid him?

Who did get paid?

_def - 21 hours ago

I can't even download the archive, the transmission always terminates just before its finished. Spooky.

direwolf20 - 2 hours ago

Blocked by cloudflare

shevy-java - 14 hours ago

So I have been wondering about this ...

Some of the gathered data is shown here, right? Probably not all.

Now ... that's static information though. That's not really an analysis, most definitely not an independent (open ended) analysis. And it will only show a very incomplete part of the full picture.

This is why I think the "release the files" movement, as good as they are, seems incomplete. I'd rather know a lot more about how they operate their networks, getting away involving underage women. How about secret services of other countries? Should that not also be highly important? So why is there not really a larger investigation as well as independent analysis? Those .pdf files alone can not tell the whole picture. That can just be the tip of the iceberg; and it evidently involves other countries too, with Prince Andrew being the most famous here (aka, the UK, but we already saw that other countries also have similar issues where people suddenly had to step away from politics when it was found out they visited the party-locations of Mr. Epstein).

nkozyra - a day ago

> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata

Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?

bugeats - a day ago

Somebody ought to train an LLM exclusively on this text, just for funsies.

corygarms - a day ago

These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.

RT_max - 4 hours ago

Love the forensic craft here. Worth noting that the 'recoverable redactions' story that went viral was based on older, unrelated DOJ documents — not the EFTA files, which were properly redacted. The misinformation spread faster than anyone could debunk it. Which is kind of its own forensics problem.

tibbon - a day ago

That's a lot of PeDoFiles!

(But seriously, great work here!)

Ms-J - 2 hours ago

Stylometry works. I've seen it used it cases where the individual was identified from a group.

One thing that is telling about the Epstein case study is how long it has stayed in public view. Pizzagate, which involved more powerful people, was shut down faster than I've ever seen for anything else. I still remember and have archived the more extreme content it's sick.

mmooss - 20 hours ago

What is the legal basis for releasing the someone's private files and communications? If they can do it to Epstein, they can do it to you, to the Washington Post journalist, to former President Clinton, etc.

Is the scope at least limited somehow? Generally I favor transparency, but of course probably the most important parts are withheld.

meidan_y - a day ago

(2025) just follow hn guideline, impressive voter ring though

NoToP - 20 hours ago

This is so incredibly useful to me right now for incidental reasons I am commenting to make sure I can get back to it.