Notes on email message threading
I sent an email to Jamie Zawinski with feedback on his venerable email threading algorithm. Perhaps my commentary will be a useful reference to others implementing email threading.
You can see my implementation of his algorithm at https://git.lukeshu.com/www/tree/cmd/generate/mailstuff/thread_alg.go (and a use of it at https://git.lukeshu.com/www/tree/cmd/generate/mailstuff/thread.go).
To: Jamie Zawinski <jwz@jwz.org>
Subject: message threading
Date: Sat, 08 Jun 2024 22:34:41 -0600
Message-ID: <87tti2ybry.wl-lukeshu@lukeshu.com>
Hi,
I'm implementing message threading, and have been referencing both your document <https://www.jwz.org/doc/threading.html>; and RFC 5256. I'm not sure whether you're interested in updating a document that's more than 25 years old, but if you are: I hope you find the following feedback valuable.
You write that the algorithm in RFC 5256 is merely a restating
of your algorithm, but I noticed 3 (minor) differences:
In your step 1.C, the RFC says to check whether this would create a loop, and if it would to skip creating the link; your version only says to perform this check in step 1.B.
The RFC says to sort the messages by date between your steps 4 and 5; that is: when grouping by subject, containers in the root set should be processed in date-order (you do not specify an order), and that if container in the root set is empty then the subject should be taken from the earliest-date child (you say to use an arbitrary child).
The RFC precisely states how to trim a subject down to a "base subject," rather than simply saying
Strip ``Re:'', ``RE:'', ``RE[5]:'', ``Re: Re[4]: Re:'' and so on.
Additionally, there are two minor points on which I found their version to be clearer:
The RFC specifies how to handle messages without a Message-Id or with a duplicate Message-Id (on page 9), as well as how to normalize a Message-Id (by referring to RFC 2822). This is perhaps out-of-scope of your algorithm document, but I feel that it would be worth mentioning in your background or definitions section.
In your step 1.B, I did not understand what
If they are already linked, don't change the existing links
meant until I read the RFC, which words it asIf a message already has a parent, don't change the existing link.
It was unclear to me whatthey
was referring to in your version.
--
Happy hacking,
~ Luke T. Shumaker