Discover Privacy News

Spam Filtering With gzip

Anthony Pell

1 - 2 min read Jan 27, 2003

I was reading through a bioinformatics book the other day, and was reminded of a useful shortcut for comparing a text against various corpora. A number of researchers have simply fed DNA sequence data into the popular Ziv-Lempel compression algorithm, to . . .

Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

A related technique allows us to measure how much a given, "test" text has in common with a corpus of possibly similar documents. If we concatenate the corpus and the test text, and gzip them together, the test text will get a better compression ratio if it has more fragments, words, or phrases in common with the corpus, and a worse ratio if it is dissimilar. Since the LZ algorithm scans the entire input for repetitions, it tends to map pieces of the test text to previous occurrences in the corpus, thereby achieving a high "appended compression ratio" if the test text is similar to what it's appended to.

kuro5hin

Discover Privacy News

Spam Filtering With gzip

Anthony Pell

LinuxSecurity Poll

What is the best browser with a built-in VPN?

Get the Latest News & Insights

News

Advisories

HOWTOs

Features

About Us

Powered By

Discover Privacy News

Spam Filtering With gzip

Anthony Pell

Related Articles

Get the Latest News & Insights

LinuxSecurity Poll

What is the best browser with a built-in VPN?

Get the Latest News & Insights

News

Advisories

HOWTOs

Features

About Us

Powered By