Kompression      zurück   Orginaltexte   Links    Fragen


Links zu den Begriffen Information , Entropie, Zufall , Struktur,  

Was bedeutet Kompression?

Letztendlich ist Kompression (Komprimieren = verdichten, zusammendrücken) auf verschiedene Bereiche des täglichen Lebens anwendbar. Speicherkapazität ist auch im Zeitalter von Gigabyte-Festplatte noch immer beschränkt und der Datentransfer im Internet fordert den Versand von Daten nicht in der Originalform, sondern im komprimierten Zustand (aus Zeit- und somit Kostengründen).

Weiterhin ist Kompression für Backups sinnvoll, da ein Realtime-Zugriff nicht notwendig ist. Kompression sollte natürlich nur begrenzt in Bereichen eingesetzt werden, in denen ein besonders schneller Zugriff auf Daten (z. B. in Realtime) notwendig ist, denn die Dekomprimierung fordert immer Rechenkapazität (mittlerweile sind die Rechner aber so schnell, daß in einigen Segmenten bereits in nahezu Realtime komprimiert und dekomprimiert wird; man denke an DoubleDisk von Microsoft).

Stichworte

ZIP, Zippen, Komprimieren, Komprimierung, Datenkompression, Dateikompression, Winzip, komprimierte Dateien, Videokompression, Kompression, mp3 , zip, png , jpg

Das grundsätzliche Prinzip

Die grundsätzliche Funktionsweise der Kompression kann in wenigen Worten zusammengefaßt werden. Es gibt zwei fundamentale Methoden:

Redundanzen (Wiederholungen) von Daten(-Strängen) werden erkannt und zusammengefaßt (verkürzte Datenmodellierung)

Unnötige und teilweise unwichtige Daten werden gelöscht.

Die Löschung von Daten kann natürlich nur vorgenommen werden, wenn diese nicht mehr benötigt werden und auch später zur Verarbeitung unrelevant sind. Diese Art der Kompression wird z. B. für Video- oder Audiodaten verwendet.

Achtung: natürlich können beide Prinzipien kombiniert werden.

Übersicht der gängigen Verfahren

LZW-Codierung (Lempel, Ziv, Welch)

Es wird ein Wörterbuch aufgebaut (ein Wort wird durch ein Byte definiert) und im Endeffekt wird für ein Folgewort nur noch ein Zeiger auf das erste Wort/das Wort im Wörterbuch gesetzt. Diese Kompression eignet sich natürlich sehr gut für Textdokumente, allerdings auch für Binarys, die viele gleiche Zeichenstränge beinhalten. LZW existiert mittlerweile in vielen Abwandelungen und in einigen sehr stark optimierten Formen. Die Kompression erfolgt verlustfrei, der Ursprungszustand des Dokumentes kann also wiederhergestellt werden.

RLE - Run Length Encoding

Einzelne gleiche Bytefolgen (z. B. 3, 3, 3, 3) werden zahlenmäßig zusammengefaßt (z. B. 43, also 4 x 3) und nacheinander gespeichert. Diese Methode eignet sich hervorragend für Schwarz/Weiß-Bilder, sofern es sich nicht um ein gleichmäßiges Störbild handelt (also 1,0,1,0), da dann eine Verfielfältigung der Dateigröße vorgenommen wird. Die Kompression ist auch verlustfrei.

Huffmann-Codierung

Alle Zeichen werden in einem Table gespeichert. Der Table wird so sortiert, daß häufig vorkommende Zeichen einen kürzeren Identifiaktionscode bekommen (z. B. ein oder zwei Bit), als ein nicht so häufig vorkommendes Zeichen (z. B. vier Bit). In Abwandelung kann auch ein Zeichenfolgentable gebildet werden. Letztendlich wird die Datei dann so gespeichert, daß ein Abbild der Originaldatei gebildet wird, in der dann nur die über den Table definierten Zeichen gespeichert werden (also anstelle eines E, dann ein Bit mit 0). Der Table muß bei dieser Komprimierungsmethode mit in die Datei eingefügt werden, da sonst eine Dekomprimierung unmöglich ist. Dieses Vorgehen kann den Nachteil bergen, daß eine kurze Datei somit länger werden kann. Auch diese Komprimierungsform ist verlustfrei.

Wavelet-Kompression

Das Objekte/die Datei wird komplett betrachtet und mittels mathematischer Berechnung (Koeffizienten, Hoch- und Tiefpaßfilterung, etc.) gesplittet (non-blocking Strategie). Letztendlich ist dieses Verfahren allerdings für dieses Tutorial zu komplex. Insgesamt ist die Methode nicht verlustfrei (also nicht für Backups geeignet) und in der Folge der Berechnung wird meistens noch einmal per RLE oder Huffmann zusätzlich komprimiert. Diese Methode ist bestens für Bilder geeignet.

Fraktale-Kompression

Ein Bild wird in verschiedene Fragmente (geometrische Strukturen) geteilt und diese Fragmente werden dann durch pushing / turning / zooming / sizing verglichen. Es werden sogenannte Domain- und Rangeblocks erzeugt (dynamisch). Zueinander ähnliche Blöcke werden gesucht; der Faktor der Ähnlichkeit kann eingestellt werden und hierüber bestimmt sich dann der Datenverlust. Jeder Domainblock wird mit mit jedem Rangeblock verglichen und außerdem wird die jeweilige Distanz zwischen den Domain- und dem transformierten Rangeblock ermittelt. Das Bild wird letztendlich durch eine Viezahl mathematischer Gleichungen präsentiert. Eine Dekodierung erfolgt durch Abarbeitung der Gleichungen und den Aufbau der Fragmente. Diese Komprimierung ist eigentlich nur für Bilder geeignet und auch nicht verlustfrei.

Übersicht der gängigen Datenformate

JPEG

Erkennt und entfernt Farbunterschiede / Helligkeitsunterschiede (für das menschliche Auge unrelevante Daten). Hierdurch ist eine gute Kompression erreichbar (einstellbar; die Qualität nimmt mit höhere Komprimierung ab). Das Verfahren ist nicht verlustfrei.

GIF

Im Standardformat hat GIF eine 8-Bit-Schlüsselung und verwendet den LZW-Algorithmus. Die Komprimierung ist verlustfrei und es gibt verschiedene Arten der Speicherung und der späteren Darstellung (z. B. Rasterung während des Aufbaus).

PNG

Dienst als Nachfolgeformat der GIF-Definition. Die Farbtiefe ist bis 64-Bit (Photorealismus) möglich und es werden weitere Filter eingesetzt.

ZIP

Das ZIP-Format kombiniert diverse Komprimierungsverfahren (Huffmann, LZW, RLE und andere). Das ZIP-Format und der Einsatz der Kompressionsalgorithmen ist standardmäßig definiert.


http://www.educeth.ethz.ch/informatik/interaktiv/kompression/huffman.html

Allgemeine Beschreibung von Kompression in der EDV

Ein Verfahren,

Die einfachste Methode - nämlich die "Lauflängen-Kodierung" bzw. "run length compression" - beruht darauf, beispielsweise die 5 RGB-Werte aufeinanderfolgender Bildpunkte nicht auszuschreiben sondern gezählt abzuspeichern:

statt "12/56/177,12/56/177,12/56/177,12/56/17712/56/177"

also "5 mal 12/56/177"

Bei den Komprimierungs-Methoden werden prinzipiell u.a. unterscheiden:

Eine Komprimierungsmethode, bei der die Originaldaten erhalten werden - was für Programme, Texte oder Tabellen unumgänglich ist. Packer wie Winzip oder ARJ arbeiten mit dieser Methode (genauer: mit der von Lempel und Ziv entwickelten LZ77-Komprimierung in Kombination mit einer Huffman-Codierung).

Im allgemeinen ist eine verlustfreie Komprimierung bei digitalisierten Videos und gescannten Fotografien nicht sehr effektiv.

Bei verlustbehafteten Komprimierungs-Methoden gehen Informationen unwiederbringlich verloren. Diese Verfahren komprimieren sehr stark, können aber nur für Datentypen eingesetzt werden, bei denen Verluste wenig auffallen oder die sich verlustfrei nur schlecht komprimieren lassen - dazu zählen Audio-, Video- und Bilddaten

Die Datenmenge, die verlorengeht, hängt vom Grad der in der Regel einstellbaren Komprimierung ab. Wichtig zu wissen ist, dass viele verlustreiche Komprimierungsmethoden bei erneuter Komprimierung weitere Inhalte verlieren. Dieser zusätzliche Verlust ist allerdings abhängig von der Komprimierungsmethode; der QuickTime Video Codec wurde z.B. so entwickelt, daß der zusätzliche Verlust bei erneuter Komprimierung gering gehalten wird (siehe auch Videokompression).

Bei der räumlichen (Video-)Komprimierung werden die einzelnen Frames einer Video-Sequenz für sich komprimiert - z.B. nach der JPEG-Methode. Häufig auftretende Nebenwirkung der räumlichen Komprimierung sind Unschärfen, Blockbildungen (Kompressions-Artefakte), Streifen (Linien gleicher Farbe) und Konturbildungen (Bereiche gleicher Farbe).

Bei der zeitlichen (Video-)Komprimierung werden immer mehrere Frames einer Sequenz berücksichtigt, indem nämlich nur die Änderungen gegenüber dem Vorgänger-Frame gespeichert werden. Als Referenz werden allerdings in regelmäßigen Abständen Key-Frames berücksichtigt, die das vollständige Bild enthalten.

Die bekanntesten Komprimier- / Pack-Programme für "normale" Dateien sind:

Die bekanntesten Komprimier- / Pack-Programme für Bild-Dateien sind:

Die bekanntesten Komprimier- / Pack-Programme für Sound-Dateien sind:

MP3

Die bekanntesten Komprimier- / Pack-Programme für Film-Dateien sind:


Among the Many Ways to Compress Data

The next story involves a small-but-rapidly-growing software company, which we'll call C. They implemented a file-compression algorithm that consisted of three things:

The Lempel-Ziv algorithm, which you should recall involves storing strings that have appeared previously in the file and replacing subsequent occurrences of that string by pointers to the original occurrence. For instance, if the appears many times, we can save by replacing all but the first to a short pointer to the byte of the file where the first the occurs.

An implementation of the dictionary data abstraction. Recall this abstraction, which I think is one of Don Knuth's neatest ideas, is a set with the methods ``insert,'' ``delete,'' and ``membership-test.'' The dictionary would hold the unique strings seen so far and their original locations in the file.

Because this story takes place in the 1980's, the amount of main memory available was really tiny. Often a PC had 64K bytes, or less. Thus, they had to be careful not to store too many strings in the dictionary, or they would run out of main memory, and performance of the data-compression would drop.

At this point, you should stop and think how you would solve this problem. That is, you need to pick some dictionary implementation, and modify it so that it will store only a limited number of strings, and yet give the correct answers to queries that ask whether a given string is in the set stored. Technically, you should think with the mind of someone from 1985, but I suspect that the method you come up with is likely to have been known in 1985, and moreover, taught to undergraduates of that year and well before.

Anyway, back to our story. C was sued by a company we'll call D, which had a patent on the three steps that I outlined above. Well not exactly. They couldn't patent Lempel-Ziv, a nice idea that was in the public domain. They tried to patent the dictionary, but that wasn't going to hold up, since the abstraction is in Knuth's 1973 book. However, they also patented a particular method for solving the third problem: limiting the size of the dictionary.

I have to be careful not to reveal too much detail, but let us say that one of C and D used a hash table with a limited number of elements per bucket, and the other used a binary search tree with a limited depth. D sued C for patent infringement, and I enlisted to defend C. At the trial, the jury learned more about the background of the case. Previously, C had licensed the software of D to include in its popular software product. Later, C decided to drop D's software and implement its own version. That decision was a disaster for D, and apparently the jury thought more about this aspect of the case than about the issue of whether or not the solution of D in the patent was or was not ``obvious.''

In retrospect I should have been more careful about the implications of a decision in favor of C. Even though I believe, and will try to argue shortly, that the case reveals a logical flaw in the way patents are treated by the courts, I also feel that it is the responsibility of all in the scientific community to support the ethical handling of intellectual property.

The following is the text of Stac Electronics' patent infringement complaint against Microsoft Corp.:

UNITED STATES DISTRICT COURT

CENTRAL DISTRICT OF CALIFORNIA

STAC ELECTRONICS, a California corporation, Plaintiff,v.

MICROSOFT CORPORATION, a Delaware corporation, Defendant.

IRELL & MANELLA Morgan Chu Wayne M. Barsky Mark A. Flagel Jeffrey L. Arrington Steven S. Weiner 1800 Avenue of the Stars Los Angeles, California 90067-4276 Telephone: (310) 277-1010

SHEA & GOULD John Kidd Nicholas L. Coch 1251 Avenue of the Americas New York, New York 10020 Telephone: (212) 827-3000

Attorneys for Plaintiff Stac Electronics

COMPLAINT FOR PATENT INFRINGEMENT

DEMAND FOR JURY TRIAL

Plaintiff Stac Electronics (``Stac'') demands a jury trial on all issues and alleges as follows:

JURISDICTION AND VENUE

1. This is an action for patent infringement arising under the Patent Act of the United States, 35 U.S.C. 271 and 281. This Court has subject matter jurisdiction over the matters complained of under 28 U.S.C. 1338(a) and 1331.

2. Venue is proper in this judicial district pursuant to 28 U.S.C. 1400(b) and 1391(c), as defendant Microsoft Corporation (``Microsoft'') resides and has committed acts of patent infringement in this judicial district.

THE PARTIES

3. Stac is a corporation organized and existing under the laws of the State of California and has its principal place of business at 5993 Avenida Encinas, Carlsbad, California 92008. Stac designs, develops, markets and supports data compression/decompression products that increase the effective capacity of computer-related storage devices and the transmission rates of data communication systems. Stac is the leading supplier of high performance data compression products for personal computers.

4. Microsoft is a corporation organized and existing under the laws of the State of Delaware. Microsoft's principal place of business is located at One Microsoft Way, Redmond, Washington 98052. Microsoft develops, produces and markets, among other things, a broad range of software for business and professional use, including operating systems, languages and applications programs. Microsoft is the world's largest software company, with reported revenues in excess of $2.7 billion in fiscal 1992, and some 12,000 employees in 27 countries. Microsoft produces and markets, among other products, the MS-DOS operating system software for IBM and IBM-compatible personal computers.

FACTUAL BACKGROUND

5. An operating system is a group of programs that, among other tasks, translates user commands to the computer, schedules and implements the execution of applications programs, allocates computer memory, and manages the flow of information and communication among various components of the computer system. Application programs ``attach'' to the underlying operating system and, when called upon to do so by the operating system, perform discrete functions such as word processing, graphics and spreadsheet operations.

6. MS-DOS is currently installed on in excess of 100 million IBM and IBM-compatible personal computers, and Microsoft ships more than 20 million units of MS-DOS every year. Microsoft's MS-DOS is the predominant operating system in the IBM and IBM-compatible personal computer market. MS-DOS is marketed principally to original equipment manufacturers (OEMs) under agreements that allow the OEMs to distribute the MS-DOS operating system software with their personal computers.

7. Stac is the manufacturer and publisher of a data compression utility program known as STACKER, which attaches to DOS operating systems such as MS-DOS, as well as other operating systems. Utility programs are a large and diverse family of application programs that are designed to enhance personal computer performance. The principal function of STACKER is to compress data stored on the hard disk of IBM and IBM-compatible personal computers when the data is not being used, and later decompress such data when it is to be used, thereby increasing the effective storage capacity of the computer.

8. Although there are a number of software companies which offer data compression programs, Stac is the acknowledged industry leader in developing and marketing data compression technology, and STACKER is currently the best-selling data compression program for use on the DOS operating system. Stac's proprietary data compression technology, developed over the course of five years and at substantial cost, is protected by a number of patents, including U.S. Patent Nos. 5,016,009 (the "009" patent) and 4,701,745 (the "745" patent) (collectively, the "patents in suit").

9. Stac's proprietary data compression technology is relied upon daily by more than four million computer users worldwide. STACKER is the winner of PC Magazine's Technical Excellence Award, Windows Magazine's WIN 100 Award, PC Magazine's Editor's Choice Award, PC Computing's Most Valuable Product Award, Byte Magazine's Best of Comdex-Finalist Award, and National Software Testing Lab's Recommendation (Five Stars), and is the recipient of numerous additional industry accolades.

10. Largely as a result of the tremendous market acceptance of STACKER, which utilizes Stac's proprietary data compression technology, Stac quickly grew from a company with 25 employees and revenues of less than $1 million in 1989 to a company with more than 200 employees and a market capitalization in excess of $150 million today.

11. The personal computer software industry is characterized by rapid technological change which requires software developers continually to enhance existing products and develop new products. A critical factor in the success of a new or enhanced product is getting the product to market quickly in response to new user needs or technological advances, while at the same time maintaining the integrity and quality of the product.

12. It was a well-known fact in the personal computer industry as early as 1991 that Microsoft's MS-DOS 5.0 retail upgrade sales were rapidly declining with each passing quarter. As Microsoft's flagship product -- with approximately $700 million in revenue per year attributable to MS-DOS sales alone -- Microsoft was under intense market pressure to stimulate MS-DOS sales with an improved version of the MS-DOS operating system.

13. Due in part to the overwhelming market success of STACKER, the personal computer industry quickly recognized that an operating system capable of incorporating a high-quality data compression utility such as STACKER would be highly competitive. Indeed, the principal competitor of Microsoft's MS-DOS -- Novell's DR-DOS operating system -- had already incorporated a data compression utility in its operating system software.

14. Microsoft's Chairman and Chief Executive Officer, William H. Gates, became personally interested in Stac's proprietary data compression technology, and the possibility of using such technology in MS-DOS, in 1991, at approximately the same time that STACKER was receiving a number of coveted industry awards for technical excellence and overall product quality.

15. Mr. Gates met with Stac's President, Gary W. Clow, at the Fall Comdex-91 ceremony in Las Vegas. During a discussion which preceded the award ceremony, Mr. Gates said that Microsoft was considering including a data compression capability in the next release of MS-DOS. Mr. Gates further stated that Microsoft would not be developing this capability internally, but rather would seek to obtain another company's data compression technology for inclusion in MS-DOS. The Editor-in-Chief of PC Magazine, Michael J. Miller -- whose magazine would later that evening present its Technical Excellence Award to Stac -- told Mr. Gates before the ceremony began that STACKER was a first-rate product. Mr. Gates asked Mr. Clow to contact Microsoft after Mr. Clow returned to California, and Mr. Clow agreed.

16. In late 1991, as a result of Mr. Gates' interest, Mr. Brad Chase -- who was then Microsoft's Group Product Manager and who today is Microsoft's General Manager for MS-DOS -- and Mr. Clow began discussing the possibility of Microsoft licensing Stac's proprietary data compression technology for inclusion in future versions of the MS-DOS operating system.

17. During the ensuing months of negotiations, Microsoft proposed that Stac grant to Microsoft a world-wide license to incorporate STACKER data compression technology and know-how into future versions of its MS-DOS operating system software. Microsoft steadfastly refused, however, to offer to pay Stac any royalty for Stac's patented data compression technology.

18. Mr. Chase made it clear during the negotiations that Microsoft was considering including data compression capability in future versions of the MS-DOS operating system, and that if it were unable to reach an agreement with Stac, it would obtain this capability elsewhere, even though Microsoft believed -- as it told Stac on numerous occasions -- that STACKER was the best data compression product for the DOS market. When the subject of incorporating data compression technology other than Stac's arose, Mr. Clow reminded Mr. Chase and others that Stac owned patent rights to its data compression technology and would enforce its patents against any infringers. At least one draft agreement was provided to Microsoft that included a specific reference to Stac's '009 patent.

19. Microsoft attempted to persuade Stac that its proposal to incorporate Stac's proprietary data compression technology -- or, for that matter, any reliable data compression technology -- into the MS-DOS operating system would, if implemented, have an immediate and adverse effect on the viability of STACKER as an independently marketed product for the DOS market. Indeed, at one point during the negotiations, Microsoft presented Stac with a spreadsheet analysis purporting to detail the adverse impact on sales of STACKER -- Stac's flagship product -- in the event Microsoft and Stac failed to reach an agreement and Microsoft incorporated a different data compression utility in future versions of the MS-DOS operating system.

20. In approximately April of 1992, Stac broke off further discussions with Microsoft in light of Microsoft's failure to present a proposal that offered reasonable compensation to Stac for Microsoft's use of Stac's proprietary data compression technology.

21. In approximately June of 1992, Mr. Chase advised Mr. Clow that Microsoft was obtaining data compression technology for use in MS-DOS, but that Microsoft wanted to offer Stac one last chance to reach an agreement. In the ensuing discussions, it again became clear that Microsoft had no intention of paying any compensation to Stac in exchange for Stac's proprietary data compression technology. Discussions between Stac and Microsoft thereupon terminated for the second time.

22. Shortly thereafter, it became well known to the industry that a new version of its DOS operating system, MS-DOS version 6.0 (``MS-DOS 6.0"), would be released in the first six months of 1993 and that MS-DOS 6.0 would include a data compression utility, which Microsoft was to later call ``DoubleSpace.''

23. Before a new program (or new version of an existing program) is made available for retail distribution, the software developer will often distribute preliminary copies of the new software (the ``beta software'') to a large group of intended users (the ``beta sites"). The developer then seeks comments from the beta sites on the beta software's performance, thereby allowing it to identify and fix problems or ``bugs'' in the beta software that might have slipped through the developer's quality control procedures. Consistent with this practice, Microsoft commenced its MS-DOS 6.0 Beta Test Program in the second half of 1992.

24. On or about November 23, 1992, a telephone conference was held with Mr. Clow and Mr. Whiting of Stac, and Microsoft's Mr. Chase. During that conversation, Mr. Chase admitted that, during Microsoft's ``normal due diligence process,'' Microsoft had concluded that the DoubleSpace data compression utility of the MS-DOS 6.0 operating system software infringed Stac's '009 patent, one of the two patents in suit. Mr. Chase requested that Stac grant a license to Microsoft under Stac's '009 patent. After a brief discussion, Mr. Clow requested that Microsoft make a specific licensing proposal to Stac, and Mr. Chase agreed to do so. During this same telephone conference, Mr. Chase promised, in response to Stac's request, to make available to Stac a copy of the beta version of the MS-DOS 6.0 software.

25. Several weeks later, after not hearing back from Mr. Chase on a licensing proposal for Stac's patent, and after not receiving the promised beta version of MS-DOS 6.0, Mr. Clow wrote to Mr. Chase and again requested a copy of the MS-DOS 6.0 software. Mr. Clow wrote to Mr. Chase and again requested a copy of the MS-DOS 6.0 software. Mr. Clow explained that other software developers had access to the beta versions of MS-DOS 6.0, which was putting Stac at a competitive disadvantage. Mr. Clow also noted that Stac was still awaiting a specific licensing proposal from Microsoft for Stac's '009 patent.

26. Microsoft finally made the beta version of the MS-DOS 6.0 software available to Stac in January of 1993. At or about the same time, with respect to Microsoft's earlier admission regarding the infringement of Stac's '009 patent, Mr. Chase advised Stac in writing: ``Don't worry about the patent stuff. We are just going to keep with our changed code which does not infringe.''

27. After receiving the beta version of MS-DOS 6.0, Stac engineers determined that, whether or not the ``code'' had in fact been changed from earlier versions, as represented by Mr. Chase, the DoubleSpace data compression utility contained in the beta version of MS-DOS 6.0 infringes upon Stac's '009 and '745 patents.

28. On or about January 15, 1993, Mr. Chase of Microsoft provided Mr. Whiting and Mr. Clow of Stac with a preliminary press release for the Microsoft Real-time Compression Interface (``MRCI''). MRCI defines a compression standard for allowing vendors to design software and hardware products that utilize or ``build'' upon the Double Space data compression utility in the MS-DOS 6.0 operating system.

29. Microsoft's preliminary press release confirms that ``DoubleSpace, the integrated data compression technology ... will be available with the next major version of MS-DOS, MS-DOS 6.'' The preliminary press release also reveals that, in an effort to have the compression technology in MS-DOS 6.0 quickly adopted as an industry standard, Microsoft is now offering to license - for free - the infringing DoubleSpace technology to independent hardware and software vendors.

30. Mr. Chase sent Microsoft's preliminary press release to Mr. Whiting and Mr. Clow for the stated purpose of having Stac approve the following proposed quote for Microsoft's ultimate press release, drafted for Stac by Microsoft:

"We're excited about MS-DOS 6 and DoubleSpace because they create a large opportunity for boards, chips and add-on software to enhance the compression services that MS-DOS 6 and DoubleSpace offer ..."

31. On information and belief, Microsoft is taking the calculated risk of incorporating what it knows to be Stac's patented data compression technology in its MS-DOS 6.0 operating system in order to stimulate sales of its flagship product and respond to the intense market and financial community pressure to remain competitive and demonstrate continued growth.

FIRST CAUSE OF ACTION FOR PATENT INFRINGEMENT

32. Stac repeats and realleges, as if set forth in full, paragraphs 1 through 31 of this Complaint.#}

33. On May 14, 1991, the '009 patent, entitled "Data Compression Apparatus And Method," was granted to Stac. Since its issuance, Stac has been, and continues to be, the owner of all right, title and interest in and to the '009 patent. A copy of the '009 patent is attached as Exhibit A, and incorporated herein by reference.#}

34. Defendant Microsoft is infringing the '009 patent, in this judicial district and elsewhere, in connection with its activities pertaining to the beta version of its MS-DOS 6.0 operating system software for IBM and IBM-compatible personal computers, which embodies the inventions disclosed and claimed in the '009 patent.#}

35. Unless enjoined by the Court, Microsoft will continue to infringe Stac's '009 patent.#}

36. As a direct and proximate result of Microsoft's conduct, Stac has suffered and will continue to suffer irreparable injury, for which it has no adequate remedy at law. Stac has also been damaged, and, until an injunction issues, will continue to be damaged in its business and reputation in an amount yet to be determined. Moreover, the wilful and deliberate nature of Microsoft's infringement renders this an exceptional case, and thus Stac is further entitled to treble damages, as well as its actual attorneys' fees and litigation costs.#}

SECOND CAUSE OF ACTION FOR PATENT INFRINGEMENT

37. Stac repeats and realleges, as if set forth in full, paragraphs 1 through 31 of this Complaint.#}

38. On October 20, 1987, the '745 patent, entitled "Data Compression System," was granted to Ferranti, plc ("Ferranti"). Ferranti subsequently assigned all right, title and interest in and to the '745 patent to Stac, which is now the owner of all right, title and interest in and to the '745 patent. A copy of the '745 patent is attached as Exhibit B, and incorporated herein by reference.#}

39. Defendant Microsoft is infringing the '745 patent, within this judicial district and elsewhere, in connection with its activities pertaining to the beta version of the MS-DOS 6.0 operating system software, which embodies the inventions disclosed and claimed in the '745 patent.#}

40. Unless enjoined by the Court, Microsoft will continue to infringe Stac's '745 patent.#}

41. As a direct and proximate result of Microsoft's conduct, Stac has suffered, and will continue to suffer, irreparable injury, for which it has no adequate remedy at law. Stac has also been damaged and, until an injunction issues, will continue to be damaged in its business and reputation in an amount yet to be determined.#}

PRAYER FOR RELIEF

WHEREFORE, Stac prays for judgment against defendant Microsoft as follows:#}

1. For a judicial determination and declaration that the '009 patent is valid and enforceable;#}

2. For a judicial determination and declaration that the '009 patent is infringed by the beta version of the MS-DOS 6.0 operating system software, and such other Microsoft products as may infringe;#}

3. For a judicial determination and declaration that Microsoft's infringement of either or both of the '009 and '745 patents is wilful;#}

4. For a judicial determination and declaration that the '745 patent is valid and enforceable;#}

5. For a judicial determination and declaration that the '745 patent is infringed by the beta version of the MS-DOS 6.0 operating system software, and such other products as may infringe;#}

6. For an order preliminary and permanently enjoining Microsoft, its officers, directors, shareholders, agents, servants, employees and attorneys, and all entities and individuals acting in concert with them or on their behalf, from infringing the '009 and '745 patents;#}

7. For damages according to proof, trebled;

8. For Stac's attorneys' fees and litigation costs; and

9. For such other and further relief as the Court may deem just and proper.#}

Dated: January 25, 1993

Respectfully submitted,

IRELL & MANELLA

Morgan Chu

Wayne M. Barsky

Mark A. Flagel

Jeffrey L. Arrington

Steven S. Weiner

SHEA & GOULD

John Kidd

Nicholas L. Coch

By: Morgan Chu

Attorneys for Plaintiff

Stac Electronics

DEMAND FOR JURY TRIAL Stac demands a jury trial on all issues. Dated: January 25, 1993#}

Respectfully submitted,

IRELL & MANELLA

Morgan Chu

Wayne M. Barsky

Mark A. Flagel

Jeffrey L. Arrington

Steven S. Weiner

SHEA & GOULD

John Kidd

Nicholas L. Coch

By: Morgan Chu

Attorneys for Plaintiff

Stac Electronics UNITED STATES PATENT [19]

[11] Patent Number: 5,016,009 Whiting et al. [45] Date of Patent: May 14, 1991 [54] DATA COMPRESSION APPARATUS AND METHOD [75] Inventors: Douglas L. Whiting, South Pasadena;#}

Glen A. George; Glen E. Ivey, both of Pasadena, all of Calif. [73] Assignee: Stac, Inc., Pasadena, Calif. [21] Appl. No.: 297,152 [22] Filed: Jan. 13, 1989 [51] Int. Cl. (5) H03M 7/40; H03L 7/00 [52] U.S. Cl. 341/67; 341/95;#}

341/106; 375/112; 370/102 [58] Field of Search 375/27, 112; 358/261.1;#}

364/715.02; 341/51, 67, 106, 95; 370/102 [56] References Cited U.S. PATENT DOCUMENTS 3,976,844 8/1976 Betz 4,021,782 5/1977 Hoerning 4,054,951 10/1977 Jackson et al. 4,412,306 10/1983 Moll 4,464,650 8/1984 Eastman et al. 341/51 4,491,934 1/1985 Heinz 4,558,302 12/1985 Welch 341/51 4,612,532 9/1986 Bacon et al. 341/90 X 4,701,745 10/1987 Waterworth 4,814,746 3/1989 Miller et al. 4,876,541 10/1989 Storer 341/67 X#}

OTHER PUBLICATIONS

J. Cleary et al. "Data Compression Using Adaptive Coding and Partial String Matching." IEEE Transactions on Communications, 32:396-403 (1984).

M Wells. "File Compression Using Variable Length Encodings," The Computer Journal, 15:308-313 (1972).

Primary Examiner - A.D. Pellinan Assistant Examiner - Sharon D. Logan Attorney, Agent, or Firm - Irell & Manella [57]

ABSTRACT An apparatus and method for converting an input data character stream into a variable length encoded data stream in a data compression system. The data compression system includes a history array means. The history array means has a plurality of entries and each entry of the history array means is for storing a portion of the input data stream. The method for converting the input data character stream includes the following steps. Performing a search in a history array means for the longest data string which matches the input data string. If the matching data string is found within the history buffer means, the next step includes encoding the longest matching data string found by appending to the encoded data stream a tag indicating the longest matching data string was found and a string substitution code. If the matching data string is not found within the history array means, the next step includes encoding the first character of the input data string by appending to the encoded data stream a raw data tag indicating that no matching data string was found and the first character of the input data string.

UNITED STATES PATENT [19] [11] Patent Number: 4,701,745

Waterworth [45] Date of Patent: Oct. 20, 1987 [54]

DATA COMPRESSION SYSTEM [75] Inventor: John R. Waterworth, Cheadle,

England [73] Assignee: Ferranti, plc, Cheshire, England [21] Appl. No.:

835,793 [22] Filed: Mar. 3, 1986 [30] Foreign Application Priority

Data Mar. 6, 1985 [GB] United Kingdom 8505790 [51] Int. Cl. (4)

H03M 7/30 [52] U.S. Cl. 340/347 DD; 364/900 [58] Field of

Search 340/347 DD; 235/310;#}

358/260, 261; 364/900 [56] References Cited

U.S. PATENT DOCUMENTS 4,054,951 10/1977 Jackson 364/900 Primary Examiner - Charles D. Miller Attorney, Agent or Firm - Kerkam, Stowell, Kondracki & Clarke [57]

ABSTRACT A data compression system includes an input store (1) for receiving and storing a plurality of bytes of data from an outside source. Data processing means for processing successive bytes of data from the input store includes circuit means (21-25) operable to check whether a sequence of bytes is identical with a sequence of bytes already processed, output means (27) operable to apply to a transfer medium (12) each byte of data not forming part of such an identical sequence, and an encoder (26) responsive to the identification of such a sequence to apply to the transfer means (12) an identification signal which identifies both the location in the input store of the previous occurrence of the sequence of bytes and the number of bytes in the sequence.#}

CONTACT: IRELL & MANELLA | Morgan Chu | Wayne M. Barsky | Mark A. Flagel| Jeffrey L. Arrington | Steven S. Weiner | 1800 Avenue of the Stars | Los Angeles, California 90067-4276 | Telephone: (310) 277-1010 | or | SHEA & GOULD | John Kidd | Nicholas L. Coch | 1251 Avenue of the Americas | New York, New York 10020 | Telephone: (212) 827-3000


Beispielprojekt an Hand des LZWs

Option Explicit

Private Type BIdx

pLinks As Long

pRechts As Long

Woerterbuchindex As Long

End Type

Dim Woerterbuch(4096) As String

Dim NaechsterWoerterbuchindex As Long

Dim Heap(4096) As BIdx

Dim NaechsterHeapIndex As Long

Dim pStr As Long

--------------------------------------------------------------------------------

Sub InitWoerterbuch()

'In dieser Sub wird das Woerterbuch

'initialisiert und mit Standardwerten belegt

Dim i As Integer

For i = 0 To 255

Woerterbuch(i) = Chr(i)

Next i

NaechsterWoerterbuchindex = 256

NaechsterHeapIndex = 0

End Sub

--------------------------------------------------------------------------------

Function AddToWoerterbuch(s As String) As Long

'In dieser Sub wird dem Woerterbuch ein

'Begriff hinzugefügt

If Len(s) = 1 Then

AddToWoerterbuch = Asc(s)

Else

AddToWoerterbuch = AddToBTree(0, s)

End If

End Function

--------------------------------------------------------------------------------

Function AddToBTree(ByRef Node As Long, _

ByRef s As String) As Long

Dim i As Integer

If Node = -1 Or NaechsterHeapIndex = 0 Then

Woerterbuch(NaechsterWoerterbuchindex) = s

Heap(NaechsterHeapIndex).Woerterbuchindex = _

NaechsterWoerterbuchindex

NaechsterWoerterbuchindex = NaechsterWoerterbuchindex _

+ 1

Heap(NaechsterHeapIndex).pLinks = -1

Heap(NaechsterHeapIndex).pRechts = -1

Node = NaechsterHeapIndex

NaechsterHeapIndex = NaechsterHeapIndex + 1

AddToBTree = -1

Else

i = StrComp(s, Woerterbuch(Heap(Node).Woerterbuchindex))

If i < 0 Then

AddToBTree = AddToBTree(Heap(Node).pLinks, s)

ElseIf i > 0 Then

AddToBTree = AddToBTree(Heap(Node).pRechts, s)

Else

AddToBTree = Heap(Node).Woerterbuchindex

End If

End If

End Function

--------------------------------------------------------------------------------

Private Sub SchreibeStringBuffer(s As String, s2 As String)

Do While pStr + Len(s2) - 1 > Len(s)

s = s & Space(100000)

Loop

Mid$(s, pStr) = s2

pStr = pStr + Len(s2)

End Sub

--------------------------------------------------------------------------------

Function Komprimieren(IPStr As String) As String

Dim TmpStr As String

Dim Ch As String

Dim Woerterbuchindex As Integer

Dim LetzterWoerterbuchindex As Integer

Dim ErsterBegriffinFolge As Boolean

Dim HalfCh As Integer

Dim i As Long

Dim ostr As String

Call InitWoerterbuch

ErsterBegriffinFolge = True

pStr = 1

For i = 1 To Len(IPStr)

Ch = Mid$(IPStr, i, 1)

Woerterbuchindex = AddToWoerterbuch(TmpStr & Ch)

If Woerterbuchindex = -1 Then

If ErsterBegriffinFolge Then

HalfCh = (LetzterWoerterbuchindex And 15) * 16

Else

SchreibeStringBuffer ostr, Chr(HalfCh Or _

(LetzterWoerterbuchindex And 15))

End If

SchreibeStringBuffer ostr, _

Chr(LetzterWoerterbuchindex \ 16)

ErsterBegriffinFolge = Not ErsterBegriffinFolge

TmpStr = Ch

LetzterWoerterbuchindex = Asc(Ch)

Else

TmpStr = TmpStr & Ch

LetzterWoerterbuchindex = Woerterbuchindex

End If

Next i

SchreibeStringBuffer ostr, _

IIf(ErsterBegriffinFolge, Chr(LetzterWoerterbuchindex _

\ 16) & Chr((LetzterWoerterbuchindex And 15) * _

16), Chr(HalfCh Or (LetzterWoerterbuchindex And _

15)) & Chr(LetzterWoerterbuchindex \ 16))

Komprimieren = Left(ostr, pStr - 1)

End Function

--------------------------------------------------------------------------------

Function GC(str As String, position As Long) As Integer

GC = Asc(Mid$(str, position, 1))

End Function

--------------------------------------------------------------------------------

Function DeKomprimieren(IPStr As String) As String

Dim Woerterbuchindex As Integer

Dim ErsterBegriffinFolge As Boolean

Dim i As Long

Dim s As String

Dim s2 As String

Call InitWoerterbuch

pStr = 1

i = 1

ErsterBegriffinFolge = True

Do While i < Len(IPStr)

If ErsterBegriffinFolge Then

Woerterbuchindex = (GC(IPStr, i) * 16) Or _

(GC(IPStr, i + 1) \ 16)

i = i + 1

Else

Woerterbuchindex = (GC(IPStr, i + 1) * 16) Or _

(GC(IPStr, i) And 15)

i = i + 2

End If

ErsterBegriffinFolge = Not ErsterBegriffinFolge

If i > 2 Then

If Woerterbuchindex = NaechsterWoerterbuchindex _

Or (Woerterbuchindex = 256 _

And NaechsterWoerterbuchindex _

= 4096) Then

AddToWoerterbuch s2 & Left$(s2, 1)

Else

AddToWoerterbuch s2 & Left$(Woerterbuch( _

Woerterbuchindex), 1)

End If

End If

s2 = Woerterbuch(Woerterbuchindex)

SchreibeStringBuffer s, s2

Loop

DeKomprimieren = Left(s, pStr - 1)

End Function

--------------------------------------------------------------------------------

Sub Start()

Dim KomprS As String

Screen.MousePointer = vbHourglass

'Kompression aufrufen

KomprS = Komprimieren(Form1.Text1)

'Übergabe des komprimierten Textes

Form1.Text6 = KomprS

'DeKompression des komprimierten Textes

Form1.Text2 = DeKomprimieren(KomprS)

'Länge des Originaltextes ermitteln

Form1.Text3 = Len(Form1.Text1)

'Länge des komprimierten Textes ermitteln

Form1.Text4 = Len(KomprS)

'Status einfügen

If Form1.Text1 <> Form1.Text2 Then

Form1.Text5 = "Fehler"

Else

Form1.Text5 = "fertig"

End If

Screen.MousePointer = vbNormal

End Sub


Hallo!Guter Tip, hast du auch was zu Hamming-Code?Grüsse,Hirf: : wer sich für den Huffmann-Kompressions-Algorithmus interessiert, findet ein Beispiel unter http://www.planet-source-code.com/vb/scripts/ShowCode.asp?lngWId=1&txtCodeId=11000: Danke für den Tipp.: Dabei haben wir ein solch schones Kompressionstutorial - mit LZW-Beispiel :-)


Data Compression

Information theory provides a method for determining exactly how many bits are required to specify a given message to a given precision. This method is called the theory of data compression or, more technically, rate distortion theory. Often the message selected at the source need not (or cannot) be transmitted perfectly to the destination. For example, in a telephone conversation, extremely high sound quality is not necessary. It is usually sufficient that the two parties recognize and understand each other. Similarly, if the message is a scientific measurement—for example, a measurement of the number = 3.14159265…—it is not possible to transmit all of the digits in a finite amount of time. However, a useful approximation of can be transmitted with a relatively small number of bits.

The general idea of data compression theory is illustrated in the graph below. The horizontal axis measures the distortion, or imprecision, that is tolerable in a given message. The vertical axis gives the minimum possible number of bits required to specify the message with this distortion. The graph shows that as the acceptable distortion becomes smaller and smaller, the required number of bits becomes larger and larger. Conversely, as the allowed distortion becomes larger, the required number of bits decreases. Ultimately, the number of required bits becomes zero. The number becomes equal to zero when the allowed distortion can be achieved by merely guessing at the message.

Consider the following simple example. If the message is the outcome of the toss of an ordinary coin and it is acceptable to be wrong 50 percent of the time, then no bits are required. In this case, one can simply guess heads and be sure of being right 50 percent of the time, on average.

The relationship shown by the rate-distortion curve represents the absolute minimum number of bits, on the average, required to represent a message with a given distortion. The exact shape of the curve depends on the details of the particular situation. Shannon's Fundamental Theorem of Data Compression states that, in principle, it is possible to compress a message to a given level, but no more. Exactly how this compression should be done he did not say, but later researchers developed practical techniques for data compression based on Shannon's theorems. These techniques use mathematical algorithms, or repeated computations, to compress data. Many of these methods come very close to achieving the ultimate limits set forth by information theory.


Fragen


Web-Links

Unicode  http://www.unicode.org/unicode/standard/translations/german.html

LZ77-Animation   http://www.data-compression.com/lempelziv.html

Lempel-Ziv- Kodierung (LZ77)  http://www.uni-karlsruhe.de/~un55/lz.html

GNU zip (kostenlos)  http://www.gzip.org/

Entropie   http://www.tu-bs.de/institute/pci/aggericke/PC1/Kap_III/Entropie_und_Information.htm

http://www.educeth.ethz.ch/informatik/interaktiv/kompression/huffman.html


Ideen + Stoffsammlung

Komprimierbarkeit


Orginaltexte

Informatik, Mathematik, Physik, Sprachwissenschaft

Gemeinsamkeit im Konzentrat [28.01.2002]

Die meisten Computernutzer werden das hilfreiche ZIP-Programm kennen, mit dem sich umfangreiche Datenbestände zu kleinen Dateien komprimieren lassen. Nun fanden Wissenschaftler heraus, dass die Software wesentlich mehr kann, als nur Datenberge zu schrumpfen:

Sie eignet sich auch dazu, den Informationsgehalt von Texten zu bestimmen und so deren Autorenschaft, die Sprache und sogar einen Sprachstammbaum abzuleiten.

Das Nibelungenlied

So beginnt die bedeutendste mittelhochdeutsche Heldensage - das Nibelungenlied. Die 39 Abschnitte der Dichtung, die "Aventiuren", erzählen uns die Geschichte von Siegfried und Kriemhild und ihrer späterer Rache an den Nibelungen. Der Schöpfer dieses berühmten Werks ist jedoch unbekannt. Zwar gibt es durchaus einige Vermutungen, und so manche Ortschaft preist ihre einstigen Schreiberlinge als mutmaßliche Urheber - doch Sicherheit gibt es nicht. In Zukunft ließe sich ein Verdacht unter Umständen jedoch sehr schnell überprüfen, dazu bedürfte es einzig und allein eines Computers, einer herkömmlichen Komprimierungssoftware, die mit ZIP-Archiven umgehen kann, und einiger Textpassagen des zu überprüfenden Dokuments sowie einige der in Frage kommenden Autoren.

Denn Dariao Benedetto, Emanuele Caglioti und Vittorio Loreto von der Università degli Studi di Roma "La Sapienza" fanden heraus, dass sich das hilfreiche Programm bestens dazu eignet, den Informationsgehalt von Texten zu vergleichen. Dabei entspricht eine Zunahme an Information einer Abnahme der Entropie des Systems - hier eines Textes. Und genau diese Entropie liefert der Lempel-Ziv-Algorithmus (LZ77), welcher der ZIP-Software zugrunde liegt, auf denkbar einfach Weise: Denn die Länge der Zeichenfolge nach dem Komprimieren geteilt durch die ursprüngliche Länge strebt im Grenzwert für unendlich lange Textpassagen gegen den Wert ihrer Entropie.

Doch wie lässt sich das ausnutzen, um Dokumente zu

vergleichen? Dazu komprimierten die Forscher zunächst eine

bekannte Zeichenfolge und notierten sich deren Länge.

Danach fügten sie der ursprünglichen Abfolge ein kurzes

Fragment einer zu prüfenden Sequenz an und ließen auch

dieses Datenpaket mit dem Algorithmus schrumpfen. Die

Längendifferenz zwischen den beiden komprimierten

Zeichenabfolgen liefert nun ein Maß dafür, wie nahe sich die

beiden ursprünglichen Zeichenreihen stehen. Dabei lässt sich

das Verfahren nicht nur auf Zeichen und Texte, sondern auf

nahezu beliebige Datensätze anwenden.

Die Forscher prüften ihr Konzept jedoch an Texten. In einem

ersten Test sollte damit die Sprache von Schriftstücken

bestimmt werden. Dazu komprimierten sie jeweils eine lange

Passage verschiedensprachiger Texte mit einem kurzen zu

testenden Abschnitt. Ihre Annahme war dabei, dass die

komprimierte Datei gerade dann am kleinsten ausfällt, wenn

das kürzere Textstück in derselben Sprache verfasst ist wie

der größere Teil. Benedetto, Caglioti und Loreto verwendeten

für ihren Test zehn offizielle Sprachen der EU in allen möglichen

Kombinationen. Das Ergebnis war erstaunlich: Tatsächlich

lieferten die Textkombinationen aus gleichen Sprachen jeweils

die kleinsten Dateien.

Anschaulich lässt sich das verstehen, wenn man sich das

Prinzip des LZ77-Algorithmus vergegenwärtigt: Dieser legt

nämlich eine Art Wörterbuch für bereits verwendete

Sequenzen an. Wenn dann im weiteren Verlauf diese Sequenz

wiederholt auftritt, so wird nur noch auf die entsprechende

"Hausnummer" im Wörterbuch verwiesen. Ein Text, welcher

Sprache auch immer, beinhaltet stets wiederkehrende

Zeichenfolgen - man denke nur an sich wiederholende Wörter,

aber auch bestimmte kürzere Abfolgen sind typisch, wie im

Deutschen beispielsweise "sch", "tz" oder ähnliches. Ändert

sich nun aber plötzlich beim Komprimieren die Sprache eines

Dokuments, dann wird der Algorithmus zunächst nicht mehr so

viele Treffer im Wörterbuch finden und produziert

entsprechend größere Dateien. Das fällt jedoch nur auf, wenn

der angehängte fremdsprachliche Abschnitt kurz genug ist.

Denn bei längeren Anhängseln hat der Algorithmus wieder

genug Zeit zu lernen, und Abweichungen in der Dateigröße

fallen nicht mehr so stark ins Gewicht. Typischerweise belegten

die umfangreichen Dateien zwischen 32 und 64 Kilobyte und

die kleinen 1 bis 15 Kilobyte.

Die Wissenschaftler prüften aber nicht nur, inwieweit sich die

Sprache wiederkennen lässt, sondern auch, ob sich so der

Autor eines Stücks offenbart. Das Prinzip war dasselbe wie

beim Sprachentest: Große Passagen mit bekanntem Autor

wurden mit Testfragmenten komprimiert. Und auch hier war die

Treffsicherheit verblüffend: In 93 Prozent der Fälle wies die

minimale Dateigröße auf denselben Schöpfer hin.

In einem letzten Experiment versuchten die Forscher das

Verfahren zu nutzen, um Sprachen zu klassifizieren. Dazu

verwendeten sie die "Allgemeine Erklärung der

Menschenrechte", da deren Formulierung in sehr vielen

Sprachen verfügbar ist. Dabei lagen alle Texte im so

genannten Unicode vor, um auch jegliche Schriftzeichen

erfassen zu können. Schließlich bedienten sie sich einer

Methode, die normalerweise zur Stammbaumanalyse

biologischer Sequenzen dient, im Prinzip wurden dazu jedoch

wieder Textpassagen in gewohnter Manier anhand ihrer

ZIP-Archive verglichen. Heraus kam ein Stammbaum für 50

Sprachen, der erstaunlich genau dem von Linguisten

entwickelten System ähnelt: So bildeten sich beispielsweise

auch hier die großen zusammenhängenden Sprachklassen

heraus.

Wenngleich Benedetto, Caglioti und Loreto ihr Verfahren in

erster Linie an Texten verschiedener Sprachen testeten, so

taugt es doch auch für unzählige andere Bereiche, bei denen

Zeichenabfolgen untersucht werden: DNA- und

Protein-Squenzierungen, Aktienmarktanalysen und

medizinische Kontrollen sind dabei nur einige wenige Beispiele.

Und wer weiß, vielleicht lässt sich mit dieser Methode auch

irgendwann der Dichter des Nibelungenlieds finden.

Thorsten Krome   © wissenschaft-online

Quellen

Physical Review Letters 88: 048702 (2002), Abstract ScienceNow


LZ77-Animation   http://www.data-compression.com/lempelziv.html

Theory of Data Compression

Contents

I. Introduction and Background

In his 1948 paper, ``A Mathematical Theory of Communication,'' Claude E. Shannon formulated the theory of data compression. Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source --- more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H.

It is mathematically impossible to do better than H.

Shannon also developed the theory of lossy data compression. This is better known as rate-distortion theory. In lossy data compression, the decompressed data does not have to be exactly the same as the original data. Instead, some amount of distortion, D, is tolerated. Shannon showed that, for a given source (with all its statistical properties known) and a given distortion measure, there is a function, R(D), called the rate-distortion function. The theory says that if D is the tolerable amount of distortion, then R(D) is the best possible compression rate.

When the compression is lossless (i.e., no distortion or D=0), the best possible compression rate is R(0)=H (for a finite alphabet source). In other words, the best possible lossless compression rate is the entropy rate. In this sense, rate-distortion theory is a generalization of lossless data compression theory, where we went from no distortion (D=0) to some distortion (D>0).

Lossless data compression theory and rate-distortion theory are known collectively as source coding theory. Source coding theory sets fundamental limits on the performance of all data compression algorithms. The theory, in itself, does not specify exactly how to design and implement these algorithms. It does, however, provide some hints and guidelines on how to achieve optimal performance. In the following sections, we will described how Shannon modeled an information source in terms of a random process, we will present Shannon lossless source coding theorem, and we will discuss Shannon rate-distortion theory. A background in probability theory is recommended.

II. Source Modeling

Imagine that you go to the library. This library has a large selection of books --- say there are 100 million books in this library. Each book in this library is very thick --- say, for example, that each book has 100 million characters (or letters) in them. When you get to this library, you will, in some random manner, select a book to check out. This chosen book is the information source to be compressed. The compressed book is then stored on your zip disk to take home, or transmitted directly over the internet into your home, or whatever the case may be.

Mathematically, the book you select is denoted by where represents the whole book, represents the first character in the book, represents the second character, and so on. Even though in reality the length of the book is finite, mathematically we assume that it has infinite length. The reasoning is that the book is so long we can just imagine that it goes on forever. Furthermore, the mathematics turn out to be surprisingly simpler if we assume an infinite length book. To simplify things a little, let us assume that all the characters in all the books are either a lower-case letter (`a' through `z') or a SPACE (e. e. cummings style of writing shall we say). The source alphabet, , is defined to be the set of all 27 possible values of the characters:

Now put yourself in the shoes of the engineer who designs the compression algorithm. She does not know in advance which book you will select. All she knows is that you will be selecting a book from this library. From her perspective, the characters in the book are random variables which take values on the alphabet . The whole book, is just an infinite sequence of random variables -- that is is a random process. There are several ways in which this engineer can model the statistical properties of the book.

A.Zero-Order Model: Each character is statistically independent of all other characters and the 27 possible values in the alphabet are equally likely to occur. If this model is accurate, then a typical opening of a book would look like this (all of these examples came directly from Shannon's 1948 paper):

xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd

This does not look like the writing of an intelligent being. In fact, it resembles the writing of a "monkey sitting at a typewriter.''

B.First-Order Model: We know that in the English language some letters occur more frequently than others. For example, the letters `a' and `e' are more common than `q' and `z'. Thus, in this model, the character are still independent of one another, but the probability distribution of the characters are according to the first-order statistical distribution of English text. A typical text for this model looks like this:

ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva nah brl

C.Second-Order Model: The previous two models assumed statistical independence from one character to the next. This does not accurately reflect the nature of the English language. For exa#ple, some letters in thi# sent#nce are missi#g. However, we are still able to figure out what those letters should have been by looking at the context. This implies that there are some dependency between the characters. Naturally, characters which are in close proximity are more dependent than those that are far from each other. In this model, the present character depends on the previous character but it is conditionally independent of all previous characters . According to this model, the probability distribution of the character varies according to what the previous character is.

For example, the letter `u' rarely occurs (probability=0.022). However, given that the previous character is `q', the probability of a `u' in the present character is much higher (probability=0.995). For a complete description, see the second-order statistical distribution of English text. A typical text for this model would look like this:

on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at teasonare fuso tizin andy tobe seace ctisbe

D.Third-Order Model: This is an extension of the previous model. Here, the present character  depends on the previous two characters but it is conditionally independent of all previous characters before those: . In this model, the distribution of varies according to what are. See the third-order statistical distribution of English text. A typical text for this model would look like this:

in no ist lat whey cratict froure birs grocid pondenome of demonstures of the reptagin is regoactiona of cre steps.

E.General Model: In this model, the book is an arbitrary stationary random process. The statistical properties of this model are too complex to be deemed practical.

This model is interesting only from a theoretical point of view.

Model A above is a special case of Model B. Model B is a special case of Model C. Model C is a special case of Model D. Model D is a special case of Model E.

III. Entropy Rate of a Source

The entropy rate of a source is a number which depends only on the statistical nature of the source. If the source has a simple model, then this number can be easily calculated. Here, we consider an arbitrary source:

while paying special attention to the case where is English text.

A.Zero-Order Model: The characters are statistically independent of each other and every letter of the alphabet,, are equally likely to occur. Let be the size of the alphabet. In this case, the entropy rate is given by

For English text, the alphabet size is m=27. Thus, if this had been an accurate model for

English text, then the entropy rate would have been H=log2 27=4.75

bits/character.

B.First-Order Model: The characters are statistically independent. Let be the

size of the

alphabet and let be the probability of the i-th letter in the alphabet. The

entropy rate is

Using the first-order distribution, the entropy rate of English text would have

been 4.07

bits/character had this been the correct model.

C.Second-Order Model: Let be the conditional probability that the present

character is the

j-th letter in the alphabet given that the previous character is the i-th letter.

The entropy rate is

Using the second-order distribution, the entropy rate of English text would

have been 3.36

bits/character had this been the correct model.

D.Third-Order Model: Let be the conditional probability that the present

character is the

k-th letter in the alphabet given that the previous character is the j-th letter

and the one before

that is the i-th letter. The entropy rate is

Using the third-order distribution, the entropy rate of English text would have

been 2.77

bits/character had this been the correct model.

E.General Model: Let represents the first characters. The entropy rate in the

general

case is given by

where the sum is over all possible values of . It is virtually impossible to

calculate the

entropy rate according to the above equation. Using a prediction method,

Shannon has been

able to estimate that the entropy rate of the 27-letter English text is 2.3

bits/character (see

Shannon:Collected Papers).

We emphasize that there is only one entropy rate for a given source. All of the

above definitions for

the entropy rate are consistent with one another.

IV. Shannon Lossless Source Coding Theorem

Shannon lossless source coding theorem is based on the concept of block

coding. To illustrate this concept, we introduce a special information source in

which the alphabet consists of only two letters:

Here, the letters `a' and `b' are equally likely to occur. However, given that `a'

occurred in the previous character, the probability that `a' occurs again in the

present character is 0.9. Similarly, given that `b' occurred in the previous

character, the probability that `b' occurs again in the present character is 0.9.

This is known as a binary symmetric Markov source.

An n-th order block code is just a mapping which assigns to each block of n

consecutive characters

a sequence of bits of varying length. The following examples illustrate this

concept.

1.First-Order Block Code: Each character is mapped to a single bit.

Codeword

a

0.5

0

b

0.5

1

R=1 bit/character

An example:

Note that 24 bits are used to represent 24 characters --- an average of 1

bit/character.

2.Second-Order Block Code: Pairs of characters are mapped to either one,

two, or three bits.

Codeword

aa

0.45

0

bb

0.45

10

ab

0.05

110

ba

0.05

111

R=0.825 bits/character

An example:

Note that 20 bits are used to represent 24 characters --- an average of 0.83

bits/character.

3.Third-Order Block Code: Triplets of characters are mapped to bit sequence

of lengths one

through six.

Codeword

aaa

0.405

0

bbb

0.405

10

aab

0.045

1100

abb

0.045

1101

bba

0.045

1110

baa

0.045

11110

aba

0.005

111110

bab

0.005

111111

R=0.68 bits/character

An example:

Note that 17 bits are used to represent 24 characters --- an average of 0.71

bits/character.

Note that:

The rates shown in the tables are calculated from

where is the length of the codeword for block .

The higher the order, the lower the rate (better compression).

The codes used in the above examples are Huffman codes.

We are only interested in lossless data compression code. That is, given the

code table and

given the compressed data, we should be able to rederive the original data.

All of the

examples given above are lossless.

Theorem: Let be the rate of an optimal n-th order lossless data compression

code (in bits/character). Then

Since both upper and lower bounds of approach the entropy rate, H, as n goes

to infinity, we

have

Thus, the theorem established that the entropy rate is the rate of an optimal

lossless data

compression code. The limit exists as long as the source is stationary.

V. Rate-Distortion Theory

In lossy data compression, the decompressed data need not be

exactly the

same as the original data. Often, it suffices to have a reasonably close

approximation. A distortion measure is a mathematical entity which

specifies

exactly how close the approximation is. Generally, it is a function

which assigns

to any two letters and in the alphabet a non-negative number

denoted

as

Here, is the original data, is the approximation, and is the amount

of distortion between and . The most common distortion measures

are the

Hamming distortion measure:

and the squared-error distortion measure (which is only applicable when is a set

of numbers):

Rate-distortion theory says that for a given source and a given distortion

measure, there exists a

function, R(D), called the rate-distortion function. The typical shape of R(D)

looks like this:

If the source samples are independent of one another, the rate-distortion

function can be obtained

by solving the constrained minimization problem:

subject to the constraints that and

where is the distortion between the i-th and j-th letter in the alphabet. Using

Blahut's

algorithm, it may be possible to numerically calculate the rate-distortion

function.

Rate-distortion theory is based on the concept of block coding (similar to

above). A lossy block

code is known as a vector quantizer (VQ). The block length n of the code is

known as the VQ

dimension.

Theorem: Assuming that the source is stationary and the source samples are

independent. For every and for every , there exists a VQ of

(sufficiently large) dimension n with distortion no greater than

and rate no more than

Further more, there does not exist a VQ with distortion and rate less than

.

In essence, the theorem states that R(D) is the rate-distortion performance of

an optimal VQ. The

above theorem can be generalized to the case where the source is stationary

and ergodic.

VI. The Gap Between Theory and Practice

The theory holds when the block length n approaches infinity. In real-time

compression, the

compression algorithm must wait for n consecutive source samples before it

can begin the

compression. When n is large, this wait (or delay) may be too long. For

example, in real-time

speech compression, the speech signal is sampled at 8000 samples/second.

If n is say 4000,

then the compression delay is half a second. In a two-way conversation, this

long delay may

not be desirable.

The theory does not take into consideration the complexities associated with

the compression

and decompression operations. Typically, as the block length n increases,

the complexities

also increase. Often, the rate of increase is exponential in n.

The theory assumes that the statistical properties of the source is known. In

practice, this

information may not be available.

The theory assumes that there is no error in the compressed bit stream. In

practice, due to

noise in the communication channel or imperfection in the storage medium,

there are errors in

the compressed bit stream.

All of these problems have been successfully solved by researchers in the field.

VII. FAQs (Frequently Asked Questions)

A.What is the difference between lossless and lossy compression?

B.What is the difference between compression rate and compression ratio?

C.What is the difference between ``data compression theory'' and ``source

coding theory''?

D.What does ``stationary'' mean?

E.What does ``ergodic'' mean?

A.What is the difference between lossless and lossy compression?

In lossless data compression, the compressed-then-decompressed data is an

exact

replication of the original data. On the other hand, in lossy data compression,

the

decompressed data may be different from the original data. Typically, there

is some distortion

between the original and reproduced signal.

The popular WinZip program is an example of lossless compression. JPEG

is an example of

lossy compression.

B.What is the difference between compression rate and compression ratio?

Historically, there are two main types of applications of data compression:

transmission and

storage. An example of the former is speech compression for real-time

transmission over

digital cellular networks. An example of the latter is file compression (e.g.

Drivespace).

The term ``compression rate'' comes from the transmission camp, while

``compression ratio''

comes from the storage camp.

Compression rate is the rate of the compressed data (which we imagined to

be transmitted in

``real-time''). Typically, it is in units of bits/sample, bits/character, bits/pixels,

or bits/second.

Compression ratio is the ratio of the size or rate of the original data to the

size or rate of the

compressed data. For example, if a gray-scale image is originally

represented by 8 bits/pixel

(bpp) and it is compressed to 2 bpp, we say that the compression ratio is 4-

to-1. Sometimes,

it is said that the compression ratio is 75%.

Compression rate is an absolute term, while compression ratio is a relative

term.

We note that there are current applications which can be considered as both

transmission and

storage. For example, the above photograph of Shannon is stored in JPEG

format. This not

only saves storage space on the local disk, it also speeds up the delivery of

the image over

the internet.

C.What is the difference between ``data compression theory'' and ``source

coding

theory''?

There is no difference. They both mean the same thing. The term ``coding''

is a general term

which could mean either ``data compression'' or ``error control coding''.

D.What does ``stationary'' mean?

Mathematically, a random process is stationary (or strict-sense stationary) if

for every

positive integers and the vectors:

and

have the same probability distribution.

We can think of ``stationarity'' in terms of our library example. Suppose that

we look at the

first letter of all 100 million books and see how often the first letter is `a', how

often it is `b',

and so on. We will then get a probability distribution of the first letter of the

(random) book.

If we repeat this experiment for the fifth and 105th letters, we will get two

other distributions.

If all three distributions are the same, we are inclined to believe that the book

process is

stationary. To be sure, we should look at the joint distribution of the first and

second letters

and compare it with the joint distribution of the 101st and 102nd letters. If

these two joint

distributions are roughly the same, we will be more convinced that the book

process is

stationary.

In reality, the book process can not be stationary because the first character

can not be a

SPACE.

E.What does ``ergodic'' mean?

The strict mathematical definition of ergodicity is too complex to explain.

However, Shannon

offered the following intuitive explanation: ``In an ergodic process every

sequence produced

by the process is the same in statistical properties. Thus the letter

frequencies, (the pairwise)

frequencies, etc., obtained from particular sequences, will, as the lengths of

the sequences

increase, approach definite limits independent of the particular sequence.

Actually this is not

true of every sequence but the set for which it is false has probability zero.

Roughly the

ergodic property means statistical homogeneity.''

VIII. References

1.C. E. Shannon, A Mathematical Theory of Communication (free pdf version)

2.C. E. Shannon and W. Weaver, The Mathematical Theory of

Communication. ($13 paper

version)

3.C. E. Shannon, N. J. Sloane, and A. D. Wyner, Claude Elwood Shannon:

Collected Papers.

4.C. E. Shannon, ``Prediction and Entropy of Printed English," available in

Shannon: Collected

Papers.

5.C. E. Shannon, ``Coding Theorems for a Discrete Source with a Fidelity

Criterion," available

in Shannon: Collected Papers.

6.T. M. Cover and J. A. Thomas, Elements of Information Theory.

7.R. Gallager, Information Theory and Reliable Communication.

8.A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.

9.K. Sayood, Introduction to Data Compression.

10.M. Nelson and J.-L. Gailly, The Data Compression Book.

11.A. Leon-Garcia, Probability and Random Processes for Electrical

Engineering. (Student

Solution Manual).

12.A. Papoulis, Probability, Random Variables, and Stochastic Processes.

13.M. R. Schroeder, Computer Speech: Recognition, Compression, Synthesis.

14.G. Held and T. R. Marshall, Data and Image Compression: Tools and

Techniques.

15.D. Hankerson, P. D. Johnson, and G. A. Harris, Introduction to Information

Theory and

Data Compression .

Note: To anyone interested in information theory, I can highly recommend two

books: Shannon's

Collected Papers and Cover and Thomas' Elements of Information Theory.

Collected Papers

includes a 1987 interview of Shannon by OMNI Magazine, almost all of his

papers on information

theory and cryptography, his celebrated Master's Theses on A Symbolic

Analysis of Relay and

Switching Circuits, his Ph.D. dissertation on An Algebra for Theoretical

Genetics, his paper on

the Scientific Aspects of Juggling, his paper on A Mind Reading Machine, and

much more.

Elements of Information Theory is now the standard textbook on information

theory. I have been

using this textbook in my graduate-level information theory course for the past

seven years. I have

had nothing but positive feedback from students who have studied this book.

Nam Phamdo

Department of Electrical and Computer Engineering

State University of New York

Stony Brook, NY 11794-2350

phamdo@ieee.org

Home

Theory

Lossless

VQ