Kompression zurück Orginaltexte Links Fragen
Was bedeutet Kompression?
Letztendlich ist Kompression (Komprimieren = verdichten, zusammendrücken) auf verschiedene Bereiche des täglichen Lebens anwendbar. Speicherkapazität ist auch im Zeitalter von Gigabyte-Festplatte noch immer beschränkt und der Datentransfer im Internet fordert den Versand von Daten nicht in der Originalform, sondern im komprimierten Zustand (aus Zeit- und somit Kostengründen).
Weiterhin ist Kompression für Backups sinnvoll, da ein Realtime-Zugriff nicht notwendig ist. Kompression sollte natürlich nur begrenzt in Bereichen eingesetzt werden, in denen ein besonders schneller Zugriff auf Daten (z. B. in Realtime) notwendig ist, denn die Dekomprimierung fordert immer Rechenkapazität (mittlerweile sind die Rechner aber so schnell, daß in einigen Segmenten bereits in nahezu Realtime komprimiert und dekomprimiert wird; man denke an DoubleDisk von Microsoft).
Stichworte
ZIP, Zippen, Komprimieren, Komprimierung, Datenkompression, Dateikompression, Winzip, komprimierte Dateien, Videokompression, Kompression, mp3 , zip, png , jpg
Das grundsätzliche Prinzip
Die grundsätzliche Funktionsweise der Kompression kann in wenigen Worten zusammengefaßt werden. Es gibt zwei fundamentale Methoden:
Redundanzen (Wiederholungen) von Daten(-Strängen) werden erkannt und zusammengefaßt (verkürzte Datenmodellierung)
Unnötige und teilweise unwichtige Daten werden gelöscht.
Die Löschung von Daten kann natürlich nur vorgenommen werden, wenn diese nicht mehr benötigt werden und auch später zur Verarbeitung unrelevant sind. Diese Art der Kompression wird z. B. für Video- oder Audiodaten verwendet.
Achtung: natürlich können beide Prinzipien kombiniert werden.
Übersicht der gängigen Verfahren
LZW-Codierung (Lempel, Ziv, Welch)
Es wird ein Wörterbuch aufgebaut (ein Wort wird durch ein Byte definiert) und im Endeffekt wird für ein Folgewort nur noch ein Zeiger auf das erste Wort/das Wort im Wörterbuch gesetzt. Diese Kompression eignet sich natürlich sehr gut für Textdokumente, allerdings auch für Binarys, die viele gleiche Zeichenstränge beinhalten. LZW existiert mittlerweile in vielen Abwandelungen und in einigen sehr stark optimierten Formen. Die Kompression erfolgt verlustfrei, der Ursprungszustand des Dokumentes kann also wiederhergestellt werden.
RLE - Run Length Encoding
Einzelne gleiche Bytefolgen (z. B. 3, 3, 3, 3) werden zahlenmäßig zusammengefaßt (z. B. 43, also 4 x 3) und nacheinander gespeichert. Diese Methode eignet sich hervorragend für Schwarz/Weiß-Bilder, sofern es sich nicht um ein gleichmäßiges Störbild handelt (also 1,0,1,0), da dann eine Verfielfältigung der Dateigröße vorgenommen wird. Die Kompression ist auch verlustfrei.
Huffmann-Codierung
Alle Zeichen werden in einem Table gespeichert. Der Table wird so sortiert, daß häufig vorkommende Zeichen einen kürzeren Identifiaktionscode bekommen (z. B. ein oder zwei Bit), als ein nicht so häufig vorkommendes Zeichen (z. B. vier Bit). In Abwandelung kann auch ein Zeichenfolgentable gebildet werden. Letztendlich wird die Datei dann so gespeichert, daß ein Abbild der Originaldatei gebildet wird, in der dann nur die über den Table definierten Zeichen gespeichert werden (also anstelle eines E, dann ein Bit mit 0). Der Table muß bei dieser Komprimierungsmethode mit in die Datei eingefügt werden, da sonst eine Dekomprimierung unmöglich ist. Dieses Vorgehen kann den Nachteil bergen, daß eine kurze Datei somit länger werden kann. Auch diese Komprimierungsform ist verlustfrei.
Wavelet-Kompression
Das Objekte/die Datei wird komplett betrachtet und mittels mathematischer Berechnung (Koeffizienten, Hoch- und Tiefpaßfilterung, etc.) gesplittet (non-blocking Strategie). Letztendlich ist dieses Verfahren allerdings für dieses Tutorial zu komplex. Insgesamt ist die Methode nicht verlustfrei (also nicht für Backups geeignet) und in der Folge der Berechnung wird meistens noch einmal per RLE oder Huffmann zusätzlich komprimiert. Diese Methode ist bestens für Bilder geeignet.
Fraktale-Kompression
Ein Bild wird in verschiedene Fragmente (geometrische Strukturen) geteilt und diese Fragmente werden dann durch pushing / turning / zooming / sizing verglichen. Es werden sogenannte Domain- und Rangeblocks erzeugt (dynamisch). Zueinander ähnliche Blöcke werden gesucht; der Faktor der Ähnlichkeit kann eingestellt werden und hierüber bestimmt sich dann der Datenverlust. Jeder Domainblock wird mit mit jedem Rangeblock verglichen und außerdem wird die jeweilige Distanz zwischen den Domain- und dem transformierten Rangeblock ermittelt. Das Bild wird letztendlich durch eine Viezahl mathematischer Gleichungen präsentiert. Eine Dekodierung erfolgt durch Abarbeitung der Gleichungen und den Aufbau der Fragmente. Diese Komprimierung ist eigentlich nur für Bilder geeignet und auch nicht verlustfrei.
Übersicht der gängigen Datenformate
JPEG
Erkennt und entfernt Farbunterschiede / Helligkeitsunterschiede (für das menschliche Auge unrelevante Daten). Hierdurch ist eine gute Kompression erreichbar (einstellbar; die Qualität nimmt mit höhere Komprimierung ab). Das Verfahren ist nicht verlustfrei.
GIF
Im Standardformat hat GIF eine 8-Bit-Schlüsselung und verwendet den LZW-Algorithmus. Die Komprimierung ist verlustfrei und es gibt verschiedene Arten der Speicherung und der späteren Darstellung (z. B. Rasterung während des Aufbaus).
PNG
Dienst als Nachfolgeformat der GIF-Definition. Die Farbtiefe ist bis 64-Bit (Photorealismus) möglich und es werden weitere Filter eingesetzt.
ZIP
Das ZIP-Format kombiniert diverse Komprimierungsverfahren (Huffmann, LZW, RLE und andere). Das ZIP-Format und der Einsatz der Kompressionsalgorithmen ist standardmäßig definiert.
http://www.educeth.ethz.ch/informatik/interaktiv/kompression/huffman.html
Allgemeine Beschreibung von Kompression in der EDV
Ein Verfahren,
Die einfachste Methode - nämlich die "Lauflängen-Kodierung" bzw. "run length compression" - beruht darauf, beispielsweise die 5 RGB-Werte aufeinanderfolgender Bildpunkte nicht auszuschreiben sondern gezählt abzuspeichern:
statt "12/56/177,12/56/177,12/56/177,12/56/17712/56/177"
also "5 mal 12/56/177"
Bei den Komprimierungs-Methoden werden prinzipiell u.a. unterscheiden:
Eine Komprimierungsmethode, bei der die Originaldaten erhalten werden - was für Programme, Texte oder Tabellen unumgänglich ist. Packer wie Winzip oder ARJ arbeiten mit dieser Methode (genauer: mit der von Lempel und Ziv entwickelten LZ77-Komprimierung in Kombination mit einer Huffman-Codierung).
Im allgemeinen ist eine verlustfreie Komprimierung bei digitalisierten Videos und gescannten Fotografien nicht sehr effektiv.
Bei verlustbehafteten Komprimierungs-Methoden gehen Informationen unwiederbringlich verloren. Diese Verfahren komprimieren sehr stark, können aber nur für Datentypen eingesetzt werden, bei denen Verluste wenig auffallen oder die sich verlustfrei nur schlecht komprimieren lassen - dazu zählen Audio-, Video- und Bilddaten
Die Datenmenge, die verlorengeht, hängt vom Grad der in der Regel einstellbaren Komprimierung ab. Wichtig zu wissen ist, dass viele verlustreiche Komprimierungsmethoden bei erneuter Komprimierung weitere Inhalte verlieren. Dieser zusätzliche Verlust ist allerdings abhängig von der Komprimierungsmethode; der QuickTime Video Codec wurde z.B. so entwickelt, daß der zusätzliche Verlust bei erneuter Komprimierung gering gehalten wird (siehe auch Videokompression).
Bei der räumlichen (Video-)Komprimierung werden die einzelnen Frames einer Video-Sequenz für sich komprimiert - z.B. nach der JPEG-Methode. Häufig auftretende Nebenwirkung der räumlichen Komprimierung sind Unschärfen, Blockbildungen (Kompressions-Artefakte), Streifen (Linien gleicher Farbe) und Konturbildungen (Bereiche gleicher Farbe).
Bei der zeitlichen (Video-)Komprimierung werden immer mehrere Frames einer Sequenz berücksichtigt, indem nämlich nur die Änderungen gegenüber dem Vorgänger-Frame gespeichert werden. Als Referenz werden allerdings in regelmäßigen Abständen Key-Frames berücksichtigt, die das vollständige Bild enthalten.
Die bekanntesten Komprimier- / Pack-Programme für "normale" Dateien sind:
Die bekanntesten Komprimier- / Pack-Programme für Bild-Dateien sind:
Die bekanntesten Komprimier- / Pack-Programme für Sound-Dateien sind:
MP3
Die bekanntesten Komprimier- / Pack-Programme für Film-Dateien sind:
The next story involves a small-but-rapidly-growing software company, which we'll call C. They implemented a file-compression algorithm that consisted of three things:
The Lempel-Ziv algorithm, which you should recall involves storing strings that have appeared previously in the file and replacing subsequent occurrences of that string by pointers to the original occurrence. For instance, if the appears many times, we can save by replacing all but the first to a short pointer to the byte of the file where the first the occurs.
An implementation of the dictionary data abstraction. Recall this abstraction, which I think is one of Don Knuth's neatest ideas, is a set with the methods ``insert,'' ``delete,'' and ``membership-test.'' The dictionary would hold the unique strings seen so far and their original locations in the file.
Because this story takes place in the 1980's, the amount of main memory available was really tiny. Often a PC had 64K bytes, or less. Thus, they had to be careful not to store too many strings in the dictionary, or they would run out of main memory, and performance of the data-compression would drop.
At this point, you should stop and think how you would solve this problem. That is, you need to pick some dictionary implementation, and modify it so that it will store only a limited number of strings, and yet give the correct answers to queries that ask whether a given string is in the set stored. Technically, you should think with the mind of someone from 1985, but I suspect that the method you come up with is likely to have been known in 1985, and moreover, taught to undergraduates of that year and well before.
Anyway, back to our story. C was sued by a company we'll call D, which had a patent on the three steps that I outlined above. Well not exactly. They couldn't patent Lempel-Ziv, a nice idea that was in the public domain. They tried to patent the dictionary, but that wasn't going to hold up, since the abstraction is in Knuth's 1973 book. However, they also patented a particular method for solving the third problem: limiting the size of the dictionary.
I have to be careful not to reveal too much detail, but let us say that one of C and D used a hash table with a limited number of elements per bucket, and the other used a binary search tree with a limited depth. D sued C for patent infringement, and I enlisted to defend C. At the trial, the jury learned more about the background of the case. Previously, C had licensed the software of D to include in its popular software product. Later, C decided to drop D's software and implement its own version. That decision was a disaster for D, and apparently the jury thought more about this aspect of the case than about the issue of whether or not the solution of D in the patent was or was not ``obvious.''
In retrospect I should have been more careful about the implications of a decision in favor of C. Even though I believe, and will try to argue shortly, that the case reveals a logical flaw in the way patents are treated by the courts, I also feel that it is the responsibility of all in the scientific community to support the ethical handling of intellectual property.
The following is the text of Stac Electronics' patent infringement complaint against Microsoft Corp.:
UNITED STATES DISTRICT COURT
CENTRAL DISTRICT OF CALIFORNIA
STAC ELECTRONICS, a California corporation, Plaintiff,v.
MICROSOFT CORPORATION, a Delaware corporation, Defendant.
IRELL & MANELLA Morgan Chu Wayne M. Barsky Mark A. Flagel Jeffrey L. Arrington Steven S. Weiner 1800 Avenue of the Stars Los Angeles, California 90067-4276 Telephone: (310) 277-1010
SHEA & GOULD John Kidd Nicholas L. Coch 1251 Avenue of the Americas New York, New York 10020 Telephone: (212) 827-3000
Attorneys for Plaintiff Stac Electronics
COMPLAINT FOR PATENT INFRINGEMENT
DEMAND FOR JURY TRIAL
Plaintiff Stac Electronics (``Stac'') demands a jury trial on all issues and alleges as follows:
JURISDICTION AND VENUE
1. This is an action for patent infringement arising under the Patent Act of the United States, 35 U.S.C. 271 and 281. This Court has subject matter jurisdiction over the matters complained of under 28 U.S.C. 1338(a) and 1331.
2. Venue is proper in this judicial district pursuant to 28 U.S.C. 1400(b) and 1391(c), as defendant Microsoft Corporation (``Microsoft'') resides and has committed acts of patent infringement in this judicial district.
THE PARTIES
3. Stac is a corporation organized and existing under the laws of the State of California and has its principal place of business at 5993 Avenida Encinas, Carlsbad, California 92008. Stac designs, develops, markets and supports data compression/decompression products that increase the effective capacity of computer-related storage devices and the transmission rates of data communication systems. Stac is the leading supplier of high performance data compression products for personal computers.
4. Microsoft is a corporation organized and existing under the laws of the State of Delaware. Microsoft's principal place of business is located at One Microsoft Way, Redmond, Washington 98052. Microsoft develops, produces and markets, among other things, a broad range of software for business and professional use, including operating systems, languages and applications programs. Microsoft is the world's largest software company, with reported revenues in excess of $2.7 billion in fiscal 1992, and some 12,000 employees in 27 countries. Microsoft produces and markets, among other products, the MS-DOS operating system software for IBM and IBM-compatible personal computers.
FACTUAL BACKGROUND
5. An operating system is a group of programs that, among other tasks, translates user commands to the computer, schedules and implements the execution of applications programs, allocates computer memory, and manages the flow of information and communication among various components of the computer system. Application programs ``attach'' to the underlying operating system and, when called upon to do so by the operating system, perform discrete functions such as word processing, graphics and spreadsheet operations.
6. MS-DOS is currently installed on in excess of 100 million IBM and IBM-compatible personal computers, and Microsoft ships more than 20 million units of MS-DOS every year. Microsoft's MS-DOS is the predominant operating system in the IBM and IBM-compatible personal computer market. MS-DOS is marketed principally to original equipment manufacturers (OEMs) under agreements that allow the OEMs to distribute the MS-DOS operating system software with their personal computers.
7. Stac is the manufacturer and publisher of a data compression utility program known as STACKER, which attaches to DOS operating systems such as MS-DOS, as well as other operating systems. Utility programs are a large and diverse family of application programs that are designed to enhance personal computer performance. The principal function of STACKER is to compress data stored on the hard disk of IBM and IBM-compatible personal computers when the data is not being used, and later decompress such data when it is to be used, thereby increasing the effective storage capacity of the computer.
8. Although there are a number of software companies which offer data compression programs, Stac is the acknowledged industry leader in developing and marketing data compression technology, and STACKER is currently the best-selling data compression program for use on the DOS operating system. Stac's proprietary data compression technology, developed over the course of five years and at substantial cost, is protected by a number of patents, including U.S. Patent Nos. 5,016,009 (the "009" patent) and 4,701,745 (the "745" patent) (collectively, the "patents in suit").
9. Stac's proprietary data compression technology is relied upon daily by more than four million computer users worldwide. STACKER is the winner of PC Magazine's Technical Excellence Award, Windows Magazine's WIN 100 Award, PC Magazine's Editor's Choice Award, PC Computing's Most Valuable Product Award, Byte Magazine's Best of Comdex-Finalist Award, and National Software Testing Lab's Recommendation (Five Stars), and is the recipient of numerous additional industry accolades.
10. Largely as a result of the tremendous market acceptance of STACKER, which utilizes Stac's proprietary data compression technology, Stac quickly grew from a company with 25 employees and revenues of less than $1 million in 1989 to a company with more than 200 employees and a market capitalization in excess of $150 million today.
11. The personal computer software industry is characterized by rapid technological change which requires software developers continually to enhance existing products and develop new products. A critical factor in the success of a new or enhanced product is getting the product to market quickly in response to new user needs or technological advances, while at the same time maintaining the integrity and quality of the product.
12. It was a well-known fact in the personal computer industry as early as 1991 that Microsoft's MS-DOS 5.0 retail upgrade sales were rapidly declining with each passing quarter. As Microsoft's flagship product -- with approximately $700 million in revenue per year attributable to MS-DOS sales alone -- Microsoft was under intense market pressure to stimulate MS-DOS sales with an improved version of the MS-DOS operating system.
13. Due in part to the overwhelming market success of STACKER, the personal computer industry quickly recognized that an operating system capable of incorporating a high-quality data compression utility such as STACKER would be highly competitive. Indeed, the principal competitor of Microsoft's MS-DOS -- Novell's DR-DOS operating system -- had already incorporated a data compression utility in its operating system software.
14. Microsoft's Chairman and Chief Executive Officer, William H. Gates, became personally interested in Stac's proprietary data compression technology, and the possibility of using such technology in MS-DOS, in 1991, at approximately the same time that STACKER was receiving a number of coveted industry awards for technical excellence and overall product quality.
15. Mr. Gates met with Stac's President, Gary W. Clow, at the Fall Comdex-91 ceremony in Las Vegas. During a discussion which preceded the award ceremony, Mr. Gates said that Microsoft was considering including a data compression capability in the next release of MS-DOS. Mr. Gates further stated that Microsoft would not be developing this capability internally, but rather would seek to obtain another company's data compression technology for inclusion in MS-DOS. The Editor-in-Chief of PC Magazine, Michael J. Miller -- whose magazine would later that evening present its Technical Excellence Award to Stac -- told Mr. Gates before the ceremony began that STACKER was a first-rate product. Mr. Gates asked Mr. Clow to contact Microsoft after Mr. Clow returned to California, and Mr. Clow agreed.
16. In late 1991, as a result of Mr. Gates' interest, Mr. Brad Chase -- who was then Microsoft's Group Product Manager and who today is Microsoft's General Manager for MS-DOS -- and Mr. Clow began discussing the possibility of Microsoft licensing Stac's proprietary data compression technology for inclusion in future versions of the MS-DOS operating system.
17. During the ensuing months of negotiations, Microsoft proposed that Stac grant to Microsoft a world-wide license to incorporate STACKER data compression technology and know-how into future versions of its MS-DOS operating system software. Microsoft steadfastly refused, however, to offer to pay Stac any royalty for Stac's patented data compression technology.
18. Mr. Chase made it clear during the negotiations that Microsoft was considering including data compression capability in future versions of the MS-DOS operating system, and that if it were unable to reach an agreement with Stac, it would obtain this capability elsewhere, even though Microsoft believed -- as it told Stac on numerous occasions -- that STACKER was the best data compression product for the DOS market. When the subject of incorporating data compression technology other than Stac's arose, Mr. Clow reminded Mr. Chase and others that Stac owned patent rights to its data compression technology and would enforce its patents against any infringers. At least one draft agreement was provided to Microsoft that included a specific reference to Stac's '009 patent.
19. Microsoft attempted to persuade Stac that its proposal to incorporate Stac's proprietary data compression technology -- or, for that matter, any reliable data compression technology -- into the MS-DOS operating system would, if implemented, have an immediate and adverse effect on the viability of STACKER as an independently marketed product for the DOS market. Indeed, at one point during the negotiations, Microsoft presented Stac with a spreadsheet analysis purporting to detail the adverse impact on sales of STACKER -- Stac's flagship product -- in the event Microsoft and Stac failed to reach an agreement and Microsoft incorporated a different data compression utility in future versions of the MS-DOS operating system.
20. In approximately April of 1992, Stac broke off further discussions with Microsoft in light of Microsoft's failure to present a proposal that offered reasonable compensation to Stac for Microsoft's use of Stac's proprietary data compression technology.
21. In approximately June of 1992, Mr. Chase advised Mr. Clow that Microsoft was obtaining data compression technology for use in MS-DOS, but that Microsoft wanted to offer Stac one last chance to reach an agreement. In the ensuing discussions, it again became clear that Microsoft had no intention of paying any compensation to Stac in exchange for Stac's proprietary data compression technology. Discussions between Stac and Microsoft thereupon terminated for the second time.
22. Shortly thereafter, it became well known to the industry that a new version of its DOS operating system, MS-DOS version 6.0 (``MS-DOS 6.0"), would be released in the first six months of 1993 and that MS-DOS 6.0 would include a data compression utility, which Microsoft was to later call ``DoubleSpace.''
23. Before a new program (or new version of an existing program) is made available for retail distribution, the software developer will often distribute preliminary copies of the new software (the ``beta software'') to a large group of intended users (the ``beta sites"). The developer then seeks comments from the beta sites on the beta software's performance, thereby allowing it to identify and fix problems or ``bugs'' in the beta software that might have slipped through the developer's quality control procedures. Consistent with this practice, Microsoft commenced its MS-DOS 6.0 Beta Test Program in the second half of 1992.
24. On or about November 23, 1992, a telephone conference was held with Mr. Clow and Mr. Whiting of Stac, and Microsoft's Mr. Chase. During that conversation, Mr. Chase admitted that, during Microsoft's ``normal due diligence process,'' Microsoft had concluded that the DoubleSpace data compression utility of the MS-DOS 6.0 operating system software infringed Stac's '009 patent, one of the two patents in suit. Mr. Chase requested that Stac grant a license to Microsoft under Stac's '009 patent. After a brief discussion, Mr. Clow requested that Microsoft make a specific licensing proposal to Stac, and Mr. Chase agreed to do so. During this same telephone conference, Mr. Chase promised, in response to Stac's request, to make available to Stac a copy of the beta version of the MS-DOS 6.0 software.
25. Several weeks later, after not hearing back from Mr. Chase on a licensing proposal for Stac's patent, and after not receiving the promised beta version of MS-DOS 6.0, Mr. Clow wrote to Mr. Chase and again requested a copy of the MS-DOS 6.0 software. Mr. Clow wrote to Mr. Chase and again requested a copy of the MS-DOS 6.0 software. Mr. Clow explained that other software developers had access to the beta versions of MS-DOS 6.0, which was putting Stac at a competitive disadvantage. Mr. Clow also noted that Stac was still awaiting a specific licensing proposal from Microsoft for Stac's '009 patent.
26. Microsoft finally made the beta version of the MS-DOS 6.0 software available to Stac in January of 1993. At or about the same time, with respect to Microsoft's earlier admission regarding the infringement of Stac's '009 patent, Mr. Chase advised Stac in writing: ``Don't worry about the patent stuff. We are just going to keep with our changed code which does not infringe.''
27. After receiving the beta version of MS-DOS 6.0, Stac engineers determined that, whether or not the ``code'' had in fact been changed from earlier versions, as represented by Mr. Chase, the DoubleSpace data compression utility contained in the beta version of MS-DOS 6.0 infringes upon Stac's '009 and '745 patents.
28. On or about January 15, 1993, Mr. Chase of Microsoft provided Mr. Whiting and Mr. Clow of Stac with a preliminary press release for the Microsoft Real-time Compression Interface (``MRCI''). MRCI defines a compression standard for allowing vendors to design software and hardware products that utilize or ``build'' upon the Double Space data compression utility in the MS-DOS 6.0 operating system.
29. Microsoft's preliminary press release confirms that ``DoubleSpace, the integrated data compression technology ... will be available with the next major version of MS-DOS, MS-DOS 6.'' The preliminary press release also reveals that, in an effort to have the compression technology in MS-DOS 6.0 quickly adopted as an industry standard, Microsoft is now offering to license - for free - the infringing DoubleSpace technology to independent hardware and software vendors.
30. Mr. Chase sent Microsoft's preliminary press release to Mr. Whiting and Mr. Clow for the stated purpose of having Stac approve the following proposed quote for Microsoft's ultimate press release, drafted for Stac by Microsoft:
"We're excited about MS-DOS 6 and DoubleSpace because they create a large opportunity for boards, chips and add-on software to enhance the compression services that MS-DOS 6 and DoubleSpace offer ..."
31. On information and belief, Microsoft is taking the calculated risk of incorporating what it knows to be Stac's patented data compression technology in its MS-DOS 6.0 operating system in order to stimulate sales of its flagship product and respond to the intense market and financial community pressure to remain competitive and demonstrate continued growth.
FIRST CAUSE OF ACTION FOR PATENT INFRINGEMENT
32. Stac repeats and realleges, as if set forth in full, paragraphs 1 through 31 of this Complaint.#}
33. On May 14, 1991, the '009 patent, entitled "Data Compression Apparatus And Method," was granted to Stac. Since its issuance, Stac has been, and continues to be, the owner of all right, title and interest in and to the '009 patent. A copy of the '009 patent is attached as Exhibit A, and incorporated herein by reference.#}
34. Defendant Microsoft is infringing the '009 patent, in this judicial district and elsewhere, in connection with its activities pertaining to the beta version of its MS-DOS 6.0 operating system software for IBM and IBM-compatible personal computers, which embodies the inventions disclosed and claimed in the '009 patent.#}
35. Unless enjoined by the Court, Microsoft will continue to infringe Stac's '009 patent.#}
36. As a direct and proximate result of Microsoft's conduct, Stac has suffered and will continue to suffer irreparable injury, for which it has no adequate remedy at law. Stac has also been damaged, and, until an injunction issues, will continue to be damaged in its business and reputation in an amount yet to be determined. Moreover, the wilful and deliberate nature of Microsoft's infringement renders this an exceptional case, and thus Stac is further entitled to treble damages, as well as its actual attorneys' fees and litigation costs.#}
SECOND CAUSE OF ACTION FOR PATENT INFRINGEMENT
37. Stac repeats and realleges, as if set forth in full, paragraphs 1 through 31 of this Complaint.#}
38. On October 20, 1987, the '745 patent, entitled "Data Compression System," was granted to Ferranti, plc ("Ferranti"). Ferranti subsequently assigned all right, title and interest in and to the '745 patent to Stac, which is now the owner of all right, title and interest in and to the '745 patent. A copy of the '745 patent is attached as Exhibit B, and incorporated herein by reference.#}
39. Defendant Microsoft is infringing the '745 patent, within this judicial district and elsewhere, in connection with its activities pertaining to the beta version of the MS-DOS 6.0 operating system software, which embodies the inventions disclosed and claimed in the '745 patent.#}
40. Unless enjoined by the Court, Microsoft will continue to infringe Stac's '745 patent.#}
41. As a direct and proximate result of Microsoft's conduct, Stac has suffered, and will continue to suffer, irreparable injury, for which it has no adequate remedy at law. Stac has also been damaged and, until an injunction issues, will continue to be damaged in its business and reputation in an amount yet to be determined.#}
PRAYER FOR RELIEF
WHEREFORE, Stac prays for judgment against defendant Microsoft as follows:#}
1. For a judicial determination and declaration that the '009 patent is valid and enforceable;#}
2. For a judicial determination and declaration that the '009 patent is infringed by the beta version of the MS-DOS 6.0 operating system software, and such other Microsoft products as may infringe;#}
3. For a judicial determination and declaration that Microsoft's infringement of either or both of the '009 and '745 patents is wilful;#}
4. For a judicial determination and declaration that the '745 patent is valid and enforceable;#}
5. For a judicial determination and declaration that the '745 patent is infringed by the beta version of the MS-DOS 6.0 operating system software, and such other products as may infringe;#}
6. For an order preliminary and permanently enjoining Microsoft, its officers, directors, shareholders, agents, servants, employees and attorneys, and all entities and individuals acting in concert with them or on their behalf, from infringing the '009 and '745 patents;#}
7. For damages according to proof, trebled;
8. For Stac's attorneys' fees and litigation costs; and
9. For such other and further relief as the Court may deem just and proper.#}
Dated: January 25, 1993
Respectfully submitted,
IRELL & MANELLA
Morgan Chu
Wayne M. Barsky
Mark A. Flagel
Jeffrey L. Arrington
Steven S. Weiner
SHEA & GOULD
John Kidd
Nicholas L. Coch
By: Morgan Chu
Attorneys for Plaintiff
Stac Electronics
DEMAND FOR JURY TRIAL Stac demands a jury trial on all issues. Dated: January 25, 1993#}
Respectfully submitted,
IRELL & MANELLA
Morgan Chu
Wayne M. Barsky
Mark A. Flagel
Jeffrey L. Arrington
Steven S. Weiner
SHEA & GOULD
John Kidd
Nicholas L. Coch
By: Morgan Chu
Attorneys for Plaintiff
Stac Electronics UNITED STATES PATENT [19]
[11] Patent Number: 5,016,009 Whiting et al. [45] Date of Patent: May 14, 1991 [54] DATA COMPRESSION APPARATUS AND METHOD [75] Inventors: Douglas L. Whiting, South Pasadena;#}
Glen A. George; Glen E. Ivey, both of Pasadena, all of Calif. [73] Assignee: Stac, Inc., Pasadena, Calif. [21] Appl. No.: 297,152 [22] Filed: Jan. 13, 1989 [51] Int. Cl. (5) H03M 7/40; H03L 7/00 [52] U.S. Cl. 341/67; 341/95;#}
341/106; 375/112; 370/102 [58] Field of Search 375/27, 112; 358/261.1;#}
364/715.02; 341/51, 67, 106, 95; 370/102 [56] References Cited U.S. PATENT DOCUMENTS 3,976,844 8/1976 Betz 4,021,782 5/1977 Hoerning 4,054,951 10/1977 Jackson et al. 4,412,306 10/1983 Moll 4,464,650 8/1984 Eastman et al. 341/51 4,491,934 1/1985 Heinz 4,558,302 12/1985 Welch 341/51 4,612,532 9/1986 Bacon et al. 341/90 X 4,701,745 10/1987 Waterworth 4,814,746 3/1989 Miller et al. 4,876,541 10/1989 Storer 341/67 X#}
OTHER PUBLICATIONS
J. Cleary et al. "Data Compression Using Adaptive Coding and Partial String Matching." IEEE Transactions on Communications, 32:396-403 (1984).
M Wells. "File Compression Using Variable Length Encodings," The Computer Journal, 15:308-313 (1972).
Primary Examiner - A.D. Pellinan Assistant Examiner - Sharon D. Logan Attorney, Agent, or Firm - Irell & Manella [57]
ABSTRACT An apparatus and method for converting an input data character stream into a variable length encoded data stream in a data compression system. The data compression system includes a history array means. The history array means has a plurality of entries and each entry of the history array means is for storing a portion of the input data stream. The method for converting the input data character stream includes the following steps. Performing a search in a history array means for the longest data string which matches the input data string. If the matching data string is found within the history buffer means, the next step includes encoding the longest matching data string found by appending to the encoded data stream a tag indicating the longest matching data string was found and a string substitution code. If the matching data string is not found within the history array means, the next step includes encoding the first character of the input data string by appending to the encoded data stream a raw data tag indicating that no matching data string was found and the first character of the input data string.
UNITED STATES PATENT [19] [11] Patent Number: 4,701,745
Waterworth [45] Date of Patent: Oct. 20, 1987 [54]
DATA COMPRESSION SYSTEM [75] Inventor: John R. Waterworth, Cheadle,
England [73] Assignee: Ferranti, plc, Cheshire, England [21] Appl. No.:
835,793 [22] Filed: Mar. 3, 1986 [30] Foreign Application Priority
Data Mar. 6, 1985 [GB] United Kingdom 8505790 [51] Int. Cl. (4)
H03M 7/30 [52] U.S. Cl. 340/347 DD; 364/900 [58] Field of
Search 340/347 DD; 235/310;#}
358/260, 261; 364/900 [56] References Cited
U.S. PATENT DOCUMENTS 4,054,951 10/1977 Jackson 364/900 Primary Examiner - Charles D. Miller Attorney, Agent or Firm - Kerkam, Stowell, Kondracki & Clarke [57]
ABSTRACT A data compression system includes an input store (1) for receiving and storing a plurality of bytes of data from an outside source. Data processing means for processing successive bytes of data from the input store includes circuit means (21-25) operable to check whether a sequence of bytes is identical with a sequence of bytes already processed, output means (27) operable to apply to a transfer medium (12) each byte of data not forming part of such an identical sequence, and an encoder (26) responsive to the identification of such a sequence to apply to the transfer means (12) an identification signal which identifies both the location in the input store of the previous occurrence of the sequence of bytes and the number of bytes in the sequence.#}
CONTACT: IRELL & MANELLA | Morgan Chu | Wayne M. Barsky | Mark A. Flagel| Jeffrey L. Arrington | Steven S. Weiner | 1800 Avenue of the Stars | Los Angeles, California 90067-4276 | Telephone: (310) 277-1010 | or | SHEA & GOULD | John Kidd | Nicholas L. Coch | 1251 Avenue of the Americas | New York, New York 10020 | Telephone: (212) 827-3000
Beispielprojekt an Hand des LZWs
Option Explicit
Private Type BIdx
pLinks As Long
pRechts As Long
Woerterbuchindex As Long
End Type
Dim Woerterbuch(4096) As String
Dim NaechsterWoerterbuchindex As Long
Dim Heap(4096) As BIdx
Dim NaechsterHeapIndex As Long
Dim pStr As Long
--------------------------------------------------------------------------------
Sub InitWoerterbuch()
'In dieser Sub wird das Woerterbuch
'initialisiert und mit Standardwerten belegt
Dim i As Integer
For i = 0 To 255
Woerterbuch(i) = Chr(i)
Next i
NaechsterWoerterbuchindex = 256
NaechsterHeapIndex = 0
End Sub
--------------------------------------------------------------------------------
Function AddToWoerterbuch(s As String) As Long
'In dieser Sub wird dem Woerterbuch ein
'Begriff hinzugefügt
If Len(s) = 1 Then
AddToWoerterbuch = Asc(s)
Else
AddToWoerterbuch = AddToBTree(0, s)
End If
End Function
--------------------------------------------------------------------------------
Function AddToBTree(ByRef Node As Long, _
ByRef s As String) As Long
Dim i As Integer
If Node = -1 Or NaechsterHeapIndex = 0 Then
Woerterbuch(NaechsterWoerterbuchindex) = s
Heap(NaechsterHeapIndex).Woerterbuchindex = _
NaechsterWoerterbuchindex
NaechsterWoerterbuchindex = NaechsterWoerterbuchindex _
+ 1
Heap(NaechsterHeapIndex).pLinks = -1
Heap(NaechsterHeapIndex).pRechts = -1
Node = NaechsterHeapIndex
NaechsterHeapIndex = NaechsterHeapIndex + 1
AddToBTree = -1
Else
i = StrComp(s, Woerterbuch(Heap(Node).Woerterbuchindex))
If i < 0 Then
AddToBTree = AddToBTree(Heap(Node).pLinks, s)
ElseIf i > 0 Then
AddToBTree = AddToBTree(Heap(Node).pRechts, s)
Else
AddToBTree = Heap(Node).Woerterbuchindex
End If
End If
End Function
--------------------------------------------------------------------------------
Private Sub SchreibeStringBuffer(s As String, s2 As String)
Do While pStr + Len(s2) - 1 > Len(s)
s = s & Space(100000)
Loop
Mid$(s, pStr) = s2
pStr = pStr + Len(s2)
End Sub
--------------------------------------------------------------------------------
Function Komprimieren(IPStr As String) As String
Dim TmpStr As String
Dim Ch As String
Dim Woerterbuchindex As Integer
Dim LetzterWoerterbuchindex As Integer
Dim ErsterBegriffinFolge As Boolean
Dim HalfCh As Integer
Dim i As Long
Dim ostr As String
Call InitWoerterbuch
ErsterBegriffinFolge = True
pStr = 1
For i = 1 To Len(IPStr)
Ch = Mid$(IPStr, i, 1)
Woerterbuchindex = AddToWoerterbuch(TmpStr & Ch)
If Woerterbuchindex = -1 Then
If ErsterBegriffinFolge Then
HalfCh = (LetzterWoerterbuchindex And 15) * 16
Else
SchreibeStringBuffer ostr, Chr(HalfCh Or _
(LetzterWoerterbuchindex And 15))
End If
SchreibeStringBuffer ostr, _
Chr(LetzterWoerterbuchindex \ 16)
ErsterBegriffinFolge = Not ErsterBegriffinFolge
TmpStr = Ch
LetzterWoerterbuchindex = Asc(Ch)
Else
TmpStr = TmpStr & Ch
LetzterWoerterbuchindex = Woerterbuchindex
End If
Next i
SchreibeStringBuffer ostr, _
IIf(ErsterBegriffinFolge, Chr(LetzterWoerterbuchindex _
\ 16) & Chr((LetzterWoerterbuchindex And 15) * _
16), Chr(HalfCh Or (LetzterWoerterbuchindex And _
15)) & Chr(LetzterWoerterbuchindex \ 16))
Komprimieren = Left(ostr, pStr - 1)
End Function
--------------------------------------------------------------------------------
Function GC(str As String, position As Long) As Integer
GC = Asc(Mid$(str, position, 1))
End Function
--------------------------------------------------------------------------------
Function DeKomprimieren(IPStr As String) As String
Dim Woerterbuchindex As Integer
Dim ErsterBegriffinFolge As Boolean
Dim i As Long
Dim s As String
Dim s2 As String
Call InitWoerterbuch
pStr = 1
i = 1
ErsterBegriffinFolge = True
Do While i < Len(IPStr)
If ErsterBegriffinFolge Then
Woerterbuchindex = (GC(IPStr, i) * 16) Or _
(GC(IPStr, i + 1) \ 16)
i = i + 1
Else
Woerterbuchindex = (GC(IPStr, i + 1) * 16) Or _
(GC(IPStr, i) And 15)
i = i + 2
End If
ErsterBegriffinFolge = Not ErsterBegriffinFolge
If i > 2 Then
If Woerterbuchindex = NaechsterWoerterbuchindex _
Or (Woerterbuchindex = 256 _
And NaechsterWoerterbuchindex _
= 4096) Then
AddToWoerterbuch s2 & Left$(s2, 1)
Else
AddToWoerterbuch s2 & Left$(Woerterbuch( _
Woerterbuchindex), 1)
End If
End If
s2 = Woerterbuch(Woerterbuchindex)
SchreibeStringBuffer s, s2
Loop
DeKomprimieren = Left(s, pStr - 1)
End Function
--------------------------------------------------------------------------------
Sub Start()
Dim KomprS As String
Screen.MousePointer = vbHourglass
'Kompression aufrufen
KomprS = Komprimieren(Form1.Text1)
'Übergabe des komprimierten Textes
Form1.Text6 = KomprS
'DeKompression des komprimierten Textes
Form1.Text2 = DeKomprimieren(KomprS)
'Länge des Originaltextes ermitteln
Form1.Text3 = Len(Form1.Text1)
'Länge des komprimierten Textes ermitteln
Form1.Text4 = Len(KomprS)
'Status einfügen
If Form1.Text1 <> Form1.Text2 Then
Form1.Text5 = "Fehler"
Else
Form1.Text5 = "fertig"
End If
Screen.MousePointer = vbNormal
End Sub
Hallo!Guter Tip, hast du auch was zu Hamming-Code?Grüsse,Hirf: : wer sich für den Huffmann-Kompressions-Algorithmus interessiert, findet ein Beispiel unter http://www.planet-source-code.com/vb/scripts/ShowCode.asp?lngWId=1&txtCodeId=11000: Danke für den Tipp.: Dabei haben wir ein solch schones Kompressionstutorial - mit LZW-Beispiel :-)
Data Compression
Information theory provides a method for determining exactly how many bits are required to specify a given message to a given precision. This method is called the theory of data compression or, more technically, rate distortion theory. Often the message selected at the source need not (or cannot) be transmitted perfectly to the destination. For example, in a telephone conversation, extremely high sound quality is not necessary. It is usually sufficient that the two parties recognize and understand each other. Similarly, if the message is a scientific measurementfor example, a measurement of the number = 3.14159265…it is not possible to transmit all of the digits in a finite amount of time. However, a useful approximation of can be transmitted with a relatively small number of bits.
The general idea of data compression theory is illustrated in the graph below. The horizontal axis measures the distortion, or imprecision, that is tolerable in a given message. The vertical axis gives the minimum possible number of bits required to specify the message with this distortion. The graph shows that as the acceptable distortion becomes smaller and smaller, the required number of bits becomes larger and larger. Conversely, as the allowed distortion becomes larger, the required number of bits decreases. Ultimately, the number of required bits becomes zero. The number becomes equal to zero when the allowed distortion can be achieved by merely guessing at the message.
Consider the following simple example. If the message is the outcome of the toss of an ordinary coin and it is acceptable to be wrong 50 percent of the time, then no bits are required. In this case, one can simply guess heads and be sure of being right 50 percent of the time, on average.
The relationship shown by the rate-distortion curve represents the absolute minimum number of bits, on the average, required to represent a message with a given distortion. The exact shape of the curve depends on the details of the particular situation. Shannon's Fundamental Theorem of Data Compression states that, in principle, it is possible to compress a message to a given level, but no more. Exactly how this compression should be done he did not say, but later researchers developed practical techniques for data compression based on Shannon's theorems. These techniques use mathematical algorithms, or repeated computations, to compress data. Many of these methods come very close to achieving the ultimate limits set forth by information theory.
Unicode http://www.unicode.org/unicode/standard/translations/german.html
LZ77-Animation http://www.data-compression.com/lempelziv.html
Lempel-Ziv- Kodierung (LZ77) http://www.uni-karlsruhe.de/~un55/lz.html
GNU zip (kostenlos) http://www.gzip.org/
Entropie http://www.tu-bs.de/institute/pci/aggericke/PC1/Kap_III/Entropie_und_Information.htm
http://www.educeth.ethz.ch/informatik/interaktiv/kompression/huffman.html
Komprimierbarkeit
Informatik, Mathematik, Physik, Sprachwissenschaft
Gemeinsamkeit im Konzentrat [28.01.2002]
Die meisten Computernutzer werden das hilfreiche ZIP-Programm kennen, mit dem sich umfangreiche Datenbestände zu kleinen Dateien komprimieren lassen. Nun fanden Wissenschaftler heraus, dass die Software wesentlich mehr kann, als nur Datenberge zu schrumpfen:
Sie eignet sich auch dazu, den Informationsgehalt von Texten zu bestimmen und so deren Autorenschaft, die Sprache und sogar einen Sprachstammbaum abzuleiten.
Das Nibelungenlied
So beginnt die bedeutendste mittelhochdeutsche Heldensage - das Nibelungenlied. Die 39 Abschnitte der Dichtung, die "Aventiuren", erzählen uns die Geschichte von Siegfried und Kriemhild und ihrer späterer Rache an den Nibelungen. Der Schöpfer dieses berühmten Werks ist jedoch unbekannt. Zwar gibt es durchaus einige Vermutungen, und so manche Ortschaft preist ihre einstigen Schreiberlinge als mutmaßliche Urheber - doch Sicherheit gibt es nicht. In Zukunft ließe sich ein Verdacht unter Umständen jedoch sehr schnell überprüfen, dazu bedürfte es einzig und allein eines Computers, einer herkömmlichen Komprimierungssoftware, die mit ZIP-Archiven umgehen kann, und einiger Textpassagen des zu überprüfenden Dokuments sowie einige der in Frage kommenden Autoren.
Denn Dariao Benedetto, Emanuele Caglioti und Vittorio Loreto von der Università degli Studi di Roma "La Sapienza" fanden heraus, dass sich das hilfreiche Programm bestens dazu eignet, den Informationsgehalt von Texten zu vergleichen. Dabei entspricht eine Zunahme an Information einer Abnahme der Entropie des Systems - hier eines Textes. Und genau diese Entropie liefert der Lempel-Ziv-Algorithmus (LZ77), welcher der ZIP-Software zugrunde liegt, auf denkbar einfach Weise: Denn die Länge der Zeichenfolge nach dem Komprimieren geteilt durch die ursprüngliche Länge strebt im Grenzwert für unendlich lange Textpassagen gegen den Wert ihrer Entropie.
Doch wie lässt sich das ausnutzen, um Dokumente zu
vergleichen? Dazu komprimierten die Forscher zunächst eine
bekannte Zeichenfolge und notierten sich deren Länge.
Danach fügten sie der ursprünglichen Abfolge ein kurzes
Fragment einer zu prüfenden Sequenz an und ließen auch
dieses Datenpaket mit dem Algorithmus schrumpfen. Die
Längendifferenz zwischen den beiden komprimierten
Zeichenabfolgen liefert nun ein Maß dafür, wie nahe sich die
beiden ursprünglichen Zeichenreihen stehen. Dabei lässt sich
das Verfahren nicht nur auf Zeichen und Texte, sondern auf
nahezu beliebige Datensätze anwenden.
Die Forscher prüften ihr Konzept jedoch an Texten. In einem
ersten Test sollte damit die Sprache von Schriftstücken
bestimmt werden. Dazu komprimierten sie jeweils eine lange
Passage verschiedensprachiger Texte mit einem kurzen zu
testenden Abschnitt. Ihre Annahme war dabei, dass die
komprimierte Datei gerade dann am kleinsten ausfällt, wenn
das kürzere Textstück in derselben Sprache verfasst ist wie
der größere Teil. Benedetto, Caglioti und Loreto verwendeten
für ihren Test zehn offizielle Sprachen der EU in allen möglichen
Kombinationen. Das Ergebnis war erstaunlich: Tatsächlich
lieferten die Textkombinationen aus gleichen Sprachen jeweils
die kleinsten Dateien.
Anschaulich lässt sich das verstehen, wenn man sich das
Prinzip des LZ77-Algorithmus vergegenwärtigt: Dieser legt
nämlich eine Art Wörterbuch für bereits verwendete
Sequenzen an. Wenn dann im weiteren Verlauf diese Sequenz
wiederholt auftritt, so wird nur noch auf die entsprechende
"Hausnummer" im Wörterbuch verwiesen. Ein Text, welcher
Sprache auch immer, beinhaltet stets wiederkehrende
Zeichenfolgen - man denke nur an sich wiederholende Wörter,
aber auch bestimmte kürzere Abfolgen sind typisch, wie im
Deutschen beispielsweise "sch", "tz" oder ähnliches. Ändert
sich nun aber plötzlich beim Komprimieren die Sprache eines
Dokuments, dann wird der Algorithmus zunächst nicht mehr so
viele Treffer im Wörterbuch finden und produziert
entsprechend größere Dateien. Das fällt jedoch nur auf, wenn
der angehängte fremdsprachliche Abschnitt kurz genug ist.
Denn bei längeren Anhängseln hat der Algorithmus wieder
genug Zeit zu lernen, und Abweichungen in der Dateigröße
fallen nicht mehr so stark ins Gewicht. Typischerweise belegten
die umfangreichen Dateien zwischen 32 und 64 Kilobyte und
die kleinen 1 bis 15 Kilobyte.
Die Wissenschaftler prüften aber nicht nur, inwieweit sich die
Sprache wiederkennen lässt, sondern auch, ob sich so der
Autor eines Stücks offenbart. Das Prinzip war dasselbe wie
beim Sprachentest: Große Passagen mit bekanntem Autor
wurden mit Testfragmenten komprimiert. Und auch hier war die
Treffsicherheit verblüffend: In 93 Prozent der Fälle wies die
minimale Dateigröße auf denselben Schöpfer hin.
In einem letzten Experiment versuchten die Forscher das
Verfahren zu nutzen, um Sprachen zu klassifizieren. Dazu
verwendeten sie die "Allgemeine Erklärung der
Menschenrechte", da deren Formulierung in sehr vielen
Sprachen verfügbar ist. Dabei lagen alle Texte im so
genannten Unicode vor, um auch jegliche Schriftzeichen
erfassen zu können. Schließlich bedienten sie sich einer
Methode, die normalerweise zur Stammbaumanalyse
biologischer Sequenzen dient, im Prinzip wurden dazu jedoch
wieder Textpassagen in gewohnter Manier anhand ihrer
ZIP-Archive verglichen. Heraus kam ein Stammbaum für 50
Sprachen, der erstaunlich genau dem von Linguisten
entwickelten System ähnelt: So bildeten sich beispielsweise
auch hier die großen zusammenhängenden Sprachklassen
heraus.
Wenngleich Benedetto, Caglioti und Loreto ihr Verfahren in
erster Linie an Texten verschiedener Sprachen testeten, so
taugt es doch auch für unzählige andere Bereiche, bei denen
Zeichenabfolgen untersucht werden: DNA- und
Protein-Squenzierungen, Aktienmarktanalysen und
medizinische Kontrollen sind dabei nur einige wenige Beispiele.
Und wer weiß, vielleicht lässt sich mit dieser Methode auch
irgendwann der Dichter des Nibelungenlieds finden.
Thorsten Krome © wissenschaft-online
Quellen
Physical Review Letters 88: 048702 (2002), Abstract ScienceNow
LZ77-Animation http://www.data-compression.com/lempelziv.html
Theory of Data Compression
Contents
I. Introduction and Background
In his 1948 paper, ``A Mathematical Theory of Communication,'' Claude E. Shannon formulated the theory of data compression. Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source --- more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H.
It is mathematically impossible to do better than H.
Shannon also developed the theory of lossy data compression. This is better known as rate-distortion theory. In lossy data compression, the decompressed data does not have to be exactly the same as the original data. Instead, some amount of distortion, D, is tolerated. Shannon showed that, for a given source (with all its statistical properties known) and a given distortion measure, there is a function, R(D), called the rate-distortion function. The theory says that if D is the tolerable amount of distortion, then R(D) is the best possible compression rate.
When the compression is lossless (i.e., no distortion or D=0), the best possible compression rate is R(0)=H (for a finite alphabet source). In other words, the best possible lossless compression rate is the entropy rate. In this sense, rate-distortion theory is a generalization of lossless data compression theory, where we went from no distortion (D=0) to some distortion (D>0).
Lossless data compression theory and rate-distortion theory are known collectively as source coding theory. Source coding theory sets fundamental limits on the performance of all data compression algorithms. The theory, in itself, does not specify exactly how to design and implement these algorithms. It does, however, provide some hints and guidelines on how to achieve optimal performance. In the following sections, we will described how Shannon modeled an information source in terms of a random process, we will present Shannon lossless source coding theorem, and we will discuss Shannon rate-distortion theory. A background in probability theory is recommended.
II. Source Modeling
Imagine that you go to the library. This library has a large selection of books --- say there are 100 million books in this library. Each book in this library is very thick --- say, for example, that each book has 100 million characters (or letters) in them. When you get to this library, you will, in some random manner, select a book to check out. This chosen book is the information source to be compressed. The compressed book is then stored on your zip disk to take home, or transmitted directly over the internet into your home, or whatever the case may be.
Mathematically, the book you select is denoted by where represents the whole book, represents the first character in the book, represents the second character, and so on. Even though in reality the length of the book is finite, mathematically we assume that it has infinite length. The reasoning is that the book is so long we can just imagine that it goes on forever. Furthermore, the mathematics turn out to be surprisingly simpler if we assume an infinite length book. To simplify things a little, let us assume that all the characters in all the books are either a lower-case letter (`a' through `z') or a SPACE (e. e. cummings style of writing shall we say). The source alphabet, , is defined to be the set of all 27 possible values of the characters:
Now put yourself in the shoes of the engineer who designs the compression algorithm. She does not know in advance which book you will select. All she knows is that you will be selecting a book from this library. From her perspective, the characters in the book are random variables which take values on the alphabet . The whole book, is just an infinite sequence of random variables -- that is is a random process. There are several ways in which this engineer can model the statistical properties of the book.
A.Zero-Order Model: Each character is statistically independent of all other characters and the 27 possible values in the alphabet are equally likely to occur. If this model is accurate, then a typical opening of a book would look like this (all of these examples came directly from Shannon's 1948 paper):
xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd
This does not look like the writing of an intelligent being. In fact, it resembles the writing of a "monkey sitting at a typewriter.''
B.First-Order Model: We know that in the English language some letters occur more frequently than others. For example, the letters `a' and `e' are more common than `q' and `z'. Thus, in this model, the character are still independent of one another, but the probability distribution of the characters are according to the first-order statistical distribution of English text. A typical text for this model looks like this:
ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva nah brl
C.Second-Order Model: The previous two models assumed statistical independence from one character to the next. This does not accurately reflect the nature of the English language. For exa#ple, some letters in thi# sent#nce are missi#g. However, we are still able to figure out what those letters should have been by looking at the context. This implies that there are some dependency between the characters. Naturally, characters which are in close proximity are more dependent than those that are far from each other. In this model, the present character depends on the previous character but it is conditionally independent of all previous characters . According to this model, the probability distribution of the character varies according to what the previous character is.
For example, the letter `u' rarely occurs (probability=0.022). However, given that the previous character is `q', the probability of a `u' in the present character is much higher (probability=0.995). For a complete description, see the second-order statistical distribution of English text. A typical text for this model would look like this:
on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at teasonare fuso tizin andy tobe seace ctisbe
D.Third-Order Model: This is an extension of the previous model. Here, the present character depends on the previous two characters but it is conditionally independent of all previous characters before those: . In this model, the distribution of varies according to what are. See the third-order statistical distribution of English text. A typical text for this model would look like this:
in no ist lat whey cratict froure birs grocid pondenome of demonstures of the reptagin is regoactiona of cre steps.
E.General Model: In this model, the book is an arbitrary stationary random process. The statistical properties of this model are too complex to be deemed practical.
This model is interesting only from a theoretical point of view.
Model A above is a special case of Model B. Model B is a special case of Model C. Model C is a special case of Model D. Model D is a special case of Model E.
III. Entropy Rate of a Source
The entropy rate of a source is a number which depends only on the statistical nature of the source. If the source has a simple model, then this number can be easily calculated. Here, we consider an arbitrary source:
while paying special attention to the case where is English text.
A.Zero-Order Model: The characters are statistically independent of each other and every letter of the alphabet,, are equally likely to occur. Let be the size of the alphabet. In this case, the entropy rate is given by
For English text, the alphabet size is m=27. Thus, if this had been an accurate model for
English text, then the entropy rate would have been H=log2 27=4.75
bits/character.
B.First-Order Model: The characters are statistically independent. Let be the
size of the
alphabet and let be the probability of the i-th letter in the alphabet. The
entropy rate is
Using the first-order distribution, the entropy rate of English text would have
been 4.07
bits/character had this been the correct model.
C.Second-Order Model: Let be the conditional probability that the present
character is the
j-th letter in the alphabet given that the previous character is the i-th letter.
The entropy rate is
Using the second-order distribution, the entropy rate of English text would
have been 3.36
bits/character had this been the correct model.
D.Third-Order Model: Let be the conditional probability that the present
character is the
k-th letter in the alphabet given that the previous character is the j-th letter
and the one before
that is the i-th letter. The entropy rate is
Using the third-order distribution, the entropy rate of English text would have
been 2.77
bits/character had this been the correct model.
E.General Model: Let represents the first characters. The entropy rate in the
general
case is given by
where the sum is over all possible values of . It is virtually impossible to
calculate the
entropy rate according to the above equation. Using a prediction method,
Shannon has been
able to estimate that the entropy rate of the 27-letter English text is 2.3
bits/character (see
Shannon:Collected Papers).
We emphasize that there is only one entropy rate for a given source. All of the
above definitions for
the entropy rate are consistent with one another.
IV. Shannon Lossless Source Coding Theorem
Shannon lossless source coding theorem is based on the concept of block
coding. To illustrate this concept, we introduce a special information source in
which the alphabet consists of only two letters:
Here, the letters `a' and `b' are equally likely to occur. However, given that `a'
occurred in the previous character, the probability that `a' occurs again in the
present character is 0.9. Similarly, given that `b' occurred in the previous
character, the probability that `b' occurs again in the present character is 0.9.
This is known as a binary symmetric Markov source.
An n-th order block code is just a mapping which assigns to each block of n
consecutive characters
a sequence of bits of varying length. The following examples illustrate this
concept.
1.First-Order Block Code: Each character is mapped to a single bit.
Codeword
a
0.5
0
b
0.5
1
R=1 bit/character
An example:
Note that 24 bits are used to represent 24 characters --- an average of 1
bit/character.
2.Second-Order Block Code: Pairs of characters are mapped to either one,
two, or three bits.
Codeword
aa
0.45
0
bb
0.45
10
ab
0.05
110
ba
0.05
111
R=0.825 bits/character
An example:
Note that 20 bits are used to represent 24 characters --- an average of 0.83
bits/character.
3.Third-Order Block Code: Triplets of characters are mapped to bit sequence
of lengths one
through six.
Codeword
aaa
0.405
0
bbb
0.405
10
aab
0.045
1100
abb
0.045
1101
bba
0.045
1110
baa
0.045
11110
aba
0.005
111110
bab
0.005
111111
R=0.68 bits/character
An example:
Note that 17 bits are used to represent 24 characters --- an average of 0.71
bits/character.
Note that:
The rates shown in the tables are calculated from
where is the length of the codeword for block .
The higher the order, the lower the rate (better compression).
The codes used in the above examples are Huffman codes.
We are only interested in lossless data compression code. That is, given the
code table and
given the compressed data, we should be able to rederive the original data.
All of the
examples given above are lossless.
Theorem: Let be the rate of an optimal n-th order lossless data compression
code (in bits/character). Then
Since both upper and lower bounds of approach the entropy rate, H, as n goes
to infinity, we
have
Thus, the theorem established that the entropy rate is the rate of an optimal
lossless data
compression code. The limit exists as long as the source is stationary.
V. Rate-Distortion Theory
In lossy data compression, the decompressed data need not be
exactly the
same as the original data. Often, it suffices to have a reasonably close
approximation. A distortion measure is a mathematical entity which
specifies
exactly how close the approximation is. Generally, it is a function
which assigns
to any two letters and in the alphabet a non-negative number
denoted
as
Here, is the original data, is the approximation, and is the amount
of distortion between and . The most common distortion measures
are the
Hamming distortion measure:
and the squared-error distortion measure (which is only applicable when is a set
of numbers):
Rate-distortion theory says that for a given source and a given distortion
measure, there exists a
function, R(D), called the rate-distortion function. The typical shape of R(D)
looks like this:
If the source samples are independent of one another, the rate-distortion
function can be obtained
by solving the constrained minimization problem:
subject to the constraints that and
where is the distortion between the i-th and j-th letter in the alphabet. Using
Blahut's
algorithm, it may be possible to numerically calculate the rate-distortion
function.
Rate-distortion theory is based on the concept of block coding (similar to
above). A lossy block
code is known as a vector quantizer (VQ). The block length n of the code is
known as the VQ
dimension.
Theorem: Assuming that the source is stationary and the source samples are
independent. For every and for every , there exists a VQ of
(sufficiently large) dimension n with distortion no greater than
and rate no more than
Further more, there does not exist a VQ with distortion and rate less than
.
In essence, the theorem states that R(D) is the rate-distortion performance of
an optimal VQ. The
above theorem can be generalized to the case where the source is stationary
and ergodic.
VI. The Gap Between Theory and Practice
The theory holds when the block length n approaches infinity. In real-time
compression, the
compression algorithm must wait for n consecutive source samples before it
can begin the
compression. When n is large, this wait (or delay) may be too long. For
example, in real-time
speech compression, the speech signal is sampled at 8000 samples/second.
If n is say 4000,
then the compression delay is half a second. In a two-way conversation, this
long delay may
not be desirable.
The theory does not take into consideration the complexities associated with
the compression
and decompression operations. Typically, as the block length n increases,
the complexities
also increase. Often, the rate of increase is exponential in n.
The theory assumes that the statistical properties of the source is known. In
practice, this
information may not be available.
The theory assumes that there is no error in the compressed bit stream. In
practice, due to
noise in the communication channel or imperfection in the storage medium,
there are errors in
the compressed bit stream.
All of these problems have been successfully solved by researchers in the field.
VII. FAQs (Frequently Asked Questions)
A.What is the difference between lossless and lossy compression?
B.What is the difference between compression rate and compression ratio?
C.What is the difference between ``data compression theory'' and ``source
coding theory''?
D.What does ``stationary'' mean?
E.What does ``ergodic'' mean?
A.What is the difference between lossless and lossy compression?
In lossless data compression, the compressed-then-decompressed data is an
exact
replication of the original data. On the other hand, in lossy data compression,
the
decompressed data may be different from the original data. Typically, there
is some distortion
between the original and reproduced signal.
The popular WinZip program is an example of lossless compression. JPEG
is an example of
lossy compression.
B.What is the difference between compression rate and compression ratio?
Historically, there are two main types of applications of data compression:
transmission and
storage. An example of the former is speech compression for real-time
transmission over
digital cellular networks. An example of the latter is file compression (e.g.
Drivespace).
The term ``compression rate'' comes from the transmission camp, while
``compression ratio''
comes from the storage camp.
Compression rate is the rate of the compressed data (which we imagined to
be transmitted in
``real-time''). Typically, it is in units of bits/sample, bits/character, bits/pixels,
or bits/second.
Compression ratio is the ratio of the size or rate of the original data to the
size or rate of the
compressed data. For example, if a gray-scale image is originally
represented by 8 bits/pixel
(bpp) and it is compressed to 2 bpp, we say that the compression ratio is 4-
to-1. Sometimes,
it is said that the compression ratio is 75%.
Compression rate is an absolute term, while compression ratio is a relative
term.
We note that there are current applications which can be considered as both
transmission and
storage. For example, the above photograph of Shannon is stored in JPEG
format. This not
only saves storage space on the local disk, it also speeds up the delivery of
the image over
the internet.
C.What is the difference between ``data compression theory'' and ``source
coding
theory''?
There is no difference. They both mean the same thing. The term ``coding''
is a general term
which could mean either ``data compression'' or ``error control coding''.
D.What does ``stationary'' mean?
Mathematically, a random process is stationary (or strict-sense stationary) if
for every
positive integers and the vectors:
and
have the same probability distribution.
We can think of ``stationarity'' in terms of our library example. Suppose that
we look at the
first letter of all 100 million books and see how often the first letter is `a', how
often it is `b',
and so on. We will then get a probability distribution of the first letter of the
(random) book.
If we repeat this experiment for the fifth and 105th letters, we will get two
other distributions.
If all three distributions are the same, we are inclined to believe that the book
process is
stationary. To be sure, we should look at the joint distribution of the first and
second letters
and compare it with the joint distribution of the 101st and 102nd letters. If
these two joint
distributions are roughly the same, we will be more convinced that the book
process is
stationary.
In reality, the book process can not be stationary because the first character
can not be a
SPACE.
E.What does ``ergodic'' mean?
The strict mathematical definition of ergodicity is too complex to explain.
However, Shannon
offered the following intuitive explanation: ``In an ergodic process every
sequence produced
by the process is the same in statistical properties. Thus the letter
frequencies, (the pairwise)
frequencies, etc., obtained from particular sequences, will, as the lengths of
the sequences
increase, approach definite limits independent of the particular sequence.
Actually this is not
true of every sequence but the set for which it is false has probability zero.
Roughly the
ergodic property means statistical homogeneity.''
VIII. References
1.C. E. Shannon, A Mathematical Theory of Communication (free pdf version)
2.C. E. Shannon and W. Weaver, The Mathematical Theory of
Communication. ($13 paper
version)
3.C. E. Shannon, N. J. Sloane, and A. D. Wyner, Claude Elwood Shannon:
Collected Papers.
4.C. E. Shannon, ``Prediction and Entropy of Printed English," available in
Shannon: Collected
Papers.
5.C. E. Shannon, ``Coding Theorems for a Discrete Source with a Fidelity
Criterion," available
in Shannon: Collected Papers.
6.T. M. Cover and J. A. Thomas, Elements of Information Theory.
7.R. Gallager, Information Theory and Reliable Communication.
8.A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
9.K. Sayood, Introduction to Data Compression.
10.M. Nelson and J.-L. Gailly, The Data Compression Book.
11.A. Leon-Garcia, Probability and Random Processes for Electrical
Engineering. (Student
Solution Manual).
12.A. Papoulis, Probability, Random Variables, and Stochastic Processes.
13.M. R. Schroeder, Computer Speech: Recognition, Compression, Synthesis.
14.G. Held and T. R. Marshall, Data and Image Compression: Tools and
Techniques.
15.D. Hankerson, P. D. Johnson, and G. A. Harris, Introduction to Information
Theory and
Data Compression .
Note: To anyone interested in information theory, I can highly recommend two
books: Shannon's
Collected Papers and Cover and Thomas' Elements of Information Theory.
Collected Papers
includes a 1987 interview of Shannon by OMNI Magazine, almost all of his
papers on information
theory and cryptography, his celebrated Master's Theses on A Symbolic
Analysis of Relay and
Switching Circuits, his Ph.D. dissertation on An Algebra for Theoretical
Genetics, his paper on
the Scientific Aspects of Juggling, his paper on A Mind Reading Machine, and
much more.
Elements of Information Theory is now the standard textbook on information
theory. I have been
using this textbook in my graduate-level information theory course for the past
seven years. I have
had nothing but positive feedback from students who have studied this book.
Nam Phamdo
Department of Electrical and Computer Engineering
State University of New York
Stony Brook, NY 11794-2350
phamdo@ieee.org
Home
Theory
Lossless
VQ