GZIP is not enough!

GZIP is not enough!


Hey everybody. Welcome to another
fantastic episode of Google Developers Live,
where we here at Google get a chance to talk to you
about all the amazing things going on on the web, in the
tech, and all the platforms and the fantastic things
you can do with it. My name is Colt McAnlis. I’m a developer advocate
here on the Chrome team, and I mostly work with web
performance issues as well as web gaming. Now, what I’d like
to talk to you today about is something
that’s very near and dear and close to my heart,
and that’s compression. Today, we’re going to be talking
about GZIP on the web platform and how you can modify it
and address it and munge it to try to minimize your
website memory footprint and actually get
faster loading data. Now, before we get into
this talk too much, I’d actually like to point
out the #perfmatters hashtag. This is a fantastic
hashtag that’s being used all
around the intertubes on various social networks
for people in web performance who are trying to
find the same problems and address difficult issues
and have conversations about performance in
interesting and insightful ways. So use this hashtag. If anything during this talk
is interesting or inspiring, feel free to go to your favorite
social media network of choice. It’s a fantastic outlet, and
you need to be following it. So with that, let’s
get into things. So let’s start at the beginning. I can’t really talk about
compression on the web without actually taking a look
at the state of the web today. Now, besides being
filled with cats, there’s actually some
interesting data out there for us to look at. There’s a great site
called httparchive.org, and what this site
will do is actually run large amounts of website
through processing filters during the day. So what will happen
is they’ll take 300,000, 400,000
very common websites, run them through a
processing system, and then scrape
information about them to present on their page. Now, what you can see
here in this graph is actually one of the
latest studies that shows the average bytes of
a page per content type. So what this graph
shows us is that images, for an average site,
actually take up the lion’s share of the content
that’s being streamed to users. In fact, it’s more than 50%. It’s actually a massive amount
of the content being delivered, while scripts actually remain
the second largest, ending up at around 260k or so. Now, the interesting
thing about this is when you look at the
individual response sizes, you see a similar
trend, of course, Flash being one of the largest
sizes per number of responses. But of course, that
comes down to the fact that you generally only have one
or two flashes per page where they tend to be pretty large,
whereas JavaScript or PNGs or JPEGs tend to be lots
of requests for a page. That actually drives
that number down. So what we’re looking
at here is basically a vision of the web that shows
that we’re dominated by images and we’re dominated
by image data. But I’m not
necessarily sure that means that we can give up on
trying to minimize and reduce the size of the text data
that actually drives our web. So let’s take a look at this. This is another great set
of charts from HTTP Archive. And effectively, what
you’re looking at here is three graphs showing
the transfer size against the number of requests. And so what you’re
seeing is over time, our number of requests have
remained pretty constant over the past couple
of years, while you see a general increasing
trend in the overall size per request. Now, this is
specifically three graphs for HTML, JavaScript, and CSS. What this is telling us is
that while images may make up the lion’s share of
the internet right now, text information, the backbone
of these three formats, isn’t going away
anytime soon, nor is it slated to get any smaller. So for most people,
they would say, well, if the web is mostly images,
why should we care about text? Well, here’s why. You see, images may
be fantastic and they may be the dominant
form of content for us, but it’s apparent and been
shown that they’re not as small as they could be. So for example, let’s
take this fantastic image, which was captured on a camera. This is about one
megapixel of data. Now, if we say
this is a PNG file, it actually comes out at
about 2.3 megabytes of data. Now, for those of you
playing the home game, you’ll note that
HTTP Archive averages that the average
website out there is about 1.2 to
1.1 megs of data. So this single PNG image,
if hosted on your website, is already larger
than the average size of a website on the internet. Now, PNG’s fantastic. It allows you to
get transparency, and it does support GZIP
compression internally as a format, but it really only
supports a lossless encoding, which means you’re
not really getting rid of any visual data
that may be redundant. JPEG, on the other
hand, as a format is built both a lossy
and lossless encoder. This allows the JPEG encoding
system to actually remove visually redundant
information from the image so that you don’t
really notice it’s gone. The human eye is pretty
complex, but still, at 32 bits per pixel,
there’s a lot of information that we can’t discern. So if you JPEG this
image, you lose the ability to get
transparency, however, you do cut the size of the
image down to 297k. Now, these have been the
two mostly dominant image formats, besides animated
cat GIFs on the internet, but earlier this year,
internet technology evangelists and other enthusiasts
decided that maybe we weren’t done here yet. Maybe the compression
system of JPEG wasn’t good enough for
the rest of the web, and thus we had the
WebP format being born. Now, the WebP format is new. It’s not necessarily
supported by all browsers as of this date. But it’s actually got
some exciting properties that are really making
people on the web stand up and take notice. So first off, if you compress
this image with WebP, it actually drops to about 198k. That’s 100 kilobytes
of data difference for this single image. In addition to that,
WebP will actually allow you to get
transparency and even support some forms of animation,
which means this single image format gets you the
compression sizes of JPEG, the transparency
properties of PNG, as well as the animation
properties of GIFs. It’s an exciting,
one stop solution for a lot of your image needs. Now, this is the important
thing to note about this image size, and about
this image format, is that most mobile phones
right now are actually about five megapixels
for their cameras, which means people are taking
snapshots of the food they eat or signs on the
street or their kids, uploading them to their
favorite social media network, and that’s actually coming in
at five megapixels uncompressed. Now, if we look at PNG and
JPEG and the compression ratios they got with a
single megabyte image, you can see that over time,
as the number of images that fill the internet
increases, we’re going to quickly run into
compression and data size issues. That’s why WebP shows
up, and it actually solves a huge problem
for us and allows us to address a lot of issues. Now, there’s a whole separate
series of GDLs done on WebP. I encourage you to go
to the GDL website, take a look at
some of these talks for more information on how
to get started with that. Now, that’s all I’m
going to say about image compression for now. But in the meantime,
let’s talk about text. So I said before that text
data– CSS, HTML, JavaScript– actually drive the rendering
performance and initial page load of your page
more than images do. This is for a very
specific reason. So if you look at
this graph here, we actually can’t start
rendering anything on your page until the HTML has been parsed
and subsequent JavaScript and CSS may be loaded to create
both the DOM and the CSS DOM so that we can actually
parse the render tree and actually start getting
pixels on the screen. So if we look at this average
flow of events from HTML to CSS to actually
getting pixels there, we can actually see an
interesting example. So we have a standard HTML,
which actually links to a CSS. That CSS is updating
some of the information on the page, the text data. We also have an image
that’s referenced and a script that’s
at the bottom of it. Now you can see on the
mobile image on the side there that nothing
has been shown yet, even though this
HTML has been parsed. So we’ve got our little HTML
box that’s mostly orange. You can see that the DOM
has mostly been parsed. It’s partially orange. And that the CSS
file– example.css– has been discovered, but it
hasn’t really been loaded yet. So this means that we
actually cannot display text on the screen because we don’t
have the styling properties that define that
text on the screen. It’s not until the
CSS data actually gets loaded that we can build
the CSS version of the DOM and actually begin
constructing the render tree. Now notice that the DOM still
hasn’t finished loading yet. That’s OK. If we can actually partially
load and partially display the top part of the DOM, or
what people typically say is above the fold, that allows
us to actually start rendering pixels on the screen while
the bottom part of the page is still loading. So now that the
CSS has loaded, we can actually get some
text on the screen. Now notice that layout
and paint are halfway through their process because
they don’t have all the data. Loading up next should be
WebP, and then of course, last.js is defined
there as well. Now, because of the
file sizes, last.js is smaller than the image,
and so it can actually get loaded and parsed
before the image data actually gets
to the screen. Now, if last.js kicks
off some style chains or some other information
that needs to be loaded, it can go back
and change the DOM and change the CSS properties,
forcing page reflows, forcing page repaints, and other
sorts of chaos in your loading time. But notice the image still
hasn’t been loaded yet and hasn’t been
displayed on screen. It’s not until the final
complete bits come off the wire, the CSS
DOM can be finished, the render tree can
actually be completed, and finally, we can
actually get the image on the screen showing
you what we’re trying to show you
in the first place. Now, what you should
gather from this is that while images make up the
bulk of the content on the web, it’s really the
textual base data that drives how and when
pixels get to your screen. And when you’re
trying to optimize for fast path and critical path
rendering on mobile devices, it’s really the
text data you need to get off the wire
as fast as possible. Now, a lot of you
kids out there are talking about some
really cool internet technology called Emscripten. Now, this is a library has
been developed open source that allows developers to take
existing C and C++ code and trans-compile that
to JavaScript data. Probably one of the
most impressive examples of this technology was debuted
a little bit earlier this year, sometime in March at the game
developer conference, where Nvidia and Mozilla and a
bunch of the open source community at Emscripten actually
debuted the Unreal 3 graphics engine running on
top of Emscripten. Now, this means that you
get this rich, full graphics engine running
inside of a browser. It runs inside of
JavaScript, so it’s spec compliant and all
this other fun stuff. Now, I’m from the
games industry, and I’ve been working with
Unreal 3 for a number of years now. And one of the things
Unreal has never been popular for is
being small in size. You can actually see that
when we move the source code and the compiled
data to the web, this problem still consists. So if you look at the loading
time of this application, you can actually see
that the core JavaScript file for this Emscripten-based
port of Unreal is about 50 megabytes of data. 50 megabytes to bring
down a JavaScript file to actually load this game. Now to be fair, it
is served compressed, which means that
you actually only have to pull down
five megabytes of data across the wire for this
single JavaScript file. But still, five megabytes. That’s five times larger than
any of our other websites out there. And by the way, this isn’t
counting the 18 megabytes worth of data that has to
be pulled down as well. So when you look at these
trends– and the more developers that are trying to
move towards high performance JavaScript execution using
things like Emscripten and asm.js– what you
start seeing is a trend. The more web applications
that produce source code in JavaScript that come
from other languages, we’ll start seeing bloated
and larger and larger JavaScript files that need to
be downloaded by their clients before pixels can get
on the screen and a game can be played. This trend is not going
to diminish over time. This means that our
text data is going to continue growing larger. As such, as a
developer, it’s your job to figure out how to
minimize, compress, and reduce the number of bits on the wire. And so with that,
let’s take a look at the good old
boy known as GZIP, sort of the backbone of
compression on the internet today. Now, I want to do a quick side
to note that compression is not the same as minimization
or minification, depending on how purist
you are on the term. So let’s take a look at
the two really quick. Minification in the
web platform is the art of removing redundant
information from textual data such that we can still
symbolically parse and process it when we pass it off
to the underlying system. So if you look at
this example here, we’ve got a function
that adds two numbers, but there seems like
there’s a lot of information that you don’t see. There’s line returns, there’s
extra spacing information, perhaps the variable names
are actually too long. The process of minification
actually reduces all of this to give you the least number of
completely processing and valid bytes for this file
representation. Again here, we can
process this ahead of time and actually pass this right
to the underlying systems to be processed and actually
get stuff on your screen. Now, compression,
on the other hand, offers something
completely different. Compression is the act of
modifying your data so that it has to be reconstructed before
being passed to the underlying processing system. So again, if we take
the minimized form of that some
function, compression will actually turn it into
a bit stream of information that then has to be deflated,
or actually decompressed into the original form
before it can actually be passed off to the
underlying systems. These two technologies
work as a one-two punch. So you actually have to add
minification to your technology before you allow GZIP-ing
to occur in order to get the fewest
bytes on the wire. Now for the rest
of this talk, we’re actually going to be addressing
some of the issues and pros and cons of GZIP,
but before we do that, I need to make sure
that we understand what’s going on under the
hood of this algorithm. You see, GZIP is a
compression format that actually uses two
interplaying technologies together. The first one is a
technology known as LZ77. Now, this is a
dictionary-based transform that will actually take
a given data stream and convert it into
a sequence of tuples. Now, in each tuple, we actually
have a position and a length value. So I know that’s confusing. Let’s take a look at
a little example here. So we’ve got this
string up top which is comprised of
various A, B’s, and C’s in some sort of random order. What happens is if we
start parsing this string, we’re going to look at
each symbol individually and we’re going to do a
scan backwards to find out when previously we’ve
seen this symbol before, because then that allows
us to actually encode this symbol rather as a
single piece of information, but actually as a relative
piece of information. The goal here in
LZ77 is actually to create a lot of
redundant types of tuples that we can then compress
a little bit better. So let’s walk through
this just a little bit so you can see what’s going on. So you see if we start
with the first A here and we actually
parse it, well, it’s the beginning of the
string, so we haven’t really seen anything before yet. Therefore, we actually
have to output a position and length of 0, 0. It represents that
we’re not actually seeing any other characters. We’re just looking
at the A itself. Now, when we get to the second
A, we start our backwards scan and we find that the
first A we encounter was actually one symbol ago
at a length of one symbol. So this means we can actually
output the tuple, 1, 1. Now B, we haven’t
seen any B’s before, so we have to output 0, 0. And C, same thing, 0, 0. But now we’ve reached
another B. So it’s going to start
scanning backwards and actually finds that the last
B we encountered in the stream was two symbols ago. And again, we only
want length of one. Now, the important and probably
the more interesting part of this particular
example is when we get to the end of the string
where we see A, B, C. Now, when we scan backwards
from A, B, C, we can actually find that we
found this exact three symbol value previously in
our stream, and exactly about five character
positions back. So this means we’re going
to output the tuple of 5, 3. So we output a tuple rather
than the three characters themselves. Now again, the point of
LZ77 is to actually create duplicate and high
redundancy tuples. These tuples and the
redundancy that we create with them are actually
very important to pass off to the next step of
the GZIP algorithm known as Huffman compression. Now, for those of you who don’t
remember Huffman compression or probably have blocked
it out of your mind back from the Computer Science
101 days at your university, Huffman works by assigning
variable length bit codes to symbols in your stream
based on probability. The idea here is that the more
probable and more frequent a symbol is in your
stream, the least bits you should use to represent it. A perfect real world example of
Huffman encoding is Morse code. In the American
language, the letter E is the most dominant
of all our words. Therefore, it’s
assigned a single beep to represent its value. Huffman compression
works in a similar way. Now, we’re going to spare
you the knowledge of building a Huffman tree and
everything else. This is in tons of
data structures books. Instead, we’ll just
show you that we can see after we parse our
newly tokenized tuple string, you can see that 0, 0 is
the most dominant tuple set. And of course, we
can assign that one bit, being a single zero. Therefore, the next
set is actually 1, 1. We have two symbols of that. And because of the way
our tree is constructed, we actually have to provide
that with two bit symbols. We continue on and continue
on and effectively keep assigning variable bit
code words to symbols, creating a compressed
version of our data. Now, these two characters
or these two algorithms operating back and
forth with each other actually have a very
beautiful ballet in the way that
content is created and how it’s compressed with
the statistical encoding. This is the cornerstone
of everything we’re going to talk about for
the rest of this conversation. Now, it’s worth pointing
out that GZIP is not the only kid on the block. In fact, GZIP is about 20
something years old now. If it were a real
human being, it would be able to
drink and drive and do all other sorts of fun
stuff on the internet. But other compression
algorithms which may be a little bit newer
and a little bit different can actually give you some
interesting trade-offs. So here are four of the more
popular encoders out there. The first is one known as LZMA. Some of you may know
this more popularly as 7-Zip, a very
popular compression archive tool out there. You can see that LZMA
actually gives you smaller compression than GZIP. This is actually due to
a lot of higher order searching and finding algorithms
built around that LZ77 algorithm. In fact, LZMA can actually be
considered a distant cousin to GZIP because it’s actually
very similar in the way that it compresses its data. However, because it uses
more modern heuristics and algorithms, it can
actually get better results. Now, below LZMA, you see LPAQ. This is a context
mixing based encoder. It’s effectively
mostly a neural net in practice of how
it matches symbols. Now notice LPAQ
actually gives us the smallest file size
compared to everything else at 0.35 megabytes. I’ll get into that a little
bit more in a second. Now, the final compressor we
look at is actually BZIP2. Now, BZIP is a modern variant
of the Burrows-Wheeler transform assigned also with a
move to front transform and either a Huffman
or an arithmetic encoder on the back end. Now, BZIP2 fundamentally changes
the way that compression works. It’s very different. Doesn’t use LZ77 at all. Instead, it uses
a semi-block based sorting transform
to actually increase redundancy or adjacent
redundancy in data. This redundancy allows it
to get better compression. Again, you can see here
where GZIP actually gives us 0.48 in terms of megabytes,
BZIP actually beats this. Now, what you’re actually
seeing here in terms of data sizes is actually a scrape I
did of the amazon.com home page. So if I take all
of the text data from that– HTML, CSS,
JSON, and JS data– it actually comes out
to about 1.64 megs. So you can see how
these compressors relate to that specific
memory footprint. Now, when you’re
comparing compressors, the size of the data is
not the only heuristic. You have two other
heuristics that you typically throw into the mix. The first one is encoding
time and the second one is decoding time, which often is
the more important of the two. So if we look at the
encoding time column, you can see that GZIP does
probably the fastest in terms of encoding speed
at 0.79 seconds. LZMA, being a distant cousin
with more search heuristics, actually comes in
pretty close at 1.26. You can see that the wins
it gets in compression size actually come from
more preprocessing at the front of
the encoding step. Now, LPAQ gives you
amazing compression, but notice it actually takes 11
seconds to compress your data. Now again, LPAQ is
a modern wonder. It actually uses
various forms of context mixing in different
sort of combinations, again, resulting in pretty
much an artificial intelligence neural net algorithm to
find compression wins. So it makes sense
that this actually oscillates at about 11 seconds
worth of processing time because there’s a lot of
things going on under the hood. Meanwhile, the Burrows-Wheeler
transform powered BZIP2 doesn’t stray too
far from the mark, but gives us about 2.1
seconds in encoding time. And there’s a
whole set of papers that talk about how that
clustering transform actually affects memory and
processing and whatnot. Now, decoding time is probably
the most important part here. You can actually see
that GZIP and LZMA are, for all practical
purposes, identical. So while LZMA takes a
little bit longer to encode, it seems to regularly produce
smaller files than GZIP, and the decoding time tends
to be almost identical. Now, LPAQ, again, because
it’s running a neural net, the decode time actually
equals almost what the encode time was, so you
look at about 11 seconds there, while BZIP
actually is a faster decode than it is an encode. The whole point of this
slide– the only thing I want you to take
away from here– is that GZIP is really
good at a couple things, specifically encoding time. It gets beat in a couple
places, specifically with the size of the compression
that it achieves, and then ties in a couple other
places in decoding time. So it’s not really the
only algorithm on the block there in terms of
compression ratios. So you can see here
that we’ve highlighted a couple of these
specific instances. LPAQ wins in size. Encoding time is, of
course, dominated by GZIP, and decoding time is, of
course, dominated by LZMA. Now, this can lead a lot of
people on the web tech platform to actually come at me
with burning pitchforks and actually say, hey, GZIP
is actually pretty good. That data just showed
us that we really don’t need any other compression
algorithms because it gives us decent size compression,
decent time for encoding, and decent decoding time. Well, I’m here to tell you
that’s a good argument, but let’s dig a
little bit deeper into whether or not GZIP should
be the only thing we allow on the web. So first off, you should
understand that there’s really no silver bullet
for compression. Depending on what your
data is, how frequent it is, the relationships it
has with local information, all relate to how well it
can be compressed at the end, and different compressors
will handle this data in different ways. For example, if you actually
applied JPEG style compression to text data, you’re
not going to be able to decompress your
text in the right way because it’s lossy in terms
of removing information bits from the stream itself. So let’s take a perfect
example that actually shows to beat GZIP without
really any modification. So let’s say we have a
string of integers here. Now, this string of
integers was actually created as an index system into
some higher order database. So there’s basically 10
things that are being indexed, and we list them here. Now, if we just GZIP
this, we actually don’t get a lot of savings. This is because,
if you remember, the LZ77 algorithm
isn’t going to find any duplicate indexes
in this array. Nine is only existing once,
and so are the other numbers. Meanwhile, that means that all
of the statistical probability for these symbols is
pretty much equal. There’s no difference. Nothing is more frequent
or less frequent, which means the Huffman
algorithm is going to pretty much assign
equal bit lengths to all the symbols themselves. Because of this, we really don’t
get any compression savings at all from the GZIP algorithm
for this particular data stream. This means that you have
to look at your data and go, what is
GZIP doing to it? If we, however, take a
little bit different swing at the problem itself,
we can actually beat what GZIP is doing with a
little knowledge of our data. So let’s say we take
the original array, and instead of just
leaving it by itself, we actually notice the
property that we can actually sort this information because
we don’t need the order to come out on the backside. So if we sort this,
we actually end up with a pretty incrementing
run of numbers– zero to nine without any gaps in the middle. We can take this string
and apply a technique known as delta encoding to actually
create a different symbol set. Now, delta encoding
works by taking a symbol and finding the
difference in the symbol between the previous symbol,
and it encodes the difference rather than the actual value. So for our specific
example here, we start with the number
zero, and the number one is one more, so we
add one to that. Two is greater than
one by a single value, so we had one, one,
one, one, one, one. And you can see
with delta encoding, all we have to do is encode
one and then eight zeros to actually represent the
original string we have. Now, the delta encoded
version, notice that it does have a lot
of duplicate symbols, and there is a
particular symbol which is more dominant than
the other symbols. Now, once we’ve encoded
our data into this form, GZIP actually has a field day. The LZ77 algorithm finds lots
of matches in its window, and the arithmetic
compressor can actually come through and assign
smaller bit codes to more frequent streams. What we’re showing you
here is that you can’t just throw arbitrary data
at GZIP and expect to get the most perfect,
amazing form of compression. Instead, a little
bit of preprocessing can actually change the
way your compression works, most of the time for the better. Now, let’s look at
another perfect example of where GZIP can actually
cause some problems. So if you look at my website, I
did a little bit of an analysis a while ago on various forms of
CSS minimization technologies. There was a
fantastic set of code out there by someone who
was using genetic algorithms to figure out optimal
ways to minify CSS data. So let’s take a look
at this table here. So if we take two CSS files from
StackOverflow and Bootstrap, and we actually look at
their minified forms– so this has already been run
through Closure or YUI or Clean CSS, one of these things– we
actually get about 90 and 97k, respectively. Now, using the genetic
algorithm version of the minifier actually
produced about 3% savings in both of these
files, which meant that we can use a different
variant of minification to get wins over what we’re
typically already doing. So this is a good idea. If we find a better algorithm
that minifies our data smaller, we should be embracing that. This is where GZIP actually
causes some trouble. When we actually run these
genetically algorithmically minimized data through
the GZIP compressor, you can see the results
are actually negative. That means it’s actually
growing the data. So we have our minified.GZIP. So if we just zip the data
that comes from Clean CSS or Closure, we get
about 18 and 15k. However, when we
zip the data that’s been generated from
our genetic algorithm, you can see that we’re
actually increasing the size of the file. Basically, what you’re
seeing is that the data is so minimized coming from
the genetic algorithm that GZIP is actually
inflating the data. It’s making your
file larger than it should be because it’s only
using the LZ77 and Huffman encoding steps. Now, this should
scare a lot of people, hopefully, into thinking
how often this is actually occurring on the internet
for various size of files. Now, I can tell you
it’s not that frequent, but it is something to keep an
eye on because it definitely shows that GZIP is
not a silver bullet and that it can actually
do harmful things in very certain circumstances. Now of course, a
lot of people here would then say, well,
this means we shouldn’t be using this genetic
algorithm to minimize our CSS. Obviously, that’s a flop. And I still think that
that’s the wrong argument, but let’s move on. Now, let’s talk
about PNG for minute. Now, PNG is a format
that you can actually export in a compressed form. Now internal to PNG, the
compression algorithm it uses is called deflate, pretty much
the backbone of what GZIP uses. It’s LZ77 coupled with a
Huffman compressor step. Now, what you’re looking at on
the screen here is two images. Effectively, I took a 90
pixel by 90 pixel sprite, and I tiled it vertically,
creating a 90 by 270 image. Now, I resized that
image just a little bit and added two columns of
pixels, so we now have a 92 by 270 image instead
of a 90 by 270. So we’ve barely altered
the size of the data. However, you can
see at the bottom there that the sizes post-GZIP
are drastically different. The 90 by 270
image is about 20k, while the 92 pixel image is 41k. That’s twice as large for
only two columns of pixels being added to the image. And again, the
compression algorithm that’s being used under the
hood here is effectively GZIP. So let’s take a look at
why this is occurring and what’s going on. So let’s say we’ve got a
bunch of pixel colors here. Well remember,
the LZ77 algorithm is going to come
through, and it’s going to try to find matches. So we’ve got a particular
green pixel here and another green
pixel that’s identical to this somewhere previous
in our data stream. Now, how this works is that
LZ77 won’t scan infinitely to try to find matches. Instead, it actually
operates in a window of 32k worth of information. Now, that means if I
encounter a symbol that’s outside of that window, then
I’m not going to find a match. It only matches things
inside of a 32k window. So if we look at
our 90 by 90 images that we created for
our tile sheet there, 90 times 90 times
4 bytes per pixel is roughly about 32k in size. Now, if we actually
look at 32 times 1,024, the real definition
of 32k, it’s, again, about 300 bytes
larger than our 90 by 90. However, when we added
those two columns of pixels, we actually changed the size
and the stride of our images. 90 by 92 by 4 is
actually 33k, not 32k. What this means
is that when we’re trying to find duplicate
pixels in this image, we’re just far enough away that
we can’t find the exact pixel that we had seen before. Let’s take a look
at this visually. So again, we have our two
images, the 20k and the 41k, and we can actually
create a heat map for this image representation. Now, this heat map shows
us in dark blue pixels being the ones that
are highly compressed. This is where matches
have been found. They’re very small, encoded
pixels at this point. Meanwhile, reds and yellows mean
that we didn’t find a match, and to encode this particular
pixel took a lot more bits. So you can see on the left hand
side, where we’re 90 by 270, we actually get a
lot more matches. The dominant form of
the image is dark blue. You can see once we tiled that
image in the lower two regions, that we find an exact pixel
matching 32k pixels away. However, when we increase
the size by two columns, we’ve actually bumped out the
window of pixels to look at. Therefore, we’re not actually
finding exact matches. In fact, we have to
look for near matches, and that’s why the second
image has a lot more misses in the LZ77 cache,
resulting in worse encoding. Now, this basically shows
us that with small changes to our image data, the
GZIP algorithm actually falls completely on its face. Again, it’s far from
being a silver bullet. Now, the truth is that at this
time you should be saying, well hey, it’d be
great if we could use one of the newer algorithms. But in reality,
we’re stuck with GZIP for some indefinite future. Here’s why. If you actually go to
the Chromium source code and look at the bug system
that’s provided there, you’ll find an interesting
tale of how Chrome actually adopted the BZIP2
compression format. Effectively, Chrome
added support for it, and a lot of servers–
Apache and whatnot– added support to send
out BZIP2 content. So effectively, a server creates
a BZIP2 file instead of GZIP and sends it off to servers. However, what started
happening in the wild was a little bit
interesting to us, and the results made
us actually have to remove the support
for BZIP2 from Chrome. You see, what was happening
was a lot of these middle boxes that are out there in
the wild didn’t ever expect that we would do
anything besides GZIP, and they actually had hard coded
paths in there to look and say, if this is not GZIP
header compressed, compress it with GZIP. So effectively, what occurred
was some servers, when they received the BZIP2
archive or the BZIP2 header, and these middle boxes
would look at it and say, hey, this isn’t GZIP. Strip the header, try
to GZIP the information, and then send it out. This means Chrome would
actually receive a BZIP2 data package whose header
had been stripped and then it had been re-gzipped. So we would have no idea what
the actual format of the data is that we’re receiving. It could be binary data,
or it could actually be valid information. The problem was this was
so systemic out there in the wild for various types
of firmware and middle boxes that it was actually
too difficult for us to fix on the fly, which
means it was a smarter idea to actually just remove
this from Chrome. So when you start
talking about the ability to add other compression
algorithms to the base browser, you’re going to run
into the same problem. There’s a lot of
systems out there who just don’t
understand and haven’t been updated to expand to
different sorts of compression technologies. So this gets us to an
interesting idea of well, if we’re stuck with GZIP, but
size of text data is a problem, how do we actually create
smaller GZIP files? Well lucky for
you, I’ve actually spent the past three or four
months focusing explicitly on this problem,
and you can actually see a lot of the results on my
blog at mainroach.blogspot.com. But first, let’s dive into
a couple of these ideas, and then you can
go there and read about all the awesome things. So first off, we can actually
generate better GZIP files. A lot of people don’t
understand this, but when your server creates
your GZIP file, a lot of them are only tuned to compress
it at a factor of six. The command line parameter
that you pass to GZIP allows you to pass in
between zero and nine, where most of the servers
are default set to six. This seems to be an
interesting trade-off between compression speed
and compression size. Now, this is mostly
because web developers tend to just upload raw
assets to the server, not wanting to deal
with compression. The server actually is
responsible for compressing the content and then
sending it off to requests. Now, what’s
interesting about this, though, is that you
can modify your files and actually GZIP them offline. So effectively,
part of your build step would be to take
your text information, GZIP it with some better
compressor to produce a smaller GZIP file, and then
actually tell the server to just pass that data through. There’s two applications out
there that could actually produce this type of
smaller GZIP file. The first one is 7-Zip. This is actually a
very nice command line tool which can produce BZIP2
archives as well as GZIP archives, but it does
so taking advantage of the more modern,
more powerful searching and dictionary based
compression algorithms that it’s going to be using. So 7-Zip can actually produce
smaller versions of GZIP files than the standard command
line GZIP archiver that ships with most
firmware in servers. Another fantastic tool out
there is one called Zopfli. Now, this was actually
created by the engineers here at Google to solve
this same problem. Now, Zopfli uses a lot more
memory and a lot more advanced algorithms than even 7-Zip to
find better matches in smaller spaces to produce smaller
GZIP files as well. It’s an open source project. You can go check it out. I highly recommend it
if you’re interested. So the cool thing is we can
use either 7-Zip or Zopfli as part of your build process to
actually generate smaller text files and have them sent
along on behalf of the user. And again, these are
GZIPs, so they’re going to be accepted by
the middle boxes as well as the browsers. Now, if you’re wondering
how these preprocessing systems actually fare
against the standard GZIP, here’s a great graph to look at. So you can actually
see the blue column across the bunch of files is
actually the standard GZIP algorithm. The red one is actually
Zopfli, and the green one is actually 7-Zip. So you can see
across these 42 files that on average, in fact,
most of the time, both Zopfli and 7-Zip regularly beat GZIP
by somewhere between 2% to 10%. And if you actually
increase your parameters and allow Zopfli and 7-Zip to
spend more time doing matching and doing compression,
you can actually get it into the 15%
ratio, which is fantastic. Now, the cost of this,
though, is enormous. A lot of these algorithms
took probably 20 minutes to run– on the light
side– to find 1% to 5% worth of compression wins. Basically, what we
found is that we’re at a local minima
for compression. The more time we spend trying
to compress the content yields less and less
and less actual savings. You could spend
six hours of cloud compute time to get 2% savings
in your compression algorithm. So at this point, you
have to look and say, well, if we’re stuck
with GZIP, yet it takes additional hours of
time to compute smaller GZIP files, what the heck is
this entire talk about? Aren’t we just stuck in this? Fret not, my good friends. You’re actually not
stuck, and it actually comes down to you
owning your data. You see, you can create
smaller GZIP files by actually preprocessing
your information before handing it off to GZIP. Much like the example
I gave earlier where we took a set of numbers
and then delta compressed them to create something that was
highly repetitive and highly compressible, you can
apply these techniques to a lot of other portions
of your code base. So let’s dive into some
interesting algorithms you can use in your
projects today. The first one happens to be
close and near to my heart. Now, many applications
and many websites out there use JSON as
a format to transfer data between client and server. Social media
information is typically sent around in this information. It’s a very nice, very
adopted file format for sending information around,
particularly because it’s built off the
JavaScript standard. Now, when you’re sending
this data around, though, most of the time,
users and developers don’t think about how to
modify the data being sent such that it compresses better. So a user asks for
some search query. The server goes and
computes the information and then returns
the JSON blob back, normally optimizing for
return round trip time. It’s trying to get the
data back to the user as fast as possible. But what they don’t
understand is if they actually have that GZIP flag
turned on, GZIP is going to stop the
operation and zip the content before sending it down
to the client anyway. Now, if you’re the
type of developer that’s dealing in third world
countries or other types of connectivity that may
be sparse or intermittent, this is actually a huge problem
because your client device– usually a mobile device
on some 1G or 2G network that may or may not
stay consistent– is sending off
requests, and then what’s hurting them
is that they’re getting a larger
payload coming down. So basically, what we’re
talking about now is, how can you process
your JSON blobs that are being returned to these
individuals in such a way that GZIP can
compress it further? And it all starts
with the ability to transpose your structured
data in your JSON file. So let’s take a look
at this example. Now, I’ve scraped
a lot of websites and a lot of JSON
responses, and I see this pattern occur
on a lot of websites. Effectively, what
you’re looking at is a list of dictionary
structures of name value pairs. So if someone requests a
search for a particular item on a shopping website,
it’s very common to actually return to them a
dictionary item that actually contains what the product
name is, what the price is, what thumbnail image to use,
et cetera, and then just list these linearly
inside of an array. So what this creates, though,
as an interesting problem, is that similar based
data is actually strided away from each other. It’s actually interleaved. So you may have a
name and then a price, which could be a floating point
value, and then a description, which may be a long
form block of text, and then a URL, which has its
own specific characteristics. What we’re proposing
in transposing our data is actually turning
this name value pair and actually de-interleaving it,
and instead actually grouping similar values for
property names together. So example here is we have these
two dictionary objects that both have name and
position values, and instead, we can
transpose that such that we have an
array of name answers and an array of pos answers. Now, I know this
looks weird at JSON, so let’s take a
graphical look at things. So let’s say on the
top here, we have an array where we have
a list of objects, and each one of
the colored blocks represents a property
on that object. So we’ve got green is a name,
and blue may be a price point, and again, red
might be some URL. What we can do, then,
is we can actually transpose this and align all of
the green blocks, blue blocks, and red blocks together,
allowing homogeneous data to reside in an array with
other homogeneous data. Now, I’m going to explain
why this is actually an improvement in a
second, but let’s see if we can take a look at whether
or not this saves us any data. So if we take a
bunch of JSON files that are returned from
very common websites– so I took some responses
from Amazon and Pinterest and whatnot– you can see that
the first column represents the source size in
bytes, the second column represents the source
data actually gzipped. The third is actually
the size of the data after it’s been transposed. You can see that there’s
some variability in numbers there because we’re
actually removing symbols from the string at this point. And you can see the
final column is actually the gzipped version of
the transposed data. Now, we’re actually in an
interesting situation here, is that for some JSON files,
the transposed gzipped data is actually smaller than the
source gzipped data, which means we actually get a win
by applying this process. Thankfully, we’re not
falling into that area where genetic algorithm
minimized CSS is actually causing GZIP to
inflate the data. We don’t want to be there. So the transpose operation
actually gives us smaller files,
which is fantastic. Now, the reason that
this actually works has to do with the
32k window that LZ77 uses for its matching. So if we have our
values here that are all interleaved– red,
green, blue, red, green, blue, red, green, blue– again, we
have to go farther to search for a piece of content
that may be an exact match. However, when we transpose
our data, we de-interleave it and group similar
types of data together. So let’s say we’re
actually trying to find a match for one
of these green values. In the top array, you can see
that an entire listing of data, the green may not fit
inside the 32k window. Meanwhile, once we transpose
it and group homogeneous data together, the entirety
of the green data actually fits in
a single window. This is going to allow you
to find better matches, which is going to result in
smaller compression values. Now this actually comes to
another step here, which is OK, well, if we can
transpose our data, how else might we
modify our content so that we can get wins? Now, this is a really,
really cool algorithm that I found digging
through the archives of IEEE known as compression boosting. Now, it operates on
the same parameters. How do we preprocess things
for better compression? So the first one we’re
going to take a look at here is actually something
called Dense Codes. Now, this is some great
research out of some academics in Argentina, and
effectively, it allows us to take a text
based file and preprocess it and hand it off to GZIP. Now, the preprocessing
is actually the important part here. We’re not actually
transposing it, but instead, were using a
modified dictionary index lookup scheme. So let’s say we parse
our text and our seen, and we create a
dictionary index. So once a word is seen, we
provide it to the dictionary, and then every
reference to that word is replaced with an
index value to the array. So let’s say we have an array
here of, “How much would could a woodchuck chuck,” and we
have 400 symbols before that. So we see that “how” is at
location 400, 401 is “much,” 402 is “wood,” 403 is “could.” So when we’re creating
a stream, we’re going to get values that point
into these element arrays. So the problem with this is that
if we only have 256 symbols, we can only use eight
bits per pointer. However, if we go
above 256, we have to start using 16 bits per
pointer, which is actually a problem because if the
symbols are weighted such that the most probable
and most visible symbols are actually closer to
the front of the dictionary, you’re actually going to
be wasting a lot of space. You’re going to have a lot of
symbols and a lot of indexes where the first upper
eight bits are actually going to be zeros for the
entire dominant side of it. So you’re actually
inflating the size of your stream at this
point because there’s a lot of bits that
aren’t being used. Now, the way that
Dense Codes work is it actually allows you to modify
the way the token is used to create the ability
to actually do variable length coding
in string, which means that for the first
three numbers, we actually use 16 bits
to represent the indexes, but the second three numbers,
because their indexes are lower than 256,
we can actually use eight bits instead of 16. This is actually going to
create a smaller stream for us to actually compress
a little bit later. Now, the wins from
this are pretty interesting to look
at because there’s a couple things in
flight here, and I’ve got an entire 10-page writeup
on my blog about this algorithm and the interesting things that
go back and forth between it, some caveats, and some
things you need to know, but it all boils
down to this table. Now basically, what
this table says is that we’ve tried the
source data being gzipped, and then we’ve
tried the data being run through our
dense codes– that’s the ETDC column– and then the
gzipping of that information. Now, next to it, I actually
compare the other compressors that we’ve mentioned today–
Zopfli, 7-Zip, and BZIP2. So what I’m trying
to see here is once I do this preprocessing,
what compression algorithm is actually going to give
us the best results? You can see that hands
down, GZIP really doesn’t produce savings with
this preprocessing method for the majority of
the data that I’ve shown here– JSON files, CSSs. And I’ve actually
run this probably against 25,000,
30,000 files, and you see this similar pattern,
that just doing standard GZIP against this information
doesn’t really give you wins. However, you see when
you start using Zopfli, you start getting some wins. The advanced pattern matching
and more memory usage characteristics that it
has inside of an encoder allows you to get better
matches with this, producing smaller files. The clear winner here is 7-Zip. Something inside of the way
it’s using its algorithms consistently produces smaller
dense code compressed files as opposed to the
source data, which means this is interesting,
that if you have data that’s above
a certain size. Let’s say you’re returning a 20k
blob of data for some reason. Preprocessing that
text with dense codes and then using 7-Zip as
your compressor of choice can actually produce smaller
GZIP files consistently. Now, the downside
of this, of course, is that you have to reconstruct
your dense code transform at the client, but that’s
not necessarily a big deal depending on what
type of trade-off you’re willing to make. Now, these are all
preprocessing schemes. There’s another form of
processing your data that can actually get wins
that these can’t touch. I mean, the types
of wins that we’re going to talk about
for Delta.js blow these preprocessing schemes out
of the water, but be warned. Thar be dragons here, mostly
in the form of madness. So let’s dive into what
I’m talking about here. So in 2012, the
Gmail team actually did a fantastic
presentation for the W3C, and actually put
up some slides that are publicly available to other
information, that was proposing a solution to a common problem. You see, Gmail users
see about 61 years of loading bar inside of
Gmail about every day. This is a lot of time that
users are sitting around, waiting for JavaScript
to be streamed out. What they proposed was
a new form of ability to, instead of transferring the
large files every single time, they can actually start
transferring the difference between the file
that the user has and the new file
that the user needs. Now, this isn’t a new concept. We’ve seen these sorts
of patching-based systems everywhere in computing
since the late 1970s, but it’s never been
able to be applied to the web due to some of
the architecture involved. How the algorithm works is this. So let’s say we have a
file, file.CSS version zero, and we’ve made an update to it. This happens quite a
bit in large projects. Now, when we make that update,
the majority of the file is the same as it used to be. There may be a
function that’s added, or some comments were placed
in, or some things were removed, but for the majority
of the file, it’s almost
identical, which means we can represent the
second file, the new file, as a difference from the
file that we’ve already seen. This is the concept,
again, of delta encoding. Now, once we’ve delta
encoded this file, the patch– basically the
difference between the two– is generally much, much smaller
than the updated version of the file, which
means we can represent version one as a patch
operation of version zero. This means that if
the user already has version one on their
machine, all we have to do is send them the
patch, and it allows them to reconstruct and create
the new version of the data. Now, this is actually a
pretty interesting concept because this means
we don’t have to send full files to the
clients all the time. Instead, we can produce and send
down highly, highly minimized content to these guys. Now of course, this comes
with a bit of overhead in terms of communication. You see, in order
to get this working, we have to have a
communication process between the client
and the server. So let’s say we have
our mobile device here and the user is going
to load a website. Well, the client
actually needs to notify the server on what version
of the file it has cached. The server can then
take this information, look up in its
array, and figure out what patch file it needs
to send to the client in order to get them up to date. Once it’s figured this out, it
passes it off to the client, and then the client is
going to use this patch to construct the new version
of the file and cache that and use that appropriately. Effectively, this technique
is trading network requests for smaller file sizes. Sending a couple byte
request to determine what version of
the file you have is going to be night and day
difference than actually doing the entire file being sent
to the user multiple times. Now, if this sounds like
craziness to you, just hold on. You have to realize
this is a common problem for many large, very
industrial websites that serve millions
of users a day. You see, when the
Gmail people actually looked at the type
of savings they can get from this
type of algorithm, they saw a massive potential
for improvement here. You see, when they
compared the number of revs they do to a single file
over a month against the size of the deltas if they
were using this scheme, they actually saw that a
whole month’s worth of changes was about 9%– lower than
9%– of the size of the assets altogether, which
means they would only have to send 9% of the content
as opposed to the new full file each time. This is a huge win. We’re not talking
about 10% improvement. We’re not talking
about 50% improvement. This is 90% size decrease by
using this delta algorithm. So obviously, this was a
proposal to the W3C spec. I’ve talked with some
of the Gmail guys. It doesn’t look like this is
actually live in the servers right now. If that’s changed, I hope
to be wrong because this is a fantastic
piece of technology that I hope to see
rolling out in some form to a lot of other
distributors on the internet. Now, there’s another
form of this compression that I’ve actually been
playing around with. It actually comes in the form
of horizontal compression. Typically, when we think of
delta encoding for files, we think in terms of patches. I’ve got version
A and version B, and I want to generate
the patch between the two. This is especially common
in game development. However, there exists a
form called horizontal. How this works is
let’s say we have a cascade of files that may
be similar on a website. So this particular website
uses three CSS files. Let’s say it’s Bootstrap
or something like that. Well, the interesting thing is
that these CSS files generally aren’t that different
from each other. There’s actually a lot a shared
syntax on a website for a given CSS file, which
means that instead of doing delta compression
between versions of the file, we can actually do delta
compression between the files that are going to
be sent to the user. So we can actually do a
difference between file zero and file one, and then
file one and file two. This actually allows
us to create patches for each one of
these, and when we combine that with the
source file to your server, the size of the assets that need
to be sent down to the client is drastically reduced. We can actually send them
the entire application and all the content required
as deltas from base files. This, again, is just
an extrapolation of various forms
of delta algorithms that are already
used out in the wild. Now, how this works
on the client side is that the server
will, of course, provide to the client the base file
and the set of patches, and then the client is
responsible for reconstructing each one of those files and
then caching them locally. This is a really
neat idea if you’re trying to optimize first
time load for users. The Gmail proposed
specification requires that the user has to download
version zero of the file, which in some cases actually
could be 300k of CSS data. Meanwhile, horizontal
compression suggests that it may
not have to do that. It may only have to send down
bites or chunks of 50k data when you actually represent
that CSS as a delta. Effectively, what you’re doing
in this horizontal scheme is you’re trading client side
processing for smaller transfer sizes. This is because it actually
takes a lot of CPU cycles on the client to reconstruct
these files before passing them off to the processing
system to create your DOM and everything else. So there’s this interesting
trade-off between Delta.js or vertical delta encoding for
files and the horizontal delta encoding for files as well. Now, we can see when I’ve
applied this technique to some various
sites on the internet that we actually get some
interesting numbers out of it. So I took all of the CSS
from a Gmail session, all of the JavaScript
from a Gmail session, and all of the CSS
from an Amazon session, and I ran it through
this technique. So you can see we’ve
got the source, the GZIP source, and then
the size of the data once it’s been delta
encoded, and then of course, the gzipping
of the delta encoded data because we can’t just stop
at the delta encoding. So you can see for
Gmail, we actually get some pretty amazing
savings, 31% and 12%, by using this horizontal
delta compression encoding. Meanwhile, something weird’s
going on with the Amazon data where we actually get an
increase of size by 13%. Now, I’d like to
actually point out that this data is highly, highly
minimized and highly redundant. All of these files that are
creating multiple network requests tend to
be self similar, and that’s actually due to the
minification processes that are being applied to
these files on the web. To give you an example of
how important minimization is to horizontal
delta compression, let’s take a look at the game
library known as Impact.js. I love this library. It’s fantastic. If you’re a game
developer and you’re looking to make HTML5
games, definitely give Impact.js a look. Effectively, I took
the source file and left them as loose files. I did not combine them
into a single large file. I actually created the delta
between all these files and then did the
gzipped version of that. Now, you can see
that it actually goes down from about 70k to 21k. However, when I minimized
all of the files before doing delta
compression, I actually got a better savings. I got down to about 14k. This is because the
minimization techniques we’re using on the
web today actually produce a lot of
duplicate symbols that could be used
in various places. Again, when you see
those original examples of taking some and we
renamed the parameters that are being passed
to it as ABC, we tend to see that pattern
occur for all the JavaScript for every function
that’s been defined, meaning we’re going
to see more matches. More matches result in better
statistical probability, which results in
smaller file sizes. So by minimizing our data before
we do our delta compression, we actually get very,
very important wins. So what we’ve talked
about today is where GZIP sits on
the web platform. We’ve got a lot of text data
that’s about to blow up. We’ve got mobile devices. We’ve got fragmented hardware. We’ve got different
connectivity all over the world, but some things here allow you
to take advantage and actually address these issues
today rather than waiting for network times to increase. So for example, we
looked at the fact that GZIP is not
a silver bullet. By preprocessing your
data, you can actually get some pretty big wins. We’ve seen where GZIP
actually falls on its face. If you’ve done a lot
of preprocessing, it may not compress your
data, and in fact, it might inflate it. Or in the case of PNGs,
modifying the window matching by slight parameters
can actually upset how well GZIP
attaches to your data. We also looked at how to
preprocess your content using various forms of other command
line algorithms, like Zopfli or 7-Zip, and then how to
transpose your JSON data, which is fantastic for a lot
of your shopping sites. Also, when you combine that
with the dense code boosting, which is more cutting edge
text preprocessing data, you start getting a sense
that the web is not done yet. You’re not locked into
your form that you have. Then you can start looking at
the delta compression methods and start looking at the
ways to actually combine your data in different
ways to reduce duplication and complexity in the
content that you have. And when you apply
all this together, you start getting
a vision of the web that we can actually
control more of our data to reduce the sizes and actually
get around a lot of the hitches and problems that
GZIP presents to us. So with that, thank you
for your time today. I really appreciate
you listening to this ramble on some very
hard compression stuff. If you’re interested
in more, I highly recommend you check
out html4rocks.com. I recently put up
two articles there on text compression
for web developers as well as image compression
for web developers, and these are generally
meant to be opener tutorials to introduce you to different
terminology and different algorithms and how
they’re being used. Once again, definitely check
out the #perfmatters hashtag. A lot of smart people are there. And join the Google+ Web
Performance community. You can actually see the
short link there on the side as goo.gl/webperf. Again, a great place to talk
about performance problems and find issues. That’s it for me. My name is Colt McAnlis. Here’s how you get a
hold of me for email and other various
social media times. Thanks, once again, for tuning
in to this episode of Google Developers Live. I hope to see you again soon. Thanks.

Comments

  1. Post
    Author
  2. Post
    Author
  3. Post
    Author
  4. Post
    Author
  5. Post
    Author
  6. Post
    Author
    fashnek

    The section around 27:00 is incorrectly stated and misleading — "GZIP is inflating the smaller file" is /wrong/. Those red numbers are not an indictment of GZIP at all and are not indicating that GZIP is harmful or "scary". They're an indictment of the genetic algorithm-based minifier tools, which make the data inherently /less compressable/. In other words, they make GZIP a little bit less helpful, NOT harmful. GZIP is no less of a "silver bullet" with this argument. GA minification is.

  7. Post
    Author
  8. Post
    Author
  9. Post
    Author
  10. Post
    Author
    ThunderousGlare

    If he was my boss i'd beat him for sure..i dont know I just wanna beat him so as id be beating him he'd ask why…id look in him eyes & say because……google.

  11. Post
    Author
  12. Post
    Author
  13. Post
    Author
    Marcus Zanona

    I believe this is because people like the idea of being expert in one thing but just a few have eager and interest in discover more about what it is said to be important?

  14. Post
    Author
  15. Post
    Author
    fashnek

    There is a mistake, in that the sorted list does not have the same cardinality as the source set. The correct result would be
    [0,1,1,0,1,0,1,1,1,1,1,1]. He wanted to make an ideal set [0,1,1,1,1,1,1,1,1,1] to demonstrate the compression. In a real world example, it would be fine to have multiple entries that were the same. Those would lead to a value of zero in the delta output like I showed.

  16. Post
    Author
    mertuarez

    Oh dear, oh dear. So this is why tablets and phones need quad-core cpu? How many extra libs i need for this? I wonder why google apps need so many libs if Android is so "wonderfull" framework? Also Google is going so forward that after release is all deprecated (documentation, samples, …) . I'm tired all the time rewriting apps for new and new Api. And finally why Google Developers talk so much and say nothing?

  17. Post
    Author
  18. Post
    Author
    Andrea Colombo

    I totally agree. I don't watch most of the videos Google Developer uploads beacuse they're too many, but I watch only the ones I'm interested in.

  19. Post
    Author
  20. Post
    Author
    Bruno Racineux

    The encode time(s) are a much larger part of latency than I would have thought.

    The 50ms of encode time per 100kb of data figure is non negligible for files that rarely change like CSS or JS. Serving them a pre-compressed files makes a lot of sense. For CMS frameworks we just need the tools to do it for us.

    Is the given amazon example based on the default gzip DeflateCompressionLevel (6)? And is this based on an average server side? Or a local machine?

  21. Post
    Author
    Рыцарь Тьмы

    ну гзип и гзип. похуй люди то все равно будут котиков смотреть и выкладывать фото как они седня срали. а вы бедные разработчики будете изобретать бессонными ночами новые способы уменьшить объемы данных. люди охуенно тупые, как говорил один старый хер

  22. Post
    Author
  23. Post
    Author
  24. Post
    Author
    RonJohn63

    "Minification" reminds me of code obfuscation which companies used back in the day to distribute source code (back when the world was much more than Windows, OSX and Linux) while making it difficult for the user to read it, and — even before that — the tokenezation performed by the MS BASIC interpreter.

  25. Post
    Author
  26. Post
    Author
  27. Post
    Author
  28. Post
    Author
  29. Post
    Author
  30. Post
    Author
  31. Post
    Author
    ThunderousGlare

    the idea to make your api results complete hell to deal with for saving 100 bytes, yea seems like a bad route.

  32. Post
    Author
  33. Post
    Author
    Daniel Friesen

    So I guess this history lesson is the reason Chrome is being cautious and hiding brotli behind a flag in Canary even though brotli was also invented at Google; while Firefox is going ahead and releasing it into the wild in the very release I'm downloading the update for now.

    That said, assuming the proxy issues haven't disappeared (which they may not be as big an issue in the years since); it does tell me that those bugs won't be as big an issue for bzip2 or brotli in HTTPS traffic, which is a growing category of traffic.

  34. Post
    Author
    snetsjs

    Doesn't LZMA offer the best web solution because it has the smallest storage and network footprint and the fastest decoding? I ask this because it seems encoding time isn't any where near as important since it is done once before deployment.

  35. Post
    Author
  36. Post
    Author
  37. Post
    Author
    David Tan

    Could you please explain why the kiwi's in the PNG with the extra two columns of pixels were not compressed? I would've expected the second kiwi to have fit in the 32k window, and perhaps part of the third kiwi as well.

  38. Post
    Author
  39. Post
    Author
  40. Post
    Author
  41. Post
    Author
    Dustin Rodriguez

    There's a problem with the delta approach. Latency. Sure, you might save a few kilobytes… but those additional requests each incur latency. So whatever raw transfer speed improvements you would get (which are not simply linear due to packet sizes, saving 1 byte inside a packet isn't nearly as important as saving 1 byte that would overflow to a new packet) would almost certainly be eaten up by the latency of those requests, especially given the way consumer connections are throttled on upstream. (Not that requests for these things would be large and run into throttling themselves, but considering the ways in which the throttling is done, especially in circumstances where the network connection is being used, such as in a family home)

  42. Post
    Author
  43. Post
    Author
    Michael Bradley

    sounds like its the middle boxes problems, not yours. If the browser and server determine what they can use, the all is good, if there is a middle box in the way, then let them get the complaints.

    Also if I may add, a lot of page laoding times I see are not due to a 2-5% saving in content size, but rather poor devlopers using all these frameworks and causing so much overhead and including 3rd party stuff, java, css, etc,, from 3rd party sites, its horrible.

  44. Post
    Author
  45. Post
    Author
  46. Post
    Author
    dipi

    About this newfangled WebP format: yeah, not too great, sounded too good to be true; Wikipedia cites researchers describing much more blurry results than with JPEG, and no significant reduction of size in memory and storage.

  47. Post
    Author
  48. Post
    Author
  49. Post
    Author
  50. Post
    Author
    Michael Moore

    Wow, how about a 90% compression system that is math based and not huff based, what effect would that have on the net….video streaming, etc.

  51. Post
    Author
  52. Post
    Author
    Adolf1Extra

    I hope webp will die in favour of FLIF. FLIF is waaaay better than webp. I hate how Google is aggressively pushing their own formats with no regard to competition and drown them out.

  53. Post
    Author
    dumbcreaknuller

    what about convert the whole data file into a gigantic expandable numerical array, then sum all the values into a gigantic floating point number and then store this number into a new file? or would that not work?

  54. Post
    Author
  55. Post
    Author
  56. Post
    Author
  57. Post
    Author
  58. Post
    Author

Leave a Reply

Your email address will not be published. Required fields are marked *