How to Create HTML/ZIP/PNG Polyglot Files

AljwadhDecember 28, 2024

0 1,588 3 minutes read

This article is a summary of the presentation available HERE. The resulting demo file can be downloaded at the end of the article. The repository can be found at https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG.

Introduction

SingleFilea tool for web archiving, typically storing web page resources as data URIs. However, this method may not be effective for many resources. A more elegant solution has emerged by combining the ZIP format’s flexible structure with HTML. We’ll take this one step further by encapsulating this entire structure inside a PNG file.

The Power of ZIP

The ZIP format provides an organized structure for storing multiple files. It is based on a structure with file entries followed by a central directory. The central directory acts as a table of contents, with headers containing metadata about each file entry. These headers include important information such as file names, sizes, checksums, and file entry offsets. What makes ZIP so versatile is its flexibility in data placement. The format enables data to be prepended before the ZIP content by setting an offset greater than 0 for the first file entry, while allowing up to 64KB of data to be appended afterwards (ie ZIP comment). This feature makes it well suited for creating polyglot files.

Create HTML/ZIP Polyglot Files

Based on this knowledge, we can create a self-extracting archive that can be used by web browsers. The page to be displayed and its resources are stored in a ZIP file. By storing the ZIP data in an HTML comment, we can create a self-loading page that extracts and displays the contents of the ZIP file.

Here is the basic structure of the takeover page itself:

  
     charset=utf-8>
    </span>Please wait...<span class="nt"/>
    <span class="nt"><script><!(CDATA(<span class="na">src=))></script></span><span class="s">lib/zip.min.js</span><span class="nt">></span>
  <span class="nt"/>
  <span class="nt"/>
    <span class="nt"/>Please wait...<span class="nt"/>
    <span class="c"><!-- (ZIP data) --></span>
    <span class="nt"><script/></span><span class="o"><!--</span> <span class="nx">Content</span> <span class="k">of</span> <span class="nx">assets</span><span class="o">/</span><span class="nx">main</span><span class="p">.</span><span class="nx">js</span> <span class="o">--></span><span class="nt"/>
  <span class="nt"/>
<span class="nt"/>
</span></code></pre>
</div>
</div>
The <code class="language-plaintext highlighter-rouge">assets/main.js</code> script on this “bootstrap page” reads the ZIP data by calling <code class="language-plaintext highlighter-rouge">fetch(””)</code> and uses the <code class="language-plaintext highlighter-rouge">lib/zip.min.js</code> JavaScript library to extract it. This bootstrap page is then replaced by the extracted page with its resources. However, there’s a problem: due to the same-origin policy, retrieving ZIP data directly with <code class="language-plaintext highlighter-rouge">fetch(””)</code> fails when the page is opened from the filesystem (except in Firefox).
<h2 id="reading-zip-data-from-the-dom">Reading ZIP Data from the DOM</h2>
To overcome the filesystem limitation, we can read ZIP data directly from the DOM. This approach requires careful handling of character encoding. The bootstrap page is now encoded in <code class="language-plaintext highlighter-rouge">windows-1252</code>, which allows data to be read from the DOM with minimum degradation. Some encoding challenges emerge:
<ol>
<li>DOM text content gets decoded to <code class="language-plaintext highlighter-rouge">UTF-16</code> instead of <code class="language-plaintext highlighter-rouge">windows-1252</code></li>
<li>The <code class="language-plaintext highlighter-rouge">NULL</code> character (<code class="language-plaintext highlighter-rouge">U+0000</code>) gets decoded to the replacement character (<code class="language-plaintext highlighter-rouge">U+FFFD</code>)</li>
<li>Carriage returns (<code class="language-plaintext highlighter-rouge">\r</code>) and carriage return + line feeds (<code class="language-plaintext highlighter-rouge">\r\n</code>) get decoded to line feeds (<code class="language-plaintext highlighter-rouge">\n</code>)</li>
</ol>
The first 2 points can be fixed by using an association table to convert characters to <code class="language-plaintext highlighter-rouge">windows-1252</code>. For the last point, “consolidation data” in a JSON script tag is added in the bootstrap page. This data tracks the offsets of carriage returns and carriage return + line feeds, and enables accurate reconstruction of the original content when extracting the ZIP data.
Here is the resulting structure:
<div class="language-html highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code><span class="cp"/>
<span class="nt"/>
  <span class="nt"/>
    <span class="nt"><meta/> <span class="na">charset=</span><span class="s">windows-1252</span><span class="nt">></span>
    <span class="nt"><title/></span>Please wait...<span class="nt"/>
    <span class="nt"><script><!(CDATA(<span class="na">src=))></script></span><span class="s">lib/zip.min.js</span><span class="nt">></span>
  <span class="nt"/>
  <span class="nt"/>
    <span class="nt"/>Please wait...<span class="nt"/>
    <span class="c"><!-- (ZIP data) --></span>
    <span class="nt"><script><!(CDATA(<span class="na">text=))></script></span><span class="s">application/json</span><span class="nt">></span>
    <span class="p">(</span><span class="nx">consolidation</span> <span class="nx">DATA</span><span class="p">)</span>
    <span class="nt"/>
    <span class="nt"><script/></span><span class="o"><!--</span> <span class="nx">Content</span> <span class="k">of</span> <span class="nx">assets</span><span class="o">/</span><span class="nx">main</span><span class="p">.</span><span class="nx">js</span> <span class="o">--></span><span class="nt"/>
  <span class="nt"/>
<span class="nt"/>
</span></code></pre>
</div>
</div>
<h2 id="adding-png-to-the-mix">Adding PNG to the Mix</h2>
The PNG format consists of a signature followed by chunks. Each chunk contains these fieds:
<ul>
<li>Length (4 bytes)</li>
<li>Type identifier (4 bytes) e.g., <code class="language-plaintext highlighter-rouge">IHDR</code> (header), <code class="language-plaintext highlighter-rouge">IDAT</code> (data), <code class="language-plaintext highlighter-rouge">IEND</code> (end of file), <code class="language-plaintext highlighter-rouge">tEXt</code> (custom data)…</li>
<li>Data content (n bytes)</li>
<li>CRC32 checksum (4 bytes)</li>
</ul>
Here is the minimum structure of a PNG file:
<ol>
<li>PNG signature (8 bytes)</li>
<li><code class="language-plaintext highlighter-rouge">IHDR</code> chunk (13 bytes)</li>
<li>One or more <code class="language-plaintext highlighter-rouge">IDAT</code> chunks</li>
<li><code class="language-plaintext highlighter-rouge">IEND</code> chunk (12 bytes)</li>
</ol>
<h2 id="the-final-form-htmlzippng-polyglot-files">The Final Form: HTML/ZIP/PNG Polyglot Files</h2>
The ultimate implementation combines all three formats into a single file. The HTML format’s fault tolerance allows for this complex structure. However, this approach introduces new challenges:
<ol>
<li>The signature, the <code class="language-plaintext highlighter-rouge">IHDR</code> and the <code class="language-plaintext highlighter-rouge">IEND</code> chunks become visible as text nodes briefly and should be removed as soon as the page is parsed</li>
<li>The displayed page is rendered in quirks mode, requiring specific handling through <code class="language-plaintext highlighter-rouge">document.write()</code> and related methods to parse the displayed page</li>
</ol>
Here is the resulting structure viewed as PNG chunks:
<div class="language-plaintext highlighter-rouge">
<div class="highlight">
<pre class="highlight"><code>(PNG signature)
(IHDR chunk)
(tEXt chunk
  
  
    
      <meta charset="windows-1252"/>
      <title>Please wait...
      
      
    
  

    
  

    
  

    
  
) (IEND fragment)

Optimization Through Image Reuse

The final optimization removes the image from the ZIP file and reuses the page, rendered as a PNG file, to replace it on the displayed page.

Result File

Downloads demo.png.zip.html.

2024-12-27 23:10:00

AljwadhDecember 28, 2024

0 1,588 3 minutes read

How to Create HTML/ZIP/PNG Polyglot Files

Introduction

The Power of ZIP

Create HTML/ZIP Polyglot Files

Optimization Through Image Reuse

Result File

Aljwadh

Leave a Reply Cancel reply

Elon Musk agrees with Tweet saying Americans aren’t smart enough for tech jobs

Apple Allows Support for Satellite T-Mobile and Starlink in the iPhone

Lamar Kendrick will appear in Synth Riders experience on Apple Pro vision

The 2024 Movie Monster State of the Union

Thousands of people are evacuating in LA as wildfires and extreme winds hit Southern California

The Wall Street Lifts as markets capture a ray of light

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)

Introduction

The Power of ZIP

Create HTML/ZIP Polyglot Files

Optimization Through Image Reuse

Result File

Aljwadh

The best home ellipticals in 2024

AUS vs IND (MIR): Marnus Labuschagne shows his sliding skills to avoid runs on Day 2 of the Christmas Day Test

Related Articles

Paul Sajna’s Blog

Or is the COBOL default in 1875-05-20 for corrupts or missing dates?

Asteroid fragments are always theory of how land life blooms

A TV with a difference you didn’t see for many years

Leave a Reply Cancel reply

The Wall Street Lifts as markets capture a ray of light

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)