18 Montessory Soldering School

image

Pastor Manul Laphroaig’s

Montessori Soldering School and Stack Smashing Academy

for Youngsters Gifted and Not

image

Pукописи не горят. pocorgtfo18.pdf. Compiled on June 23, 2018.
Application Fee: €0, $0 USD, $0 AUD, 0 RSD, 0 SEK, $50 CAD, 6 × 1029 Pengő (3 × 108 Adópengő), 100 JPC.

18:02 An 8 Kilobyte Mode 7 Demo for the Apple II

by Vincent M. Weaver

While making an inside-joke filled game for my favorite machine, the Apple ][, I needed to create a Final-Fantasy-esque flying-over-the-planet sequence. I was originally going to fake this, but why fake graphics when you can laboriously spend weeks implementing the effect for real. It turns out the Apple ][ is just barely capable of generating such an effect in real time.

Once I got the code working I realized it would be great as part of a graphical demo, so off on that tangent I went. This turned out well, despite the fact that all I knew about the demoscene I had learned from a few viewings of the Future Crew Second Reality demo combined with dimly remembered Commodore 64 and Amiga usenet flamewars.

While I hope you enjoy the description of the demo and the work that went into it, I suspect this whole enterprise is primarily of note due to the dearth of demos for the Apple ][ platform. For those of you who would like to see a truly impressive Apple ][ demo, I would like to make a shout out to FrenchTouch whose works put this one to shame.

image

The Hardware

CPU, RAM and Storage:

The Apple ][ was introduced in 1977 with a 6502 processor running at roughly 1.023MHz. Early models only shipped with 4k of RAM, but in later years, 48k, 64k and 128k systems became common. While the demo itself fits in 8k, it decompresses to a larger size and uses a full 48k of RAM; this would have been very expensive in the seventies.

In 1977 you would probably be loading this from cassette tape, as it would be another year before Woz’s single-sided image Disk II came around. With the release of Apple DOS3.3 in 1980, it offered 140k of storage on each side.

Sound:

The only sound available in a stock Apple ][ is a bit-banged speaker. There is no timer interrupt; if you want music, you have to cycle-count via the CPU to get the waveforms you needed.

The demo uses a Mockingboard soundcard, first introduced in 1981. This board contains dual AY-3-8910 sound generation chips connected via 6522 I/O chips. Each chip provides three channels of square waves as well as noise and envelope effects.

Graphics:

It is hard to imagine now, but the Apple ][ had nice graphics for its time. Compared to later competitors, however, it had some limitations: No hardware sprites, user-defined character sets, blanking interrupts, palette selection, hardware scrolling, or even a linear framebuffer! It did have hardware page flipping, at least.

The hi-res graphics mode is a complex mess of NTSC hacks by Woz. You get approximately 280 × 192 resolution, with 6 colors available. The colors are NTSC artifacts with limitations on which colors can be next to each other, in blocks of 3.5 pixels. There is plenty of fringing on edges, and colors change depending on whether they are drawn at odd or even locations. To add to the madness, the framebuffer is interleaved in a complex way, and pixels are drawn least-significant-bit first. (All of this to make DRAM refresh better and to shave a few 7400 series logic chips from the design.) You do get two pages of graphics, Page 1 is at $2000 and Page 2 at $4000.0 Optionally four lines of text can be shown at the bottom of the screen instead of graphics.

The lo-res mode is a bit easier to use. It provides 40 × 48 blocks, reusing the same memory as the 40 × 24 text mode. (As with hi-res you can switch to a 40 × 40 mode with four lines of text displayed at the bottom.) Fifteen unique colors are available, plus a second shade of grey. Again the addresses are interleaved in a non-linear fashion. Lo-res Page 1 is at $400 and Page 2 is at $800.

Some amazing effects can be achieved by cycle counting, reading the floating bus, and racing the beam while toggling graphics modes on the fly.

Development Toolchain

I do all of my coding under Linux, using the ca65 assembler from the cc65 project. I cross-compile the code, constructing Apple-DOS 3.3 disk images using custom tools I have written. I test first in emulation, where AppleWin under Wine is the easiest to use, but until recently MESS/MAME had cleaner sound.

Once the code appears to work, I put it on a USB stick and transfer to actual hardware using a CFFA3000 disk emulator installed in an Apple IIe platinum edition.

image

Colorful View of Executable Code

image

The Title Screen

 ------------- $ffff
| ROM/IO      |
 -------------  $c000
|             |
| Uncompressed|
| Code/Data   |
|             |
-------------   $4000
| Compressed  |
|   Code      |
-------------   $2000
|   free      |
-------------   $1c00
|  Scroll     |
|   Data      |
-------------   $1800
|  Multiply   |
|   Tables    |
-------------   $1000
| LORES pg 3  |
-------------   $0c00
| LORES pg 2  |
-------------   $0800
| LORES pg 1  |
-------------   $0400
|free/vectors |
-------------   $0200
|   stack     |
-------------   $0100
|   zero pg   |
-------------   $0000

Memory Map

Bootloader

An Applesoft BASIC “HELLO” program loads the binary automatically at bootup. This does not count towards the executable size, as you could manually BRUN the 8k machine-language program if you wanted.

To make the loading time slightly more interesting the HELLO program enables graphics mode and loads the program to address $2000 (hi-res page 1). This causes the display to filled with the colorful pattern corresponding to the compressed image. (Page 631.) This conveniently fills all 8k of the display RAM, or would have if we had poked the right soft-switch to turn off the bottom four lines of text. After loading, execution starts at address $2000.

Decompression

The binary is encoded with the LZ4 algorithm. We flip to hi-res Page 2 and decompress to this region so the display now shows the executable code.

The 6502 size-optimized LZ4 decompression code was written by qkumba (Peter Ferrie).1 The program and data decompress to around 22k starting at $4000. This overwrites parts of DOS3.3, but since we are done with the disk this is no problem.

If you look carefully at the upper left corner of the screen during decompression you will see my triangular logo, which is supposed to evoke my VMW initials. To do this I had to put the proper bit pattern inside the code at the interleaved addresses of $4000, $4400, $4800, and $4C00. The image data at $4000 maps to (mostly) harmless code so it is left in place and executed.

Optimizing the code inside of a compressed image (to fit in 8k) is much more complicated than regular size optimization. Removing instructions sometimes makes the binary larger as it no longer compresses as well. Long runs of a single value, such as zero padding, are essentially free. This became an exercise of repeatedly guessing and checking, until everything fit.

Title Screen

Once decompression is done, execution continues at address $4000. We switch to low-res mode for the rest of the demo.

FADE EFFECT: The title screen fades in from black, which is a software hack as the Apple ][ does not have palette support. This is done by loading the image to an off-screen buffer and then a lookup table is used to copy in the faded versions to the image buffer on the fly.

TITLE GRAPHICS: The title screen is shown on page 631. The image is run-length encoded (RLE) which is probably unnecessary in light of it being further LZ4 encoded. (LZ4 compression was a late addition to this endeavor.)

Why not save some space and just loading our demo at $400, negating the need to copy the image in place? Remember the graphics are 40 × 48 (shared with the text display region). It might be easier to think of it as 40 × 24 characters, with the top / bottom nybbles of each ASCII character being interpreted as colors for a half-height block. If you do the math you will find this takes 960 bytes of space, but the memory map reserves 1k for this mode. There are “holes” in the address range that are not displayed, and various pieces of hardware can use these as scratchpad memory. This means just overwriting the whole 1k with data might not work out well unless you know what you are doing. Our RLE decompression code skips the holes just to be safe.

SCROLL TEXT: The title screen has scrolling text at the bottom. This is nothing fancy, the text is in a buffer off screen and a 40 × 4 chunk of RAM is copied in every so many cycles.

You might notice that there is tearing/jitter in the scrolling even though we are double-buffering the graphics. Sadly there is no reliable cross-platform way to get the VBLANK info on Apple ][ machines, especially the older models.

Mockingbird Music

No demo is complete without some exciting background music. I like chiptune music, especially the kind written for AY-3-8910 based systems. During the long wait for my Mockingboard hardware to arrive, I designed and built a Raspberry Pi chiptune player that uses essentially the same hardware. This allowed me to build up some expertise with the software/hardware interface in advance.

The song being played is a stripped down and re-arranged version of “Electric Wave” from CC’00 by EA (Ilya Abrosimov).

Most of my sound infrastructure involves YM5 files, a format commonly used by ZX Spectrum and Atari ST users. The YM file format is just AY-3-8910 register dumps taken at 50Hz. To play these back one sets up the sound card to interrupt 50 times a second and then writes out the fourteen register values from each frame in an interrupt handler.

Writing out the registers quickly enough is a challenge on the Apple ][, as for each register you have to do a handshake and then set both the register number and the value. It is hard to do this in less than forty 1MHz cycles for each register. With complex chiptune files (especially those written on an ST with much faster hardware), sometimes it is not possible to get exact playback due to the delay. Further slowdown happens as you want to write both AY chips (the output is stereo, with one AY on the left and one on the right). To help with latency on playback, we keep track of the last frame written and only write to the registers that have changed.

The demo detects the Mockingboard in Slot 4 at startup. First the board is initialized, then one of the 6522 timers is set to interrupt at 25Hz. Why 25Hz and not 50Hz? At 50Hz with fourteen registers you use 700 bytes/s. So a two minute song would take 84k of RAM, which is much more than is available! To allow the song to fit in memory, without a fancy circular buffer decompression routine, we have to reduce the size.2

First the music is changed so it only needs to be updated at 25Hz, and then the register data is compressed from fourteen bytes to eleven bytes by stripping off the envelope effects and packing together fields that have unused bits. In the end the sound quality suffered a bit, but we were able to fit an acceptably catchy chiptune inside of our 8k payload.

Drawing the Mode7 Background

Mode 7 is a Super Nintendo (SNES) graphics mode that takes a tiled background and transforms it by rotating and scaling. The most common effect squashes the background out to the horizon, giving a three-dimensional look. The SNES did these transforms in hardware, but our demo must do them in software.

Our algorithm is based on code by Martijn van Iersel which iterates through each horizontal line on the screen and calculates the color to output based on the camera height ( spacez) and angle as well as the current coordinates, x and y.

First, the distance d is calculated based on fixed scale and distance-to-horizon factors. Instead of a costly division operation, we use a pre-generated lookup table for this.

image

Next we calculate the horizontal scale (distance between points on this line):

image

Then we calculate delta x and delta y values between each block on the line. We use a pre-computed sine/cosine lookup table.

Δx = –sin(angle) × h
Δy = cos(angle) × h

The leftmost position in the tile lookup is calculated:

image

Then an inner loop happens that adds Δx and Δy as we lookup the color from the tilemap (just a wrap-around array lookup) for each block on the line.

color = tilelookup ( tilex, tiley)
plot(x, y)
tilex + = Δx, tiley += Δy

Optimizations: The 6502 processor cannot do floating point, so all of our routines use 8.8 fixed point math. We eliminate all use of division, and convert as much as possible to table lookups, which involves limiting the heights and angles a bit.

Some cycles are also saved by using self-modifying code, most notably hard-coding the height (z) value and modifying the code whenever this is changed. The code started out only capable of roughly 4.9fps in 40 × 20 resolution and in the end we improved this to 5.7fps in 40 × 40 resolution. Care was taken to optimize the innermost loop, as every cycle saved there results in 1280 cycles saved overall.

Fast Multiply: One of the biggest bottlenecks in the mode7 code was the multiply. Even our optimized algorithm calls for at least seven 16-bit by 16-bit to 32-bit multiplies, something that is really slow on the 6502. A typical implementation takes around 700 cycles for an 8.8 × 8.8 fixed point multiply.

We improved this by using the ancient quarter-square multiply algorithm, first described for 6502 use by Stephen Judd.

This works by noting these factorizations:

(a + b)2 = a2 + 2ab + b2

(a b)2 = a2 2ab + b2

If you subtract these you can simplify to

image

For 8-bit values if you create a table of squares from 0 to 511, then you can convert a multiply into two table lookups and a subtraction.3 This does have the downside of requiring two kilobytes of lookup tables, but it reduces the multiply cost to the order of 250 cycles or so and these tables can be generated at startup.

image

Bouncing ball on Infinite Checkerboard

image

Spaceship Flying Over an Island

BALL ON CHECKERBOARD

The first Mode7 scene transpires on an infinite checkerboard. A demo would be incomplete without some sort of bouncing geometric solid, in this case we have a pink sphere. The sphere is represented by sixteen sprites that were captured from a twenty year old OpenGL example. Screenshots were reduced to the proper size and color limitations. The shadows are also sprites, and as the Apple ][ has no dedicated sprite hardware, these are drawn completely in software.

The clicking noise on bounce is generated by accessing the speaker port at address $C030. This gives some sound for those viewing the demo without the benefit of a Mockingboard.

TFV SPACESHIP FLYING

This next scene has a spaceship flying over an island. The Mode7 graphics code is generic enough that only one copy of the code is needed to generate both the checkerboard and island scenes. The spaceship, water splash, and shadows are all sprites. The path the ship takes is pre-recorded; this is adapted from the Talbot Fantasy 7 game engine with the keyboard code replaced by a hard-coded script of actions to take.

image
image

Spaceship with Starfield

image

Rasterbars, Stars, and Credits

STARFIELD

The spaceship now takes to the stars. This is typical starfield code, where on each iteration the x and y values are changed by

image

In order to get a good frame rate and not clutter the lo-res screen only sixteen stars are modeled. To avoid having to divide, the reciprocal of all possible z values are stored in a table, and the fast-multiply routine described previously is used.

The star positions require random number generation, but there is no easy way to quickly get random data on the Apple ][. Originally we had a 256-byte blob of pre-generated “random” values included in the code. This wasted space, so instead we use our own machine code at address at $5000 as if it were a block of random numbers!

A simple state machine controls star speed, ship movement, hyperspace, background color (for the blue flash) and the eventual sequence of sprites as the ship vanishes into the distance.

RASTERBARS/CREDITS

Once the ship has departed, it is time to run the credits as the stars continue to fly by.

The text is written to the bottom four lines of the screen, seemingly surrounded by graphics blocks. Mixed graphics/text is generally not be possible on the Apple ][, although with careful cycle counting and mode switching groups such as FrenchTouch have achieved this effect. What we see in this demo is the use of inverse-mode (inverted color) space characters which appear the same as white graphics blocks.

The rasterbar effect is not really rasterbars, just a colorful assortment of horizontal lines drawn at a location determined with a sine lookup table. Horizontal lines can take a surprising amount of time to draw, but these were optimized using inlining and a few other tricks.

The spinning text is done by just rapidly rotating the output string through the ASCII table, with the clicking effect again generated by hitting the speaker at address $C030. The list of people to thank ended up being the primary limitation to fitting in 8kB, as unique text strings do not compress well. I apologize to everyone whose moniker got compressed beyond recognition, and I am still not totally happy with the centering of the text.

A Parting Gift

Further details, a prebuilt disk image, and full source code are available both online and attached to the electronic version of this document.4 5

image

18:03 Fun Memory Corruption Exploits for Kids with Scratch!

by Kev Sheldrake

image

When my son graduated from Scratch Junior on the iPad to full-blown Scratch on a desktop computer, I opted to protect the Internet from him by not giving him a network interface. Instead I installed the offline version of Scratch on his computer that works completely stand-alone. One of the interesting differences between the online and offline versions of Scratch is the way in which it can be extended; the offline version will happily provide an option to install an ‘Experimental HTTP Extension’ if you use the super-secret ‘shift click’ on the File menu instead of the regular, common-all-garden ‘click’.

These extensions allow Scratch to communicate with another process outside the sandbox through a web service; there is an abandoned Python module that provides a suitable framework for building them. While words like ‘experimental’ and ‘abandoned’ don’t appear to offer much hope, this is all just a facade and the technology actually works pretty well. Indeed, we have interfaced Scratch to Midi, Arduino projects and, as this essay will explain, TCP/IP network sockets because, well, if a language exists to teach kids how to code then I think it could also and should also be used to teach them how to hack.

Scratch Basics

If you’re not already aware, Scratch is an IDE and a language, all wrapped up in a sandbox built out of Squeak/Smalltalk (v1.0 to v1.4), Flash/Adobe Air (v2.0) and HTML5/Javascript (v3.0). Within it, sprite-based programs can be written using primitives that resemble jigsaw pieces that constrain where or how they can be placed. For example, an IF/THEN primitive requires a predicate operator, such as X=Y or X>Y; in Scratch, predicates have angled edges and only fit in places where predicates are accepted. This makes it easier for children to learn how to combine primitives to make statements and eventually programs.

All code lives behind sprites or the stage (background); it can sense key presses, mouse clicks, sprites touching, etc, and can move sprites and change their size, colour, etc. If you ever wanted to recreate that crappy flash game you played in the late 90s at university or in your first job then Scratch is perfect for that. You could probably get something that looks suitably professional within an afternoon or less.

Don’t be fooled by the fact it was made for kids. Scratch can make some pretty cool things, but also be aware that it has its limitations and that a lack of networking is among them.

The offline version of Scratch relies on Adobe Air which has been abandoned on Linux. An older 32-bit version can be installed, but you’ll have much better results if you just try this on Windows or MacOS.

Scratch Extensions

Extensions were introduced in Scratch v2.0 and differ between the online and offline versions. For the online version extensions are coded in JS, stored on github.io and accessed via the ScratchX version of Scratch. As I had limited my son to the offline version, we were treated to web service extensions built in Python.

image

On the face of it a web service seems like an obvious choice because they are easy to build, are asynchronous by nature and each method can take multiple arguments. In reality, this extension model was actually designed for controlling things like robot arms rather than anything generic. There are commands and reporters, each represented in Scratch as appropriate blocks; commands would move robot motors and reporters would indicate when motor limits are hit. To put these concepts into more standard terms, commands are essentially procedures. They take arguments but provide no responses, and reporters are essentially global variables that can be affected by the procedures. If you think this is a weird model to program in then you’d be correct.

In order to quickly and easily build a suitable web service, we can use the off-the-shelf abandonware, Blockext.0 This is a python module that provides the full web service functionality to an object that we supply. It’s relatively trivial to build methods that create sockets, write to sockets, and close sockets, as we can get away without return values. To implement methods that read from sockets we need to build a command (procedure) that does the actual read, but puts the data into a global variable that can be read via a reporter.

At this point it is worth discussing how these reporters / global variables actually function. They are exposed via the web service by simply reporting their values thirty times a second. That’s right, thirty times a second. This makes them great for motor limit switches where data is minimal but latency is critical, but less great at returning data from sockets. Still, as my hacky extension shows, if their use is limited they can still work. The blockext console doesn’t log reporter accesses but a web proxy can show them happening if you’re interested in seeing them.

Scratch Limitations
image

While Scratch can handle binary data, it doesn’t really have a way to input it, and certainly no C-style or pythonesque formatting. It also has no complex data types; variables can be numbers or strings, but the language is probably Turing-complete so this shouldn’t really stop us. There is also no random access into strings or any form of string slicing; we can however retrieve a single letter from a string by position.

Strings can be constructed from a series of joins, and we can write a python handler to convert from an ASCIIfied format (such as ‘xNN’) to regular binary. Stripping off newlines on returned strings requires us to build a new (native) Scratch block. Just like the python blocks accessible through the web service, these blocks are also procedures with no return values. We are therefore constrained to returning values via (sprite) global variables, which means we have to be careful about concurrency.

Talking of concurrency, Scratch has a handy message system that can be used to create parallel processing. As highlighted, however, the lack of functions and local variables means we can easily run into problems if we’re not careful.

Blockext

The Python blockext module can be clone from its repository and installed installed with a simple sudo python setup.py install.

My socket extension is quite straight forward. The definition of the object is mostly standard socket code. While it has worked in my limited testing, feel free to make it more robust for any production use. This is just a PoC after all.

image
image
image
image

The remaining code is simply the description of the blocks that the extension makes available over the web service to Scratch. Each block line takes four arguments: the Python function to call, the type of block (command, predicate or reporter), the text description that the Scratch block will present (how it will look in Scratch), and the default values. For reference, predicates are simply reporter blocks that only return a boolean value.

The text description includes placeholders for the arguments to the Python function: %s for a string, %n for a number, and %m for a drop-down menu. All %m arguments are post-fixed with the name of the menu from which the available values are taken. The actual menus are described as a dictionary of named lists.

Finally, the object is linked to the description and the web service is then started. This Python script is launched from the command line and will start the web service on the given port.

image
Linking into Scratch

The web service provides the required web service description file from its index page. Simply browse to http://localhost:5000 and download the Scratch 2 extension file (Scratch Scratch Sockets English.s2e). To load this into Scratch we need to use the super-secret ‘shift click’ on the File menu to reveal the ‘Import experimental HTTP extension’ option. Navigate to the s2e file and the new blocks will appear under ‘More Blocks’.

image
Fuzzing, crashing, controlling EIP, and exploiting

In order to demonstrate the use of the extension, I obtained and booted the TinySploit VM from Saumil Shah’s ExploitLab, and then used the given stack-based overflow to gain remote code execution. The details are straight forward; the shell code by Julien Ahrens came from ExploitDB and was modified to execute Busybox correctly.1 Scratch projects are available as an attachment to this PDF.2

Scratch is a great language/IDE to teach coding to children. Once they’ve successfully built a racing game and a PacMan clone, it can also be used to teach them to interact with the world outside of Scratch. As I mentioned in the introduction, we’ve interfaced Scratch to Midi and Arduino projects from where a whole world opens up. The above screen shots show how it can also be interfaced to a simple TCP/IP socket extension to allow interaction with anything on the network.

From here it is possible to cause buffer overflows that lead to crashes and, through standard stack-smashing techniques, to remote code execution. When I was a child, Z-80 assembly was the second language I learned after BASIC on a ZX Spectrum. (The third was 8086 funnily enough!) I hunted for infinite lives and eventually became a reasonable C programmer. Perhaps with a (slightly better) socket extension, Scratch could become a gateway to x86 shell code. I wonder whether IT teachers would agree?

—Kev Sheldrake

image

18:04 Concealing ZIP Files in NES Cartridges

by Vi Grey

Hello, neighbors.

This story begins with the fantastic work described in PoC‖-GTFO 14:12, which presented an NES ROM that was also a PDF. That file, pocorgtfo14.pdf, was by coincidence also a ZIP file. That issue inspired me to learn 6502 Assembly, develop an NES game from scratch, and burn it onto a physical cartridge for the #tymkrs.

During development, I noticed that the unused game space was just being used as padding and that any data could be placed in that padding. Although I ended up using that space for something else in the game, I realized that I could use padding space to make an NES ROM that is also a ZIP file. This polyglot file wouldn’t make the NES ROM any bigger than it originally was. I quickly got to work on this idea.

The method described in this article to create an NES + ZIP polyglot file is different from that which was used in PoC GTFO 14:12. In that method, none of the ZIP file data is saved inside the NES ROM itself. My method is able to retain the ZIP file data, even when it burned onto a cartridge. If you rip the data off of a cartridge, the resulting NES ROM file will still be an NES + ZIP polyglot file.

Numbers and ranges included in figures in this article will be in Hexadecimal. Range values are big-endian and ranges work the same as Python slices, where [x : y ] is the range of x to, but not including, y.

image
image
image

Figure 18.43: iNES File Format

iNES File Format

This article focuses on the iNES file format. This is because, as was described in PoCGTFO 14:12, iNES is essentially the de facto standard for NES ROM files. Figure 18.43 shows the structure of an NES ROM in the iNES file format that fits on an NROM-128 cartridge.0

The first sixteen bytes of the file MUST be the iNES Header, which provides information for NES Emulators to figure out how to play the ROM.

Following the iNES Header is the 16 KiB PRG ROM. If the PRG ROM data doesn’t fill up that entire 16 KiB, then the PRG ROM will be padded. As long as the PRG padding isn’t actually being used, it can be any byte value, as that data is completely ignored. The final six bytes of the PRG ROM data are the interrupt vectors, which are required.

Eight kilobytes of CHR ROM data follows the PRG ROM.

image

Figure 18.44: End of Central Directory Record Format

ZIP File Format

There are two things in the ZIP file format that we need to focus on to create this polyglot file, the End of Central Directory Record and the Central Directory File Headers.

End of Central Directory Record

To find the data of a ZIP file, a ZIP file extractor should start searching from the back of the file towards the front until it finds the End of Central Directory Record. The parts we care about are shown in Figure 18.44.

The End of Central Directory Record begins with the four-byte big-endian signature 504B0506.

Twelve bytes after the end of the signature is the four-byte Central Directory Offset, which states how far from the beginning of the file the start of the Central Directory will be found.

The following two bytes state the ZIP file comment length, which is how many bytes after the ZIP file data the ZIP file comment will be found. Two bytes for the comment length means we have a maximum length value of 65,535 bytes, more than enough space to make our polyglot file.

image

Figure 18.45: Central Directory File Header Format

Central Directory File Headers

For every file or directory that is zipped in the ZIP file, a Central Directory File Header exists. The parts we care about are shown in Figure 18.45.

Each Central Directory File Header starts with the four-byte big-endian signature 504B0102.

38 bytes after the signature is a four-byte Local Header Offset, which specifies how far from the beginning of the file the corresponding local header is.

Miscellaneous ZIP File Fun

Five bytes into each Central Directory File Header is a byte that determines which Host OS the file attributes are compatible for.

The document, “APPNOTE.TXT - .ZIP File Format Specification” by PKWARE, Inc., specifies what Host OS goes with which decimal byte value.1 I included a list of hex byte values for each Host OS below.

image

00 - MS - DOS and OS /2
01  -  Amiga
02  -  OpenVMS
03  -  UNIX
04  -  VM / CMS
05  -  Atari ST
06  -  OS /2 H.P.F.S.
07  -  Macintosh
08  -  Z - System
09  -  CP /M
0A  -  Windows NTFS
0B  -  MVS ( OS /390 - Z/ OS )
0C  -  VSE
0D  -  Acorn Risc
0E  -  VFAT
0F  -  Alternate MVS
10  -  BeOS
11  -  Tandem
12  -  OS /400
13  -  OS /X ( Darwin )
(14 -  FF) - Unused

Although 0A is specified for Windows NTFS and 0B is specified for MVS (OS/390 - Z/OS), I kept getting the Host OS value of TOPS-20 when I used 0A and NTFS when I used 0B.

I ended up deciding to set the Host OS for all of the Central Directory File Headers to Atari ST. With that said, I have tested every Host OS value from 00 to FF on this file and it extracted properly for every value. Different Host OS values may produce different read, write, and execute values for the extracted files and directories.

iNES + ZIP File Format

With this information about iNES files and ZIP files, we can now create an iNES + ZIP polyglot file, as shown in Figure 18.46.

Here, the first sixteen bytes of the file continue to be the same iNES header as before.

image

Figure 18.46: iNES + ZIP Polyglot File Format

The PRG ROM still starts in the same location. Somewhere in the PRG Padding an amount of bytes equal to the length of the ZIP file data is replaced with the ZIP file data. The ZIP file data starts at hex offset YYyy and ends right before the PRG Interrupt Vectors. This ZIP file data MUST be smaller than or equal to the size of the PRG Padding to make this polyglot file.

Local Header Offsets and the Central Directory Offset of the ZIP file data are updated by adding the little-endian hex value yyYY to them and the ZIP file comment length is set to the little-endian hex value 0602 (8,198 in Decimal), which is the length of the PRG Interrupt Vectors plus the CHR ROM (8 KiB).

PRG Interrupt Vectors and CHR ROM data remain unmodified, so they are still the same as before.

Because the iNES header is the same, the PRG and CHR ROM are still the correct size, and none of the required PRG ROM data or any of the CHR ROM data were modified, this file is still a completely standard NES ROM. The NES ROM file does not change in size, so there is no extra “garbage data” outside of the NES ROM file as far as NES emulators are concerned.

With the ZIP file offsets being updated and all data after the ZIP file data being declared as a ZIP file comment, this file is a standard ZIP file that your ZIP file extractor will be able to properly extract.2

NES Cartridge

The PRG and CHR ROMs of this polyglot file can be burned onto EPROMs and put on an NROM-128 board to make a completely functioning NES cartridge.

Ripping the NES ROM from the cartridge and turning it back into an iNES file will result in the file being a NES + ZIP polyglot file again. It is therefore possible to sneak a secret ZIP file to someone via a working NES cartridge.

Don’t be surprised if that crappy bootleg copy of Tetris I give you is also a ZIP file containing secret documents!

Source Code

This NES + ZIP polyglot file is a quine.3 Unzip it and the extracted files will be its source code.4 Compile that source code and you’ll create another NES + ZIP polyglot file quine that can then be unzipped to get its source code.

I was able to make this file contain its own source code because the source code itself was quite small and highly compressible in a ZIP file.

image

18:05 House of Fun; or, Heap Exploitation against GlibC in 2018

by Yannay Livneh

GlibC’s malloc implementation is a gift that keeps on giving. Every now and then someone finds a way to turn it on its head and execute arbitrary code. Today is one of those days. Today, dear neighbor, you will see yet another path to code execution. Today you will see how you can overwrite arbitrary memory addresses—yes, more than one!—with a pointer to your data. Today you will see the perfect gadget that will make the code of your choosing execute. Welcome to the House of Fun.

The History We Were Taught

The very first heap exploitation techniques were publicly introduced in 2001. Two papers in Phrack 57—Vudo Malloc Tricks0 and Once Upon a Free1—explained how corrupted heap chunks can lead to full compromise. They presented methods that abused the linked list structure of the heap in order to gain some write primitives. The best known technique introduced in these papers is the unlink technique, attributed to Solar Designer. It is quite well known today, but let’s explain how it works anyway. In a nutshell, deletion of a controlled node from a linked list leads to a write-what-where primitive.

Consider this simple implementation of list deletion:

image

This is roughly equivalent to:

image

So, an attacker in control of fd and bk can write the value of bk to (somewhat after) fd and vice versa.

This is why, in late 2004, a series of patches to GNU libc malloc implemented over a dozen mandatory integrity assertions, effectively rendering the existing techniques obsolete. If the previous sentence sounds familiar, this is not a coincidence; it is a quote from the famous Malloc Maleficarum.2

This paper was published in 2005 and was immediately regarded as a classic. It described five new heap exploitation techniques. Some, like previous techniques, exploited the structure of the heap, but others introduced a new capability: allocating arbitrary memory. These newer techniques exploited the fact that malloc is a memory allocator, returning memory for the caller to use. By corrupting various fields used by the allocator to decide which memory to allocate (the chunk’s size and pointers to subsequent chunks), exploiters tricked the allocator to return addresses in the stack, .got, or other places.

Over time, many more integrity checks were added to glibc. These checks try to make sure the size of a chunk makes sense before allocating it to the user, and that it’s in a reasonable memory region. It is not perfect, but it helped to some degree.

Then, hackers came up with a new idea. While allocating memory anywhere in the process’s virtual space is a very strong primitive, many times it’s sufficient to just corrupt other data on the heap, in neighboring chunks. By corrupting the size field or even just the flags in the size field, it’s possible to corrupt the chunk in such a way that makes the heap allocate a chunk which overlaps another chunk with data the exploiter wants to control. A couple of techniques which demonstrate it were published in recent years, most notably Chris Evans’ The poisoned NUL byte, 2014 edition.3

To mitigate against these kinds of attacks, another check was added. The size of a freed chunk is written twice, once in the beginning of the chunk and again at its end. When the allocator makes a decision based on the chunk’s size, it verifies that both sizes agree. This isn’t bulletproof, but it helps.

The most up-to-date repository of currently usable techniques is maintained by the Shellphish CTF team in their how2heap GitHub repository.4

A Brave New Primitive

Sometimes, in order to take two steps forward we must first take one step back. Let’s travel back in time and examine the structure of the heap like they did in 2001. The heap internally stores chunks in doubly linked lists. We already discussed list deletion, how it can be used for exploitation, and the fact it’s been mitigated for many years. But list deletion (unlinking) is not the only list operation! There is another operation: insertion.

Consider the following code:

image

The line before the last roughly translates to:

image

An attacker in control of prev->fd can write the inserted node address wherever she desires!

Having this control is quite common in the case of heap-based corruptions. Using a Use-After-Free or a Heap-Based-Buffer-Overflow, the attacker commonly controls the chunk’s fd (forward pointer). Note also that the data written is not arbitrary. It’s an address of the inserted node, a chunk on the heap which may be allocated back to the user, or might still be in the user’s control! So this is not only a write-where primitive, it’s more of a write-pointer-to-what-where.

Looking at malloc’s code, this primitive can be quite easily employed. Insertion into lists happens when a freed chunk is inserted into a large bin. But more about this later. Before diving into the details of how to use it, there are some issues we need to clear first.

When I started writing this paper, after understanding the categorization of techniques I described earlier, an annoying doubt popped into my mind. The primitive I found in malloc’s code is very much connected to the old unlink primitive; they are literally counterparts. How come no one had found and published it in the early years of heap exploitation? And if someone had, how come neither I nor any of my colleagues I discussed it with had ever heard of it?

So I sat down and read the early papers, the ones from 2001 that everyone says contain only obsolete and mitigated techniques. And then I learned, lo and behold, it had been found many years ago!

History of the Forgotten Frontlink

The list insertion primitive described in the previous section is in fact none other than the frontlink technique. This technique is the second one described in Vudo Malloc Tricks, the very first paper about heap exploitation from 2001. (Part 3.6.2.)

In the paper, the author says it is “less flexible and more difficult to implement” in comparison to the unlink technique. It is far inferior in a world with no NX bit (DEP), as it writes a value the attacker does not fully control, whereas the unlink technique enables the attacker to control the written data (as long as it’s a writable address). I believe that for this reason the frontlink method was less popular. And so, it has almost been completely forgotten.

In 2002, malloc was re-written as an adaptation of Doug Lea’s malloc-2.7.0.c. This re-write refactored the code and removed the frontlink macro, but basically does the same thing upon list insertion. From this year onward, there is no way to attribute the name frontlink with the code the technique is exploiting.

In 2003, William Robertson, et al., announced a new system that “detects and prevents all heap overflow exploits” by using some kind of cookie-based detection. They also announced it in the security focus mailing list.5 One of the more interesting responses to this announcement was from Stefan Esser, who described his private mitigation for the same problem. This solution is what we now know as “safe unlinking.”

Robertson says that it only prevents unlink attacks, to which Esser responds:

I know that modifying unlink does not protect against frontlink attacks. But most heap exploiters do not even know that there is anything else than unlink.

Following this correspondence, in late 2004, the safe unlinking mitigation was added to malloc’s code.

In 2005, the Malloc Maleficarum is published. Here is the first paragraph from the paper:

In late 2001, “Vudo Malloc Tricks” and “Once Upon A free()” defined the exploitation of overflowed dynamic memory chunks on Linux. In late 2004, a series of patches to GNU libc malloc implemented over a dozen mandatory integrity assertions, effectively rendering the existing techniques obsolete.

Every paper that followed it and accounted for the history of heap exploits has the same narrative. In Malloc Des-Maleficarum,6 Blackeng states:

The skills published in the first one of the articles, showed:

— unlink () method.

— frontlink () method.

. . . these methods were applicable until the year 2004, when the GLIBC library was patched so those methods did not work.

And in Yet Another Free Exploitation Technique,7 Huku states:

The idea was then adopted by glibc-2.3.5 along with other sanity checks thus rendering the unlink() and frontlink() techniques useless.

I couldn’t find any evidence that supports these assertions. On the contrary, I managed to successfully employ the frontlink technique on various platforms from different years, including Fedora Core 4 from early 2005 with glibc 2.3.5 installed. The code is presented later in this paper.

In conclusion, the frontlink technique never gained popularity. There is no way to link the name frontlink to any existing code, and all relevant papers claim it’s useless and a waste of time.

However, it works in practice today on every machine I checked.

Back To Completing Exploitation

At this point you might think this write-pointer-to-what-where primitive is nice, but there is still a lot of work to do to get control over a program’s flow. We need to find a suitable pointer to overwrite, one which points to a struct that contains function pointers. Then we can trigger this indirect function call. Surprisingly, this turns out to be rather easy. Glibc itself has some pointers which fit perfectly for this primitive. Among some other pointers, the most suitable for our needs is the _dl_open_hook. This hook is used when loading a new library. In this process, if this hook is not NULL, _dl_open_hook->dlopen_mode() is invoked which can very much be in the attacker’s control!

As for the requirement of loading a library, fear not! The allocator itself does it for us when an integrity check fails. So all an attacker needs to do is to fail an integrity check after overwriting _dl_open_hook and enjoy her shell.8

That’s it for theory. Let’s see how we can make it happen in the actual implementation!

The Gory Internals of Malloc

First, a short recollection of the allocator’s internals.

GlibC malloc handles its freed chunks in bins. A bin is a linked list of chunks which share some attributes. There are four types of bins: fast, unsorted, small, and large. The large bins contain freed chunks of a specific size-range, sorted by size. Putting a chunk in a large bin happens only after sorting it, extracting it from the unsorted bin and putting it in the appropriate small or large bin. The sorting process happens when a user requests an allocation which can’t be satisfied by the fast or small bins. When such a request is made, the allocator iterates over the chunks in the unsorted bin and puts each chunk where it belongs. After sorting the unsorted bin, the allocator applies a best-fit algorithm and tries to find the smallest freed chunk that can satisfy the user’s request. As a large bin contains chunks of multiple sizes, every chunk in the bin not only points to the previous and next chunk ( bk and fd) in the bin but also points to the next and previous chunks which are smaller and bigger than itself ( bk_nextsize and fd_nextsize). Chunks in a large bin are sorted by size, and these pointers speed up the search for the best fit chunk.

Figure 18.47 illustrates a large bin with seven chunks of three sizes. Page 675 contains the relevant code from _int_malloc.9

Here, the size variable is the size of the victim chunk which is removed from the unsorted bin. The logic in lines 3566–3620 tries to determine between which bck and fwd chunks should be inserted. Then, in lines 3622–3626, the block is inserted into the list.

In the case that the victim chunk belongs in a small bin, bck and fwd are trivial. As all chunks in a small bin have the same size, it does not matter where in the bin it is inserted, so bck is the head of the bin and fwd is the first chunk in the bin (lines 3568–3573). However, if the chunk belongs in a large bin, as there are chunks of various sizes in the bin, it must be inserted in the right place to keep the bin sorted.

If the large bin is not empty (line 3581) the code iterates over the chunks in the bin with a decreasing size until it finds the first chunk that is not smaller than the victim chunk (lines 3599– 3603). Now, if this chunk is of a size that already exists in the bin, there is no need to insert it into the nextsize list, so just put it after the current chunk (lines 3605–3607). If, on the other hand, it is of a new size, it needs to be inserted into the nextsize list (lines 3608–3614). Either way, eventually set the bck accordingly (line 3615) and continue to the insertion of the victim chunk into the linked list (lines 3622–3626).

// Extract of _int_malloc. c
3504 while ( ( victim=unsorted _ chunks ( av )−>bk ) != unsorted _ chunks ( av ) )
3505   {
3506     bck = victim −>bk ;
. . .
3511     size = chunksize ( victim ) ;
. . .
3549     /* remove from unsorted l i s t */
3550     unsorted _ chunks ( av )−>bk = bck ;
3551     bck−>f d = unsorted _ chunks ( av ) ;
3552 
3553     /* Take now instead of binning if exactfit */
3554 
3555     if  ( size == nb )
3556       {
. . .
3561         void *p = chunk2mem (victim);
3562         alloc _ perturb (p, bytes);
3563         return p ;
3564       }
3565 
3566     /* place chunk in bin */
3567 

3568     if ( in _ smallbin _ range ( size ) )
3569       {
3570         victim_index = smallbin_index (size);
3571         bck = bin_at (av, victim_index) ;
3572         fwd = bck−>fd;
3573       }
3574     else
3575       {
3576         victim _ index = largebin_index (size);
3577         bck = bin_at (av, victim_index);
3578         fwd = bck−>fd;
3579 
3580         /* maintain large bins in sorted order */
3581         if ( fwd != bck )
3582           {
3583              /* Or with inuse bit to speed comparisons */
3584              size |= PREV_INUSE ;
3585              /* if smaller than smallest, bypass loop below */
3586              assert ( ( bck−>bk−> size & NON_MAIN_ARENA) == 0 ) ;
3587              if ( (unsigned long ) ( size )
                                < (unsigned long ) ( bck−>bk−> size ) )
3588                {
3589                  fwd = bck ;
3590                  bck = bck−>bk ;
3591 
3592                  victim −>f d _ next size = fwd−>f d ;
3593                  victim −>bk _ next size = fwd−>fd −>b k _ next size ;
3594                  fwd−>fd −>b k _ next size =
                           victim −>b k _ next size −>f d _ next size = victim ;
3596           } else {
3598                 assert ( ( fwd−> size & NON_MAIN_ARENA) == 0 ) ;
3599                 while ( (unsigned long ) size < fwd−> size )
3600                   {
3601                     fwd = fwd−>f d _ next size ;
3602                     assert ( ( fwd−> size & NON_MAIN_ARENA) == 0 ) ;
3603                   }
3604 
3605                 if ( (unsigned long ) size ==
                                           ( unsigned long ) fwd->size )
3606                   /* Always insert in the second position . */
3607                   fwd = fwd−>f d ;
3608                 else
3609                   {
3610                      victim −>f d _ next size = fwd ;
3611                      victim −>b k _ next size = fwd−>b k _ next size ;
3612                      fwd−>b k _ next size = victim ;
3613                      victim −>b k _ next size −>f d _ next size = victim ;
3614                   }
3615                 bck = fwd−>bk ;
3616              }
3617           }
3618         else
3619           victim −>f d _ next size = victim −>b k _ next size = victim ;
3620       }
3621 
3622     mark_bin ( av , victim _ i n d e x ) ;
3623     victim −>bk = bck ;
3624     victim −>f d = fwd ;
3625     fwd−>bk = victim ;
3626     bck−>f d = victim ;
. . .
3631   }

The Frontlink Technique in 2018

So, remembering our nice theories, we need to consider how can we manipulate the list insertion to our needs. How can we control the fwd and bck pointers?

When the victim chunk belongs in a small bin, these values are hard to control. The bck is the address of the bin, an address in the globals section of glibc. And the fwd address is a value written in this section. bck->fd which means it’s a value written in glibc’s global section. A simple heap vulnerability such as a Use-After-Free or Buffer Overflow does not let us corrupt this value in any immediate way, as these vulnerabilities usually corrupt data on the heap. (A different mapping entirely from glibc.) The fast bins and unsorted bin are equally unhelpful, as insertion to these bins is always done at the head of the list.

So our last option to consider is using the large bins. Here we see that some data from the chunks is used. The loop which iterates over the chunks in a large bin uses the fd_nextsize pointer to set the value of fwd and the value of bck is derived from this pointer as well. As the chunk pointed by fwd must meet our size requirement and the bck pointer is derived from it, we better let it point to a real chunk in our control and only corrupt the bk of this chunk. Corrupting the bk means that line 3626 writes the address of the victim chunk to a location in our control. Even better, if the victim chunk is of a new size that does not previously exist in the bin, lines 3611–3612 insert this chunk to the nextsize list and write its address to fwd->bk_nextsize->fd_nextsize. This means we can write the address of the victim chunk to another location. Two writes for one corruption!

In summary, if we corrupt a bk and bk_nextsize of a chunk in the large bin and then cause malloc to insert another chunk with a bigger size, this will overwrite the addresses we put in bk and bk_nextsize with the address of the freed chunk.

The Frontlink Technique in 2001

For the sake of historical justice, the following is the explanation of the frontlink technique concept from Vudo Malloc Tricks.10

This is the code of list insertion in the old implementation:

# define frontlink ( A, P, S, IDX , BK , FD ) {              
     if ( S < MAX_SMALLBIN_SIZE ) {                          
         IDX = smallbin_index ( S );                          
         mark_binblock ( A, IDX );                            
         BK = bin_at ( A, IDX );                              
         FD = BK ->fd;                                        
         P->bk = BK;                                          
         P->fd = FD;                                          
         FD ->bk = BK ->fd = P;                              
[1] } else {                                                  
         IDX = bin_index ( S );                              
         BK = bin_at ( A, IDX );                              
         FD = BK ->fd;                                        
         if ( FD == BK ) {                                    
             mark_binblock (A, IDX );                        
         } else {                                            
[2]          while (FD != BK                                  
                    && S < chunksize (FD) ) {                
[3]              FD = FD ->fd;                                
             }                                                
[4]          BK = FD ->bk;                                    
         }                                                    
         P->bk = BK;                                          
         P->fd = FD;                                          
[5]      FD ->bk = BK ->fd = P;                              
    }                                                        
}

And this is the description:

If the free chunk P processed by frontlink() is not a small chunk, the code at line 1 is executed, and the proper doubly-linked list of free chunks is traversed (at line 2) until the place where P should be inserted is found. If the attacker managed to overwrite the forward pointer of one of the traversed chunks (read at line 3) with the address of a carefully crafted fake chunk, they could trick frontlink() into leaving the loop (2) while FD points to this fake chunk. Next the back pointer BK of that fake chunk would be read (at line 4) and the integer located at BK plus 8 bytes (8 is the offset of the fd field within a boundary tag) would be overwritten with the address of the chunk P (at line 5).

image

Figure 18.47: A Large Bin with Seven Chunks of Three Sizes

Bear in mind the implementation was somewhat different. The P referred to is the equivalent to our victim pointer and there was no secondary nextsize list.

The Universal Frontlink PoC

In theory we see both editions are the very same technique, and it seems what was working in 2001 is still working in 2018. It means we can write one PoC for all versions of glibc that were ever released!

Please, dear neighbor, compile the code from page 682 and execute it on any machine with any version of glilbc and see if it works. I have tried it on Fedora Core 4 32-bit with glibc-2.3.5, Fedora 10 32-bit live, Fedora 11 32-bit and Ubuntu 16.04 and 17.10 64-bit. It worked on all of them.

We already covered the background of how the overwrite happens, now we have just a few small details to cover in order to understand this PoC in full.

Chunks within malloc are managed in a struct called mall-oc_chunk which I copied to the PoC. When allocating a chunk to the user, malloc uses only the size field and therefore the first byte the user can use coincides with the fd field. To get the pointer to the malloc_chunk, we use mem2chunk which subtracts the offset of the fd field in the malloc_chunk struct from the allocated pointer (also copied from glibc).

The prev_size of a chunk resides in the last sizeof(size_t) bytes of the previous chunk. It may only be accessed if the previous chunk is not allocated. But if it is allocated, the user may write whatever she wants there. The PoC writes the string “YES” to this exact place.

Another small detail is the allocation of ALLOCATION_BIG sizes. These allocations have two roles: First they make sure that the chunks are not coalesced (merged) and thus keep their sizes even when freed, but they also force the allocator to sort the unsorted bin when there is no free chunk ready to server the request in a normal bin.

Now, the crux of the exploit is exactly as in theory. Allocate two large chunks, p1 and p2. Free and corrupt p2, which is in the large-bin. Then free and insert p1 into the bin. This insertion overwrites the verdict pointer with mem2chunk(p1), which points to the last sizeof(size_t) bytes of p0.11

image
image

Control PC or GTFO

Now that we have frontlink covered, and we know how to overwrite a pointer to data in our control, it’s time to control the flow. The best victim to overwrite is _dl_open_hook. This pointer in glibc, when not NULL, is used to alter the behavior of dlopen, dlsym, and dlclose. If set, an invocation of any of these functions will use a callback in the struct dl_open_hook pointed by _dl_open_hook. It’s a very simple structure.

image

When invoking dlopen, it actually calls dlopen_mode which has the following implementation:

image

Thus, controlling the data pointed to by _dl_open_hook and being able to trigger a call to dlopen is sufficient for hijacking a program’s flow.

Now, it’s time for some magic. dlopen is not a very common function to use. Most binaries know at compile time which libraries they are going to use, or at least in program initialization process and don’t use dlopen during the programs normal operation. So causing a dlopen invocation may be far fetched in many circumstances. Fortunately, we are in a very specific scenario here: a heap corruption. By default, when the heap code fails an integrity check, it uses malloc_printerr to print the error to the user using __libc_message. This happens after printing the error and before calling abort, printing a backtrace and memory maps. The function generating the backtrace and memory maps is backtrace_and_maps which calls the architecture-specific function __backtrace. On x86_64, this function calls a static init function which tries to dlopen libgcc_s.so.1.

So if we manage to fail an integrity check, we can trigger dlopen, which in turn will use data pointed by _dl_open_hook to change the programs flow. Win!

Madness? Exploit 300!

Now that we know everything there is to know, it’s time to use this technique in the real world. For PoC purposes, we solve the 300 CTF challenge from the last Chaos Communication Congress, 34c3.

Here is the source code of the challenge, courtesy of its challenge author, Stephen Röttger, a.k.a. Tsuro:

#include <unistd.h>
#include <string.h>
#include <err.h>
#include <stdlib.h>

#define ALLOC_CNT 10

char * a l l o c s [ALLOC_CNT] = { 0 } ;

void myputs ( const char * s ) {
  write ( 1, s, strlen ( s ) ) ;
  write ( 1, " ", 1 ) ;
}

int read_int ( ) {
  char buf[16 ] = " " ;
  ssize _ t cnt = read ( 0, buf,
                   size of ( buf ) -1) ;

  if ( cnt <= 0 ) {
    err ( 1 , "read" ) ;
  }
  buf [cnt ] = 0 ;
  return atoi ( buf ) ;
}

void menu ( ) {
  myputs ( " 1 ) alloc" ) ;
  myputs ( " 2 ) write" ) ;
  myputs ( " 3 ) print" ) ;
  myputs ( " 4 ) free" ) ;
}

void alloc _ it (int slot ) {
  allocs [ slot ] = malloc ( 0 x300 ) ;
}

void write_ it (int slot ) {
  read (0, allocs [ slot], 0 x300 ) ;
}

void print _ it ( int slot ) {
  myputs ( a l l o c s [ s l o t ] ) ;
}

void free _ it (int slot ) {
  free ( allocs [ slot ] ) ;
}

int main (int argc, char ** a rg v ) {
  while ( 1 ) {
    menu ( ) ;
    int choice = read_ int ( ) ;
    myputs ("slot ? (0 -9)" ) ;
    int slot = read_ int ( ) ;
    if (slot < 0 | | slot > 9 ) {
      exit ( 0 ) ;
    }
    switch ( choice ) {
      case 1 :
        alloc _ it ( slot ) ;
        break ;
      case 2 :
        write_ it ( slot ) ;
        break ;
      case 3 :
        print _ it ( slot ) ;
        break ;
      case 4 :
        free _ it ( slot ) ;
        break ;
      default :
        exit ( 0 ) ;
    }
  }
  return 0;
}

The purpose of the challenge is to execute arbitrary shellcode on a remote service executing the given code. We see that in the globals section there is an array of ten pointers. As clients, we have the following options:

  1. Allocate a chunk of size 0x300 and assign its address to any of the pointers in the array.
  2. Write 0x300 bytes to a chunk pointed by a pointer in the array.
  3. Print the contents of any chunk pointed in the array.
  4. Free any pointer in the array.
  5. Exit.

The vulnerability here is straightforward: Use-After-Free. As no code ever zeros the pointers in the array, the chunks pointed by them are accessible after free. It is also possible to double-free a pointer.

A solution to a challenge always start with some boilerplate. Defining functions to invoke specific functions in the remote target and some convenience functions. We use the brilliant Pwn library for communication with the vulnerable process, conversion of values, parsing ELF files and probably some other things.12

This code is quite self-explanatory. alloc_it, print_it, write-_it, free_it invoke their corresponding functions in the remote target. The chunk function receives an offset and a dictionary of fields of a malloc_chunk and their values and returns a dictionary of the offsets to which the values should be written. For example, chunk(offset=0x20, bk=0xdeadbeef) returns {56: 3735928559} as the offset of bk field is 0x18 thus 0x18 + 0x20 is 56 (and 0xdeadbeef is 3735928559). The chunk function is used in combination with pwn’s fit function which writes specific values at specific offsets.13

Now, the first thing we want to do to solve this challenge is to know the base address of libc, so we can derive the locations of various data in libc. We must also know the address of the heap, so we can craft pointers to our controlled data.

image
image

Figure 18.48

As we can print chunks after freeing them, leaking these addresses is quite easy. By freeing two non-consecutive chunks and reading their fd pointers (the field which coincides with the pointer returned to the caller when a chunk is allocated), we can read the address of the unsorted bin because the first chunk in it points to its address. And we can also read the address of that chunk by reading the fd pointer of the second freed chunk, because it points to the first chunk in the bin. See Figure 18.48. We can quickly test this arrangement in Python.

image
image

Now that we know the address of libc and the heap, it’s time to craft our frontlink attack. First, we need to have a chunk we control in the large bin. Unfortunately, the challenge’s constraints do not let us free a chunk with a controlled size. However, we can control a freed chunk in the unsorted bin. As chunks inserted to the large bin are first removed from the unsorted bin, this provides us with a primitive which is sufficient to our needs.

We overwrite the bk of a chunk in the unsorted bin.

image
image

Here we allocated two chunks and free the first, which inserts it to the unsorted bin. Then we overwrite the bk pointer of a chunk which starts 0x10 before the allocation of slot 0 ( offset=-0x10), i.e., the chunk in the unsorted bin. When making another allocation, the chunk in the unsorted bin is removed and returned to the caller and the bk pointer of the unsorted bin is updated to point to the bk of the removed chunk.

Now that the bk of the unsorted bin pointer points to the controlled region in slot 1, we forge a list that has a fake chunk with size 0x400, as this size belongs in the large bin, and another chunk of size 0x310. When requesting another allocation of size 0x300, the first chunk is sorted and inserted to the large bin and the second chunk is immediately returned to the caller.

image
image
image

Perfect! we have a chunk in our control in the large bin. It’s time to corrupt this chunk! We point the bk and bk_nextsize of this chunk before the _dl_open_hook and put some more forged chunks in the unsorted bin. The first chunk will be the chunk which its address is written to _dl_open_hook so it must have a size bigger then 0x400 yet belongs in the same bin. The next chunk is of size 0x310 so it is returned to the caller after request of allocation of 0x300 and after inserting the 0x410 into the large bin and performing the attack.

image
image

This allocation overwrites _dl_open_hook with the address of controlled+0x60, the address of the 0x410 chunk.

Now it’s time to hijack the flow. We overwrite offset 0x60 of the controlled chunk with one_gadget, an address when jumped to executes exec("/bin/bash"). We also write an easily detectable bad size to the next chunk in the unsorted bin, then make an allocation. The allocator detects the bad size and tries to abort. The abort process invokes _dl_open_hook->dlopen_mode which we set to be the one_gadget and we get a shell!

image

Figure 18.49: This dumps the flag!

See Figure 18.49 for the code and 18.50 for the runlog.

Closing Words

Glibc malloc’s insecurity is a never ending story. The inline-metdata approach keeps presenting new opportunities for exploiters. (Take a look at the new tcache thing in version 2.26.) And even the old ones, as we learned today, are not mitigated. They are just there, floating around, waiting for any UAF or overflow. Maybe it’s time to change the design of libc altogether.

Another important lesson we learned is to always check the details. Reading the source or disassembly yourself takes courage and persistence, but fortune prefers the brave. Double check the mitigations. Re-read the old materials. Some things that at the time were considered useless and forgotten may prove valuable in different situations. The past, like the future, holds many surprises.

image

Figure 18.50: Capturing the Flag

image

18:06 RelroS: Read Only Relocations for Static ELF Executables

by Ryan “ElfMaster” O’Neill

This paper is going to shed some insights into the more obscure security weaknesses of statically linked executables: the glibc initialization process, what the attack surface looks like, and why the security mitigation known as RELRO is as equally important for static executables as it is for dynamic executables. We will discuss some solutions, and explore the experimental software that I have presented as a solution for enabling RELRO binaries that are statically linked, usually to avoid complex dependecy issues. We will also take a look at ASLR, and innovate a solution for making it work on statically linked executables.

Standard ELF Security Mitigations

Over the years there have been some innovative and progressive overhauls that have been incorporated into glibc, the linker, and the dynamic linker, in order to make certain security mitigations possible. First there was Pipacs who decided that making ELF programs that would otherwise be ET_EXEC (executables) could benefit from becoming ET_DYN objects, which are shared libraries. if a PT_INTERP segment is added to an ET_DYN object to specify an interpreter then ET_DYN objects can be linked as executable programs which are position independent executables, “ -fPIC-pie” and linked with an address space that begins at 0x0. This type of executable has no real absolute address space until it has been relocated into a randomized address space by the kernel. A PIE executable uses IP relative addressing mode so that it can avoid using absolute addresses; consequently, a program that is an ELF ET_DYN can make full use of ASLR.

ASLR can work with ET_EXEC’s with PaX using a technique called VMA mirroring, but I can’t say for sure if its still supported and it was never the preferred method.0

When an executable runs privileged, such as sshd, it would ideally be compiled and linked into a PIE executable which allows for runtime relocation to a random address space, thus hardening the attack surface into far more hostile playing grounds.

Try running readelf -e /usr/sbin/sshd | grep DYN and you will see that it is (most likely) built this way.

Somewhere along the way came RELRO (read-only relocations) a security mitigation technique that has two modes: partial and full. By default only the partial relro is enforced because full-relro requires strict linking which has less efficient program loading time due to the dynamic linker binding/relocating immediately (strict) vs. lazy. but full RELRO can be very powerful for hardening the attack surface by marking specific areas in the data segment as read-only. Specifically the .init_array, .fini_array, .jcr, .got, .got.plt sections. The .got.plt section and .fini_array are the most frequent targets for attackers since these contain function pointers into shared library routines and destructor routines, respectively.

What about static linking?

Developers like statically linked executables because they are easier to manage, debug, and ship; everything is self contained. The chances of a user running into issues with a statically linked executable are far less than with a dynamically linked executable which require dependencies, sometimes hundreds of them. I’ve been aware of this for some time, but I was remiss to think that statically linked executables don’t suffer from the same ELF security problems as dynamically linked executables! To my surprise, a statically linked executable is vulnerable to many of the same attacks as a dynamically linked executable, including shared library injection, .dtors ( .fini_array) poisoning, and PLT/-GOT poisoning.

image

This might surprise you. Shouldn’t a static executable be immune to relocation table tricks? Let’s start with shared library injection. A shared library can be injected into the process address space using ptrace injected shellcode for malware purposes, however if full RELRO is enabled coupled with PaX mprotect restrictions this becomes impossible since the PaX feature prevents the default behavior of allowing ptrace to write to read-only segments and full RELRO would ensure read-only protections on the relevant data segment areas. Now, from an exploitation standpoint this becomes more interesting when you realize that the PLT/GOT is still a thing in statically linked executables, and we will discuss it shortly, but in the meantime just know that the PLT/GOT contains function pointers to libc routines. The .init_array/ .fini_array function pointers respectively point to initialization and destructor routines. Specifically .dtors has been used to achieve code execution in many types of exploits, although I doubt its abuse is ubiquitous as the .got.plt section itself. Let’s take a tour of a statically linked executable and analyze the finer points of the security mitigations–both present and absent–that should be considered before choosing to statically link a program that is sensitive or runs privileged.

Demystifying the Ambiguous

The static binary in Figure 18.51 was built with full RELRO flags, gcc -static -Wl,-z,relro,-z,now. And even the savvy reverser might be fooled into thinking that RELRO is in-fact enabled. partial-RELRO and full-RELRO are both incompatible with statically compiled binaries at this point in time, because the dynamic linker is responsible for re-mapping and mprotecting the common attack points within the data segment, such as the PLT/GOT, and as shown in Figure 18.51 there is no PT_INTERP to specify an interpreter nor would we expect to see one in a statically linked binary. The default linker script is what directs the linker to create the GNU_RELRO segment, even though it serves no current purpose.

Notice that the GNU_RELRO segment points to the beginning of the data segment which is usually where you would want the dynamic linker to mprotect n bytes as read-only. however, we really don’t want .tdata marked as read-only, as that will prevent multi-threaded applications from working.

So this is just another indication that the statically built binary does not actually have any plans to enable RELRO on itself. Alas, it really should, as the PLT/GOT and other areas such as .fini_array are as vulnerable as ever. A common tool named checksec.sh uses the GNU_RELRO segment as one of the markers to denote whether or not RELRO is enabled on a binary,1 and in the case of statically compiled binaries it will report that partial-relro is enabled, because it cannot find a DT_BIND_NOW dynamic segment flag since there are no dynamic segments in statically linked executables. Let’s take a quick tour through the init code of a statically compiled executable.

From the output in Figure 18.51, you will notice that there is a .got and .got.plt section within the data segment, and to enable full RELRO these are normally merged into one section but for our purposes that is not necessary since my tool marks both of them as read-only.

# http://www.trapkit.de/tools/checksec.html

image

Figure 18.51: RELRO is Broken for Static Executables

image

Figure 18.52: FTracing a Static ELF

Overview of Statically Linked ELF

A high level overview can be seen with the ftrace tool, shown in Figure 18.52.2

Most of the heavy lifting that would normally take place in the dynamic linker is performed by the generic_start_main() function which in addition to other tasks also performs various relocations and fixups to all the many sections in the data segment, including the .got.plt section, in which case you can setup a few watch points to observe that early on there is a function that inquires about CPU information such as the CPU cache size, which allows glibc to intelligently determine which version of a given function, such as strcpy(), should be used.

In Figure 18.53, we set watch points on the GOT entries for some shared library routines and notice that the generic_start-_main() function serves, in some sense, much like a dynamic linker. Its job is largely to perform relocations and fixups.

So in both cases the GOT entry for a given libc function had its PLT stub address replaced with the most efficient version of the function given the CPU cache size looked up by certain glibc init code (i.e. __cache_sysconf()). Since this a somewhat high level overview I will not go into every function, but the important thing is to see that the PLT/GOT is updated with a libc function, and can be poisoned, especially since RELRO is not compatible with statically linked executables. This leads us into the solution, or possible solutions, including our very own experimental prototype named relros, which uses some ELF trickery to inject code that is called by a trampoline that has been placed in a very specific spot. It is necessary to wait until generic_start_main() has finished all of its writes to the memory areas that we intend to mark as read-only before we invoke our enable_relro() routine.

image

Figure 18.53: Exploring a Static ELF with GDB

A Second Implementation

My first prototype had to be written quickly due to time constraints. This quick implementation uses an injection technique that marks the PT_NOTE program header as PT_LOAD, and we therefore create a second text segment effectively.

In the generic_start_main() function (Figure 18.54) there is a very specific place that we must patch and it requires exactly a five byte patch. ( call <imm>.) As immediate calls do not work when transferring execution to a different segment, an lcall (far call) is needed which is considerably more than five bytes. The solution to this is to switch to a reverse text infection which will keep the enable_relro() code within the one and only code segment. Currently though we are being crude and patching the code that calls main().

Currently we are overwriting six bytes at 0x405b54 with a push $enable_relro; ret set of instructions, shown in Figure 18.55. Our enable_relro() function mprotects the part of the data segment denoted by PT_RELRO as read-only, then calls main(), then sys_exits. This is flawed since none of the deinitilization routines get called. So what is the solution?

image

Figure 18.54: Unpatched generic_start_main().

Like I mentioned earlier, we keep the enable_relro() code within the main programs text segment using a reverse text extension, or a text padding infection. We could then simply overwrite the five bytes at 0x405b46 with a call <offset> to enable_relro() and then that function would make sure we return the address of main() which would obviously be stored in %rax. This is perfect since the next instruction is callq *%rax, which would call main() right after RELRO has been enabled, and no instructions are thrown out of alignment. So that is the ideal solution, although it doesn’t yet handle the problem of .tdata being at the beginning of the data segment, which is a problem for us since we can only use mprotect on memory areas that are multiples of a PAGE_SIZE.

A more sophisticated set of steps must be taken in order to get multi-threaded applications working with RELRO using binary instrumentation. Other solutions might use linker scripts to put the thread data and bss into their own data segment.

Notice how we patch the instruction bytes starting at 0x405b4f with a push/ret sequence, corrupting subsequent instructions. Nonetheless this is the prototype we are stuck with until I have time to make some changes.

image

Figure 18.55: Patched generic_start_main().

So let’s take a look at this RelroS application.3 4 First we see that this is not a dynamically linked executable.

$ readelf -d test
There is no dynamic section in this file.

We observe that there is only a r+x text segment, and a r+w data segment, with a lack of read-only memory protections on the first part of the data segment.

$ ./ test &
[1] 27891
$ cat / proc /‘pidof test ‘/ maps
00400000 -004 cc000 r - xp 00000000 fd :01 4856460 test
006 cc000 -006 cf000 rw -p 000 cc000 fd :01 4856460 test
...

We apply RelroS to the executable with a single command.

$ ./ relros ./ test
injection size : 464
main () : 0 x400b23

We observe that read-only relocations have been enforced by our patch that we instrumented into the binary called test.

$ ./test &
[1] 28052
$ cat /proc/'pidof test'/maps
00400000-004cc000 r-xp 00000000 fd:01 10486089 test
006cc000-006cd000 r--p 000cc000 fd:01 10486089 test
006cd000-006cf000 rw-p 000cd000 fd:01 10486089 test
...

Notice after we applied relros on ./test, it now has a 4096 area in the data segment that has been marked as read-only. This is what the dynamic linker accomplishes for dynamically linked executables.

————

So what are some other potential solutions for enabling RELRO on statically linked executables? Aside from my binary instrumentation project that will improve in the future, this might be fixed either by tricky linker scripts or by the glibc developers.

Write a linker script that places .tbss, .tdata, and .data in their own segment, and the sections that you want readonly should be placed in another segment. These sections include .init_array, .fini_array, .jcr, .dynamic, .got, and .got.plt. Both of these PT_LOAD segments will be marked as PF_R|PF_W (r+w), and serve as two separate data segments. A program can then have a custom function, but not a constructor, that is called by main() before it even checks argc and argv. The reason we don’t want a constructor function is because it will attempt to mprotect read-only permissions on the second data segment before the glibc init code has finished performing its fixups which require write access. This is because the constructor routines stored in .init section are called before the write instructions to the .got, .got.plt sections, etc.

The glibc developers should probably add a function that is invoked by generic_start_main() right before main() is called. You will notice there is a _dl_protect_relro() function in statically linked executables that is never called.

ASLR Issues

ASLR requires that an executable is ET_DYN unless VMA mirroring is used for ET_EXEC ASLR. A statically linked executable can only be linked as an ET_EXEC type executable.

$ gcc -static -fPIC -pie test2.c -o test2
ld: x86_64-linux-gnu/5/crtbeginT.o:
relocation R_X86_64_32 against ‘__TMC_END__ ’ can not be used
when making a shared object; recompile with -fPIC
x86_64-linux-gnu/5/crtbeginT.o: error adding
symbols: Bad value
collect2: error: ld returned 1 exit status

This means that you can remove the -pie flag and end up with an executable that uses position independent code. But it does not have an address space layout that begins with base address zero, which is what we need. So what to do?

ASLR Solutions

I haven’t personally spent enough time with the linker to see if it can be tweaked to link a static executable that comes out as an ET_DYN object, which should also not have a PT_INTERP segment since it is not dynamically linked. A quick peak in src/linux/fs/binfmt_elf.c, shown in Figure 18.56, will show that the executable type must be ET_DYN.

image

Figure 18.56: src/linux/fs/binfmt_elf.c

A Hybrid Solution

The linker may not be able to perform this task yet, but I believe we can. A potential solution exists in the idea that we can at least compile a statically linked executable so that it uses position independent code (IP relative), although it will still maintain an absolute address space. So here is the algorithm from a binary instrumentation standpoint.

First we’ll compile the executable with -static -fPIC, then use static_to_dyn.c to adjust the executable. It changes ehdr-->e_type from ET_EXEC to ET_DYN, then modifies the phdrs for each PT_LOAD segment, setting both phdr[TEXT].p_vaddr and .p_offset to zero. It sets phdr[DATA].p_vaddr to 0x200000 + phdr[DATA].p_offset. It sets ehdr->e_entry to ehdr->e_entry - old_base. Finally, it updates each section header to reflect the new address range, so that GDB and objdump can work with the binary.

$ gcc -static -fPIC test2.c -o test2
$ ./static_to_dyn ./test2
Setting e_entry to 8b0
$ ./test2
Segmentation fault (core dumped)

Alas, a quick look at the binary with objdump will prove that most of the code is not using IP relative addressing and is not truly PIC. The PIC version of the glibc init routines like _start lives in /usr/lib/X86_64-linux-gnu/Scrt1.o, so we may have to start thinking outside the box a bit about what a statically linked executable really is. That is, we might take the -static flag out of the equation and begin working from scratch!

Perhaps test2.c should have both a _start() and a main(), as shown in Figure 18.57. _start() should have no code in it and use __attribute__((weak)) so that the _start() routine in Scrt1.o can override it. Or we can compile Diet Libc5 with IP relative addressing, using it instead of glibc for simplicity. There are multiple possibilities, but the primary idea is to start thinking outside of the box. So for the sake of a PoC here is a program that simply does nothing but check if argc is larger than one and then increments a variable in a loop every other iteration. We will demonstrate how ASLR works on it. It uses _start() as its main().

$ gcc -nostdlib -fPIC test2.c -o test2
$ ./test2 arg1

$ pmap ‘pidof test2‘
17370:   ./test2 arg1
0000000000400000      4K r-x-- test2
0000000000601000      4K rw--- test2
00007ffcefcca000    132K rw---   [ stack ]
00007ffcefd20000      8K r----   [ anon ]
00007ffcefd22000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total              160K

ASLR is not present, and the address space is just as expected on a 64 bit ELF binary in Linux. So let’s run static_to_dyn.c on it, and then try again.

$ ./static_to_dyn test2
$ ./test2 arg1

$ pmap ‘pidof test2‘
17622:   ./test2 arg1
0000565271e41000      4K r-x-- test2
0000565272042000      4K rw--- test2
00007ffc28fda000    132K rw---   [ stack ]
00007ffc28ffc000      8K r----   [ anon ]
00007ffc28ffe000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total              160K

Notice that the text and data segments for test2 are mapped to a random address space. Now we are talking! The rest of the homework should be fairly straight forward.

Improving Static Linking Techniques

Since we are compiling statically by simply cutting glibc out of the equation with the -nostdlib compiler flag, we must consider that things we take for granted, such as TLS and system call wrappers, must be manually coded and linked. One potential solution I mentioned earlier is to compile dietlibc with IP relative addressing mode, and simply link your code to it with -nostdlib. Figure 18.58 is an updated version of test2.c which prints the command line arguments.

image

Figure 18.57: First Draft of test2.c

image

Figure 18.58: Updated test2.c.

Now we are actually building a statically linked binary that can get command line args, and call statically linked in functions from Diet Libc.6

$ gcc -nostdlib -c -fPIC test2.c -o test2.o
$ gcc -nostdlib test2.o /usr/lib/diet/lib-x86_64/libc .a
      -o test2
$ ./test2 arg1 arg2
./test2
arg1
arg2

Now we can run static_to_dyn from page 715 to enforce ASLR.7 The first two sections are happily randomized!

$ ./static_to_dyn test2
$ ./test2 foo bar
$ pmap ‘pidof test‘
24411:   ./test2 foo bar
0000564cf542f000      8K r-x-- test2
0000564cf5631000      4K rw--- test2
00007ffe98c8e000    132K rw---   [ stack ]
00007ffe98d55000      8K r----   [ anon ]
00007ffe98d57000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total              164K

Summary

In this paper we have cleared some misconceptions surrounding the attack surface of a statically linked executable, and which security mitigations are lacking by default. PLT/GOT attacks do exist against statically linked ELF executables, but RELRO and ASLR defenses do not.

We presented a prototype tool for enabling full RELRO on statically linked executables. We also engaged in some work to create a hybridized approach between linking techniques with instrumentation, and together were able to propose a solution for making static binaries that work with ASLR. Our solution for ASLR is to first build the binary statically, without glibc.

image

18:07 A Trivial Exploit for Tetrinet; or, Update Player TranslateMessage to Level Shellcode.

by John Laky and Kyle Hanslovan

Lo, the year was 1997 and humanity completed its greatest feat yet. Nearly thirty years after NASA delivered the lunar landings, St0rmCat released TetriNET, a gritty multiplayer reboot of the gaming monolith Tetris, bringing capitalists and communists together in competitive, adrenaline-pumping, line-annihilating, block-crushing action, all set to a period-appropriate synthetic soundtrack that would make Gorbachev blush. TetriNET holds the dubious distinction of hosting one of the most hilarious bugs ever discovered, where sending an offset and overwritable address in a stringified game state update will jump to any address of our choosing.

The TetriNET protocol is largely a trusted two-way ASCII-based message system with a special binascii encoded handshake for login.0 Although there is an official binary (v1.13), this protocol enjoyed several implementations that aid in its reverse engineering, including a Python server/client implementation.1 Authenticating to a TetriNET server using a custom encoding scheme, a rotating xor derived from the IP address of the server. One could spend ages reversing the C ++ binary for this algorithm, but The Great Segfault punishes wasted time and effort, and our brethren at Pytrinet already have a Python implementation.

image
image

One of the many updates a TetriNET client can send to the server is the level update, an 0xFF terminated string of the form:

lvl < player number > < level number > xff

The documentation states acceptable values for the player number range 1-6, a caveat that should pique the interest of even nascent bit-twiddlers. Predictably, sending a player number of 0x20 and a level of 0x00AABBCC crashes the binary through a write-anywhere bug. The only question now is which is easier: overwriting a return address on a stack or a stomping on a function pointer in a v-table or something. A brief search for the landing zone yields the answer:

image

Praise the Stack! We landed inside the import table.

image

Now we have a plan to overwrite an often-called function pointer with a useful address, but which one? There are a few good candidates, and a look at the imports reveals a few of particular interest: PeekMessageA, DispatchMessageA, and TranslateMessage, indicating TetriNET relies on Windows message queues for processing. Because these are usually handled asynchronously and applications receive a deluge of messages during normal operation, these are perfect candidates for corruption. Indeed, TetriNET implements a PeekMessageA / TranslateMessage / DispatchMessageA subroutine.

image
image
image

Adjusting our firing solution to overwrite the address of TranslateMessage (remember the vulnerable instruction multiplies the player number by the size of a pointer; scale the payload accordingly) and voila! EIP jumps to our provided level number.

Now, all we have to do is jump to some shellcode. This may be a little trickier than it seems at first glance.

The first option: with a stable write-anywhere bug, we could write shellcode into an rwx section and jump to it. Unfortunately, the level number that eventually becomes ebx in the vulnerable instruction is a signed double word, and only positive integers can be written without raising an error. We could hand-craft some clever shellcode that only uses bytes smaller than 0x80 in key locations, but there must be a better way.

The second option: we could attempt to write our shellcode three bytes at a time instead of four, working backward from the end of an RWX section, always writing double words with one positive-integer-compliant byte followed by three bytes of shellcode, always overwriting the useless byte of the last write. Alas, the vulnerable instruction enforces 4-byte aligned writes:

0044 B963 mov ds : dword_453F28 [ eax *4] , ebx

The third option: we could patch either the positive-integer-compliant check or the vulnerable instruction to allow us to perform either of the first two options. Alas, the page containing this code is not writable.

image

Suddenly, the Stack grants us a brief moment of clarity in the midst of our desperation: because the login encoding accepts an arbitrary binary string as the nickname, all manner of shellcode can be passed as the nickname, all we have to do is find a way to jump to it. Surely, there must be a pointer somewhere in the data section to the nickname we can use to jump it. After a brief search, we discover there is indeed a static value pointing to the login nickname in the heap. Now, we can write a small trampoline to load that pointer into a register and jump to it:

image

Voila! Login as shellcode, update your level to the trampoline, smash the pointer to TranslateMessage and pull the trigger on the windows message pump and rejoice in the shiny goodness of a running exploit. The Stack would be proud! While a host of vulnerabilities surely lie in wait betwixt the subroutines of tetrinet.exe, this vulnerability’s shameless affair with the player is truly one for the ages.

Scripts and a reference Tetrinet executable are attached to this PDF,2 and the editors of this fine journal have resurrected the abandoned website at http://tetrinet.us/.

18:08 A Guide to KLEE LLVM Execution Engine Internals

by Julien Vanegue

Greetings fellow neighbors!

It is my great pleasure to finally write my first article in this journal after so many of you have contributed excellent content in the past dozens of issues that Pastor Laphroig put together for our enjoyment. I have been waiting for this moment for some time, and been harassed a few times, to finally come up with something worthwhile. Given the high standards set upon all of us, I did not feel like rushing it. Instead, I bring to you today what I think will be a useful piece of texts for many fellow hackers to use in the future. Apologies for any errors that may have slipped from my understanding, I am getting older after all, and my memory is not what it used to be. Not like it has ever been infaillible but at least I used to remember where the cool kids hung out. This is my attempt at renewing the tradition of sharing knowledge through some more informal channels.

Today, I would like to talk to you about KLEE, an open source symbolic execution engine originally developed at Stanford University and now maintained at Imperial College in London. Symbolic Execution (SYMEX) stands somewhere between static analysis of programs and [dynamic] fuzz testing. While its theoretical foundations dates back from the late seventies,0 practical application of it waited until the late 2000s (such as SAGE1 at Microsoft Research) to finally become mainstream with KLEE in 2008. These tools have been used in practice to find thousands of security issues in software, going from simple NULL pointer dereferences, to out of bound reads or writes for both the heap and the stack, including use-after-free vulnerabilities and other type-state issues that can be easily defined using “asserts.”

In one hand, symbolic execution is able to undergo concrete execution of the analyzed program and maintains a concrete store for variable values as the execution progresses, but it can also track path conditions using constraints. This can be used to verify the feasibility of a specific path. At the same time, a process tree ( PTree) of nodes ( PTreeNode) represent the state space as an ImmutableTree structure. The ImmutableTree implements a copy-on-write mechanism so that parts of the state (mostly variable values) that are shared across the node don’t have to be copied from state to state unless they are written to. This allows KLEE to scale better under memory pressure. Such state contains both a list of symbolic constraints that are known to be true in this state, as well as a concrete store for program variables on which constraints may or may not be applied (but that are nonetheless necessary so the program can execute in KLEE).

My goal in this article is not so much to show you how to use KLEE, which is well understood, but bring you a tutorial on hacking KLEE internals. This will be useful if you want to add features or add support for specific analysis scenarios that you care about. I’ve spent hundreds of hours in KLEE internals and having such notes may have helped me in the beginning. I hope it helps you too.

Now let’s get started.

Working with Constraints

Let’s start with a simple C program.

int fct (int a, int b) {          int main (int argc,
  int c = 0;                                char **argv) {
  if (a < b)                        if (argc != 3) return (-1);
    c++;                            int a = atoi(argv [1]);
  else                              int b = atoi(argv [2]);
    c--;                            if (a < b)
  return c;                           return (0);
}                                   return fct(a, b);
                                  }

It is clear that the path starting in main and continuing in the first if (a < b) is infeasible. This is because any such path will actually have finished with a return (0) in the main function already. The way KLEE can track this is by listing constraints for the path conditions.

This is how it works: first KLEE executes some bootstrapping code before main takes control, then starts executing the first LLVM instruction of the main function. Upon reaching the first if statement, KLEE forks the state space (via function Executor::fork). The left node has one more constraint ( argc != 3) while the right node has constraint ( argc == 3). KLEE eventually comes back to its main routine ( Executor::run), adds the newly-generated states into the set of active states, and picks up a new state to continue analysis with.

Executor Class

The main class in KLEE is called the Executor class. It has many methods such as Executor::run(), which is the main method of the class. This is where the set of states: added states and removed states set are manipulated to decide which state to visit next. Bear in mind that nothing guarantees that next state in the Executor class will be the next state in the current path.

Figure 18.59 shows all of the LLVM instructions currently supported by KLEE.

  • Call/Br/Ret: Control flow instructions. These are cases where the program counter (part of the state) may be modified by more than just the size of the current instruction. In the case of Call and Ret, a new object StackFrame is created where local variables are bound to the called function and destroyed on return. Defining new variables may be achieved through the KLEE API bindObjectInState().
  • Add/Sub/Mul/*S*/U*/*Or*: The Signed and Unsigned arithmetic instructions. The usual suspects including bit shifting operations as well.
  • Cast operations ( UItoFP, FPtoUI, IntToPtr, PtrToInt, Bit-Cast, etc.): used to convert variables from one type to a variable of a different type.
  • *Ext* instructions: these extend a variable to use a larger number of bits, for example 8b to 32b, sometimes carrying the sign bit or the zero bit.
  • F* instructions: the floating point arithmetic instructions in KLEE. I dont myself do much floating point analysis and I tend not to modify these cases, however this is where to look if you’re interested in that.
  • Alloca: used to allocate memory of a desired size
  • Load/Store: Memory access operations at a given address
  • GetElementPtr: perform array or structure read/write at certain index
    image

    Figure 18.59: LLVM Instructions supported by KLEE

  • PHI: This corresponds to the PHI function in the Static Single Assignment form (SSA) as defined in the literature.2

There are other instructions I am glossing over but you can refer to the LLVM reference manual for an exhaustive list.

So far the execution in KLEE has gone through Executor::run()-> Executor::executeInstruction() -> case ... but we have not looked at what these cases actually do in KLEE. This is handled by a class called the ExecutionState that is used to represent the state space.

ExecutionState Class

This class is declared in include/klee/ExecutionState.h and contains mostly two objects:

  • AddressSpace: contains the list of all meta-data for the process objects in this state, including global, local, and heap objects. The address space is basically made of an array of objects and routines to resolve concrete addresses to objects (via method AddressSpace::resolveOne to resolve one by picking up the first match, or method AddressSpace::resolve for resolving to a list of objects that may match). The AddressSpace object also contains a concrete store for objects where concrete values can be read and written to. This is useful when you’re tracking a symbolic variable but suddently need to concretize it to make an external concrete function call in libc or some other library that you haven’t linked into your LLVM module.
  • ConstraintManager: contains the list of all symbolic constraints available in this state. By default, KLEE stores all path conditions in the constraint manager for that state, but it can also be used to add more constraints of your choice. Not all objects in the AddressSpace may be subject to constraints, which is left to the discretion of the KLEE programmer. Verifying that these constraints are satisfiable can be done by calling solver->mustBeTrue() or solver->MayBeTrue() methods, which are APIs provided in KLEE to call either SMT or Z3 independently of the low-level solver API. This comes handy when you want to check the feasibility of certain variable values during analysis.

Every time the ::fork() method is called, one execution state is split into two where possibly more constraints or different values have been inserted in these objects. One may call the Executor::branch() method directly to create a new state from the existing state without creating a state pair as fork would do. This is useful when you only want to add a subcase without following the exact fork expectations.

Executor::executeMemoryOperation(), MemoryObject and ObjectState

Two important classes in KLEE are MemoryObject and ObjectState, both defined in lib/klee/Core/Memory.h.

The MemoryObject class is used to represent an object such as a buffer that has a base address and a size. When accessing such an object, typically via the Executor::executeMemoryOperation() method, KLEE automatically ensures that accesses are in bound based on known base address, desired offset, and object size information. The MemoryObject class provides a few handy methods.

(...)
ref<ConstantExpr> getBaseExpr()
ref<ConstantExpr> getSizeExpr()
ref<Expr> getOffsetExpr(ref<Expr> pointer)
ref<Expr> getBoundsCheckPointer(ref<Expr> pointer)
ref<Expr> getBoundsCheckPointer(ref<Expr> pointer,
                                unsigned bytes)
ref<Expr> getBoundsCheckOffset(ref<Expr> offset)
ref<Expr> getBoundsCheckOffset(ref<Expr> offset,
                               unsigned bytes)

Using these methods, checking for boundary conditions is child’s play. It becomes more interesting when symbolics are used as the conditions that must be checked involves more than constants, depending on whether the base address, the offset or the index are symbolic values (or possibly depending on the source data for certain analyses, for example taint analysis).

While the MemoryObject somehow takes care of the spatial integrity of the object, the ObjectState class is used to access the memory value itself in the state. Its most useful methods are:

// return bytes read.
ref<Expr> read(ref<Expr> offset, Expr::Width width);
ref<Expr> read(unsigned offset, Expr::Width width);
ref<Expr> read8(unsigned offset);

// return bytes written.
void write(unsigned offset, ref<Expr> value);
void write(ref<Expr> offset, ref<Expr> value);
void write8(unsigned offset, uint8_t value);
void write16(unsigned offset, uint16_t value);
void write32(unsigned offset, uint32_t value);
void write64(unsigned offset, uint64_t value);

Objects can be either concrete or symbolic, and these methods implement actions to read or write the object depending on this state. One can switch from concrete to symbolic state by using methods:

void makeConcrete();
void makeSymbolic();

These methods will just flush symbolics if we become concrete, or mark all concrete variables as symbolics from now on if we switch to symbolic mode. Its good to play around with these methods to see what happens when you write the value of a variable, or make a new variable symbolic and so on.

When Instruction::Load and ::Store are encountered, the Executor::executeMemoryOperation() method is called where symbolic array bounds checking is implemented. This implementation uses a healthy mix of MemoryObject, ObjectState, AddressSpace::resolveOne() and MemoryObject::getBoundsCheckOffset() to figure out whether any overflow condition can happen.

If so, it calls KLEE’s internal API Executor::terminate-StateOnError() to signal the memory safety issue and terminate the current state. Symbolic execution will then resume on other states so that KLEE does not stop after the first bug it finds. As it finds more errors, KLEE saves the error locations so it won’t report the same bugs over and over.

Special Function Handlers

A bunch of special functions are defined in KLEE that have special handlers and are not treated as normal functions. See lib/-Core/SpecialFunctionHandler.cpp.

Some of these special functions are called from the Executor-::executeInstruction() method in the case of the Instruction::Call instruction.

All the klee_* functions are internal KLEE functions which may have been produced by annotations given by the KLEE analyst. (For example, you can add a klee_assume(p) somewhere in the analyzed program’s code to say that p is assumed to be true, thereby some constraints will be pushed into the ConstraintManager of the currenet state without checking them.) Other functions such as malloc, free, etc. are not treated as normal function in KLEE. Because the malloc size could be symbolic, KLEE needs to concretize the size according to a few simplistic criteria (like size = 0, size = 28, size = 216, etc.) to continue making progress. Suffice to say this is quite approximate.

image

Figure 18.60: KLEE Special Function Handlers

This logic is implemented in the Executor::executeAlloc() and ::executeFree() methods. I have hacked around some modifications to track the heap more precisely in KLEE, however bear in mind that KLEE’s heap as well as the target program’s heap are both maintained within the same address space, which is extremely intrusive. This makes KLEE a bad framework for layout sensitive analysis, which many exploit generation problems require nowadays. Other special functions include stubs for Address Sanitizer (ASan), which is now included in LLVM and can be enabled while creating LLVM code with clang. ASan is mostly useful for fuzzing so normally invisible corruptions turn into visible assertions. KLEE does not make much use of these stubs and mostly generate a warning if you reach one of the ASan-defined stubs.

Other recent additions were klee_open_merge() and klee_-close_merge() that are an annotation mechanism to perform selected merging in KLEE. Merging happens when you come back from a conditional contruct (e.g., switch, or when you must define whether to continue or break from a loop) as you must select which constraints and values will hold in the state immediately following the merge. KLEE has some interesting merging logic implemented in lib/Core/MergeHandler.cpp that are worth taking a look at.

Experiment with KLEE for yourself!

I did not go much into details of how to install KLEE as good instructions are available onine.3 Try it for yourself!

My setup is an amd64 machine on Ubuntu 16.04 that has most of what you will need in packages. I recommend building LLVM and KLEE from sources as well as all dependencies (e.g., Z34 and/or STP5) that will help you avoid weird symbol errors in your experiments.

A good first target to try KLEE on is coreutils, which is what pretty much everybody uses in their research papers evaluation nowadays. Coreutils is well tested so new bugs in it are scarce, but its good to confirm everything works okay for you. A tutorial on how to run KLEE on coreutils is available as part of the project website.6

I personally used KLEE on various targets: coreutils, busybox, as well as other standard network tools that take input from untrusted data. These will require a standalone research paper explaining how KLEE can be used to tackle these targets.

Symbolic Heap Execution in KLEE

For heap analysis, it appears that KLEE has a strong limitation of where heap chunks for KLEE as well as for the target program are maintained in the same address space. One would need to introduce an allocator proxy7 if we wanted to track any kind of heap layout fidelity for heap prediction purpose. There are spatial issues to consider there as symbolic heap size may lead to heap state space explosion, so more refined heap management may be required. It may be that other tools relying on selective symbolic execution (S2E)8 may be more suitable for some of these problems.

image
Analyzing Distributed Applications.

These are more complex use-cases where KLEE must be modified to track state across distributed component.9 Several industrially sized programs use databases and key-value stores and it is interesting to see what symbolic execution model can be defined for those. This approach has been applied to distributed sensor networks and could also be experimented on distributed software in the cloud.

You can either obtain LLVM bytecode by compiling with the clang compiler or by use of a decompiler like McSema and its ReMill library.10

Beware of restricting yourself to artificial test suites as, beyond their likeness to real world code, they do not take into account all the environmental dependencies that a real project might have. A typical example is that KLEE does not support inline assembly. Another is the heap intrusiveness previously mentioned. These limitations might turn a golden technique like symbolic execution into a vacuous technology if applied to a bad target.

I leave you to that. Have fun and enjoy!

—Julien

18:09 Reversing the Sandy Bridge DDR3 Scrambler with Coreboot

by Nico Heijningen

Humble greetings neighbors,

I reverse engineered part of the memory scrambling included in Intel’s Sandy/Ivy Bridge processors. I have distilled my research in a PoC that can reproduce all 218 possible 1,024 byte scrambler sequences from a 1,026 bit starting state.0

For a while now Intel’s memory controllers include memory scrambling functionality. Intel’s documentation explains the benefits of scrambling the data before it is written to memory for reducing power spikes and parasitic coupling.1 Prior research on the topic2 3 quotes different Intel patents.4

Furthermore, some details can be deduced by cross-referencing datasheets of other architectures.5 For example the scrambler is initialized with a random 18 bit seed on every boot, the SCRMSEED. Other than this nothing is publicly known or documented by Intel. The prior work shows that scrambled memory can be descrambled, yet newer versions of the scrambler seem to raise the bar, together with prospects of full memory encryption.6 While the scrambler has never been claimed to provide any cryptographic security, it is still nice to know how the scrambling mechanism works.

Not much is known as to the internals of the memory scrambler, Intel’s patents discuss the use of LFSRs and the work of Bauer et al. has modeled the scrambler as a stream cipher with a short period. Hence the possibility of a plaintext attack to recover scrambled data: if you know part of the memory content you can obtain the cipher stream by XORing the scrambled memory with the plaintext. Once you know the cipher stream you can repetitively XOR this with the scrambled data to obtain the original unscrambled data.

image

An analysis of the properties of the cipher stream has to our knowledge never been performed. Here I will describe my journey in obtaining the cipher stream and analyzing it.

First we set out to reproduce the work of Bauer et al.: by performing a cold-boot attack we were able to obtain a copy of memory. However, because this is quite a tedious procedure, it is troublesome to profile different scrambler settings. Bauer’s work is built on ‘differential’ scrambler images: scrambled with one SCRMSEED and descrambled with another. The data obtained by using the procedure of Bauer et al. contains some artifacts because of this.

image

Figure 18.61: Coreboot’s Scrambling Seed for Sandy Bridge

We found that it is possible to disable the memory scrambler using an undocumented Intel register and used coreboot to set this bit early in the boot process. We patched coreboot to try and automate the process of profiling the scrambler. We chose the Sandy Bridge platform as both Bauer et al.’s work was based on it and because coreboot’s memory initialization code has been reverse engineered for the platform.7 Although coreboot builds out-of-the-box for the Gigabyte GA-B75M-D3V motherboard we used, coreboot’s makefile ecosystem is quite something to wrap your head around. The code contains some lines dedicated to the memory scrambler, setting the scrambling seed or SCRMSEED. I patched the code in Figure 18.61 to disable the memory scrambler, write all zeroes to memory, reset the machine, enable the memory scrambler with a specific SCRMSEED, and print a specific memory region to the debug console. (COM port.) This way we are able to obtain the cipher stream for different SCRMSEEDs. For example when writing eight bytes of zeroes to the memory address starting at 0x10000070 with the scrambler disabled, we read 3A E0 9D 70 4E B8 27 5C back from the same address once the PC is reset and the scrambler is enabled. We know that that’s the cipher stream for that memory region. A reset is required as the SCRMSEED can no longer be changed nor the scrambler disabled after memory initialization has finished. (Registers need to be locked before the memory can be initialized.)

image

Figure 18.62: Keyblock

Now some leads by Bauer et al. based on the Intel patents quickly led us in the direction of analyzing the cipher stream as if it were the output of an LFSR. However, taking a look at any one of the cipher stream reveals a rather distinctive usage of a LFSR. It seems as if the complete internal state of the LFSR is used as the cipher stream for three shifts, after which the internal state is reset into a fresh starting state and shifted three times again. (See Figure 18.62.)

00111010 11100000
10011101 01110000
01001110 10111000
00100111 01011100

It is interesting to note that a feedback bit is being shifted in on every clocktick. Typically only the bit being shifted out of the LFSR would be used as part of the ‘random’ cipher stream being generated, instead of the LFSR’s complete internal state. The latter no longer produces a random stream of data, the consequences of this are not known but it is probably done for performance optimization.

These properties could suggest multiple constructions. For example, layered LFSRs where one LFSR generates the next LFSR’s starting state, and part of the latter’s internal state being used as output. However, the actual construction is unknown. The number of combined LFSRs is not known, neither is their polynomial (positions of the feedback taps), nor their length, nor the manner in which they’re combined.

Normally it would be possible to deduce such information by choosing a typical length, e.g. 16-bit, LFSR and applying the Berlekamp Massey algorithm. The algorithm uses the first 16-bits in the cipher stream and deduces which polynomials could possibly produce the next bits in the cipher stream. However, because of the previously described unknowns this leads us to a dead end. Back to the drawing board!

Automating the cipher stream acquisition by also patching coreboot to parse input from the serial console we were able to dynamically set the SCRMSEED, then obtain the cipher stream. Writing a Python script to control the PC via a serial cable enabled us to iterate all 218 possible SCRMSEEDs and save their accompanying 1024 byte cipher streams. Acquiring all cipher streams took almost a full week. This data now allowed us to try and find relations between the SCRMSEED and the produced cipher stream. Stated differently, is it possible to reproduce the scrambler’s working by using less than 218 × 1024 bytes?

image

Figure 18.63: TeraDIMM Scrambling

This analysis was eased once we stumbled upon a patent describing the use of the memory bus as a high speed interconnect, under the name of TeraDIMM.8 Using the memory bus as such, one would only receive scrambled data on the other end, hence the data needs to be descrambled. The authors give away some of their knowledge on the subject: the cipher stream can be built from XORing specific regions of the stream together. This insight paved the way for our research into the memory scrambling.

The main distinction that the TeraDIMM patent makes is the scrambling applied is based on four bits of the memory address versus the scrambling based on the (18-bit) SCRMSEED. Both the memory address- and SCRMSEED-based scrambling are used to generate the cipher stream 64 byte blocks at a time.9 Each 64 byte cipher-stream-block is a (linear) combination of different blocks of data that are selected with respect to the bits of the memory address. See Figure 18.63.

Because the address-based scrambling does not depend on the SCRMSEED, this is canceled out in the differential images obtained by Bauer. This is how far the TeraDIMM patent takes us; however, with this and our data in mind it was easy to see that the SCRMSEED based scrambling is also built up by XORing blocks together. Again depending on the bits of the SCRMSEED set, different blocks are XORed together.

Hence, to reproduce any possible cipher stream we only need four such blocks for the address scrambling, and eighteen blocks for the SCRMSEED scrambling. We have named the eighteen SCRMSEEDs that produce the latter blocks the (SCRMSEED) toggleseeds. We’ll leave the four address scrambling blocks for now and focus on the toggleseeds.

The next step in distilling the redundancy in the cipher stream is to exploit the observation that for specific toggleseeds parts of the 64 byte blocks overlap in a sequential manner. (See Figures 18.65 and 18.66.) The 18 toggleseeds can be placed in four groups and any block of data associated with the toggleseeds can be reproduced by picking a different offset in the non-redundant stream of one of the four groups. Going back from the overlapping stream to the cipher stream of SCRMSEED 0x100 we start at an offset of 16 bytes and take 64 bytes, obtaining 00 30 80 ... 87 b7 c3.

Finally, the overlapping streams of two of the four groups can be used to define the other two; by combining specific eight byte stretches i.e., multiplying the stream with a static matrix. For example, to obtain the first stretch of the overlapping stream of SCRMSEEDs 0x4, 0x10, 0x100, 0x1000, and 0x10000 we combine the fifth and the sixth stretch of the overlapping stream of SCRMSEEDs 0x1, 0x40, 0x400, and 0x4000. That is 20 00 10 00 08 00 04 00 = 00 01 00 00 00 00 00 00 ^ 20 01 10 00 08 00 04 00. The matrix is the same between the two groups and provided in Figure 18.64. One is invited to verify the correctness of that figure using Figures 18.65 and 18.66.

image

Figure 18.64: Scrambler Matrix

Some work remains to be done. We postulate the existence of a mathematical basis to these observations, but a nice mathematical relationship underpinning the observations is yet to be found. Any additional details can be found in my TUE thesis.10

image

Figure 18.65: Overlapping Streams 1

image

Figure 18.66: Overlapping Streams 2

18:10 Easy SHA-1 Collisions with PDFLaTeX

by Ange Albertini

In the summer of 2015, I worked with Marc Stevens on the reusability of a SHA1 collision: determining a prefix could enable us to craft an infinite amount of valid PDF pairs, with arbitrary content with a SHA-1 collision.

image

The first SHA-1 colliding pair of PDF files were released in February 2017.0 I documented the process and the result in my “Exploiting hash collisions” presentation.

The resulting prefix declares a PDF, with a PDF object declaring an image as object 1, with references to further objects 2–8 in the file for the properties of the image.

image

The PDF is otherwise entirely normal. It’s just a PDF with its first eight objects used, and with a image of fixed dimensions and colorspace, with two different contents in each of the colliding files.

The image can be displayed one or many times, with optional clipping, and the raw data of the image can be also used as page content under specific readers (non browsers) if stored losslessly repeating lines of code eight times.

The rest of the file is totally standard. It could be actually a standard academic paper like this one.

We just need to tell PDFLATEX that object 1 is an image, that the next seven objects are taken, and do some postprocessing magic: since we can’t actually build the whole PDF file with the perfect precision for hash collisions, we’ll just use placeholders for each of the objects. We also need to tell PDFLATEX to disable decompression in this group of objects.

Here’s how to do it in PDFLATEX. You may have to put that even before the documentclass declaration to make sure the first PDF objects are not yet reserved.

image

Then we just need to get the reference to the last PDF image object, and we can now display our image wherever we want.

image

We then just need to actually overwrite the first eight objects of a colliding PDF, and everything falls into place.1 You can optionally adjust the XREF table for a perfectly standard, SHA-1 colliding, and automatically generated PDF pair.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset