Malware Analyst: The Portable Executable

Introduction

MICROSOFT INTRODUCED A NEW executable file format with Windows NT. This format is called the Portable Executable (PE) format because it is supposed to be portable across all 32-bit operating systems by Microsoft. The same PE format executable can be executed on any version of Windows NT, Windows 95, and Win32s. Also, the same format is used for executables for Windows NT running on processors other than Intel x86, such as MIPS, Alpha, and Power PC. The 32-bit DLLs and Windows NT device drivers also follow the same PE format.

It is helpful to understand the PE file format because PE files are almost identical on disk and in RAM. Learning about the PE format is also helpful for understanding many operating system concepts. For example, how operating system loader works to support dynamic linking of DLL functions, the data structures involved in dynamic linking such as import table, export table, and so on.

The PE format is not really undocumented. The WINNT.H file has several structure definitions representing the PE format. The Microsoft Developer's Network (MSDN) CD-ROMs contain several descriptions of the PE format. However, these descriptions are in bits and pieces, and are by no means complete.

Microsoft also provides a DLL with the SDK that has utility functions for interpreting PE files. We also discuss these functions and correlate them with other information about the PE format.

OVERVIEW OF A PE FILE

In this section, we discuss the overall structure of a PE file. In the sections that follow, we go into detail about the PE format. A PE file comprises various sections. Because Microsoft’s 32-bit operating systems follow the flat memory model, an executable no longer contains segments. Still, different parts of an executable, such as code and data, have different characteristics. These different parts of an executable are stored as different sections. Thus, a PE file is a concatenation of data stored in sections.

A few sections are always present in a PE file generated by the Microsoft linker. Other linkers may generate similar sections with different names. A PE file generated with the Microsoft linker has a .text section that contains the code bytes concatenated from all the object files. As for the data, it can be classified into different categories. The .data section contains all the initialized global and static data, while the .bss section contains the uninitialized data. The read-only data, such as string literals and constants, is stored in the .rdata section. This section also contains some other read-only structures, such as the debug directory, the Thread Local Storage (TLS) directory, and so on. The .edata section contains information about the functions exported from a DLL, while the .idata section stores information about the functions imported by an executable or a DLL. The .rsrc section contains various resources, such as menus and dialog boxes. The .reloc section stores the information required for relocating the image while loading.

The names of the sections do not have any significance. As mentioned earlier, different linkers may use different names for the sections. Programmers can also create new sections of their own. The #pragma code_seg and #pragma data_seg macros can be used to create new sections while working with Microsoft compiler. The operating system loader locates the required piece of information from the data directories present in the file headers. Shortly, we will present an overview of file headers and then look at them in more detail.

STRUCTURE OF A PE FILE

Apart from the sections consisting of the actual data, a PE file contains various headers that describe the sections and the important information present in the sections.

If you look at the hex dump of a PE file, the first 2 bytes might look familiar. Aren’t they M and Z? Yes, a PE file starts with the DOS executable header. It is followed by a small program that prints an error message saying that the program cannot be run in DOS mode. It’s the same idea that was used in 16-bit Windows executables. This program code is executed, if the PE image is run under DOS.

After the DOS header and the DOS executable stub comes the PE header. A field in the DOS header points to this new header. The PE header starts with the 4-byte signature “PE” followed by two nulls. The PE format is based on the Common Object File Format (COFF) used by Unix. The PE signature is followed by the object file header borrowed from COFF. This header is present also for the object files produced by Microsoft’s 32-bit compilers. This header contains some general information about the file, such as the target machine ID, the number of sections in the file, and so forth. The COFF style header is followed by the optional header. This header is optional in the sense that it is not required for the object files. As far as executables and DLLs are concerned, this header is mandatory. The optional header has two parts. The first part is inherited from COFF and can be found in all COFF files. The second part is an NT-specific extension of COFF. Apart from other NT-specific information, such as the subsystem type, this part also contains the data directory. The data directory is an array in which each entry points to some important piece of information. One of the entries in the data directory points to the import table of the executable or DLL, another entry points to the export table of the DLL, and so on.

The data directory is followed by the section table. The section table is an array of section headers. A section header summarizes the important information about the respective section. Finally, the section table is followed by the sections themselves.

We hope that this gives you an overview of the organization of a PE file. Before diving into the details of the PE format, let’s discuss a concept that is vital in interpreting a PE file.

RELATIVE VIRTUAL ADDRESS

All the offsets within a PE file are denoted as Relative Virtual Addresses (RVAs). An RVA is an offset from the base address at which an executable is loaded in memory. This is not the same as the offset within the file because of the section alignment requirements. The PE header specifies the section alignment requirements for an executable image. A section has to be loaded at a memory address that is a multiple of the section alignment. The section alignment has to be a multiple of the page size. This is because different sections have different page attribute requirements; for example, the .data section needs read-write permissions, while the .text section needs read-execute permissions. Hence, a page cannot span section boundaries.

Because the PE format always talks in terms of RVAs, it’s difficult to find the location of the required information within a file. A common practice while accessing a PE file is to map the file in memory using the Win32 memory mapping API. It’s a bit complicated to calculate the address for the given RVA in this memory-mapped file. You first need to find out the section in which the given RVA lies. You can accomplish this by iterating through the section table. Each section header stores the starting RVA for the section and the size of the section. A section is guaranteed to be contiguously loaded in memory. Hence, the offset from the start of the section for a particular piece of data is bound to be the same whether the file is memory mapped or loaded by the operating system loader for execution. Hence, to find out the address in a memory-mapped file, you simply need to add this offset to the base address of the section in the memory-mapped file. Now, this base address can be calculated from within the file offset of the section, which is also stored in the respective section header. Quite an easy procedure, isn’t it?

Malware Analyst

Sunday, July 5, 2009

The Portable Executable

OVERVIEW OF A PE FILE

STRUCTURE OF A PE FILE

RELATIVE VIRTUAL ADDRESS

No comments:

About Me