Content Disarm and Reconstruction: Eh, What’s Up Docx?

  • Post author:
Previously I wrote about content disarm and reconstruction (CDR) with respect to steganography. Stego is really cool stuff, but demonstrating what CDR does with respect to documents makes things less abstract. You can actually see what I am talking about. Like LibreOffice, OpenOffice, and a couple dozen other productivity applications, Microsoft Office documents use the OpenDocument file format. This means that you have a ZIP file with XML and potentially other embedded objects. The images to the right are the headers of a Microsoft Word document and a LibreOffice document. The “PK” marker at the beginning indicates a zipped file. Sure enough, these documents can be opened with an archiver. Opening such a document with an archiver will display the contents of the zip file. Shall we extract the files? That was a rhetorical question; you know we will. This is where things start to get interesting. Unless this is already interesting, in which case things start to get more interesting. The extension .docm means it is a Word document with macros, better known as you’ve been pwnd. Seriously, weaponized Word documents delivered through email are one of the favorite attack methods of APT groups and others. Ever since Visual Basic for Applications (VBA) was added to Word, the word “macro” has meant “Make A Criminal Rich Online”. But I digress. As you can see, I have extracted “Word Example.docm” to the “Word Example” folder with the directory structure there to see in all of its splendid glory. Let’s focus on the “Word” folder which resides in the Word Example folder. For the time being, which means it won’t be covered in this blog, we’re going to ignore the _rels folder. Circled in red (technically oblonged in red) there are two files; vbaData.xml and vbaProject.bin, The file vbaProject.bin contains the macros. The bin extension is telling you that there is executable content. If you’re the CISO or CTO who hasn’t deployed CDR it means that you should have paid attention to the fortune cookie that said “It’s time to update your resume.” CDR is going to see the vbaProject.bin file and rip it out of the document… No questions asked. Take no prisoners. May as well take out the vbaData.xml file while we’re at it. After analyzing the document, if no further potentially harmful objects reside in the document, then CDR will reconstruct the document to its functional state, except the document is now safe to open. If for some reason the functionality of the document is broken due to the content that was removed, perhaps the problem is process. It is extremely rare that a document with macros arriving in your inbox is safe. Still, virtually any good CDR implementation will provide means of obtaining and delivering the unprocessed document. An API driven solution can allow for custom actions such as redirecting a file to a virtual environment, a quarantine location for manual inspection, or if you’re feeling lucky, an administrative override. Moving on, you see that there is an ActiveX directory. What that means to a CISO or CTO is that your fortune cookie told you, “You might want to start sending out resumes” (CVs for my European friends) The contents of ActiveX.xml are irrelevant to CDR. There is the potential for harm, so out it goes. So, CDR has removed the macros and the ActiveX object. Are we good to go? Not yet. We haven’t looked at the media folder. If you remember my blog about CDR with respect to steganography, I’m flattered! Each of those image files has the potential to contain badness. Whether it is used for data exfiltration, covert communications, or as a malware carrier, the pictures are potentially harmful. Fortunately for you, unless the images are NSFW, CDR will not remove the images, but rather it will make slight alterations that will disrupt steganography and remove any appended malware. So now, we process the images, remove the ActiveX object, remove the macros, reconstruct the document, and give the CISO or CTO a fortune cookie with a more enjoyable message. Go ahead and try this manually. Remove the macros, remove the ActiveX object, process the images, and zip it all up. It doesn’t work that way, there’s more to reconstructing the document than meets the eye. And this was easy stuff. Wait until we get to recursion. The wait is over! As you can see, we have a file named sexy.zip. Betcha can’t wait to see what’s inside. Nobody else could, so they opened it and infected the whole office. CDR has to deal with multiple types of archives and nested files. Here’s what may seem like an extreme example, until you remember ZIP Bombs! Let’s get to work on this. Inside of Sexy.zip, we see recursive.docx. I’ve taken the liberty of extracting recursive.docx, and you can see that this document contains a folder called “embeddings.” Shall we… never mind, rhetorical question. If you are not familiar with the nursery song “There Was an Old Lady Who Swallowed a Fly” then google it, as that was the inspiration for all current forms of recursion. Inside of “Recursion.docx” we see an Excel worksheet and two binary objects. Don’t worry, it’s worse than you think. I’m going to extract these in order to show you what is inside of the recursed objects. Inside of recursive.docx we find a folder named “embeddings.” Cool stuff. As you can see there is an Excel worksheet and 2 OLE objects. It’s not what you think… probably. Before we go onto the worksheet, let’s inspect the OLE objects. Yep, I extracted the contents of object1.bin, and one of the files, contents, has a surprise in store. As you can see from the start of the hex dump, the file is actually an embedded PDF. PDFs are a file type that can contain potentially harmful content, and as such need to be processed with CDR. And so, you see, CDR also has to deal with filetype validation or it will not discover the embedded PDF. Oleobject2 is fun. Looking at the hex dump you can see “Word Example/tar.bz2. Oh great, we have something in a bzip file that is inside of a tar file. I’ve taken the liberty of extracting Oleobject2.bin and providing a hex dump of the contents file. It’s our friend “Word Example” which is contained in bzip2 file that is inside of a tar file. And remember, that document file is actually a zip file too. Object1, the PDF cannot simply be removed. The PDF can be sanitized, but it needs to remain for the document to be considered functional. So, CDR processes the PDF and reconstructs it prior to putting it back into its container. Object2 has multiple objects that also must be processed. All of the Office files will contain images that need to be processed. And so we end up with fie systems inside of file systems. That CDR systems must deconstruct, process, recurse, deconstruct, and so on until the original zip container is sanitized and returned to the email that also had components to be sanitized before being delivered as a safe email and attachment. Every single piece of potentially dangerous content has been removed, with one caveat… nothing is 100%, but CDR virtually eliminates the threat of email-borne attachments. One more thing to consider. Let’s say a zero-day comes in through email, but the payload is in a file that has been sanitized by CDR. Although the exploit may have detonated, the payload was neutralized. While this example was manufactured for demonstration purposes, the degrees of recursion can make it difficult for traditional technologies to deal with, and even when they do, that picture of the purple dinosaur in the final embedded image is quite easily programmatically accessible for nefarious intentions. Recursion is cool and necessary, but right now it’s the simple flat documents, PDFs, and scripts that are pummeling enterprises left and right, and a quality CDR product would dramatically reduce risk. Security is after all, risk management. In addition to being deployed in email, CDR is an essential component of kiosks used to sanitize data prior to being transferred to resources on air-gapped networks, secure vaults, and file uploads from both employees and third-party vendors. If you were not aware of CDR technology before you will become increasingly aware of it as it is finally gaining traction. While initially there were very few security vendors providing the technology, it is increasingly becoming a part of the portfolio of security solutions offered by vendors. Randy Abrams Senior Scapegoat SecureIQLab