Thursday, May 17, 2018

Parsing and reporting on hyperlinks in email using EWS and REST (eg looking for baseStriker) in Exchange and Office365

Its been quite a busy week in Email security the pass 7 days with 2 new vulnerabilities released in the last week first BaseStriker and now EFail . While its still too early to gauge the implications of both of these flaws what they both have in common is using the HTML body of a message and underlying html markup tags to make these exploits work. With baseStriker its the use of the Base Href tag in a HTML document and with EFail using an Img Src tag to send decrypted email contents to an external server (this is an over simplification).

In this post I'm going to look at how you can parse the HTML Links, Image SRC tags from messages that are sitting in a Mailbox (so post any Transport pipeline filtering) and provide a level or reporting on these. Or basically because we are going to be using the Mailbox API's for this we are looking directly at what's available to any Email Client in terms of Link and Images.

The Challenge

The challenge with this type of problem is that by their very nature the payload your looking for will vary that any form of formal search for a static URL will fail as Phisher and spammers have developed ways of getting around scanning methods that just look statically for values (basically ruling out Search-Mailbox). So one way to attack this is to get the Body Content one by one (which is an expensive thing to do in terms of time and resources) and do the scanning at the client end.

Different types Messages Bodies

The format for Message bodies can vary depending on the Mail Agent (eg the email client) that is sending the Message for example in Exchange you could have a Native Body type of RTF,HTML or Text (or it could be multi part). If for example you are using Outlook and you have chosen RTF as the Body type when sending a Message to another user locally on the same Exchange server. Then only the native body RTF will be stored for the Message and the Exchange Store will do an on the fly conversion of the RTF body to HTML when the first client requests the HTML body. The Best body algorithm describes this problem in more detail . With my scripts I've chosen to use the PidBodyHTML Extended Property for the HTML body because I found this gave me the most raw version of the BodyHTML which was important to getting the most accurate link report.


You would think that parsing HTML would be a pretty basic and easy thing to do in any API and it is up to point. Eg a lot of people point towards using this method in PowerShell to parse HTML

$HTMDoc = New-Object -com "HTMLFILE"

While this works okay and produces a nice result with all the Links and Images in a collection because this is also essentially rendering the HTML it will execute any javascript in the HTML (which shouldn't be there for Email) but also it downloads all the images in the src links. On suspect content this isn't what your really want to be doing and even on Marketing type emails because often images in emails are used to perform beaconing so if your looking to do something simular to this yourself be very careful of using any objects that are going to parse (especially those that reuse browser objects like the above example) to a dom as there might be unintended consequences you didn't expect if you don't fully understand how the object you using is parsing the content. With my script I'm just relying on firstly a very simple RegEx to get all the HTML tags and then some other filtering code to pull the attributes out for href links, base  and src links and then some further code to expand any base url links. While this isn't perfect and does fail in some instances its at least safe as it won't activate any content and generally you can just tweak the code to workaround any failures.


I've created an EWS version and a Graph/Rest version of this code which should be useable in both OnPrem or Office365. The EWS version can be found in GitHub here the Graph version is in my Exch-REST module which  is available from the PowerShell Gallery and GitHub (version 3.8)

The Code

With the code I've written its separated into two function the first function

 Get-EWSBodyLinks -MailboxName -FolderPath \Inbox -MessageCount 500

 Get-EXREmailBodyLinks -MailboxName -FolderPath \Inbox -MessageCount 500
The inputs are relatively simple it will take the FolderPath and MessagCount for the number of messages you want scanned. Then the function does the parsing of the Message Body and builds 3 dictionary objects with the Links,Images and Basehref details of the underlying HTML body of the messages that are scanned. This property is the added back to the EWS Managed API or Custom Rest object so it available for further pipeline or script processing in PowerShell.  eg

theses properties are collections or URI objects so you can do further things like

$Messages[0].ParsedLinks.Links | select absoluteuri

to just show the absolute URI on a message or if you where just interested in links from a particular URLShortner you could use

$Messages[0].ParsedLinks.Links | where-object dnsSafehost -eq ""

And a whole number of other things

BaseStriker Reporting

In the instance where you want to see which emails are using the base href tags (which may or may not be related to basestriker you can use the following)

$BaseHrefMessages = Get-EWSBodyLinks -MailboxName -FolderPath \Inbox -MessageCount 10000 | where-object {$_.ParsedLinks.HasBaseURL -eq $true} 

$BaseHrefMessages =  Get-EXREmailBodyLinks -MailboxName -FolderPath \Inbox -MessageCount 500 | where-object {$_.ParsedLinks.HasBaseURL -eq $true}  
These examples will return a collection of Messages that are using the BaseURL which you can then have a look at further. For example if you had a Mail that was matching Avanan's sample for BaseStriker the ParsedLinks property on a returned message would look like

In the parsing code I expand out the relative URL's that are used when there is BaseURL in the document.

In most of the scanning that I did on my email there where a few companies that used the BASEURL legitimately for instance it seems to be used in OneDrive where you share a item in the invitation message that gets sent out.


The second cmdlets I've written takes the data from the above functions and then preforms a consolidation report on the Domains in the href links, the domain in the Img src links, the href and img src's. For each of these reporting areas it counts the number of times the link appears and the number of messages that the link or domain appears in. To run the Reports

$Report = Get-LinkReport -MailboxName -FolderPath \Inbox -MessageCount 100

$Report = Get-EXREmailLinkReport -MailboxName -FolderPath \Inbox -MessageCount 100
In these examples you will end up with a $Report variable that contains collections that you could export to CSV or do some further manipulation eg

$Report = Get-EXREmailLinkReport -MailboxName -FolderPath \Inbox -MessageCount 100
$report.Domains | Sort-Object MessageCount -Descending


There are a lot of Links and Images used within email so this type of parsing of Email will produce a lot of data that you need to filter or process further. Eg if you started to find links that you think might be suspect then you may want to look at using a service link VirusTotal which has the ability to scan suspect links and return the results using an API. They also provide a paid for private API's if your going to do this in a high volume nature. The other thing is downloading the body of each email is a pretty costly process so watch out for throttling if your doing this on a large scale basis.