Using Hadoop to process Active Directory user lists

Powershell offers some really nice exporting and reporting capabilities. Most sysadmins will have a script like the following in their toolkit:

Import-Module activedirectory
Get-ADUser -filter *  -properties distinguishedname,lastlogontimestamp | export-csv -path .\Downloads\userlist.csv

It's a nice little script that will produce a csv file containing a user list, and a few keye elements. Usually you would import this into Excel. The file looks something like this: "DistinguishedName","Enabled","GivenName","lastlogontimestamp","Name","ObjectClass","ObjectGUID","SamAccountName","SID","Surname","UserPrincipalName" "CN=user,CN=Users,DC=mydomain,DC=local","True","OMGAUser","131402913451305734","Administrator","user","33df0732-43b8-4682-a214-b44d280bdb01","Administrator","S-1-5-21-2423464028-1209010664-1514014521-500","Account","user@users"

Unfortunately I hit a snag yesterday when I ran this against a domain that had a whopping 120 users.

At this point, it was vastly in excess of what Excel could manage, so I did what everyone does with an uncontrollable data set and called my Oracle account manager.

Within 20 minutes he showed up in his new Bentley and proceeded to help me calculate the licensing costs. Oracle's licensing formula is:

licences = (rows * cols) ^ tables

Fortunately we only had one table to deal with, but with 120 rows and 11 columns, we're staring down the barrel of 1320 Oracle licences. At $2,000 a pop + 20% account manager tip, standard processes for dealing with data were starting to become infeasible at this scale.

Building a processing script

Our goals are:

  • Filter out disabled users
  • Filter out users that should be disabled based on last logon
  • Group them by OU

With a classic map reduce put together, we've got one sticking point. Active Directory's timestamp is annoyingly formatted. Here's a quick but of Ruby to help show the conversion:

=> #<DateTime: 2017-05-26T16:55:45+00:00 ((2457900j,60945s,0n),+0s,2299161j)>

We're going to use Netflix Pigpen to process this data. Clojure really is a great way to understand data wrangling. Let's start with a function that walks through what we're doing with it:

(defn processusers
[input output]
    (pig-data input)
    (pig/map (partial zipmap [:cn :enabled :first :logon]))
    (pig/filter isenabled)
    (pig/filter filterold)
    (pig/map #(zipmap [:user :group]
        (rest (re-find #"CN=(.+?),OU=(.+?)," (get % :cn)))))
    (pig/group-by :group)
    (pig/map (fn [[word occurrences]] [word (count occurrences)]))
    (pig/store-json output)

Walking through this function, we: - Read the csv data - Map columns to a hash of four items, throw out the rest - Filter disabled users - Filter old users - Using a regex, extract the name and OU - Group by OU - Perform the classic wordcount function on OU members - Store JSON

All that's left is to fill in the functions.

(defn isenabled
(if-not (= (get user :enabled) "True") false true))

(defn pig-data
(pig/load-csv input))

(defn getage
(if (nil? lastlogin) -1000
; Conversion to seconds (age/10000000)-11644473600
(quot (-(- (quot lastlogin 10000000) 11644473600) (quot (System/currentTimeMillis) 1000)) 86400))))

(defn filterold
(let [daysold (getage (parse-int (get user :logon)))]
(if (< daysold -30) false true)))

(defn parse-int [s]
    (when-let [d (re-find #"-?\d+" s)] (BigInteger. d)))

The "getage" function is unfortunately horrible. I'm not sure how to improve this. Now you've just got to build the Pig script:

(require '[pigpen.pig])
(pigpen.pig/write-script "my-script.pig" (mypigrun "$INPUT" "$OUTPUT"))

And the .jar file, and a quick hack on the script. Substitute bucket name as appropriate:

$ lein uberjar
$ cp target/uberjar/myapp-0.1.0-SNAPSHOT-standalone.jar pigpen.jar
$ cat /tmp/big/my-script.pig  | sed 's_pigpen\.jar_s3://examplebucket/pigpen.jar_' > my-s3script.pig

Now strap ourself in, because with just a few commands we're going to launch a six server cluster, at two millionth of the cost of running Oracle.

$ aws s3 mb s3://examplebucket
$ aws s3 cp pigpen.jar s3://examplebucket/pigpen.jar
$ aws s3 cp input.csv s3://examplebucket/input.csv
$ aws s3 cp myscript.pig s3://examplebucket/my-s3script.pig
$ aws emr create-cluster --name "Pig Cluster" --release-label emr-5.5.0 --applications Name=Pig \
--use-default-roles --instance-type m3.xlarge --instance-count 6 --log-uri s3://examplebucket --auto-terminate \
--steps Type=PIG,Name="Pig Program",ActionOnFailure=CONTINUE,Args=[-f,s3://examplebucket/my-s3script.pig,-p,INPUT=s3://examplebucket

Come back in around two hours and for just a few dollars you'll have a great Active Directory report waiting for you:

$ aws s3 ls s3://examplebucket/output/
2017-05-23 06:44:09          0 _SUCCESS
2017-05-23 06:44:09        358 part-v001-o000-r-00000
$ aws s3 cp s3://examplebucket/output/part-v001-o000-r-00000 .
download: s3://examplebucket/output/part-v001-o000-r-00000 to ./part-v001-o000-r-00000


Obviously at 120 rows, the scale comments aren't meant to be taken literally. This guide does however, provide a very workable framework for processing very large amounts of data. Whilst I'm proclaiming Clojure a really good way of wrangling this sort of data - I've been using it for a week. This code could probably be a lot better.