Dentsu Inc. is a marketing and PR firm, they're based in Japan but this project was ordered by their UK office. Through this contract position, I also had the chance to work with GM, and Jaguar, as well as help implement the below Reporting engine for Microsoft.
Marketing is a difficult process, it’s even more difficult if you don’t know who you can market to or how you can market to that population. The data for this is usually pretty open to retrieve, but in its raw state, it’s hard to make rhyme or reason from what’s in there. This is where this report project comes in.
The client wanted to have a Cloud Agnostic system so that they weren’t locked into any one provider but also needed something that by the end of the pipe there could be meaningful reports built from the data. So currently the requirements were “no lock-in” and “as simple as possible”.
The next requirement is that the data is retrieved from different sources, from APIs to FTP servers. That data needs to be retrieved, cleaned and then stored in a way that reports can be made out of them.
Finally, different companies could want different reports built from different data that come from different data sources. So I would need to make a schema that a company could create that told us "where the data is", "when it needs to be gotten", "how that data fits into the report" and "what graphs are needed for the report to be meaningful".
The Architecture
Cloud Agnostic is a difficult concept to build up from because it’s using the same basic idea as all of the managed services but with twists, you don’t expect. The client for this project wanted to be able to build up the environments with as little manned effort as possible, on any cloud provider or even on-prem if the need arose. This meant that we needed to use open-source products such as the Hadoop environment to make this requirement a reality. Using Hadoop meant that we had access to well-utilised Big Data tools that were scalable and able to build a pipeline that matched our requirements.
Extracting data from a data source, Transforming it to confine within a pre-determined structure and then Loading that data into a data lake is a well-known pipeline pattern (ETL). But it still has its challenges, especially when more than one of the 3 Vs are involved. On this occasion, it was not only “Big” because of the size of the individual files (sitting at a minimum of 4Gb each), but also because of the sheer amount of them.
Starting with the glue that tied the whole thing together, Airflow and Kafka. Airflow was used as the orchestrator because of needing to be “Cloud Agnostic”, but also because it has a proven track record of being able to orchestrate many DAGs at one time which is exactly what we needed for this project. Kafka came in for scalability, we were dealing with so many files from so many different places it was impossible to not use some kind of queuing engine. We chose Kafka because it can force messages coming in incredibly quickly to slow down so that we can batch and run them (and because it wasn’t part of any major cloud…so bonus there) and for its accessibility within the Hadoop environment.
Now looking at the front of the pipe was the Connectors, starting out as custom-built consumers to be able to pull data from a variety of data sources (Twitter API, Google Campaign Manager or even Facebook) and then push those files into the pipeline. We later moved over to a Saas product called Adverity that had all of the connection options that we had and more. This meant that we didn’t need to maintain the connectors ourselves, so we could focus on more troublesome pieces of the pipe.
Transforming that much data in a timely fashion can be its own kettle of fish. So this is where Apache PySpark came in, using PySparks distributed processing meant that no matter how big the files got we could scale the processing over multiple nodes to lower processing time. One of the hardest parts of the transformation process was knowing what file needed to be transformed in which way, this meant we needed to keep a schema as a source of truth for what the files were when they started the PySpark process and what they had to become by the end. This meant that we could “version” the schemas as APIs changed and add new ones as required by the client.
The final part of the pipeline is the Reporting, this is where the schema came in the most useful. At this point, I would have all the data transformed and ready to go but without a source of truth, there wouldn't be able to tell where the data could fit into graphs. After reading the schema into a PySpark job I would look firstly at what graphs are going to be required. Then I would find what transformed data I had to go with those graphs, the data was cleaned from the original transformation but it didn't necessarily fit into any graph and sometimes the same data would be used for multiple graphs. So each graph needed a map from "A" to "B" data transformation specific to that graph. Once the data was transformed into the right format, that data would be passed into Plotly to make and render the graphs. The graphs were rendered into HTML that was then dumped onto a page that would later be turned into a PDF and dropped in a bucket for users to download on the front end.
More difficult as projects go. I was merely told what had to be implemented and then it was up to me to gather requirements, plan how things were going to fit together and then implement it.
Python
Postgresql
Kubernetes
Hadoop
(Not all of Hadoop but enough that I can't put it all here)
Apache Airflow
PySpark
Django Rest Framework