The entire idea of the Contract Action Extraction System (CAES) was to be able to upload a contract and then find what parts of the contract were "something someone had to do" and pull that out so that it could be monitored, these were simply referred to as “actions” and those actions could be indefinite like having to commit to certain standards for the duration of the contract or time-specific such as needing to finish building a house by X date. So let’s break down what needs to happen for an action to be extracted.
Firstly we needed to define what an “action” is so that we could figure out how to extract them, the best we got to is that an “action” is when one party is told that something has to be done irrelevant to dating said action. This didn’t give us much room to breathe from an extraction point of view though, because that is quite an open definition.
Next, we needed to find a way to be able to take a bunch of text and get the action out of it (we needed to prove this was possible first, before going into the difficulty of reading PDFs as not to waste time). Using REGEX was out of the question because there simply wouldn’t be specific enough arguments for us to hook on to. This left us with the option of Machine Learning. Either make an ML model from scratch and then take the amount of time and data needed to build the accuracy up, or we could use an out of the box implementation and try to fit it to our needs. This is where Google Dialogflow came in.
Using Dialogflow would mean that we didn’t need to create our own model and try to make it work with the limited amount of data that we had, we could build the intents out as much as possible and then if and when something didn’t fit right then we could alter them as need be. So now the intents just needed to be built. Building the intents was difficult, mostly because it’s difficult to get a lot of data such as this and beyond that, it’s difficult for humans to read the “legalese” in contracts never mind making a model do it. So to build the intents we pulled out actions and split them into their different parts. This meant that we could use the data to find the “do something” and if it had a time frame we could get the “do this by this”.
Once the intents were built we would pass the text into Dialogflow's API and read off what the found intents were. We then extracted the lines out of the text and saved both the raw line and its correlating “action” into the database to be monitored for later recognition.
Now architecturally this was a challenge as well because reading a PDF, cleaning possibly hundreds of blocks of text and then sending said blocks to the Dialogflow API could take a lot of time. Beyond that, the application still needed to be able to show contracts, perform basic CRUD operations and see the progress of how far the job has gone.
The backend was relatively simplistic by design, connecting a Django Rest Framework project with a Postgresql database meant that there were good query times and well-rounded database management. The front end was built in VueJs so that the project could leverage the async power of a Single Page Application while using a theme that got changed to fit the clients styling.
For the extraction jobs in question, the simplest solution seemed to be the best. Because the jobs could take up to an hour to run (and we weren’t using long contracts, only about 50 pages) we simply couldn’t let it run synchronously, so instead we decided to split it off and use Celery workers to run the different jobs and just ping back points of interest in regards to the progress of the job, this meant that we could utilise the power of Kubernetes and if there were more jobs than normal we could simply increase the pods for the workers.
This was one of my first major projects, because of this it was important for me to learn from the leads and take in as much information as I could. By the end of the project, I was one of the main engineers on it and was helping architect the backlog for version 2 of the Extraction System.
Python
Celery
VueJs
Dialogflow
Postgresql
Django Rest Framework
Kubernetes
Google Cloud Platform
Redis