TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images [Part-2]
This is a 2-part series on end to end case study based TableNet research paper.
In the last post we discussed Pytorch implementation of TableNet, refer Part 1.
Table of Contents
- Post EDA of the solution
- Fixing Image Problems and Re-Training
- Improving model predictions using OpenCV2
- OCR predictions
- Deployment
- Future Work
- End Notes
- References
Post EDA of the solution
Looking at evaluation metrics to judge the model behavior is not enough. We should be able to answer questions like:
→Are we able to explain the outputs of Model w.r.t. Input ? ( Explainability is highly effective in classification and regression models)
→ Can the model performance be improved ?
→ Can we improve the data to have better performing model ?
→ Can we somehow know, what kind of data scores higher by the model ?
To answer these questions, we need to look at our training data and categorize the input data into Bad, Good and Best data. As in real world, we wont be able to get test data with similar distribution to train data, Thus we do Post Training EDA on train data . To do this we will predict table and column masks from the model and rank them / categorize them based on F1 Scores.
For this purpose , we only use Table F1 Scores as benchmark.
Lets pick a threshold 0.5 and 0.85 for categorizing the images. After plotting the scores, we see, there are images which have F1 Score of 0.0.
Bad Images [ Threshold : 0.0–0.5]
We see that there are 3 images in Bad predictions / Images category
Good Images [ Threshold : 0.5–0.85]
Many Images fall in this category, below are 2 of them
Best Images [ Threshold : >0.85]
Many Images fall in this category, below are 2 of them
Observations
- From above Images, we can see that Bad / worst predictions are given by images with colored tables. Model didn't predict any thing and F1 score is close to 0.0. There are very few images in the dataset which have colored tables .
- Good predictions come from those images which predicted good Table mask, but it also predicted columns in the table where in actual there were no columns.
- Best Predictions are images which helped model learn table and column boundaries even without line demarcations
Fixing Image Problems
We have 2 options, which might improve model performance,
- Remove colored images, or [ Problem : Data reduction is an issue here as we already have less data]
- We can have uniform data by converting all images to grayscale first and then increase the number of channels in preprocessing , and Train model again.
After following the second approach and fixing the dataset, the model was trained again.
Unfortunately, model performance didn't increase. We see same performance metrics before fixing the dataset.
But lets take a look at the Bad Images from previous section and see if anything has improved .
Re-Evaluating Bad Images from previous section
A significant increase in F1 Scores can be seen here. From no predictions for Table and column on colored table images, we managed to get F1 score on both table and column to 0.92. These Images can now be categorized under Best images.
Lets look at Bad images according to our new model
We can see Only 1 Bad Image below threshold and 2 good prediction / images. Rest seems to be in Best predictions Category.
Bad Predictions / Images
Lowest F1 score is around 0.37 , only 1 bad image.
It is not a good idea to conclude the pattern of Bad prediction images when we have only 1 image. But it seems, the input image , has no proper line demarcations which would lead to Table structure, that's why wherever model sees a line demarcations, it assumes that there is a table in that area.
We can now say, even though new model didn't improve in terms of performance metrics, but it improved the learning and predictions.
Improving model predictions using OpenCV2
We can still see uneven boundaries of predicted table and column masks. In some cases, Table mask predictions are not even filled inside. If we directly crop the mask portions of the image to get Table, we might lose some information. Not to mention, there are other areas with activations in the predicted table mask (which are not tables).
To solve these issues, we will use contours from classical image processing techniques.
Basic Idea :
- Get contours around the activation from the predicted table mask.
- Remove contours which cant be rectangle / small patch of activations.
- Get bounding coordinates of the remaining contour.
- Repeat the same process with Column Masks
Below code applies the process on both table and column masks and returns Table and Column Coordinates.
Lets look at the outputs →
Step 1 : we first get predictions from the model,
Step 2: Then we pass our predicted mask to the fixMasks() function
OCR predictions
After getting table bounding boxes, Pytesseract OCR is applied on each tables, and Output is saved to a dataframe
Here are the final Outputs for each Table detected in the previous section
Deployment
We will deploy this new model locally using Streamlit API. It is an open-source Python library to create custom web app for machine learning and deep learning projects.
Future Work
- Deploy this application on a remote server using AWS /StreamLit sharing/heroku.
- Model Quantization for faster inference time.
- Train for more epochs and compare the performances.
- Increase data size by adding data from ICDAR 2013 Table recognition dataset.
End Notes
This marks end of this case study. I tried to put in maximum information with crisp code snippets for every stage of this case study.
Feel free to reach out, to discuss more on this . I’d be happy to receive feedback.
If you want to check out whole code , please refer my Github repo below.
You can connect with me on LinkedIn