Building an online
PDF editor from scratch
PyWaw #20, 21.01.2013
Zbigniew Siciarz @zsiciarz http://siciarz.net
Slide 2
Slide 2 text
Why?
Slide 3
Slide 3 text
Disclaimer
• still not a full-blown editor
• proof of concept
• simple way to add rich media content to
digital magazines
Slide 4
Slide 4 text
Current status
Slide 5
Slide 5 text
Links
Slide 6
Slide 6 text
Multimedia
Slide 7
Slide 7 text
Go to page
Slide 8
Slide 8 text
Everything is a link
• website URLs (d’oh!)
• multimedia content (audio/video/galleries)
• internal links („go to page”)
• custom HTML5 widgets
Slide 9
Slide 9 text
Workflow
1. upload a PDF file
2. preprocessing on the server
3. add widgets, links etc.
in web editor
4. save and create package
5. publish to mobile devices
6. download package
and display content
Publisher
Slide 10
Slide 10 text
Preprocessing
• run asynchronously as a queued task
• extract metadata from uploaded file
• create page thumbnails (with ImageMagick)
• find any existing links
• mark as unpublished
Slide 11
Slide 11 text
Keep existing links!
• extract links with PyPDF2
• store in database as PdfLink objects
• display in web editor
Slide 12
Slide 12 text
Dimensions and boxes
• cartesian coordinate
system
• box is a list of 4 floats:
[x1, y1, x2, y2]
• PDF units = 1/72”= pt
x
y
(0, 0)
(x1, y1)
(x2, y2)
Slide 13
Slide 13 text
Dimensions and boxes
• artBox
Slide 14
Slide 14 text
Dimensions and boxes
• artBox
• bleedBox
Slide 15
Slide 15 text
Dimensions and boxes
• artBox
• bleedBox
• cropBox
Links
• PDF annotations are messy
• 4 (or more?) different
representations
• indirect objects
all the way down
• reversed coordinates
• peculiar edge cases
still not covered
Slide 21
Slide 21 text
Watermarking
• create blank PDF (watch
out for page dimensions!)
• draw links with ReportLab
• cross your fingers
• merge with original file
Slide 22
Slide 22 text
Watermarking
Slide 23
Slide 23 text
Merging
+ =
Slide 24
Slide 24 text
Merging
• PyPDF2 can’t properly merge PDFs with
links :(
• ReportLab can’t extract links from PDFs*
• several hours wasted on hacking PyPDF2
• pdftk…?
• pdftk!
*Open Source version
Slide 25
Slide 25 text
Merging
• apply watermark page by page to original
PDF
• does not work :(
• works!
Slide 26
Slide 26 text
Final package
• encrypted PDF + media assets
• digitally signed archive
• publication = push notification to devices
• mobile application downloads the package
and displays content
Slide 27
Slide 27 text
Conclusion
• sadly, 3 different toolkits are necessary to
get the job done
PyPDF2 ReportLab pdftk
Extract links Yes No* No
Draw links No Yes No
Merge and
preserve links
No No Yes
*Open Source version
Slide 28
Slide 28 text
ReportLab PLUS?
• „ Reuse your existing pdfs in new and
exciting ways”
• might just work
• pricey :(
Slide 29
Slide 29 text
Appendix
Slide 30
Slide 30 text
Appendix
Slide 31
Slide 31 text
Credits
• Businessperson designed by Devochkina
Oxana from The Noun Project
• Servers designed by Daniel Campos from
The Noun Project
• Maru - http://sisinmaru.blog17.fc2.com/