Slide 1

Slide 1 text

Practical & Impractical File Operations Evadne Wu github.com/evadne [email protected] // @evadne last updated 30 September 2017

Slide 2

Slide 2 text

Structure File Lifecycle — what really needs to happen when a file is being uploaded ➤ Phoenix-specific discussions ➤ General discussions Various Solutions — COTS, OSS, etc ➤ NB: we do not use all of them Q&A

Slide 3

Slide 3 text

File Lifecycle Selection: User selects one or many files Ingestion: Transfer of files from User to Service Inspection: Verification of files’ integrity and congruence Conversion: Creation of new Representations (e.g. thumbnails) as needed Storage: Storing of the file in a block or object storage service for later retrieval Presentation: making the File and/or its Representations available to the User

Slide 4

Slide 4 text

File Lifecycle: Selection Selection: User selects one or many files ➤ With drag & drop operation or a traditional form input ➤ Preliminary validation in JavaScript takes place here ➤ Reports MIME type + file size ➤ NB: Browsers can lie about MIME type and any enterprising person can fake size / MIME type information ➤ NB: Some browsers report incorrect file sizes

Slide 5

Slide 5 text

File Lifecycle: Selection File.name, File.type, File.size, etc. available via vanilla JavaScript APIs ➤ MDN: File
 https://developer.mozilla.org/en-US/docs/Web/API/File ➤ MDN: FileList
 https://developer.mozilla.org/en-US/docs/Web/API/FileList

Slide 6

Slide 6 text

File Lifecycle: Selection However you should not trust MIME types returned by the browser. ➤ WebKit infers content type from path extension
 https://github.com/WebKit/webkit/blob/master/Source/WebCore/fileapi/ File.cpp#L124 If you are serious about the files you get, you will have to validate it yourself and implement rejection logic for files that were merely renamed. ➤ For example… “Converting DOCX to PDF by renaming”

Slide 7

Slide 7 text

File Lifecycle: Selection Drag & Drop Solution A: Classic Solution via Big Input ➤ Make a big and translucent element, listen for change event, then query the FileList to get actual files.
 e.g. https://www.sitepoint.com/html5-file-drag-and-drop

Slide 8

Slide 8 text

File Lifecycle: Selection Drag & Drop Solution B: Fancy Solution via proper HTML5 Drag & Drop ➤ MDN: File drag and drop
 https://developer.mozilla.org/en-US/docs/Web/API/ HTML_Drag_and_Drop_API/File_drag_and_drop

Slide 9

Slide 9 text

File Lifecycle: Ingestion Ingestion: Transfer of files from User to Service ➤ In most cases, a plain HTTP POST will do ➤ However, you may want a more scalable solution ➤ I’ll explain why

Slide 10

Slide 10 text

File Lifecycle: Ingestion (Basic) ➤ Accept HTTP POST with file in payload ➤ e.g. with file_input, from phoenix_html
 https://hexdocs.pm/phoenix_html/ Phoenix.HTML.Form.html#file_input/3 ➤ Ingest file in Phoenix, then store it elsewhere ➤ Plug.Upload
 https://hexdocs.pm/plug/Plug.Upload.html

Slide 11

Slide 11 text

File Lifecycle: Ingestion (Basic) Problem 1: Memory pressure (due to API implementation) ➤ Plug.Parsers.MULTIPART requests a temporary file ➤ Plug.Upload provides a temporary file and monitors requesting process ➤ Request continues to be read (approximately 1MB at a time) ➤ Temporary file is written to, incrementally ➤ Path to the temporary file is exposed downstream

Slide 12

Slide 12 text

File Lifecycle: Ingestion (Basic) Problem 2: Disk I/O pressure (due to API design) ➤ Entire file needs to be written to disk first ➤ This is because the file needs to be presented as a path ➤ This may or may not be a problem ➤ The more concurrent uploads you have, the more disk I/O you need to do

Slide 13

Slide 13 text

File Lifecycle: Ingestion (Basic) Problem 3: Network pressure (due to application implementation) ➤ Related to choice of file storage mechanisms ➤ Local directory / DRBD / other sorts of block storage devices ➤ AWS S3, Google Cloud Storage, Azure Storage, etc ➤ Either way there will be traffic ➤ Internet to Phoenix to Block Storage / Object Storage ➤ QoS enforcement = £££

Slide 14

Slide 14 text

File Lifecycle: Ingestion (S3) The front-end can send the entire file to S3 or somewhere else ➤ S3 Uploads: 5MB – 5GB per request ➤ S3 Multipart Uploads: 5MB – 5GB per part, 10,000 parts total ➤ AWS: “In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.”
 http://docs.aws.amazon.com/AmazonS3/latest/dev/ uploadobjusingmpu.html

Slide 15

Slide 15 text

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Make 2 buckets (one permanent, one temporary) ➤ Attach a Lifecycle Policy on the Temporary bucket ➤ Clean up all lingering multi-part uploads after a certain amount of time ➤ Clean up all remaining temporary uploads after a certain amount of time ➤ Customers upload to Temporary bucket only

Slide 16

Slide 16 text

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ You can use AWS STS (Secure Token Service) to create credentials that have additional restrictions. ➤ For example, create UUID, then create a token with s3:PutObject rights, but only against a specific ARN (which uses said UUID). ➤ NB: s3:PutObject still allows overwriting an existing object ➤ There is, however, no way to enforce upload size using this solution

Slide 17

Slide 17 text

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ You should decide between vending signatures for each part or vending the credential directly. ➤ If vending signatures for each part, you can put additional limitations in place. ➤ If vending credentials, you are unable to put size limitations in place.

Slide 18

Slide 18 text

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Each Upload Part request has Content-Length, Content-MD5 and Expect headers. ➤ AWS: Upload Part
 http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html ➤ You could use a signed structure to hold an accumulator for bytes already uploaded, and enforce total upload size this way by a) vending signatures derived from a fixed Content-Length, and b) refusing to sign extraneous parts.

Slide 19

Slide 19 text

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Alternatively, if the expected payload is small (<100MB) you can vend a signature with additional bits for the client to use with a single request. ➤ Create the POST policy with starts-with, content-length-range, etc. ➤ AWS: Creating a POST Policy
 http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4- HTTPPOSTConstructPolicy.html

Slide 20

Slide 20 text

File Lifecycle: Ingestion (S3) Discussion ➤ One other benefit of using multipart uploads is the ability to resume in case of network interruption or perhaps when somebody yanked the power cable, or when a USB storage device is removed. ➤ Therefore this is beneficial even for smaller files (say, 25MB – 50MB) that are below AWS recommended thresholds. ➤ You should analyse this based on your own use cases.

Slide 21

Slide 21 text

File Lifecycle: Inspection Inspection: Verification of files’ integrity and congruence There are actually two goals of such a step: ➤ Goal 1: Identifying files that fall outside of the acceptance criteria ➤ Goal 2: Preventing bad files from reaching conversion processes I will explain the reason why both goals should be put in this step, shortly

Slide 22

Slide 22 text

File Lifecycle: Inspection Goal 1: Actual file content / integrity verification ➤ Verification of artefacts’ fitness for purpose ➤ Verification of absence of unwanted content ➤ Generic: VBA macros, PUAs, embedded JavaScript in PDF, XFAs, etc. ➤ Specific: business-specific content e.g. empty forms or templates ➤ Verification of file name / type / content congruence

Slide 23

Slide 23 text

File Lifecycle: Inspection Goal 2: Rejection of internally incongruent files ➤ Either malicious or plain broken; valid attack factor either way ➤ Most converters are designed to crash when a bad file is sent ➤ Re-establishing processes take time and can cause dips in throughput ➤ Some converters tolerate ambiguity and do the wrong thing

Slide 24

Slide 24 text

File Lifecycle: Inspection Example: ImageTragick ➤ Make ImageMagick issue HTTP GET / FTP Requests ➤ Basically anything your host / container can do ➤ Potent mix if used to retrieve EC2 Instance Metadata or ECS Task Role ➤ Could be worked around if ImageMagick is not used, or if only congruent images were sent (i.e. actual JPEGs, PNGs, etc)

Slide 25

Slide 25 text

File Lifecycle: Inspection Example: Infinite loops in PDF Catalogue ➤ Basically DoS attack by holding up conversion processes ➤ Certified programming with dependent types
 (Because the future of defense is liberal application of math)
 https://media.ccc.de/v/cccamp11-4426- certified_programming_with_dependent_types-en ➤ “Six year old PDF loop bug affects most major implementations”
 https://blog.fuzzing-project.org/59-Six-year-old-PDF-loop-bug-affects-most-major- implementations.html

Slide 26

Slide 26 text

File Lifecycle: Inspection Discussion Any proper conversion process should be held to resource usage limits and be subject to limits on how much time it can spend doing the work. However, re-establishment of processes (killing and re-spawning) can sometimes take quite a while (especially for virus scanners, which require a lot of definition data to be loaded).

Slide 27

Slide 27 text

File Lifecycle: Inspection Possible Solution Binary inspection of all incoming files. In UNIX-like systems this can be done with the file(1) command. ➤ file(1) - Linux man page
 https://linux.die.net/man/1/file

Slide 28

Slide 28 text

File Lifecycle: Inspection Possible Solution Usually the file(1) command is supported by libmagic. It is a pattern matcher which scans the binary file with pre-defined patterns, and returns the most likely match. ➤ You can write your own magic if desired. ➤ libmagic(3) - Linux man page
 https://linux.die.net/man/3/libmagic ➤ Guide to using filemagic
 https://filemagic.readthedocs.io/en/latest/guide.html

Slide 29

Slide 29 text

File Lifecycle: Conversion Conversion: Creation of new Representations (e.g. thumbnails) as needed ➤ Usually done for images ➤ Sometimes for video files (still frames) and documents (page images) ➤ Lesser known: Album art from ID3 tags (MP3), etc.

Slide 30

Slide 30 text

File Lifecycle: Conversion Discussion ➤ Images are quite easy to deal with, but be careful with colour space conversions (for example CMYK to RGB) and with resampling. ➤ VIPS — HOWTO — Image shrinking
 https://github.com/jcupitt/libvips/wiki/HOWTO----Image-shrinking ➤ Tip: larger JPEGs with lower Quality setting still look quite good on higher resolution displays, and can be smaller in size too.

Slide 31

Slide 31 text

File Lifecycle: Storage Storage: Storing of the file in a storage service for later retrieval If you’ve done the S3 ingestion route then Storage is largely taken care of. Otherwise you will have to ensure that the underlying block storage device is large enough and is taken care of by operators. It may also be a good idea to have a staged retention policy to get rid of old files, so as not to keep growing the amount of data stored. ➤ This could be done via Lifecycle Policy, or Object Expiration, and notifications sent via S3 Event Notifications.

Slide 32

Slide 32 text

File Lifecycle: Presentation Presentation: making the File and/or its Representations available Usually people will vend a signed link directly, and this may be adequate for your uses. In any case, consider tracking file names separately (during Ingestion) and vending link with the correct name in response-content-disposition: ➤ attachment; filename*=UTF-8''${encoded_filename} Percent-encode anything outside of A-Za-z0-9

Slide 33

Slide 33 text

Various Components ➤ Arc (ExAWS + Ecto) https://github.com/stavro/arc ➤ Embed in your Phoenix application ➤ EvaporateJS https://github.com/TTLabs/EvaporateJS ➤ JS package to help you with multipart uploads

Slide 34

Slide 34 text

Various Solutions ➤ AWS Services ➤ Elastic Transcoder (i.e. hosted ZenCoder) ➤ Lambda Functions calling libVIPS with Sharp (NPM Package) ➤ CloudConvert https://cloudconvert.com ➤ Quite good results from our internal testing ➤ FEG’s own sausage factory ➤ Higher-level solution; REST API + Webhooks; 3.34m+ transactions

Slide 35

Slide 35 text

Takeaway ➤ Try to set appropriate boundaries: total size, number of parts, etc. ➤ Try to process everything in isolation ➤ Use a dedicated service to deal with files if needed ➤ Try not having ImageMagick / FFmpeg / LibAV everywhere ➤ Minimises patching workload — single point of audit