Practical & Impractical File Operations (in Phoenix)

Practical & Impractical File Operations Evadne Wu github.com/evadne ev@radi.ws //
@evadne last updated 30 September 2017

Structure File Lifecycle — what really needs to happen when
a ﬁle is being uploaded ➤ Phoenix-speciﬁc discussions ➤ General discussions Various Solutions — COTS, OSS, etc ➤ NB: we do not use all of them Q&A

File Lifecycle Selection: User selects one or many files Ingestion:
Transfer of files from User to Service Inspection: Verification of files’ integrity and congruence Conversion: Creation of new Representations (e.g. thumbnails) as needed Storage: Storing of the file in a block or object storage service for later retrieval Presentation: making the File and/or its Representations available to the User

File Lifecycle: Selection Selection: User selects one or many files
➤ With drag & drop operation or a traditional form input ➤ Preliminary validation in JavaScript takes place here ➤ Reports MIME type + file size ➤ NB: Browsers can lie about MIME type and any enterprising person can fake size / MIME type information ➤ NB: Some browsers report incorrect file sizes

File Lifecycle: Selection File.name, File.type, File.size, etc. available via vanilla
JavaScript APIs ➤ MDN: File  https://developer.mozilla.org/en-US/docs/Web/API/File ➤ MDN: FileList  https://developer.mozilla.org/en-US/docs/Web/API/FileList

File Lifecycle: Selection However you should not trust MIME types
returned by the browser. ➤ WebKit infers content type from path extension  https://github.com/WebKit/webkit/blob/master/Source/WebCore/fileapi/ File.cpp#L124 If you are serious about the files you get, you will have to validate it yourself and implement rejection logic for files that were merely renamed. ➤ For example… “Converting DOCX to PDF by renaming”

File Lifecycle: Selection Drag & Drop Solution A: Classic Solution
via Big Input ➤ Make a big and translucent <input> element, listen for change event, then query the FileList to get actual ﬁles.  e.g. https://www.sitepoint.com/html5-ﬁle-drag-and-drop

File Lifecycle: Selection Drag & Drop Solution B: Fancy Solution
via proper HTML5 Drag & Drop ➤ MDN: File drag and drop  https://developer.mozilla.org/en-US/docs/Web/API/ HTML_Drag_and_Drop_API/File_drag_and_drop

File Lifecycle: Ingestion Ingestion: Transfer of ﬁles from User to
Service ➤ In most cases, a plain HTTP POST will do ➤ However, you may want a more scalable solution ➤ I’ll explain why

File Lifecycle: Ingestion (Basic) ➤ Accept HTTP POST with file
in payload ➤ e.g. with file_input, from phoenix_html  https://hexdocs.pm/phoenix_html/ Phoenix.HTML.Form.html#file_input/3 ➤ Ingest file in Phoenix, then store it elsewhere ➤ Plug.Upload  https://hexdocs.pm/plug/Plug.Upload.html

File Lifecycle: Ingestion (Basic) Problem 1: Memory pressure (due to
API implementation) ➤ Plug.Parsers.MULTIPART requests a temporary file ➤ Plug.Upload provides a temporary file and monitors requesting process ➤ Request continues to be read (approximately 1MB at a time) ➤ Temporary file is written to, incrementally ➤ Path to the temporary file is exposed downstream

File Lifecycle: Ingestion (Basic) Problem 2: Disk I/O pressure (due
to API design) ➤ Entire file needs to be written to disk first ➤ This is because the file needs to be presented as a path ➤ This may or may not be a problem ➤ The more concurrent uploads you have, the more disk I/O you need to do

File Lifecycle: Ingestion (Basic) Problem 3: Network pressure (due to
application implementation) ➤ Related to choice of ﬁle storage mechanisms ➤ Local directory / DRBD / other sorts of block storage devices ➤ AWS S3, Google Cloud Storage, Azure Storage, etc ➤ Either way there will be trafﬁc ➤ Internet to Phoenix to Block Storage / Object Storage ➤ QoS enforcement = £££

File Lifecycle: Ingestion (S3) The front-end can send the entire
ﬁle to S3 or somewhere else ➤ S3 Uploads: 5MB – 5GB per request ➤ S3 Multipart Uploads: 5MB – 5GB per part, 10,000 parts total ➤ AWS: “In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.”  http://docs.aws.amazon.com/AmazonS3/latest/dev/ uploadobjusingmpu.html

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Make 2 buckets
(one permanent, one temporary) ➤ Attach a Lifecycle Policy on the Temporary bucket ➤ Clean up all lingering multi-part uploads after a certain amount of time ➤ Clean up all remaining temporary uploads after a certain amount of time ➤ Customers upload to Temporary bucket only

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ You can use
AWS STS (Secure Token Service) to create credentials that have additional restrictions. ➤ For example, create UUID, then create a token with s3:PutObject rights, but only against a speciﬁc ARN (which uses said UUID). ➤ NB: s3:PutObject still allows overwriting an existing object ➤ There is, however, no way to enforce upload size using this solution

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ You should decide
between vending signatures for each part or vending the credential directly. ➤ If vending signatures for each part, you can put additional limitations in place. ➤ If vending credentials, you are unable to put size limitations in place.

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Each Upload Part
request has Content-Length, Content-MD5 and Expect headers. ➤ AWS: Upload Part  http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html ➤ You could use a signed structure to hold an accumulator for bytes already uploaded, and enforce total upload size this way by a) vending signatures derived from a ﬁxed Content-Length, and b) refusing to sign extraneous parts.

File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Alternatively, if the
expected payload is small (<100MB) you can vend a signature with additional bits for the client to use with a single request. ➤ Create the POST policy with starts-with, content-length-range, etc. ➤ AWS: Creating a POST Policy  http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4- HTTPPOSTConstructPolicy.html

File Lifecycle: Ingestion (S3) Discussion ➤ One other benefit of
using multipart uploads is the ability to resume in case of network interruption or perhaps when somebody yanked the power cable, or when a USB storage device is removed. ➤ Therefore this is beneficial even for smaller files (say, 25MB – 50MB) that are below AWS recommended thresholds. ➤ You should analyse this based on your own use cases.

File Lifecycle: Inspection Inspection: Verification of files’ integrity and congruence
There are actually two goals of such a step: ➤ Goal 1: Identifying files that fall outside of the acceptance criteria ➤ Goal 2: Preventing bad files from reaching conversion processes I will explain the reason why both goals should be put in this step, shortly

File Lifecycle: Inspection Goal 1: Actual file content / integrity
verification ➤ Verification of artefacts’ fitness for purpose ➤ Verification of absence of unwanted content ➤ Generic: VBA macros, PUAs, embedded JavaScript in PDF, XFAs, etc. ➤ Specific: business-specific content e.g. empty forms or templates ➤ Verification of file name / type / content congruence

File Lifecycle: Inspection Goal 2: Rejection of internally incongruent ﬁles
➤ Either malicious or plain broken; valid attack factor either way ➤ Most converters are designed to crash when a bad ﬁle is sent ➤ Re-establishing processes take time and can cause dips in throughput ➤ Some converters tolerate ambiguity and do the wrong thing

File Lifecycle: Inspection Example: ImageTragick ➤ Make ImageMagick issue HTTP
GET / FTP Requests ➤ Basically anything your host / container can do ➤ Potent mix if used to retrieve EC2 Instance Metadata or ECS Task Role ➤ Could be worked around if ImageMagick is not used, or if only congruent images were sent (i.e. actual JPEGs, PNGs, etc)

File Lifecycle: Inspection Example: Infinite loops in PDF Catalogue ➤
Basically DoS attack by holding up conversion processes ➤ Certified programming with dependent types  (Because the future of defense is liberal application of math)  https://media.ccc.de/v/cccamp11-4426- certified_programming_with_dependent_types-en ➤ “Six year old PDF loop bug affects most major implementations”  https://blog.fuzzing-project.org/59-Six-year-old-PDF-loop-bug-affects-most-major- implementations.html

File Lifecycle: Inspection Discussion Any proper conversion process should be
held to resource usage limits and be subject to limits on how much time it can spend doing the work. However, re-establishment of processes (killing and re-spawning) can sometimes take quite a while (especially for virus scanners, which require a lot of deﬁnition data to be loaded).

File Lifecycle: Inspection Possible Solution Binary inspection of all incoming
files. In UNIX-like systems this can be done with the file(1) command. ➤ file(1) - Linux man page  https://linux.die.net/man/1/file

File Lifecycle: Inspection Possible Solution Usually the file(1) command is
supported by libmagic. It is a pattern matcher which scans the binary file with pre-defined patterns, and returns the most likely match. ➤ You can write your own magic if desired. ➤ libmagic(3) - Linux man page  https://linux.die.net/man/3/libmagic ➤ Guide to using filemagic  https://filemagic.readthedocs.io/en/latest/guide.html

File Lifecycle: Conversion Conversion: Creation of new Representations (e.g. thumbnails)
as needed ➤ Usually done for images ➤ Sometimes for video ﬁles (still frames) and documents (page images) ➤ Lesser known: Album art from ID3 tags (MP3), etc.

File Lifecycle: Conversion Discussion ➤ Images are quite easy to
deal with, but be careful with colour space conversions (for example CMYK to RGB) and with resampling. ➤ VIPS — HOWTO — Image shrinking  https://github.com/jcupitt/libvips/wiki/HOWTO----Image-shrinking ➤ Tip: larger JPEGs with lower Quality setting still look quite good on higher resolution displays, and can be smaller in size too.

File Lifecycle: Storage Storage: Storing of the file in a
storage service for later retrieval If you’ve done the S3 ingestion route then Storage is largely taken care of. Otherwise you will have to ensure that the underlying block storage device is large enough and is taken care of by operators. It may also be a good idea to have a staged retention policy to get rid of old files, so as not to keep growing the amount of data stored. ➤ This could be done via Lifecycle Policy, or Object Expiration, and notifications sent via S3 Event Notifications.

File Lifecycle: Presentation Presentation: making the File and/or its Representations
available Usually people will vend a signed link directly, and this may be adequate for your uses. In any case, consider tracking file names separately (during Ingestion) and vending link with the correct name in response-content-disposition: ➤ attachment; filename*=UTF-8''${encoded_filename} Percent-encode anything outside of A-Za-z0-9

Various Components ➤ Arc (ExAWS + Ecto) https://github.com/stavro/arc ➤ Embed
in your Phoenix application ➤ EvaporateJS https://github.com/TTLabs/EvaporateJS ➤ JS package to help you with multipart uploads

Various Solutions ➤ AWS Services ➤ Elastic Transcoder (i.e. hosted
ZenCoder) ➤ Lambda Functions calling libVIPS with Sharp (NPM Package) ➤ CloudConvert https://cloudconvert.com ➤ Quite good results from our internal testing ➤ FEG’s own sausage factory ➤ Higher-level solution; REST API + Webhooks; 3.34m+ transactions

Takeaway ➤ Try to set appropriate boundaries: total size, number
of parts, etc. ➤ Try to process everything in isolation ➤ Use a dedicated service to deal with ﬁles if needed ➤ Try not having ImageMagick / FFmpeg / LibAV everywhere ➤ Minimises patching workload — single point of audit

Practical & Impractical File Operations (in Pho...

Practical & Impractical File Operations (in Phoenix)

More Decks by Evadne Wu

Other Decks in Technology

Featured

Transcript