Practical & Impractical File Operations (in Phoenix)

16925e7df06e14eb8d36263b4a8c31b4?s=47 Evadne Wu
September 26, 2017

Practical & Impractical File Operations (in Phoenix)

A high level view of various file handling techniques that deliver high performance and smooth user experience while retaining security.

16925e7df06e14eb8d36263b4a8c31b4?s=128

Evadne Wu

September 26, 2017
Tweet

Transcript

  1. Practical & Impractical File Operations Evadne Wu github.com/evadne ev@radi.ws //

    @evadne last updated 30 September 2017
  2. Structure File Lifecycle — what really needs to happen when

    a file is being uploaded ➤ Phoenix-specific discussions ➤ General discussions Various Solutions — COTS, OSS, etc ➤ NB: we do not use all of them Q&A
  3. File Lifecycle Selection: User selects one or many files Ingestion:

    Transfer of files from User to Service Inspection: Verification of files’ integrity and congruence Conversion: Creation of new Representations (e.g. thumbnails) as needed Storage: Storing of the file in a block or object storage service for later retrieval Presentation: making the File and/or its Representations available to the User
  4. File Lifecycle: Selection Selection: User selects one or many files

    ➤ With drag & drop operation or a traditional form input ➤ Preliminary validation in JavaScript takes place here ➤ Reports MIME type + file size ➤ NB: Browsers can lie about MIME type and any enterprising person can fake size / MIME type information ➤ NB: Some browsers report incorrect file sizes
  5. File Lifecycle: Selection File.name, File.type, File.size, etc. available via vanilla

    JavaScript APIs ➤ MDN: File
 https://developer.mozilla.org/en-US/docs/Web/API/File ➤ MDN: FileList
 https://developer.mozilla.org/en-US/docs/Web/API/FileList
  6. File Lifecycle: Selection However you should not trust MIME types

    returned by the browser. ➤ WebKit infers content type from path extension
 https://github.com/WebKit/webkit/blob/master/Source/WebCore/fileapi/ File.cpp#L124 If you are serious about the files you get, you will have to validate it yourself and implement rejection logic for files that were merely renamed. ➤ For example… “Converting DOCX to PDF by renaming”
  7. File Lifecycle: Selection Drag & Drop Solution A: Classic Solution

    via Big Input ➤ Make a big and translucent <input> element, listen for change event, then query the FileList to get actual files.
 e.g. https://www.sitepoint.com/html5-file-drag-and-drop
  8. File Lifecycle: Selection Drag & Drop Solution B: Fancy Solution

    via proper HTML5 Drag & Drop ➤ MDN: File drag and drop
 https://developer.mozilla.org/en-US/docs/Web/API/ HTML_Drag_and_Drop_API/File_drag_and_drop
  9. File Lifecycle: Ingestion Ingestion: Transfer of files from User to

    Service ➤ In most cases, a plain HTTP POST will do ➤ However, you may want a more scalable solution ➤ I’ll explain why
  10. File Lifecycle: Ingestion (Basic) ➤ Accept HTTP POST with file

    in payload ➤ e.g. with file_input, from phoenix_html
 https://hexdocs.pm/phoenix_html/ Phoenix.HTML.Form.html#file_input/3 ➤ Ingest file in Phoenix, then store it elsewhere ➤ Plug.Upload
 https://hexdocs.pm/plug/Plug.Upload.html
  11. File Lifecycle: Ingestion (Basic) Problem 1: Memory pressure (due to

    API implementation) ➤ Plug.Parsers.MULTIPART requests a temporary file ➤ Plug.Upload provides a temporary file and monitors requesting process ➤ Request continues to be read (approximately 1MB at a time) ➤ Temporary file is written to, incrementally ➤ Path to the temporary file is exposed downstream
  12. File Lifecycle: Ingestion (Basic) Problem 2: Disk I/O pressure (due

    to API design) ➤ Entire file needs to be written to disk first ➤ This is because the file needs to be presented as a path ➤ This may or may not be a problem ➤ The more concurrent uploads you have, the more disk I/O you need to do
  13. File Lifecycle: Ingestion (Basic) Problem 3: Network pressure (due to

    application implementation) ➤ Related to choice of file storage mechanisms ➤ Local directory / DRBD / other sorts of block storage devices ➤ AWS S3, Google Cloud Storage, Azure Storage, etc ➤ Either way there will be traffic ➤ Internet to Phoenix to Block Storage / Object Storage ➤ QoS enforcement = £££
  14. File Lifecycle: Ingestion (S3) The front-end can send the entire

    file to S3 or somewhere else ➤ S3 Uploads: 5MB – 5GB per request ➤ S3 Multipart Uploads: 5MB – 5GB per part, 10,000 parts total ➤ AWS: “In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.”
 http://docs.aws.amazon.com/AmazonS3/latest/dev/ uploadobjusingmpu.html
  15. File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Make 2 buckets

    (one permanent, one temporary) ➤ Attach a Lifecycle Policy on the Temporary bucket ➤ Clean up all lingering multi-part uploads after a certain amount of time ➤ Clean up all remaining temporary uploads after a certain amount of time ➤ Customers upload to Temporary bucket only
  16. File Lifecycle: Ingestion (S3) Proposed Strategy ➤ You can use

    AWS STS (Secure Token Service) to create credentials that have additional restrictions. ➤ For example, create UUID, then create a token with s3:PutObject rights, but only against a specific ARN (which uses said UUID). ➤ NB: s3:PutObject still allows overwriting an existing object ➤ There is, however, no way to enforce upload size using this solution
  17. File Lifecycle: Ingestion (S3) Proposed Strategy ➤ You should decide

    between vending signatures for each part or vending the credential directly. ➤ If vending signatures for each part, you can put additional limitations in place. ➤ If vending credentials, you are unable to put size limitations in place.
  18. File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Each Upload Part

    request has Content-Length, Content-MD5 and Expect headers. ➤ AWS: Upload Part
 http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html ➤ You could use a signed structure to hold an accumulator for bytes already uploaded, and enforce total upload size this way by a) vending signatures derived from a fixed Content-Length, and b) refusing to sign extraneous parts.
  19. File Lifecycle: Ingestion (S3) Proposed Strategy ➤ Alternatively, if the

    expected payload is small (<100MB) you can vend a signature with additional bits for the client to use with a single request. ➤ Create the POST policy with starts-with, content-length-range, etc. ➤ AWS: Creating a POST Policy
 http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4- HTTPPOSTConstructPolicy.html
  20. File Lifecycle: Ingestion (S3) Discussion ➤ One other benefit of

    using multipart uploads is the ability to resume in case of network interruption or perhaps when somebody yanked the power cable, or when a USB storage device is removed. ➤ Therefore this is beneficial even for smaller files (say, 25MB – 50MB) that are below AWS recommended thresholds. ➤ You should analyse this based on your own use cases.
  21. File Lifecycle: Inspection Inspection: Verification of files’ integrity and congruence

    There are actually two goals of such a step: ➤ Goal 1: Identifying files that fall outside of the acceptance criteria ➤ Goal 2: Preventing bad files from reaching conversion processes I will explain the reason why both goals should be put in this step, shortly
  22. File Lifecycle: Inspection Goal 1: Actual file content / integrity

    verification ➤ Verification of artefacts’ fitness for purpose ➤ Verification of absence of unwanted content ➤ Generic: VBA macros, PUAs, embedded JavaScript in PDF, XFAs, etc. ➤ Specific: business-specific content e.g. empty forms or templates ➤ Verification of file name / type / content congruence
  23. File Lifecycle: Inspection Goal 2: Rejection of internally incongruent files

    ➤ Either malicious or plain broken; valid attack factor either way ➤ Most converters are designed to crash when a bad file is sent ➤ Re-establishing processes take time and can cause dips in throughput ➤ Some converters tolerate ambiguity and do the wrong thing
  24. File Lifecycle: Inspection Example: ImageTragick ➤ Make ImageMagick issue HTTP

    GET / FTP Requests ➤ Basically anything your host / container can do ➤ Potent mix if used to retrieve EC2 Instance Metadata or ECS Task Role ➤ Could be worked around if ImageMagick is not used, or if only congruent images were sent (i.e. actual JPEGs, PNGs, etc)
  25. File Lifecycle: Inspection Example: Infinite loops in PDF Catalogue ➤

    Basically DoS attack by holding up conversion processes ➤ Certified programming with dependent types
 (Because the future of defense is liberal application of math)
 https://media.ccc.de/v/cccamp11-4426- certified_programming_with_dependent_types-en ➤ “Six year old PDF loop bug affects most major implementations”
 https://blog.fuzzing-project.org/59-Six-year-old-PDF-loop-bug-affects-most-major- implementations.html
  26. File Lifecycle: Inspection Discussion Any proper conversion process should be

    held to resource usage limits and be subject to limits on how much time it can spend doing the work. However, re-establishment of processes (killing and re-spawning) can sometimes take quite a while (especially for virus scanners, which require a lot of definition data to be loaded).
  27. File Lifecycle: Inspection Possible Solution Binary inspection of all incoming

    files. In UNIX-like systems this can be done with the file(1) command. ➤ file(1) - Linux man page
 https://linux.die.net/man/1/file
  28. File Lifecycle: Inspection Possible Solution Usually the file(1) command is

    supported by libmagic. It is a pattern matcher which scans the binary file with pre-defined patterns, and returns the most likely match. ➤ You can write your own magic if desired. ➤ libmagic(3) - Linux man page
 https://linux.die.net/man/3/libmagic ➤ Guide to using filemagic
 https://filemagic.readthedocs.io/en/latest/guide.html
  29. File Lifecycle: Conversion Conversion: Creation of new Representations (e.g. thumbnails)

    as needed ➤ Usually done for images ➤ Sometimes for video files (still frames) and documents (page images) ➤ Lesser known: Album art from ID3 tags (MP3), etc.
  30. File Lifecycle: Conversion Discussion ➤ Images are quite easy to

    deal with, but be careful with colour space conversions (for example CMYK to RGB) and with resampling. ➤ VIPS — HOWTO — Image shrinking
 https://github.com/jcupitt/libvips/wiki/HOWTO----Image-shrinking ➤ Tip: larger JPEGs with lower Quality setting still look quite good on higher resolution displays, and can be smaller in size too.
  31. File Lifecycle: Storage Storage: Storing of the file in a

    storage service for later retrieval If you’ve done the S3 ingestion route then Storage is largely taken care of. Otherwise you will have to ensure that the underlying block storage device is large enough and is taken care of by operators. It may also be a good idea to have a staged retention policy to get rid of old files, so as not to keep growing the amount of data stored. ➤ This could be done via Lifecycle Policy, or Object Expiration, and notifications sent via S3 Event Notifications.
  32. File Lifecycle: Presentation Presentation: making the File and/or its Representations

    available Usually people will vend a signed link directly, and this may be adequate for your uses. In any case, consider tracking file names separately (during Ingestion) and vending link with the correct name in response-content-disposition: ➤ attachment; filename*=UTF-8''${encoded_filename} Percent-encode anything outside of A-Za-z0-9
  33. Various Components ➤ Arc (ExAWS + Ecto) https://github.com/stavro/arc ➤ Embed

    in your Phoenix application ➤ EvaporateJS https://github.com/TTLabs/EvaporateJS ➤ JS package to help you with multipart uploads
  34. Various Solutions ➤ AWS Services ➤ Elastic Transcoder (i.e. hosted

    ZenCoder) ➤ Lambda Functions calling libVIPS with Sharp (NPM Package) ➤ CloudConvert https://cloudconvert.com ➤ Quite good results from our internal testing ➤ FEG’s own sausage factory ➤ Higher-level solution; REST API + Webhooks; 3.34m+ transactions
  35. Takeaway ➤ Try to set appropriate boundaries: total size, number

    of parts, etc. ➤ Try to process everything in isolation ➤ Use a dedicated service to deal with files if needed ➤ Try not having ImageMagick / FFmpeg / LibAV everywhere ➤ Minimises patching workload — single point of audit