Transfer of files from User to Service Inspection: Verification of files’ integrity and congruence Conversion: Creation of new Representations (e.g. thumbnails) as needed Storage: Storing of the file in a block or object storage service for later retrieval Presentation: making the File and/or its Representations available to the User
➤ With drag & drop operation or a traditional form input ➤ Preliminary validation in JavaScript takes place here ➤ Reports MIME type + file size ➤ NB: Browsers can lie about MIME type and any enterprising person can fake size / MIME type information ➤ NB: Some browsers report incorrect file sizes
returned by the browser. ➤ WebKit infers content type from path extension https://github.com/WebKit/webkit/blob/master/Source/WebCore/fileapi/ File.cpp#L124 If you are serious about the files you get, you will have to validate it yourself and implement rejection logic for files that were merely renamed. ➤ For example… “Converting DOCX to PDF by renaming”
via Big Input ➤ Make a big and translucent <input> element, listen for change event, then query the FileList to get actual files. e.g. https://www.sitepoint.com/html5-file-drag-and-drop
in payload ➤ e.g. with file_input, from phoenix_html https://hexdocs.pm/phoenix_html/ Phoenix.HTML.Form.html#file_input/3 ➤ Ingest file in Phoenix, then store it elsewhere ➤ Plug.Upload https://hexdocs.pm/plug/Plug.Upload.html
API implementation) ➤ Plug.Parsers.MULTIPART requests a temporary file ➤ Plug.Upload provides a temporary file and monitors requesting process ➤ Request continues to be read (approximately 1MB at a time) ➤ Temporary file is written to, incrementally ➤ Path to the temporary file is exposed downstream
to API design) ➤ Entire file needs to be written to disk first ➤ This is because the file needs to be presented as a path ➤ This may or may not be a problem ➤ The more concurrent uploads you have, the more disk I/O you need to do
application implementation) ➤ Related to choice of file storage mechanisms ➤ Local directory / DRBD / other sorts of block storage devices ➤ AWS S3, Google Cloud Storage, Azure Storage, etc ➤ Either way there will be traffic ➤ Internet to Phoenix to Block Storage / Object Storage ➤ QoS enforcement = £££
file to S3 or somewhere else ➤ S3 Uploads: 5MB – 5GB per request ➤ S3 Multipart Uploads: 5MB – 5GB per part, 10,000 parts total ➤ AWS: “In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.” http://docs.aws.amazon.com/AmazonS3/latest/dev/ uploadobjusingmpu.html
(one permanent, one temporary) ➤ Attach a Lifecycle Policy on the Temporary bucket ➤ Clean up all lingering multi-part uploads after a certain amount of time ➤ Clean up all remaining temporary uploads after a certain amount of time ➤ Customers upload to Temporary bucket only
AWS STS (Secure Token Service) to create credentials that have additional restrictions. ➤ For example, create UUID, then create a token with s3:PutObject rights, but only against a specific ARN (which uses said UUID). ➤ NB: s3:PutObject still allows overwriting an existing object ➤ There is, however, no way to enforce upload size using this solution
between vending signatures for each part or vending the credential directly. ➤ If vending signatures for each part, you can put additional limitations in place. ➤ If vending credentials, you are unable to put size limitations in place.
request has Content-Length, Content-MD5 and Expect headers. ➤ AWS: Upload Part http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html ➤ You could use a signed structure to hold an accumulator for bytes already uploaded, and enforce total upload size this way by a) vending signatures derived from a fixed Content-Length, and b) refusing to sign extraneous parts.
expected payload is small (<100MB) you can vend a signature with additional bits for the client to use with a single request. ➤ Create the POST policy with starts-with, content-length-range, etc. ➤ AWS: Creating a POST Policy http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4- HTTPPOSTConstructPolicy.html
using multipart uploads is the ability to resume in case of network interruption or perhaps when somebody yanked the power cable, or when a USB storage device is removed. ➤ Therefore this is beneficial even for smaller files (say, 25MB – 50MB) that are below AWS recommended thresholds. ➤ You should analyse this based on your own use cases.
There are actually two goals of such a step: ➤ Goal 1: Identifying files that fall outside of the acceptance criteria ➤ Goal 2: Preventing bad files from reaching conversion processes I will explain the reason why both goals should be put in this step, shortly
verification ➤ Verification of artefacts’ fitness for purpose ➤ Verification of absence of unwanted content ➤ Generic: VBA macros, PUAs, embedded JavaScript in PDF, XFAs, etc. ➤ Specific: business-specific content e.g. empty forms or templates ➤ Verification of file name / type / content congruence
➤ Either malicious or plain broken; valid attack factor either way ➤ Most converters are designed to crash when a bad file is sent ➤ Re-establishing processes take time and can cause dips in throughput ➤ Some converters tolerate ambiguity and do the wrong thing
GET / FTP Requests ➤ Basically anything your host / container can do ➤ Potent mix if used to retrieve EC2 Instance Metadata or ECS Task Role ➤ Could be worked around if ImageMagick is not used, or if only congruent images were sent (i.e. actual JPEGs, PNGs, etc)
Basically DoS attack by holding up conversion processes ➤ Certified programming with dependent types (Because the future of defense is liberal application of math) https://media.ccc.de/v/cccamp11-4426- certified_programming_with_dependent_types-en ➤ “Six year old PDF loop bug affects most major implementations” https://blog.fuzzing-project.org/59-Six-year-old-PDF-loop-bug-affects-most-major- implementations.html
held to resource usage limits and be subject to limits on how much time it can spend doing the work. However, re-establishment of processes (killing and re-spawning) can sometimes take quite a while (especially for virus scanners, which require a lot of definition data to be loaded).
supported by libmagic. It is a pattern matcher which scans the binary file with pre-defined patterns, and returns the most likely match. ➤ You can write your own magic if desired. ➤ libmagic(3) - Linux man page https://linux.die.net/man/3/libmagic ➤ Guide to using filemagic https://filemagic.readthedocs.io/en/latest/guide.html
as needed ➤ Usually done for images ➤ Sometimes for video files (still frames) and documents (page images) ➤ Lesser known: Album art from ID3 tags (MP3), etc.
deal with, but be careful with colour space conversions (for example CMYK to RGB) and with resampling. ➤ VIPS — HOWTO — Image shrinking https://github.com/jcupitt/libvips/wiki/HOWTO----Image-shrinking ➤ Tip: larger JPEGs with lower Quality setting still look quite good on higher resolution displays, and can be smaller in size too.
storage service for later retrieval If you’ve done the S3 ingestion route then Storage is largely taken care of. Otherwise you will have to ensure that the underlying block storage device is large enough and is taken care of by operators. It may also be a good idea to have a staged retention policy to get rid of old files, so as not to keep growing the amount of data stored. ➤ This could be done via Lifecycle Policy, or Object Expiration, and notifications sent via S3 Event Notifications.
available Usually people will vend a signed link directly, and this may be adequate for your uses. In any case, consider tracking file names separately (during Ingestion) and vending link with the correct name in response-content-disposition: ➤ attachment; filename*=UTF-8''${encoded_filename} Percent-encode anything outside of A-Za-z0-9
of parts, etc. ➤ Try to process everything in isolation ➤ Use a dedicated service to deal with files if needed ➤ Try not having ImageMagick / FFmpeg / LibAV everywhere ➤ Minimises patching workload — single point of audit