Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Solution for downloading large amounts of video...

Solution for downloading large amounts of videos from multiple SNS on AWS

Phong Phạm

December 05, 2024
Tweet

Other Decks in Programming

Transcript

  1. Agenda 1. Introduction 2. Requirements Gathering / Context 3. Solution

    Overview / High level Design 4. Implementation and Struggle 5. Conclusion
  2. Introduction Phong Pham - フォン・ファム 2024, Nov ~ Present: Platform

    Engineer at ExaWizards Inc. 2020, Oct ~ 2024, Nov: Site Reliability Engineer at Hakuhodo Technologies
  3. Requirements Gathering / Context In-house video analysis application - Thousands

    of designers and creators using this app - Need to process thousands of advertising videos for business insight purpose. - The videos come from multiple sources such as: Youtube, Twitter, Vimeo, Tiktok, Facebook, … => Need a solution that can provide users a way to collect / gather / download all the videos they found on SNS, to save it for later ad-hoc based analysis
  4. Requirements Gathering / Context Internal VPN / Company IP addresses

    Requests for downloading videos IP Banned ?????
  5. Requirements Gathering / Context Non-Functional Requirement: - bypass IP banned

    from SNS - process large amount of videos per day - can be easily integrated into current in-house analysis application Functional Requirement: - Fully Serverless - Utilize AWS IP - The downloading process could take a lot of time => Need to come up with a solution for long-process-running jobs
  6. Solution Overview / High level Design Requirements Solutions Fully Serverless

    Fully managed service on AWS Utilize AWS IP Lambda, Ec2, fargate long-process-running jobs decoupled systems, async job process process large amount of videos per day YoutubeDL: https://github.com/ytdl-org/youtube-dl bypass IP banned from SNS Each time a video is downloaded, it should be happened with an unique ipv4 can be easily integrated Release under APIs
  7. Implementation and Struggle DynamoDB Table Meaning media_video_id Id on SNS

    platform sqs_message_id SQS Id raw_url raw url, could be shorten s3_path Saved path on S3 status QUEUED | RUNNING | SUCCESS | FAILED | CANCELED hash video hash for duplicate checking metadata video json metadata created_at timestamp
  8. Implementation and Struggle - User type in the url of

    video from SNS platform - The Lambda should be able to extract exactly the video Id and which platform the video belongs to - The forms of URL are various => Need to implement a regex-based VideoUrlParser method
  9. Implementation and Struggle - Using a Lambda as a consumer

    for SQS - When the message amounts is huge and the Lambda is invoked sequentially, results in a situation that all the invoked Lambda have the same IP address => got banned again => Need to invoke Lambda concurrently so that each Lambda can have unique IP address - The /temp storage of a Lambda only have 3GB => Need to upload video as a stream so that it’s not be stored as /temp inside a function - Handle duplicated video => Implement a hash function to get the hash of each video and store in DynamoDB
  10. Conclusion What I’ve learned: - Regex for URL Parser -

    Concurrent Programming - Streaming upload to S3 - Hash function for video data - Serverless Application Model Follow up: - Handle failed message - By pass proxy and cookies - etc …