Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Pyspark - produtividade e poder de processamento
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Felipe cruz
November 10, 2015
Technology
76
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Pyspark - produtividade e poder de processamento
Felipe cruz
November 10, 2015
More Decks by Felipe cruz
See All by Felipe cruz
Recomendação - Algoritmos de Filtragem Colaborativa
felipecruz
0
440
Coleta Massiva de Dados
felipecruz
2
140
TDC 2014 - Machine Learning Guerrilha
felipecruz
0
290
Python & C - Formas de Integração
felipecruz
0
140
Other Decks in Technology
See All in Technology
MUSUBI 田中裕一『AIと共に行う「しごとのリデザイン」- スモールバックオフィス編』AI Ops Lab #4
musubi
0
140
失敗を資産に変えるClaude Code
shinyasaita
0
640
AmazonRoute 53ではじめてのドメイン取得!HTTPS化までの道のりを整理してみた
usanchuu
3
140
2026TECHFRESH畢業分享會 - Lightning Talk - 資料也要 CI/CD? 用 Airbyte 自動化資料同步
line_developers_tw
PRO
0
950
非エンジニアがClaudeと挑んだ「1ヶ月間プロダクト30本ノック」
askokc
0
470
エンジニアリング戦略の作り方 / Crafting Engineering Strategy
iwashi86
21
6.8k
2026TECHFRESH畢業分享會 - 原生還是跨平台? App 開發踩坑實錄
line_developers_tw
PRO
0
970
SONiCのLinuxベースを活かしたZabbix監視
sonic
0
140
エラーバジェットのアラートのタイミングを考える.pdf
kairim0
0
140
Oracle AI Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
6
1.5k
FinOps × AIエージェントで実現する コストインシデントの自動調査
oasis1994liveforever
0
130
小さく始める AI 活用推進 ― 日経電子版 Web チームの事例/nikkei-tech-talk47
nikkei_engineer_recruiting
0
260
Featured
See All Featured
Raft: Consensus for Rubyists
vanstee
141
7.5k
Six Lessons from altMBA
skipperchong
29
4.3k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
140
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
140
Are puppies a ranking factor?
jonoalderson
1
3.5k
AI: The stuff that nobody shows you
jnunemaker
PRO
8
710
Visual Storytelling: How to be a Superhuman Communicator
reverentgeek
2
560
The Spectacular Lies of Maps
axbom
PRO
1
800
Scaling GitHub
holman
464
140k
Automating Front-end Workflow
addyosmani
1370
210k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.8k
My Coaching Mixtape
mlcsv
0
150
Transcript
PySpark PySpark Produtividade e poder de processamento
Quem? Quem? github.com/felipecruz github.com/felipecruz @ @felipe felipej jcruz cruz
Agenda Agenda Map-Reduce Pyspark
Maior oferta feita em Maior oferta feita em uma semana
na uma semana na BOVESPA? BOVESPA?
Motivação Motivação highest_offer = max(offers)
None
None
? ?
? ?
highest_offers_1 = max(offers_partition1) highest_offers_2 = max(offers_partition2)
highest_offers_1 = max(offers_partition1) highest_offers_2 = max(offers_partition2) max(highest_offers_1, highest_offers_2)
calma... calma...
Map
Map Reduce
Map-Reduce Map-Reduce não é divisão e conquista (que pode ser
implementada com map-reduce)
Aplicações Aplicações Filtragem Distintos Top K Por valor Sumarização Índice
invertido Contagem de palavras Estruturação Ordenação Particionamento Embaralhamento Join Inner join Produto cartesiano nosso exemplo K = 1
PySpark PySpark
Funcionalidades centrais Funcionalidades centrais Map-Reduce RDD, DataFrames & SQL MLlib
Streaming GraphX
Map Map & Reduce & Reduce >>> prices = sc.textFile('s3n://prognoos-pyspark/*.gz')
\ ... .filter(lambda x: x.count(';') > 14) \ ... .map(lambda x: [s.strip() for s in x.split(';')]) \ ... .map(lambda x: (x[1], x[8], x[15])) ... >>> prices.take(2) [(u'ABEVA70', u'000000000000.350000', u'000000000000008300'), (u'ABEVA70', u'000000000000.350000', u'000000000000007100')]
Map & Map & Reduce Reduce >>> prices = sc.textFile('ftp://*.gz')
\ ... .filter(lambda x: x.count(';') > 14) \ ... .map(lambda x: [s.strip() for s in x.split(';')]) \ ... .map(lambda x: (x[1], float(x[8]), x[15])) ... >>> sum_all = prices.map(lambda x: x[2])\ ... .reduce(lambda x, y: x + y) ... >>> sum_all 1532623750.0
from datetime import datetime strpt = lambda x: datetime.strptime(x, '%H:%M:%S.%f')
f = float negs = sc.textFile('s3n://prognoos-pyspark/NEG/*.gz') \ .filter(lambda x: x.count(';') > 14) \ .map(lambda x: [s.strip() for s in x.split(';')]) \ .map(lambda x: (strpt(x[5]), 'NEG', x[1], f(x[3]), f(x[16]), x[17])) buys = sc.textFile('s3n://prognoos-pyspark/CPA/*.gz') \ .filter(lambda x: x.count(';') > 14) \ .map(lambda x: [s.strip() for s in x.split(';')]) \ .map(lambda x: (strpt(x[6]), 'CPA', x[1], f(x[8]), x[15], None)) sell = sc.textFile('s3n://prognoos-pyspark/VDA/*.gz') \ .filter(lambda x: x.count(';') > 14) \ .map(lambda x: [s.strip() for s in x.split(';')]) \ .map(lambda x: (strpt(x[6]), 'VDA', x[1], f(x[8]), None, x[15])) all_operations = negs.union(buys).union(sell) total = all_operations.count() # total = 52980676
... nem tudo são ... nem tudo são flores flores
data = sc.parallelize(['aa', 'bb', 'ab', 'bc']) def _filter(data): sts =
['a', 'b'] rets = [] for st in sts: rets.append((st, data.filter(lambda x: x.startswith(st)))) return rets rdds = _filter(data) for st, rdd in rdds: print((st, rdd.collect())) # ('a', ['bb', 'bc']) # ('b', ['bb', 'bc']) Python - Anti-pattern - não faça!!
DataFrames & SQL DataFrames & SQL
DataFrame DataFrame A distributed collection of data grouped into named
columns http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
events = negs.union(buys).union(sell).toDF() # API de DataFrame total = events.count()
# Salva pra uso posterior events.write.save('s3n://prognoos/events/', format='parquet', mode='Overwrite')
SparkSQL SparkSQL http://spark.apache.org/docs/latest/api/python/pyspark.sql.html >>> path = 's3n://prognoos-test/events' >>> table_name =
'bovespa_events' >>> events = sqlContext.read.parquet(path) >>> events.registerTempTable(table_name) >>> total_events = sqlContext.sql(''' select count(*) from bovespa_events ''')
Spark em produção Spark em produção Standalone Hadoop/Yarn Mesos
Spark em produção Spark em produção
Dúvidas? Dúvidas? @felipejcruz @felipejcruz github.com/felipecruz github.com/felipecruz