Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Pyspark - produtividade e poder de processamento
Search
Felipe cruz
November 10, 2015
Technology
76
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Pyspark - produtividade e poder de processamento
Felipe cruz
November 10, 2015
More Decks by Felipe cruz
See All by Felipe cruz
Recomendação - Algoritmos de Filtragem Colaborativa
felipecruz
0
440
Coleta Massiva de Dados
felipecruz
2
140
TDC 2014 - Machine Learning Guerrilha
felipecruz
0
290
Python & C - Formas de Integração
felipecruz
0
140
Other Decks in Technology
See All in Technology
EventBridge Connection
_kensh
5
710
自律型AIエージェントは何を破壊するのか
kojira
0
160
AIエージェントが名古屋の猛暑からあなたを守る
happysamurai294
0
110
Claude Code の Sandbox 機能を Anthropic Sandbox Runtime(srt) で試そう!/lets-play-anthropic-sandbox-runtime
tomoki10
1
570
白金鉱業Meetup_Vol.24_「AIエージェントは分けるほど良い」は本当か? / Is it true that “the more you divide AI agents, the better”?
brainpadpr
1
360
手塩にかけりゃいいってもんじゃない
ming_ayami
0
560
データサイエンスを価値につなげるプロジェクト設計 〜 DS一年目が現場で得た気づき 〜
ysd113
1
230
エンジニアリング戦略の作り方 / Crafting Engineering Strategy
iwashi86
21
6.8k
脆弱性対応、どこで線を引くか
rymiyamoto
1
380
LayerXにおけるセキュリティ管理の現在地と次の一手
tosho
0
150
スキルと MCP ツール、責務をどう分けるか? AI が迷わないインターフェース設計の戦略
cdataj
1
1k
RSA暗号を手計算したくなること、ありますよね?? (20260615_orestudy6_rsa)
thousanda
0
370
Featured
See All Featured
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
840
Building Adaptive Systems
keathley
44
3.1k
A Tale of Four Properties
chriscoyier
163
24k
A Modern Web Designer's Workflow
chriscoyier
698
190k
Technical Leadership for Architectural Decision Making
baasie
3
410
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
Optimizing for Happiness
mojombo
378
71k
Git: the NoSQL Database
bkeepers
PRO
432
67k
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
380
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.5k
HDC tutorial
michielstock
2
710
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
230
23k
Transcript
PySpark PySpark Produtividade e poder de processamento
Quem? Quem? github.com/felipecruz github.com/felipecruz @ @felipe felipej jcruz cruz
Agenda Agenda Map-Reduce Pyspark
Maior oferta feita em Maior oferta feita em uma semana
na uma semana na BOVESPA? BOVESPA?
Motivação Motivação highest_offer = max(offers)
None
None
? ?
? ?
highest_offers_1 = max(offers_partition1) highest_offers_2 = max(offers_partition2)
highest_offers_1 = max(offers_partition1) highest_offers_2 = max(offers_partition2) max(highest_offers_1, highest_offers_2)
calma... calma...
Map
Map Reduce
Map-Reduce Map-Reduce não é divisão e conquista (que pode ser
implementada com map-reduce)
Aplicações Aplicações Filtragem Distintos Top K Por valor Sumarização Índice
invertido Contagem de palavras Estruturação Ordenação Particionamento Embaralhamento Join Inner join Produto cartesiano nosso exemplo K = 1
PySpark PySpark
Funcionalidades centrais Funcionalidades centrais Map-Reduce RDD, DataFrames & SQL MLlib
Streaming GraphX
Map Map & Reduce & Reduce >>> prices = sc.textFile('s3n://prognoos-pyspark/*.gz')
\ ... .filter(lambda x: x.count(';') > 14) \ ... .map(lambda x: [s.strip() for s in x.split(';')]) \ ... .map(lambda x: (x[1], x[8], x[15])) ... >>> prices.take(2) [(u'ABEVA70', u'000000000000.350000', u'000000000000008300'), (u'ABEVA70', u'000000000000.350000', u'000000000000007100')]
Map & Map & Reduce Reduce >>> prices = sc.textFile('ftp://*.gz')
\ ... .filter(lambda x: x.count(';') > 14) \ ... .map(lambda x: [s.strip() for s in x.split(';')]) \ ... .map(lambda x: (x[1], float(x[8]), x[15])) ... >>> sum_all = prices.map(lambda x: x[2])\ ... .reduce(lambda x, y: x + y) ... >>> sum_all 1532623750.0
from datetime import datetime strpt = lambda x: datetime.strptime(x, '%H:%M:%S.%f')
f = float negs = sc.textFile('s3n://prognoos-pyspark/NEG/*.gz') \ .filter(lambda x: x.count(';') > 14) \ .map(lambda x: [s.strip() for s in x.split(';')]) \ .map(lambda x: (strpt(x[5]), 'NEG', x[1], f(x[3]), f(x[16]), x[17])) buys = sc.textFile('s3n://prognoos-pyspark/CPA/*.gz') \ .filter(lambda x: x.count(';') > 14) \ .map(lambda x: [s.strip() for s in x.split(';')]) \ .map(lambda x: (strpt(x[6]), 'CPA', x[1], f(x[8]), x[15], None)) sell = sc.textFile('s3n://prognoos-pyspark/VDA/*.gz') \ .filter(lambda x: x.count(';') > 14) \ .map(lambda x: [s.strip() for s in x.split(';')]) \ .map(lambda x: (strpt(x[6]), 'VDA', x[1], f(x[8]), None, x[15])) all_operations = negs.union(buys).union(sell) total = all_operations.count() # total = 52980676
... nem tudo são ... nem tudo são flores flores
data = sc.parallelize(['aa', 'bb', 'ab', 'bc']) def _filter(data): sts =
['a', 'b'] rets = [] for st in sts: rets.append((st, data.filter(lambda x: x.startswith(st)))) return rets rdds = _filter(data) for st, rdd in rdds: print((st, rdd.collect())) # ('a', ['bb', 'bc']) # ('b', ['bb', 'bc']) Python - Anti-pattern - não faça!!
DataFrames & SQL DataFrames & SQL
DataFrame DataFrame A distributed collection of data grouped into named
columns http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
events = negs.union(buys).union(sell).toDF() # API de DataFrame total = events.count()
# Salva pra uso posterior events.write.save('s3n://prognoos/events/', format='parquet', mode='Overwrite')
SparkSQL SparkSQL http://spark.apache.org/docs/latest/api/python/pyspark.sql.html >>> path = 's3n://prognoos-test/events' >>> table_name =
'bovespa_events' >>> events = sqlContext.read.parquet(path) >>> events.registerTempTable(table_name) >>> total_events = sqlContext.sql(''' select count(*) from bovespa_events ''')
Spark em produção Spark em produção Standalone Hadoop/Yarn Mesos
Spark em produção Spark em produção
Dúvidas? Dúvidas? @felipejcruz @felipejcruz github.com/felipecruz github.com/felipecruz