Importing Wikipedia in Plone
Eric BREHAULT – Plone Conference 2013
Slide 2
Slide 2 text
ZODB is good at storing objects
●
Plone contents are objects,
●
we store them in the ZODB,
●
everything is fine, end of the story.
Slide 3
Slide 3 text
But what if ...
... we want to store non-contentish records?
Like polls, statistics, mail-list subscribers, etc.,
or any business-specific structured data.
Slide 4
Slide 4 text
Store them as contents anyway
That is a powerfull solution.
But there are 2 major problems...
Slide 5
Slide 5 text
Problem 1: You need to manage a secondary
system
●
you need to deploy it,
●
you need to backup it,
●
you need to secure it,
●
etc.
Slide 6
Slide 6 text
Problem 2: I hate SQL
No explanation here.
Slide 7
Slide 7 text
I think I just cannot digest it...
Slide 8
Slide 8 text
How to store many records in the ZODB?
●
Is the ZODB strong enough?
●
Is the ZCatalog strong enough?
Slide 9
Slide 9 text
My grandmother often told me
"If you want to become stronger, you have to eat your soup."
Slide 10
Slide 10 text
Where do we find a good soup for Plone?
In a super souper!!!
Slide 11
Slide 11 text
souper.plone and souper
●
It provides both storage and indexing.
●
Record can store any persistent pickable data.
●
Created by BlueDynamics.
●
Based on ZODB BTrees, node.ext.zodb, and repoze.catalog.
Slide 12
Slide 12 text
Add a record
>>> soup = get_soup('mysoup', context)
>>> record = Record()
>>> record.attrs['user'] = 'user1'
>>> record.attrs['text'] = u'foo bar baz'
>>> record.attrs['keywords'] = [u'1', u'2', u'ü']
>>> record_id = soup.add(record)
Slide 13
Slide 13 text
Record in record
>>> record['homeaddress'] = Record()
>>> record['homeaddress'].attrs['zip'] = '6020'
>>> record['homeaddress'].attrs['town'] = 'Innsbruck'
>>> record['homeaddress'].attrs['country'] = 'Austria'
Slide 14
Slide 14 text
Access record
>>> from souper.soup import get_soup
>>> soup = get_soup('mysoup', context)
>>> record = soup.get(record_id)
Slide 15
Slide 15 text
Query
>>> from repoze.catalog.query import Eq, Contains
>>> [r for r in soup.query(Eq('user', 'user1')
& Contains('text', 'foo'))]
[]
or using CQE format
>>> [r for r in soup.query("user == 'user1' and 'foo' in text")]
[]
Slide 16
Slide 16 text
souper
●
a Soup-container can be moved to a specific ZODB mount-
point,
●
it can be shared across multiple independent Plone instances,
●
souper works on Plone and Pyramid.
Slide 17
Slide 17 text
Plomino & souper
●
we use Plomino to build non-content oriented apps easily,
●
we use souper to store huge amount of application data.
Slide 18
Slide 18 text
Plomino data storage
Originally, documents (=record) were ATFolder.
Capacity about 30 000.
Slide 19
Slide 19 text
Plomino data storage
Since 1.14, documents are pure CMF.
Capacity about 100 000.
Usally the Plomino ZCatalog contains a lot of indexes.
Slide 20
Slide 20 text
Plomino & souper
With souper, documents are just soup records.
Capacity: several millions.
Slide 21
Slide 21 text
Typical use case
●
Store 500 000 addresses,
●
Be able to query them in full text and display the result on a map.
Demo
Slide 22
Slide 22 text
What is the limit?
Can we import Wikipedia in souper?
Demo with 400 000 records
Demo with 5,5 millions of records
Slide 23
Slide 23 text
Conclusion
●
Usage performances are good,
●
Plone performances are not impacted.
Use it!
Slide 24
Slide 24 text
Thoughts
●
What about a REST API on top of it?
●
Massive import is long and difficult, could it be improved?
Slide 25
Slide 25 text
Makina Corpus
For all questions related to this talk,
please contact Éric Bréhault
[email protected]
Tel : +33 534 566 958
www.makina-corpus.com