$30 off During Our Annual Pro Sale. View Details »

Creating JATS XML from Japanese language articles and automatic typesetting using XSLT / JATS-Con-Asia-20151019-03-Hidehiko-Nakanishi

Creating JATS XML from Japanese language articles and automatic typesetting using XSLT / JATS-Con-Asia-20151019-03-Hidehiko-Nakanishi

JATS-Con Asia
Monday, October 19, 2015
http://jats-con-asia.strikingly.com/

General Session:
Speaker 1 "Creating JATS XML from Japanese language articles and automatic typesetting using XSLT"
-Hidehiko Nakanishi, Nakanishi Printing, Co. Ltd.
Abstract: http://jats-con-asia.strikingly.com/#speakers
Materials: https://speakerdeck.com/jatsconasiasc/jats-con-asia-20151019-03-hidehiko-nakanishi
Video: https://vimeo.com/150206898

More Decks by JATS-Con Asia Steering Committee

Other Decks in Technology

Transcript

  1. Creating JATS XML from
    Japanese language articles and
    automatic typesetting using XSLT.
    Hidehiko Nakanishi
    Nakanishi Printing Co., Ltd.
    Kyoto Japan

    View Slide

  2. Contents
    1. Introduction
    2. Creating Japanese XML articles in JATS
    3. Creating PDF using AH Formatter
    4. Challenges of Applying JATS to Japanese
    language texts
    5. Future
    6. Conclusion

    View Slide

  3. Introduction

    View Slide

  4. Many countries use Non-Latin script

    View Slide

  5. Not all research articles are written
    in English.
     Many articles are not even using Latin
    alphabets

    View Slide

  6. What languages are used in articles written in Japan?
    STM
    Japanese
    English
    Articles published in J-Stage,
    E-journal platform operated by the
    Japan Science and Technology
    Agency (JST).
    1
    2
    Japanese
    English
    University journal articles indexed in
    NDL-OPAC,
    All areas

    View Slide

  7. We wanted schema applicable to
    Japanese
     Even for Japanese-language articles, e-articles
    are essential.
     We were looking for schema for Japanese-
    language articles.
     Such schema had to accept English as well.

    View Slide

  8. JATS multi-language support
     In 2011 JATS 0.4 enabled to express Japanese-
    language articles in XML
     J-STAGE supported JATS 0.4 immediately
     We started creating JATS XML for Japanese-
    language articles
     Before that

    View Slide

  9. I am from Kyoto, Japan
    Bethesda
    Kyoto
    East Asia Kanji cultural zone

    View Slide

  10. Kyoto was a former capital
    Where my company, Nakanishi Printing, is located.

    View Slide

  11. Founded in 1865 by our ancestor.
    150 year old family business.
    One of the oldest printers.
    Former building of Nakanishi
    Printing in Taisho era (1912-
    1926)
    Current building of Nakanishi
    printing
    Our Tradition

    View Slide

  12. A brazier made by Woodcut print plate
    in 19c Type picker 1960’s
    Our history
    Today

    View Slide

  13. This is a Japanese e-journal
    The Japanese Journal of Gastroenterological Surgery

    View Slide

  14. Same page expressed in English

    View Slide

  15. View Slide

  16. Expressing Multiple Languages
     Alternate expressions for a single object are
    necessary
     Simple repetition of a tag can be confusing
    – Two name expressions of the same person?
    – Or two different persons?
     JATS introduced “alternatives” tags for such
    cases

    View Slide

  17. • Two name expressions of a single person


    中西
    秀彦


    Nakanishi
    Hidehiko


    “Alternatives” Tags

    View Slide

  18. “Alternatives” tags

    View Slide

  19. element name multi-language tag Note
    article title
    article subtitle
    names
    affiliations
    collaborators
    abstract is repeatable with different
    "xml:lang".
    is for articles later
    translated.
    keyword group is repeatable with
    different "xml:lang".
    generic any component which need multi-
    language data
    How multiple language can be
    expressed in JATS

    View Slide

  20. Creating Japanese XML
    articles in JATS

    View Slide

  21. Creating XML articles in JATS
     We don’t have tools readily available for
    creating Japanese XML files.
     Our method
    1. Convert Microsoft Word to Microsoft Office
    Open XML
    2. Convert Microsoft Office Open XML to JATS
    XML
    3. Validate XML

    View Slide

  22. (1) Converting Microsoft Word to
    Microsoft Office Open XML
    MS Open XML tags

    View Slide

  23. (2) Converting Microsoft Office Open
    XML to JATS XML
     Through XSLT, removing unnecessary tags.
     Perl program processing.
     We faced the difficulty of Agglutinative
    languages
    – A word connect next word without space.
    – Computer cannot distinguish word separation.
    – Even in given name and surname separation.

    View Slide

  24. Agglutinative languages
     Typical in East Asian languages
     No separating spaces between words

    View Slide

  25. One sentence one character string
    Japanese
    Agglutinative languages using
    Ideograph
    日本語
    表意文字を用いた膠着語

    View Slide

  26. Agglutinative languages
     In old days, even no punctuations were used
    i.e. multiple sentences in one character string!

    View Slide

  27. Inserting word separators.
     we insert separators manually.
    – surname, "中西", given name, "秀彦", are attached
    as "中西秀彦" in an article
    – It is separated as "中西@秀彦"
     Possible alternatives are "中@西秀彦", and "
    中西秀@彦", but only human can eliminate
    them
     There is no algorithm to determine it correctly.

    View Slide

  28. (3) Validating XML
     Use the Oxygen XML editor
     Final JATS XML is obtained to be uploaded to
    J-STAGE

    View Slide

  29. PDF is still necessary
     For paper publishing.
     For readability.

    View Slide

  30. View Slide

  31. Creating PDF using AH
    Formatter

    View Slide

  32.  アンテナハウスさんから資料
    Antenna House AH Formatter

    View Slide

  33. XSLT
     The XSLT converts a JATS file into XSL-FO
    which expresses page model format for PDF.

    View Slide

  34. For Japanese rendering
    AH Formatter extension

    View Slide

  35. Using Formatter for STM articles
     There are no major problems
     The basic style of writing STM papers do not
    differ greatly between western countries and
    Japan.
     Word separators should be inserted in XML in
    advance

    View Slide

  36. Challenges of Applying JATS to
    Japanese language texts
     But in Japan, exquisite type settings are
    requested.
     Automatic type setting by AH formatter may
    not be sufficient.

    View Slide

  37. Avoiding Line-Top Punctuations
     Punctuation marks shall not come at the top of
    a line ⇒ Also in English
     「っ」or「ッ」 (to mark a geminate consonant)
    does not come in a head of a line ⇒ Japanese
    rule
     AH Formatter can handle these rules

    View Slide

  38. Avoiding Word Breakup
     Some words, such as personal names shall not be
    broken-up between lines
     We use "Zero Width Joiner" code (‍)
    e.g. 中‍西

    View Slide

  39. Positioning Figures/Tables
     Figures and tables should be positioned in the
    SAME page that the corresponding texts
    appear.
     This requires customized XSLT, sometimes
    for each figures and tables.
     This increases cost.

    View Slide

  40. Positioning Figures/Tables
    Every articles need these XSLTs

    View Slide

  41. Future
    What is to be done next
    –Vertical writing
    –Emphasis or “Kenten”
    –Warichu

    View Slide

  42. Vertical writing
     Traditionally, Japanese
    (and Chinese and
    Korean) writes from top
    to bottom

    View Slide

  43. Vertical Writing
     Vertical Writing
    causes some
    interesting problems,
    orientation of Arabic
    numerals and Latin
    alphabets
    New element for direction is necessary.
    as

    View Slide

  44. Emphasis
     Emphasis or “Kenten”
     It is like bold faces and
    italics in English
     We use
    and AH formatter extension
    to express this today.
     We need a generic tag,

    View Slide

  45. Warichu
     Vertical writing
    texts sometimes
    contain notes called
    “Warichu”.
     Warichu uses 2 lines
    within a parent line.

    View Slide

  46. Warichu
    Historical document example

    View Slide

  47. Suggestion
    Additional tags for
    –Vertical writing
    –Emphasis or “Kenten”
    –Warichu

    View Slide

  48. Conclusion
     JATS opened a new horizon in processing
    Japanese-language articles
    – No major difficulties
    – UTF-8, encoding for XML, also enables to express
    most Japanese characters correctly

    View Slide

  49. Conclusion
     Still there are remaining issues in processing
    non-Latin, agglutinative languages such as
    Japanese.
     Challenges
    – Word separators have to be inserted manually
    – Line break issues
    – Positioning figures and tables correctly

    View Slide

  50. Heaven/Earth/Man
    http://artnews.blog.so-net.ne.jp/2011-04-22

    View Slide

  51. Structure vs. Expression
     In pictograph/ideograph writing system,
    authors and publishers care more about the
    look appearance and the layout, than those in
    western world.
    – Calligraphy
     We sometimes need to describe such
    looks/layouts in XML.
    – May, or may not be solved by extending JATS

    View Slide

  52. Is JATS applicable?
    “Kaitai shinsho” the first western medical book translation in 1774.

    View Slide

  53. Is JATS applicable?
     “Amma tebiki” Eastern medical text book(1835)

    View Slide

  54. Thank you

    View Slide