テキストデータの連結

7b687d32752421ed7bdc2428e5e4b8e6?s=47 papiron
April 22, 2017

 テキストデータの連結

joinコマンドを中心とした、キーを元に2つのテキストデータを連結する処理。実際に業務で必要に迫られた事例も。イベントレポートブログはこちら。
http://papiro.hatenablog.jp/entry/2017/04/24/002533

7b687d32752421ed7bdc2428e5e4b8e6?s=128

papiron

April 22, 2017
Tweet

Transcript

  1. ςΩετσʔλͷ࿈݁ USP༑ͷձ @papiron

  2. RDBͷςʔϒϧ࿈݁ • ෳ਺ͷςʔϒϧΛ࿈݁ • SQLͷJOINͰ • PKͱFK • ಺෦݁߹ɺ֎෦݁߹ɺ׬શ݁߹

  3. ςΩετϑΝΠϧͷ࿈݁ • ෳ਺ͷςΩετϑΝΠϧΛ࿈݁ • joinίϚϯυͳͲͰ • ಛఆͷΧϥϜΛΩʔ • ಺෦݁߹ɺ֎෦݁߹ɺ׬શ݁߹

  4. ࿈݁ʹؔ͢ΔίϚϯυ • joinίϚϯυ • pasteίϚϯυ • TukubaiͷjoinܥɺloopjͳͲ • ͦͷଞิॿతͳίϚϯυ •

    commͳͲ
  5. ͓୊ • ͓୊1 • ϚελͱτϥϯβΫγϣϯͷ࿈݁ • ಺෦݁߹ • ͓୊2 •

    ิ׬Λ൐͏σʔλͷ࿈݁ • ֎෦݁߹ ͓୊2͸࣮ࡍͷۀ຿Ͱ ඞཁʹഭΒΕͨࣄྫ
  6. ͓୊1 01 suzuki 02 saitou 03 yamada 04 isono 05

    toudou 0001 05 20170101 600 0002 04 20170101 800 0003 01 20170101 500 0004 02 20170102 900 0005 03 20170102 800 0006 02 20170102 100 0007 02 20170102 900 0008 05 20170102 600 0009 03 20170103 400 0010 01 20170103 200 0011 03 20170103 900 0012 03 20170103 900 0013 01 20170103 500 0014 03 20170103 600 0015 04 20170103 900 0016 05 20170104 700 0017 04 20170104 300 0018 02 20170104 700 0019 05 20170104 400 0020 01 20170104 700 MASTER id,୲౰ऀ TRAN id,୲౰ऀid,೔෇,ൢചֹۚ
  7. ͓୊1.1 $ sort -s -k2,2 TRAN | join -1 1

    -2 2 MASTER - 01 suzuki 0003 20170101 500 01 suzuki 0010 20170103 200 01 suzuki 0013 20170103 500 01 suzuki 0020 20170104 700 02 saitou 0004 20170102 900 02 saitou 0006 20170102 100 02 saitou 0007 20170102 900 02 saitou 0018 20170104 700 03 yamada 0005 20170102 800 03 yamada 0009 20170103 400 03 yamada 0011 20170103 900 03 yamada 0012 20170103 900 03 yamada 0014 20170103 600 04 isono 0002 20170101 800 04 isono 0015 20170103 900 04 isono 0017 20170104 300 05 toudou 0001 20170101 600 05 toudou 0008 20170102 600 05 toudou 0016 20170104 700 05 toudou 0019 20170104 400 جຊతͳ݁߹ʢΩʔɺ1൪໨ͷϑΝΠϧɺ2൪໨ͷϑΝΠϧͷॱ൪ͰฒͿʣ ࿈݁͢Δσʔλ͸ ΩʔͰฒ΂ସ͑Λ ͢Δඞཁ͋Γ
  8. ͓୊1.1a $ sort -s -k2,2 TRAN | join1 key=2 MASTER

    - 0003 01 suzuki 20170101 500 0010 01 suzuki 20170103 200 0013 01 suzuki 20170103 500 0020 01 suzuki 20170104 700 0004 02 saitou 20170102 900 0006 02 saitou 20170102 100 0007 02 saitou 20170102 900 0018 02 saitou 20170104 700 0005 03 yamada 20170102 800 0009 03 yamada 20170103 400 0011 03 yamada 20170103 900 0012 03 yamada 20170103 900 0014 03 yamada 20170103 600 0002 04 isono 20170101 800 0015 04 isono 20170103 900 0017 04 isono 20170104 300 0001 05 toudou 20170101 600 0008 05 toudou 20170102 600 0016 05 toudou 20170104 700 0019 05 toudou 20170104 400 Tukubaiͷjoin1Λ࢖͏ʢϚελ͕ࢦఆͨ͠Ωʔͷ࣍ͷΧϥϜ΁݁߹ʣ ࿈݁͢Δσʔλ͸ ΩʔͰฒ΂ସ͑Λ ͢Δඞཁ͋Γ
  9. ͓୊1.1b $ cjoin1 key=2 MASTER TRAN 0001 05 toudou 20170101

    600 0002 04 isono 20170101 800 0003 01 suzuki 20170101 500 0004 02 saitou 20170102 900 0005 03 yamada 20170102 800 0006 02 saitou 20170102 100 0007 02 saitou 20170102 900 0008 05 toudou 20170102 600 0009 03 yamada 20170103 400 0010 01 suzuki 20170103 200 0011 03 yamada 20170103 900 0012 03 yamada 20170103 900 0013 01 suzuki 20170103 500 0014 03 yamada 20170103 600 0015 04 isono 20170103 900 0016 05 toudou 20170104 700 0017 04 isono 20170104 300 0018 02 saitou 20170104 700 0019 05 toudou 20170104 400 0020 01 suzuki 20170104 700 Tukubaiͷcjoin1Λ࢖͏ʢϚελ͕ࢦఆͨ͠Ωʔͷ࣍ͷΧϥϜ΁݁߹ʣ ࿈݁͢Δσʔλ͸ ΩʔͰฒ΂ସ͑Λ ͢Δඞཁ͕ͳ͍
  10. ͓୊1.2 $ sort -s -k2,2 TRAN | join -1 1

    -2 2 -o 2.1,0,1.2,2.3,2.4 MASTER - 0003 01 suzuki 20170101 500 0010 01 suzuki 20170103 200 0013 01 suzuki 20170103 500 0020 01 suzuki 20170104 700 0004 02 saitou 20170102 900 0006 02 saitou 20170102 100 0007 02 saitou 20170102 900 0018 02 saitou 20170104 700 0005 03 yamada 20170102 800 0009 03 yamada 20170103 400 0011 03 yamada 20170103 900 0012 03 yamada 20170103 900 0014 03 yamada 20170103 600 0002 04 isono 20170101 800 0015 04 isono 20170103 900 0017 04 isono 20170104 300 0001 05 toudou 20170101 600 0008 05 toudou 20170102 600 0016 05 toudou 20170104 700 0019 05 toudou 20170104 400 ΧϥϜͷฒͼॱΛࢦఆͯ݁͠߹ʢTRANͷ୲౰ऀidͷՕॴʹ୲౰ऀΛೖΕΔʣ -oΦϓγϣϯͰ ΩʔͷΧϥϜ͸0ͱࢦఆ Tukubaiͷjoin1΍cjoin1ͱ ಉ͡ฒͼΛࢦఆ
  11. ͓୊1.3 $ awk '{a[$2" "$3]+=$4}END{for(v in a)print v,a[v] }' TRAN

    | sort -s -k1,1 | join -1 1 -2 1 MASTER - 01 suzuki 20170101 500 01 suzuki 20170103 700 01 suzuki 20170104 700 02 saitou 20170102 1900 02 saitou 20170104 700 03 yamada 20170102 800 03 yamada 20170103 2800 04 isono 20170101 800 04 isono 20170103 900 04 isono 20170104 300 05 toudou 20170101 600 05 toudou 20170102 600 05 toudou 20170104 1100 ୲౰ऀͱച্೔ຖʹूܭʢawkͷ࿈૝഑ྻΛར༻ʣ awkͷ࿈૝഑ྻ͸ forϧʔϓͰग़ྗ͢Δࡍ ฒ΂ସ͕͑ඞཁ
  12. ͓୊1.3a $ sort -s -k2,2 TRAN | sm2 2 3

    4 4 | join1 key=1 MASTER 01 suzuki 20170101 500 01 suzuki 20170103 700 01 suzuki 20170104 700 02 saitou 20170102 1900 02 saitou 20170104 700 03 yamada 20170102 800 03 yamada 20170103 2800 04 isono 20170101 800 04 isono 20170103 900 04 isono 20170104 300 05 toudou 20170101 600 05 toudou 20170102 600 05 toudou 20170104 1100 ୲౰ऀͱച্೔ຖʹूܭʢTukubaiͷsm2ͱjoin1Λར༻ʣ ूܭॲཧ͕ sm2ίϚϯυͰ εοΩϦॻ͚Δ
  13. ͓୊2 01 xxxx.0c4d.0c44 02 xxxx.0d46.f3c3 03 xxxx.0d17.7313 04 xxxx.0d81.33a8 05

    xxxx.0d17.9658 06 xxxx.0c4d.09dc 07 xxxx.0a69.b7b1 08 xxxx.0d81.1dac 09 xxxx.0fff.d828 10 xxxx.0d17.7477 11 xxxx.0d46.33c3 12 xxxx.0148.3c58 13 xxxx.0d46.f3c6 14 xxxx.0d81.1c9c 15 xxxx.04f8.c8e5 16 xxxx.0d46.7eb6 17 xxxx.0d42.5cc4 18 xxxx.108b.cc1c 19 xxxx.0fff.ce94 20 xxxx.02b6.4c24 xxxx.17d0.2c07 1914 xxxx.0d81.33a8 2353 xxxx.0d81.1c9c 2275 xxxx.0d46.33c3 2500 xxxx.0fff.d828 3766 xxxx.13a2.8858 3235 xxxx.13aa.091e 3119 xxxx.0d81.1dac 3382 xxxx.0c4d.0c44 3618 xxxx.0d46.7eb6 3443 xxxx.0d46.f3c3 3431 xxxx.1301.bac4 1607 xxxx.0d46.f3c6 3063 xxxx.04f8.c8e5 3120 xxxx.0d55.9b9f 3119 DEVICELIST id,MACΞυϨε MEASUREMENT MACΞυϨε,ܭଌ஋ MACΞυϨε͸ޓ͍ʹ଍Γͳ͍΋ͷ͕͋Δ
  14. ͓୊2.1 $ comm <(awk '{print $2}' DEVICELIST | sort) <(awk

    '{print $1}' MEASUREMENT | sort) xxxx.0148.3c58 xxxx.02b6.4c24 xxxx.04f8.c8e5 xxxx.0a69.b7b1 xxxx.0c4d.09dc xxxx.0c4d.0c44 xxxx.0d17.7313 xxxx.0d17.7477 xxxx.0d17.9658 xxxx.0d42.5cc4 xxxx.0d46.33c3 xxxx.0d46.7eb6 xxxx.0d46.f3c3 xxxx.0d46.f3c6 xxxx.0d55.9b9f xxxx.0d81.1c9c xxxx.0d81.1dac xxxx.0d81.33a8 xxxx.0fff.ce94 xxxx.0fff.d828 xxxx.108b.cc1c xxxx.1301.bac4 xxxx.13a2.8858 xxxx.13aa.091e xxxx.17d0.2c07 commίϚϯυͰ֬ೝ͢Δ࣌͸ฒ΂ସ͕͑ඞཁ ୈҰΧϥϜ ɹ1ͭ໨ͷϑΝΠϧͷΈʹଘࡏ͢Δߦ ୈೋΧϥϜ ɹ2ͭ໨ͷϑΝΠϧͷΈʹଘࡏ͢Δߦ ୈࡾΧϥϜ ɹ2ͭͷϑΝΠϧʹڞ௨ͯ͠ଘࡏ͢Δߦ ޓ͍ʹڞ௨͢ΔΩʔͱ଍Γͳ͍Ωʔͷ֬ೝ
  15. ͓୊2.2 $ join -a 1 -1 2 <(sort -s -k2,2

    DEVICELIST) <(sort -k1,1 MEASUREMENT) | sort -k2,2 | awk '{print $1,$3}' xxxx.0c4d.0c44 3618 xxxx.0d46.f3c3 3431 xxxx.0d17.7313 xxxx.0d81.33a8 2353 xxxx.0d17.9658 xxxx.0c4d.09dc xxxx.0a69.b7b1 xxxx.0d81.1dac 3382 xxxx.0fff.d828 3766 xxxx.0d17.7477 xxxx.0d46.33c3 2500 xxxx.0148.3c58 xxxx.0d46.f3c6 3063 xxxx.0d81.1c9c 2275 xxxx.04f8.c8e5 3120 xxxx.0d46.7eb6 3443 xxxx.0d42.5cc4 xxxx.108b.cc1c xxxx.0fff.ce94 xxxx.02b6.4c24 LEFT JOIN૬౰ͷ֎෦݁߹ʢjoinίϚϯυͷ-aΦϓγϣϯΛར༻ʣ Mac΍FreeBSDͷํ͸gjoinίϚϯυͰ DEVICELISTͷid൪߸Λ࢖ͬͯ ݩͷॱ൪ʹฒ΂ସ͍͑ͯΔ ࠷ޙʹid൪߸Λ࡟আ
  16. ͓୊2.2a $ join2 key=1 <(sort -k1,1 MEASUREMENT) <(sort -s -k2,2

    DEVICELIST | self 2 1) | sort -k3,3 | delf 3 xxxx.0c4d.0c44 3618 xxxx.0d46.f3c3 3431 xxxx.0d17.7313 _ xxxx.0d81.33a8 2353 xxxx.0d17.9658 _ xxxx.0c4d.09dc _ xxxx.0a69.b7b1 _ xxxx.0d81.1dac 3382 xxxx.0fff.d828 3766 xxxx.0d17.7477 _ xxxx.0d46.33c3 2500 xxxx.0148.3c58 _ xxxx.0d46.f3c6 3063 xxxx.0d81.1c9c 2275 xxxx.04f8.c8e5 3120 xxxx.0d46.7eb6 3443 xxxx.0d42.5cc4 _ xxxx.108b.cc1c _ xxxx.0fff.ce94 _ xxxx.02b6.4c24 _ LEFT JOIN૬౰ͷ֎෦݁߹ʢTukubaiͷjoin2ίϚϯυΛར༻ʣ TukubaiͰ͸LEFT JOIN૬౰ͷίϚϯυ͸ແ͍ɻ ͔͠͠σʔλΛೖΕସ͑ͨΓ͢Ε͹࣮ݱՄೳɻ id൪߸ͷ࡟আʹ͸delfίϚϯυΛར༻ɻ
  17. ͓୊2.3 $ join -a 1 -a 2 -1 2 -2

    1 -o 0,1.1,2.2 -e @ <(sort -k2,2 DEVICELIST) <(sort -k1,1 MEASUREMENT) xxxx.0148.3c58 12 @ xxxx.02b6.4c24 20 @ xxxx.04f8.c8e5 15 3120 xxxx.0a69.b7b1 07 @ xxxx.0c4d.09dc 06 @ xxxx.0c4d.0c44 01 3618 xxxx.0d17.7313 03 @ xxxx.0d17.7477 10 @ xxxx.0d17.9658 05 @ xxxx.0d42.5cc4 17 @ xxxx.0d46.33c3 11 2500 xxxx.0d46.7eb6 16 3443 xxxx.0d46.f3c3 02 3431 xxxx.0d46.f3c6 13 3063 xxxx.0d55.9b9f @ 3119 xxxx.0d81.1c9c 14 2275 xxxx.0d81.1dac 08 3382 xxxx.0d81.33a8 04 2353 xxxx.0fff.ce94 19 @ xxxx.0fff.d828 09 3766 xxxx.108b.cc1c 18 @ xxxx.1301.bac4 @ 1607 xxxx.13a2.8858 @ 3235 xxxx.13aa.091e @ 3119 xxxx.17d0.2c07 @ 1914 ׬શ݁߹ʢjoinίϚϯυͷ-aΦϓγϣϯΛ2ճར༻ʣ -eΦϓγϣϯͰ ิ׬จࣈྻΛࢦఆՄೳ
  18. ͓୊2.3a $ loopj -d@ num=1 <(sort -s -k2,2 DEVICELIST |

    self 2 1) <(sort -k1,1 MEASUREMENT) xxxx.0148.3c58 12 @ xxxx.02b6.4c24 20 @ xxxx.04f8.c8e5 15 3120 xxxx.0a69.b7b1 07 @ xxxx.0c4d.09dc 06 @ xxxx.0c4d.0c44 01 3618 xxxx.0d17.7313 03 @ xxxx.0d17.7477 10 @ xxxx.0d17.9658 05 @ xxxx.0d42.5cc4 17 @ xxxx.0d46.33c3 11 2500 xxxx.0d46.7eb6 16 3443 xxxx.0d46.f3c3 02 3431 xxxx.0d46.f3c6 13 3063 xxxx.0d55.9b9f @ 3119 xxxx.0d81.1c9c 14 2275 xxxx.0d81.1dac 08 3382 xxxx.0d81.33a8 04 2353 xxxx.0fff.ce94 19 @ xxxx.0fff.d828 09 3766 xxxx.108b.cc1c 18 @ xxxx.1301.bac4 @ 1607 xxxx.13a2.8858 @ 3235 xxxx.13aa.091e @ 3119 xxxx.17d0.2c07 @ 1914 ׬શ݁߹ʢTukubaiͷloopjίϚϯυΛར༻ʣ -dΦϓγϣϯͰ ิ׬จࣈྻΛࢦఆՄೳ