Skip to content

lovit/customized_konlpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

customized KoNLPy

ν•œκ΅­μ–΄ μžμ—°μ–΄μ²˜λ¦¬λ₯Ό ν•  수 μžˆλŠ” 파이썬 νŒ¨ν‚€μ§€, KoNLPy의 customized versionμž…λ‹ˆλ‹€.

customized_KoNLPyλŠ” ν™•μ‹€νžˆ μ•Œκ³  μžˆλŠ” 단어듀에 λŒ€ν•΄μ„œλŠ” 라이브러리λ₯Ό κ±°μΉ˜μ§€ μ•Šκ³  μ£Όμ–΄μ§„ μ–΄μ ˆμ„ μ•„λŠ” λ‹¨μ–΄λ“€λ‘œ ν† ν¬λ‚˜μ΄μ§• / ν’ˆμ‚¬νŒλ³„μ„ ν•˜λŠ” κΈ°λŠ₯을 μ œκ³΅ν•©λ‹ˆλ‹€. 이λ₯Ό μœ„ν•΄ template 기반 ν† ν¬λ‚˜μ΄μ§•μ„ μˆ˜ν–‰ν•©λ‹ˆλ‹€.

사전: {'μ•„μ΄μ˜€μ•„μ΄': 'Noun', 'λŠ”': 'Josa'} νƒ¬ν”Œλ¦Ώ: Noun + Josa 

μœ„μ™€ 같은 단어 λ¦¬μŠ€νŠΈμ™€ νƒ¬ν”Œλ¦Ώμ΄ μžˆλ‹€λ©΄ 'μ•„μ΄μ˜€μ•„μ΄λŠ”' μ΄λΌλŠ” μ–΄μ ˆμ€ [('μ•„μ΄μ˜€μ•„μ΄', 'Noun'), ('λŠ”', 'Josa')]둜 λΆ„λ¦¬λ©λ‹ˆλ‹€.

Install

$ git clone https://github.com/lovit/customized_konlpy.git $ pip install customized_konlpy 

Requires

  • JPype >= 0.6.1
  • KoNLPy >= 0.4.4

Usage

Part of speech tagging

KoNLPy와 λ™μΌν•˜κ²Œ Twitter.pos(phrase)λ₯Ό μž…λ ₯ν•©λ‹ˆλ‹€. 각 μ–΄μ ˆλ³„λ‘œ μ‚¬μš©μž 사전에 μ•Œλ €μ§„ 단어가 μΈμ‹λ˜λ©΄ customized_tagger둜 μ–΄μ ˆμ„ λΆ„λ¦¬ν•˜λ©°, μ‚¬μš©μž 사전에 μ•Œλ €μ§€μ§€ μ•Šμ€ λ‹¨μ–΄λ‘œ κ΅¬μ„±λœ μ–΄μ ˆμ€ νŠΈμœ„ν„° ν˜•νƒœμ†Œ λΆ„μ„κΈ°λ‘œ μ²˜λ¦¬ν•©λ‹ˆλ‹€.

twitter.pos('μš°λ¦¬μ•„μ΄μ˜€μ•„μ΄λŠ” μ΄λ»μš”')
[('우리', 'Noun'), ('μ•„μ΄μ˜€', 'Noun'), ('아이', 'Noun'), ('λŠ”', 'Josa'), ('이뻐', 'Adjective'), ('μš”', 'Eomi')] 

'μ•„μ΄μ˜€μ•„μ΄'κ°€ μ•Œλ €μ§„ 단어가 μ•„λ‹ˆμ—ˆκΈ° λ•Œλ¬Έμ— νŠΈμœ„ν„° λΆ„μ„κΈ°μ—μ„œ 단어λ₯Ό μ œλŒ€λ‘œ μΈμ‹ν•˜μ§€ λͺ»ν•©λ‹ˆλ‹€. μ•„λž˜μ˜ μ‚¬μš©μž μ‚¬μ „μœΌλ‘œ 단어 μΆ”κ°€λ₯Ό ν•œ λ’€ λ™μΌν•œ μž‘μ—…μ„ μˆ˜ν–‰ν•˜λ©΄ μ•„λž˜μ™€ 같은 κ²°κ³Όλ₯Ό 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.

twitter.pos('μš°λ¦¬μ•„μ΄μ˜€μ•„μ΄λŠ” μ΄λ»μš”')
[('우리', 'Modifier'), ('μ•„μ΄μ˜€μ•„μ΄', 'Noun'), ('λŠ”', 'Josa'), ('이뻐', 'Adjective'), ('μš”', 'Eomi')] 
twitter.pos('νŠΈμ™€μ΄μŠ€ttλŠ” μ’‹μ•„μš”')
[('νŠΈμ™€μ΄μŠ€', 'Noun'), ('tt', 'Noun'), ('λŠ”', 'Josa'), ('μ’‹', 'Adjective'), ('μ•„μš”', 'Eomi')] 

Add words to dictioanry

ckonlpy.tag의 TwitterλŠ” add_dictionaryλ₯Ό ν†΅ν•˜μ—¬ str ν˜Ήμ€ list of str ν˜•μ‹μ˜ μ‚¬μš©μž 사전을 μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

from ckonlpy.tag import Twitter twitter.add_dictionary('μ•„μ΄μ˜€μ•„μ΄', 'Noun') twitter.add_dictionary(['νŠΈμ™€μ΄μŠ€', 'tt'], 'Noun')

νŠΈμœ„ν„° ν•œκ΅­μ–΄ λΆ„μ„κΈ°μ—μ„œ μ΄μš©ν•˜μ§€ μ•ŠλŠ” ν’ˆμ‚¬ (단어 클래슀)λ₯Ό μΆ”κ°€ν•˜κ³  싢을 κ²½μš°μ—λŠ” λ°˜λ“œμ‹œ force=True둜 μ„€μ •ν•΄μ•Ό ν•©λ‹ˆλ‹€.

twitter.add_dictionary('lovit', 'Name', force=True)

Add template to customized tagger

ν˜„μž¬ μ‚¬μš©μ€‘μΈ νƒ¬ν”Œλ¦Ώ 기반 ν† ν¬λ‚˜μ΄μ €λŠ” μ½”λ“œ μ‚¬μš© 쀑 νƒ¬ν”Œλ¦Ώμ„ μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€. ν˜„μž¬ μ‚¬μš©μ€‘μΈ νƒ¬ν”Œλ¦Ώμ˜ λ¦¬μŠ€νŠΈλŠ” μ•„λž˜μ²˜λŸΌ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

twitter.template_tagger.templates
[('Noun', 'Josa'), ('Modifier', 'Noun'), ('Modifier', 'Noun', 'Josa')] 

νƒ¬ν”Œλ¦Ώμ€ tuple of str ν˜•μ‹μœΌλ‘œ μž…λ ₯ν•©λ‹ˆλ‹€.

twitter.template_tagger.add_a_template(('Noun', 'Noun', 'Josa'))

Set templates tagger selector

Templatesλ₯Ό μ΄μš©ν•˜μ—¬λ„ 후보가 μ—¬λŸ¬ 개 λ‚˜μ˜¬ 수 μžˆμŠ΅λ‹ˆλ‹€. μ—¬λŸ¬ 개 후보 μ€‘μ—μ„œ best λ₯Ό μ„ νƒν•˜λŠ” ν•¨μˆ˜λ₯Ό 직접 λ””μžμΈ ν•  수 도 μžˆμŠ΅λ‹ˆλ‹€. 이처럼 λͺ‡ 개의 점수 기쀀을 λ§Œλ“€κ³ , 각 κΈ°μ€€μ˜ weightλ₯Ό λΆ€μ—¬ν•˜λŠ” 방식은 νŠΈμœ„ν„° λΆ„μ„κΈ°μ—μ„œ μ΄μš©ν•˜λŠ” 방식인데, 직관적이고 νŠœλ‹ κ°€λŠ₯ν•΄μ„œ 맀우 쒋은 방식이라 μƒκ°ν•©λ‹ˆλ‹€.

my_weights = [ ('num_nouns', -0.1), ('num_words', -0.2), ('no_noun', -1), ('len_sum_of_nouns', 0.2) ] def my_evaluate_function(candidate): num_nouns = len([word for word, pos, begin, e in candidate if pos == 'Noun']) num_words = len(candidate) has_no_nouns = (num_nouns == 0) len_sum_of_nouns = 0 if has_no_nouns else sum( (len(word) for word, pos, _, _ in candidate if pos == 'Noun')) scores = (num_nouns, num_words, has_no_nouns, len_sum_of_nouns) score = sum((score * weight for score, (_, weight) in zip(scores, my_weights))) return score

μœ„μ˜ 예제처럼 my_weights 와 my_evaluate_function ν•¨μˆ˜λ₯Ό μ •μ˜ν•˜μ—¬ twitter.set_evaluator()에 μž…λ ₯ν•˜λ©΄, ν•΄λ‹Ή ν•¨μˆ˜ κΈ°μ€€μœΌλ‘œ best candidateλ₯Ό μ„ νƒν•©λ‹ˆλ‹€.

twitter.set_evaluator(my_weights, my_evaluate_function)

Postprocessor

passwords, stopwords, passtags, 단어 μΉ˜ν™˜μ„ μœ„ν•œ ν›„μ²˜λ¦¬λ₯Ό ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

passwords 에 λ“±λ‘λœ 단어, (단어, ν’ˆμ‚¬)만 좜λ ₯λ©λ‹ˆλ‹€.

from ckonlpy.tag import Postprocessor passwords = {'μ•„μ΄μ˜€μ•„μ΄', ('정말', 'Noun')} postprocessor = Postprocessor(twitter, passwords = passwords) postprocessor.pos('μš°λ¦¬μ•„μ΄μ˜€μ•„μ΄λŠ” 정말 μ΄λ»μš”') # [('μ•„μ΄μ˜€μ•„μ΄', 'Noun'), ('정말', 'Noun')]

stopwords 에 λ“±λ‘λœ 단어, (단어, ν’ˆμ‚¬)λŠ” 좜λ ₯λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

stopwords = {'λŠ”'} postprocessor = Postprocessor(twitter, stopwords = stopwords) postprocessor.pos('μš°λ¦¬μ•„μ΄μ˜€μ•„μ΄λŠ” 정말 μ΄λ»μš”') # [('우리', 'Modifier'), ('μ•„μ΄μ˜€μ•„μ΄', 'Noun'), ('정말', 'Noun'), ('이뻐', 'Adjective'), ('μš”', 'Eomi')]

νŠΉμ • ν’ˆμ‚¬λ₯Ό μ§€μ •ν•˜λ©΄, ν•΄λ‹Ή ν’ˆμ‚¬λ§Œ 좜λ ₯λ©λ‹ˆλ‹€.

passtags = {'Noun'} postprocessor = Postprocessor(twitter, passtags = passtags) postprocessor.pos('μš°λ¦¬μ•„μ΄μ˜€μ•„μ΄λŠ” 정말 μ΄λ»μš”') # [('μ•„μ΄μ˜€μ•„μ΄', 'Noun'), ('정말', 'Noun')]

μΉ˜ν™˜ν•  단어, (단어, ν’ˆμ‚¬)λ₯Ό dict ν˜•μ‹μœΌλ‘œ μ •μ˜ν•˜λ©΄ tag μ—μ„œ 단어가 μΉ˜ν™˜λ˜μ–΄ 좜λ ₯λ©λ‹ˆλ‹€.

replace = {'μ•„μ΄μ˜€μ•„μ΄': 'μ•„μ΄λŒ', ('이뻐', 'Adjective'): 'μ˜ˆμ˜λ‹€'} postprocessor = Postprocessor(twitter, replace = replace) postprocessor.pos('μš°λ¦¬μ•„μ΄μ˜€μ•„μ΄λŠ” 정말 μ΄λ»μš”') # [('우리', 'Modifier'), ('μ•„μ΄λŒ', 'Noun'), ('λŠ”', 'Josa'), ('정말', 'Noun'), ('μ˜ˆμ˜λ‹€', 'Adjective'), ('μš”', 'Eomi')]

μ—°μ†λœ 단어λ₯Ό ν•˜λ‚˜μ˜ 단어루 λ¬ΆκΈ° μœ„ν•΄μ„œ nested tuple μ΄λ‚˜ tuple of str ν˜•μ‹μ˜ ngram 을 μž…λ ₯ν•  수 μžˆμŠ΅λ‹ˆλ‹€. tuple of str 의 ν˜•μ‹μœΌλ‘œ μž…λ ₯된 ngram 은 Noun 으둜 μΈμ‹λ©λ‹ˆλ‹€.

ngrams = [(('미슀', '함무라비'), 'Noun'), ('λ°”λžŒ', '의', 'λ‚˜λΌ')] postprocessor = Postprocessor(twitter, ngrams = ngrams) postprocessor.pos('미슀 ν•¨λ¬΄λΌλΉ„λŠ” μž¬λ°ŒλŠ” λ“œλΌλ§ˆμž…λ‹ˆλ‹€') # [('미슀 - 함무라비', 'Noun'), ('λŠ”', 'Josa'), ('μž¬λ°ŒλŠ”', 'Adjective'), ('λ“œλΌλ§ˆ', 'Noun'), ('μž…λ‹ˆ', 'Adjective'), ('λ‹€', 'Eomi')]

Loading wordset

utils μ—λŠ” stopwords, passwords, replace word pair λ₯Ό 파일둜 μ €μž₯ν•˜μ˜€μ„ 경우, 이λ₯Ό μ†μ‰½κ²Œ λΆˆλŸ¬μ˜€λŠ” ν•¨μˆ˜κ°€ μžˆμŠ΅λ‹ˆλ‹€.

load_wordset 은 set of str ν˜Ήμ€ set of tuple 을 return ν•©λ‹ˆλ‹€. μ˜ˆμ‹œμ˜ passwords.txt 의 λ‚΄μš©μ€ μ•„λž˜μ™€ κ°™μŠ΅λ‹ˆλ‹€. λ‹¨μ–΄μ˜ ν’ˆμ‚¬λŠ” ν•œ μΉΈ λ„μ–΄μ“°κΈ°λ‘œ κ΅¬λΆ„ν•©λ‹ˆλ‹€. stopwords.txt 도 λ™μΌν•œ ν¬λ©§μž…λ‹ˆλ‹€.

μ•„μ΄μ˜€μ•„μ΄ μ•„μ΄μ˜€μ•„μ΄ Noun 곡연 

load_wordset 을 μ΄μš©ν•˜λŠ” μ˜ˆμ‹œμ½”λ“œ μž…λ‹ˆλ‹€.

from ckonlpy.utils import load_wordset passwords = load_wordset('./passwords.txt') print(passwords) # {('μ•„μ΄μ˜€μ•„μ΄', 'Noun'), 'μ•„μ΄μ˜€μ•„μ΄', '곡연'} stopwords = load_wordset('./stopwords.txt') print(stopwords) # {'은', 'λŠ”', ('이', 'Josa')}

μΉ˜ν™˜ν•  λ‹¨μ–΄μŒμ€ tap ꡬ뢄이 λ˜μ–΄μžˆμŠ΅λ‹ˆλ‹€. μΉ˜ν™˜λ  단어에 ν’ˆμ‚¬ νƒœκ·Έκ°€ μžˆμ„ 경우 ν•œ μΉΈ λ„μ–΄μ“°κΈ°λ‘œ κ΅¬λΆ„ν•©λ‹ˆλ‹€.

str\tstr str str\tstr 

μ•„λž˜λŠ” replacewords.txt 의 μ˜ˆμ‹œμž…λ‹ˆλ‹€.

μ•„λΉ 	아버지 μ—„λ§ˆ Noun	μ–΄λ¨Έλ‹ˆ 

load_replace_wordpair 을 μ΄μš©ν•˜λŠ” μ˜ˆμ‹œμ½”λ“œ μž…λ‹ˆλ‹€.

from ckonlpy.utils import load_replace_wordpair replace = load_replace_wordpair('./replacewords.txt') print(replace) # {'μ•„λΉ ': '아버지', ('μ—„λ§ˆ', 'Noun'): 'μ–΄λ¨Έλ‹ˆ'}

ngram λ‹¨μ–΄λ“€μ˜ 각 λ‹¨μ–΄λŠ” ν•œ μΉΈ λ„μ–΄μ“°κΈ°λ‘œ, ngram 의 ν’ˆμ‚¬λŠ” tap 으둜 κ΅¬λΆ„λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

str str str str\tstr 

μ•„λž˜λŠ” ngrams.txt 의 μ˜ˆμ‹œμž…λ‹ˆλ‹€.

λ°”λžŒ 의 λ‚˜λΌ 미슀 함무라비	Noun 

load_ngram 을 μ΄μš©ν•˜λŠ” μ˜ˆμ‹œμ½”λ“œ μž…λ‹ˆλ‹€.

from ckonlpy.utils import load_ngram ngrams = load_ngram('./ngrams.txt') print(ngrams) # [('λ°”λžŒ', '의', 'λ‚˜λΌ'), (('미슀', '함무라비'), 'Noun')]

0.0.6+ vs 0.0.5x

0.0.5x μ—μ„œμ˜ λ³€μˆ˜μ™€ ν•¨μˆ˜μ˜ 이름, λ³€μˆ˜μ˜ νƒ€μž… 일뢀λ₯Ό λ³€κ²½ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

λ³€κ²½ μ „ λ³€κ²½ ν›„
ckonlpy.tag.Twitter._loaded_twitter_default_dictionary ckonlpy.tag.Twitter.use_twitter_dictionary
ckonlpy.tag.Twitter._dictionary ckonlpy.tag.Twitter.dictionary
ckonlpy.tag.Twitter._customized_tagger ckonlpy.tag.Twitter.template_tagger
ckonlpy.tag.Postprocessor.tag ckonlpy.tag.Postprocessor.pos
ckonlpy.custom_tag.SimpleSelector ckonlpy.custom_tag.SimpleEvalator
ckonlpy.custom_tag.SimpleSelector.score ckonlpy.custom_tag.SimpleEvalator.evaluate
ckonlpy.tag.Twitter.set_selector ckonlpy.tag.AbstractTagger.set_evaluator
ckonlpy.custom_tag.SimpleSelector.weight ckonlpy.custom_tag.SimpleEvaluator.weight
λ³€κ²½ ν›„ λ³€κ²½ 이유
ckonlpy.tag.Twitter.use_twitter_dictionary konlpy.tag.Twitter 의 사전 μ‚¬μš© 유무
ckonlpy.tag.Twitter.dictionary public 으둜 λ³€ν™˜ν•˜μ˜€μŠ΅λ‹ˆλ‹€
ckonlpy.tag.Twitter.template_tagger Template 기반으둜 μž‘λ™ν•˜λŠ” tagger μž„μ„ λͺ…μ‹œν•˜κ³ , public 으둜 λ³€ν™˜ν•˜μ˜€μŠ΅λ‹ˆλ‹€
ckonlpy.tag.Postprocessor.pos κΈ°λ³Έ tagger 의 κ²°κ³Όλ₯Ό ν›„μ²˜λ¦¬ν•˜λŠ” κΈ°λŠ₯이기 λ•Œλ¬Έμ— λ™μΌν•œ ν•¨μˆ˜λͺ…μœΌλ‘œ ν†΅μΌν•˜μ˜€μŠ΅λ‹ˆλ‹€
ckonlpy.custom_tag.SimpleEvalator 클래슀 이름을 Selector μ—μ„œ Evaluator 둜 λ³€κ²½ν•˜μ˜€μŠ΅λ‹ˆλ‹€
ckonlpy.custom_tag.SimpleEvalator.evaluate ν’ˆμ‚¬μ—΄ ν›„λ³΄μ˜ 점수 계산 뢀뢄을 score --> evaluate 둜 ν•¨μˆ˜λͺ…을 λ³€κ²½ν•˜μ˜€μŠ΅λ‹ˆλ‹€
ckonlpy.tag.AbstractTagger.set_evaluator ν’ˆμ‚¬μ—΄ ν›„λ³΄μ˜ 점수 계산 ν•¨μˆ˜λ₯Ό μ„€μ •ν•˜λŠ” ν•¨μˆ˜μ˜ 이름을 λ³€κ²½ν•˜μ˜€μŠ΅λ‹ˆλ‹€. ν•΄λ‹Ή ν•¨μˆ˜λŠ” ckonlpy.tag.Twitter μ—μ„œ ckonlpy.tag.AbstractTagger 둜 μ΄λ™ν•˜μ˜€μŠ΅λ‹ˆλ‹€
ckonlpy.custom_tag.SimpleEvaluator.weight {str:float} ν˜•μ‹μ˜ weight λ₯Ό [(str, float)] ν˜•μ‹μœΌλ‘œ λ³€κ²½ν•˜μ˜€μŠ΅λ‹ˆλ‹€

About

Customized KoNLPy - Korean Natural Language Processing Toolkit KoNLPy wrapping code

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages