python - How to make POS n-grams more effective? -


I am text-texturing with SOM, as features using POS N-Gram. But I only take 2 hours to complete POS Unigram. I have 5000 lessons, each lesson has 300 words. Here's my code:

  def posNgrams (s, n): '' 'Calculate POS n-gram and return a dictionary' '' text = nltk.word_tokenize (s) text_tags = Text_tags: tag_lot (textlist -n + 1) for i in pos_tag (text) taglist = [] output = {}: g = '' .join for item in text_tags: taglist.append (item [1]) (Taglist [i: i + n]) output.setdefault (g, 0) output [g] + = 1 return output  

I used the same method to n-gram Have tried and it only took me many minutes to make me fast about my POS ngram Can you give some information?

Using the server with inxi -C with these things: / p>

  CPU (s): 2 hexa-core Intel Xeon E5-2430 CPU V2S (-HT- Mseepi- SMP-) cash: 30720 KB flags (lM NX SSE SSE2 SSE3 sse4_1 sse4_2 SSSE3 VMX ) Clock speed: 1: 2500.036 MHz  

Generally, using batch tagging with canonical answer pos_tag_sents , but it does not seem that it is fast is.

Before attempting POS tags, try to profile some steps (using only 1 core):

  nltk. Import import time Kopas Brown nltk import send_tokenize, word_tokenize, the nltk import pos_tag_sents # Load Brown Fund start = time.time (pos_tag) brown_corpus = brown.raw () loading_time = time.time () - start print "Loading has taken fund brown ", loading_time # sentence tokenizing Fund began Timektime = () = Braun_sent sent Tokyo (Braunconpas) Sent_time = Smayktaim () - Start Print" phrase taken Tokning corpus ", SENT_TIME # Word tokenizing fund [word_tokenize for brown_sents i (i)] = time.time () brown_words = start word_time = time.time () - sh True print took the word "tokenizing", word_time # loading, sent_tokenize, word_tokenize all together. Start = time.time () loading brown_words = [sent_tokenize s word_tokenize (s) (brown.raw (in))] tokenize_time = time.time () - Start Print "and took tokenizing Fund", tokenize_time # position a sentence tagging taken once Start = time.time () brown_tagged = [pos_tag (word_tokenize s_tokenize (brown.raw ())) for s] tagging_time = time.time () - Start Print "tagging by sentence Took the sentence ", use tagging_time # batch_pos_tag Start = time.time () brown_tagged = pos_tag_sents (word_tokenize for s in [sent_tokenize (s) (brown.raw ())]) tagging_time = time.time () - start putting sentences tagging print "batch", tagging_time  

[out]:

  loading brown fund took 0.154870033264 sentence tokenizing fund took 3.7720630168 9 token tokenizing fund 13.98, 28,45,068 Loading and Tokenizing Treasures Sentence 17.884783 9323 By Taking the Sentence Taken 1,114.65085101 Tagged sentences by batch 1104.63432097  

Note: that pos_tag_sents was previously stated before the NLT 3.0 in the batch_pos_tag version

Finally , I think you will need to consider other POS tagger to pre-print your data, or you will have to use threading to handle POS tags.


Comments

Popular posts from this blog

winforms - C# Form - Property Change -

javascript - amcharts makechart not working -

java - Algorithm negotiation fail SSH in Jenkins -