使用 Stanford 中文斷詞系統,首先,請到下列的網址:
http://nlp.stanford.edu/software/segmenter.html#Download
並且下載下圖中框起來的部分。
下載完,解壓縮就可以看到下列的資料結構:
Sample Code 請打開 SegDemo.txt
或者直接參閱我下列寫的程式碼:
再輸入程式之前,請 import slf4j-api.jar, slf4j-simple.jar, stanford-segmenter-3.6.0.jar 與 stanford-segmenter-3.6.0-sources.jar
並且把 data 整個資料夾複製到你的 java 專案底下:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import java.util.List; | |
import java.util.Properties; | |
import edu.stanford.nlp.ie.crf.CRFClassifier; | |
import edu.stanford.nlp.ling.CoreLabel; | |
public class stanfordNLPSegmenter { | |
public static void main(String args[]) throws IOException{ | |
Properties props = new Properties(); | |
props.setProperty("sighanCorporaDict", "data"); | |
//設定所使用斷詞邊界的字典 | |
props.setProperty("serDictionary", "data/dict-chris6.ser.gz"); | |
props.setProperty("inputEncoding", "UTF-8"); | |
props.setProperty("sighanPostProcessing", "true"); | |
CRFClassifier segmenter = new CRFClassifier<>(props); | |
//讀取 Chinese TreeBank (ctb) 的資料集 | |
segmenter.loadClassifierNoExceptions("data/ctb.gz", props); | |
String sample = "我住在美国。"; | |
List segmented = segmenter.segmentString(sample); | |
System.out.println(segmented); | |
} | |
} |
然後執行就可以看到斷詞結果了。
[我, 住在, 美國]
文章標籤
全站熱搜