[Natural Language Processing][Segmenter][Stanford NLP] Stanford segmenter (斷詞) 使用方式－葛瑞斯肯樂活筆記

使用 Stanford 中文斷詞系統，首先，請到下列的網址:

並且下載下圖中框起來的部分。

下載完，解壓縮就可以看到下列的資料結構:

Sample Code 請打開 SegDemo.txt

或者直接參閱我下列寫的程式碼:

再輸入程式之前，請 import slf4j-api.jar, slf4j-simple.jar, stanford-segmenter-3.6.0.jar 與 stanford-segmenter-3.6.0-sources.jar

並且把 data 整個資料夾複製到你的 java 專案底下:

	import java.util.List;
	import java.util.Properties;

	import edu.stanford.nlp.ie.crf.CRFClassifier;
	import edu.stanford.nlp.ling.CoreLabel;

	public class stanfordNLPSegmenter {
	public static void main(String args[]) throws IOException{
	Properties props = new Properties();
	props.setProperty("sighanCorporaDict", "data");
	//設定所使用斷詞邊界的字典
	props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
	props.setProperty("inputEncoding", "UTF-8");
	props.setProperty("sighanPostProcessing", "true");

	CRFClassifier segmenter = new CRFClassifier<>(props);
	//讀取 Chinese TreeBank (ctb) 的資料集
	segmenter.loadClassifierNoExceptions("data/ctb.gz", props);

	String sample = "我住在美国。";
	List segmented = segmenter.segmentString(sample);
	System.out.println(segmented);
	}
	}