推荐《数学之美》,这个书写得特别科普且生动形象,我相信你不会觉得枯燥。这个我极力推荐,我相信科研的真正原因是因为兴趣,而不是因为功力的一些东西。
接下来说,《统计自然语言处理》这本书,这书实在是太老了,但是也很经典,看不看随意了。
现在自然语言处理都要靠统计学知识,所以我十分十分推荐《统计学习方法》,李航的。李航老师用自己课余时间7年写的,而且有博士生Review的。自然语言处理和机器学习不同,机器学习依靠的更多是严谨的数学知识以及推倒,去创造一个又一个机器学习算法。而自然语言处理是把那些机器学习大牛们创造出来的东西当Tool使用。所以入门也只是需要涉猎而已,把每个模型原理看看,不一定细致到推倒。
然后就是Stanford公开课了,Stanford公开课要求一定的英语水平。| Coursera 我觉得讲的比大量的中国老师好~
举例:
http://www.ark.cs.cmu.edu/LS2/in…
或者
http://www.stanford.edu/class/cs…
如果做工程前先搜索有没有已经做好的工具,不要自己从头来。做学术前也要好好的Survey!
开始推荐工具包:
中文的显然是哈工大开源的那个工具包 LTP (Language Technology Platform) developed by HIT-SCIR(哈尔滨工业大学社会计算与信息检索研究中心).
英文的(python):
- pattern – simpler to get started than NLTK
- chardet – character encoding detection
- pyenchant – easy access to dictionaries
- scikit-learn – has support for text classification
- unidecode – because ascii is much easier to deal with
必读论文(摘自Quora 我过一阵会翻译括号里面的解释):
Parsing(句法结构分析~语言学知识多,会比较枯燥)
- Klein & Manning: “Accurate Unlexicalized Parsing” ( )
- Klein & Manning: “Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency” (革命性的用非监督学习的方法做了parser)
- Nivre “Deterministic Dependency Parsing of English Text” (shows that deterministic parsing actually works quite well)
- McDonald et al. “Non-Projective Dependency Parsing using Spanning-Tree Algorithms” (the other main method of dependency parsing, MST parsing)
Machine Translation
- Knight “A statistical MT tutorial workbook” (easy to understand, use instead of the original Brown paper)
- Och “The Alignment-Template Approach to Statistical Machine Translation” (foundations of phrase based systems)
- Wu “Inversion Transduction Grammars and the Bilingual Parsing of Parallel Corpora” (arguably the first realistic method for biparsing, which is used in many systems)
- Chiang “Hierarchical Phrase-Based Translation” (significantly improves accuracy by allowing for gappy phrases)
Language Modeling (语言模型)
- Goodman “A bit of progress in language modeling” (describes just about everything related to n-gram language models 这是一个survey,这个survey写了几乎所有和n-gram有关的东西,包括平滑 聚类)
- Teh “A Bayesian interpretation of Interpolated Kneser-Ney” (shows how to get state-of-the art accuracy in a Bayesian framework, opening the path for other applications)
Machine Learning for NLP
- Sutton & McCallum “An introduction to conditional random fields for relational learning” (everyone should know CRFs, and this paper is the easiest to understand)
- Knight “Bayesian Inference with Tears” (explains the general idea of bayesian techniques quite well)
- Berg-Kirkpatrick et al. “Painless Unsupervised Learning with Features” (this is from this year and thus a bit of a gamble, but this has the potential to bring the power of discriminative methods to unsupervised learning)
Information Extraction
- Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992. (The very first paper for all the bootstrapping methods for NLP. It is a hypothetical work in a sense that it doesn’t give experimental results, but it influenced it’s followers a lot.)
- Collins and Singer. Unsupervised Models for Named Entity Classification. EMNLP 1999. (It applies several variants of co-training like IE methods to NER task and gives the motivation why they did so. Students can learn the logic from this work for writing a good research paper in NLP.)
Computational Semantics
- Gildea and Jurafsky. Automatic Labeling of Semantic Roles. Computational Linguistics 2002. (It opened up the trends in NLP for semantic role labeling, followed by several CoNLL shared tasks dedicated for SRL. It shows how linguistics and engineering can collaborate with each other. It has a shorter version in ACL 2000.)
- Pantel and Lin. Discovering Word Senses from Text. KDD 2002. (Supervised WSD has been explored a lot in the early 00’s thanks to the senseval workshop, but a few system actually benefits from WSD because manually crafted sense mappings are hard to obtain. These days we see a lot of evidence that unsupervised clustering improves NLP tasks such as NER, parsing, SRL, etc,
— 完 —
本文作者:吴俣
【知乎日报】
你都看到这啦,快来点我嘛 Σ(▼□▼メ)
此问题还有 9 个回答,查看全部。
延伸阅读:
粤语究竟是「语言」还是「方言」?
有哪些支持多种语言的 IDE ?