Simple examples about cleaning text data
See source code, output and comments on the scala files:
- Word freq
- Special chars, identify and clean
- Word stemmer
- Natural Language Processing
## notebook/regexs.ipynb, regexs.pdf
- Regexs
- Stop words
- Find patterns in tokens
- querying Patstat outside the SQL relational model
- comparing text, text distance, alignment, disambiguation, google refine…
- regex vs CFG, web scraping, table stats, validation, data curation workflow
- Java JDK 8
- SBT >=0.13.12 (build tool for Scala)
- IntelliJ with the Scala plugin (IDE, optional)
$ sbt "runMain application.TextCleanExample"
$ sbt "runMain application.StanfordNLPExample"
$ export dbUrl="jdbc:mysql://example.com/patstat_2015a?user=__USER__&password=__PASSWORD__&useSSL=false"
$ sbt "runMain application.RemoveStopWordsExample $dbUrl"
$ sbt "runMain application.PatentNumbersPatterns $dbUrl"
$ sbt "runMain application.EPFLPatentsProject $dbUrl"
docker run -it --rm -p 8888:8888 -v $PWD/notebook:/home/jovyan/work jupyter/all-spark-notebook start-notebook.sh
Contact me at [email protected]