• Croatian POS Tagging

    POS tagging in not so common languages usually requires a bit of effort to be set up. Luckily, for Croatian, Željko Agić has created a very good POS tagger licensed under CC-BY-SA-3.0. It is based on the hunpos package which was originally created for Hungarian and which is licensed under the New BSD License.

  • Analyzing the Common Crawl using Map-Reduce

    Let’s analyze some real data using Map-Reduce. Common Crawl is a web crawl of the entire web by a non profit organization (but they seem to have some sponsors to pay for resources and they’re even hiring employees). Their datasets are provided in a public S3 bucket for free to the downloader. We will analyze the data using Hadoop (in my case on Amazon’s EMR). At first I tried to use Disco, but it caused a lot of effort and some day I got stuck with a problem to hard to invest more time.

  • Training your custom classifier in Tensorflow Inception image recognition

    Just some months ago, Google released code for classifying images using neural networks. Some time later, they also released code to train your custom models, either from scratch or improving a baseline model. The baseline model in that case usually is a model trained on the ImageNet dataset.

  • Using Map-Reduce on Graphs

    Map-Reduce seems to be the standard technology for working with large amounts of data these days. It is most well-known in combination with simple flat, table-like structures, maybe because most beginner tutorials focus on these.

  • Using CodeCommit with the Ubuntu AMI

    Sometimes, you might have to fetch your own git repository from an AMI. In order to achieve this, you need a role which allows your EC2 instance to access the git repository. So, in your IAM create a new role with the attached policy AWSCodeCommitReadOnly and a trust relationship for EC2.

subscribe via RSS