• A simplistic backup strategy for Git repositories to AWS S3

    Since I started hosting some of my git repository on my own server instead of Github, I wanted a backup strategy for my repositories. I see several possibilities there:

  • Securing an nginx hosted website with SSL

    As soon as you provide some sort of password secured login on your website, you have to implement SSL/TLS to secure the password transmission. Plus, there is the general tendency nowadays to encrypt traffic which does not transport traffic, just to protect the privacy of your internet visitors. With encryption, eavesdroppers can only know which domain (hostname) and server (IP address) you connect to, but not which page and what kind of information from that host you are reading.

  • Using different hosters for domain and subdomain

    Recently for an active project I wanted to link the main domain sprakit.com to one hoster and a subdomain beta.sprakit.com to another hoster. Technically, this is not a big deal, but it’s simpler with a basic understanding of the DNS record and its entries.

  • Croatian POS Tagging

    POS tagging in not so common languages usually requires a bit of effort to be set up. Luckily, for Croatian, Željko Agić has created a very good POS tagger licensed under CC-BY-SA-3.0. It is based on the hunpos package which was originally created for Hungarian and which is licensed under the New BSD License.

  • Analyzing the Common Crawl using Map-Reduce

    Let’s analyze some real data using Map-Reduce. Common Crawl is a web crawl of the entire web by a non profit organization (but they seem to have some sponsors to pay for resources and they’re even hiring employees). Their datasets are provided in a public S3 bucket for free to the downloader. We will analyze the data using Hadoop (in my case on Amazon’s EMR). At first I tried to use Disco, but it caused a lot of effort and some day I got stuck with a problem to hard to invest more time.

subscribe via RSS