03 April 2020

Posted by PSY

at 1:46 PM

Recommender Systems

Introduction to Recommender Systems

Recommender systems are an important application of machine learning. They are critical drivers for online retail businesses such as Amazon and Netflix as they introduce customers to products that are purchased by other users with similar tastes. From the perspective of machine learning, it has an important property of learning features.

When to use Recommender System?

The representative examples are Amazon and Netflix. There are many users, products, ratings and overlap in the items reviewed by different users. So this is not for a one-of-a kind product or a highly customized product for each individual.

The major technique of recommender system is collaborative filtering algorithm. Let's see how collaborative filtering works by applying it to an example of movie ratings.

Dataset

We use the following matrices to represent a dataset.

nm is the number of movies, nuis the number of users, and n is the number of features.

The matrix Y( a nm x nu matrix) stores the ratings y(i,j) (from 1 to 5, for instance).

The matrix R is an binary-valued indicator matrix, where R(i,j)=1 if user j gave a rating to movie i, and R(i,j)=0 otherwise.

The matrix ( a nm x n matrix) stores the features of movies. The i-th row of X corresponds to the feature vector x(i) for the i-th movie.

The matrix ( a nu x n matrix) stores user features. The j-th row of Theta corresponds to one parameter vector (j). Both x(i) and (j) are n-dimensional vectors where n is the number of features.

Objective

We want to predict movie ratings for the movies that users have not yet rated, that is, the entries with R(i,j)=1. This will allow us to recommend the movies with the highest predicted ratings to the user.

Approach

Consider a set of n-dimensional parameter vectors x(1), ..., xnm and (1), ..., (nu), where the model predicts the rating for movie i by user j as y(i,j) =( (j))Tx(i) (Note: MT means transpose of matrix M). Given a dataset that consists of a set of ratings produced by some users on some movies, the model learns the parameter vectors x(1), ..., xnm and (1), ..., (nu)that produce the best fit (minimizes the squared error).

Cost Function

By fitting available user ratings, we can estimate parameters such as user features and movie features. The fitting is done by evaluating the cost function given by ,

J(x(1), ..., x(nm), (1), ..., (nu)) = 12(i,j): r(i,j)=1(( (j))Tx(i)-y(i,j))2 + (2j=1nuk=1n(k(j))2 ) + ( 2i=1nmk=1n(xk(j))2 )

The first term on the right measures the squared differences between predictions and available ratings. The second and third terms are for regularization of user parameters and product parameters respectively. Regularization is for minimizing parameters for the sake of generalizing models to fit new data where lambda indicates its strength. So the cost function is determined by a balance between fitting the given data and fitting new data.

Gradient

The minimization of cost function is done by updating parameters using gradient descent.

where is a learning rate which determines the updating speed. The smaller it is, the more precise and slower the updating proceeds.

The gradients of the cost function are calculated by following,

Collaborative filtering learning algorithm

Since we can learn features whether user parameters or product parameters, we can use random numbers for initial parameters and keep updating them until we get the small enough value of cost function. Then we use the learned features to predict a rating. So we can follow these steps for collaborative filtering.

1. Initialize x(1), ..., xnm and (1), ..., (nu)to small random values

2. Minimize J(x(1), ..., x(nm), (1), ..., (nu))using gradient descent for every j=1, ..., nu, i=1, ..., nm

3. For a user with parameters and a movie with learned features x, predict a rating Tx

Recommendation

For each product i, we learn a feature vector x(i)in Rn.

A feature vector captures the aspects of the product which are important for ratings. For example, the features for movies can be x1 = romance, x2= action, x3= comedy, and so on. If the distance between features of two movies, x(i)-x(j), is small, we can say the two movies are similar. So We can make recommendations to a user by finding similar items to the one with a high rating by the user.

For a new user who doesn't have any ratings yet, we can predict ratings as the means of available ratings.

Conclusion

In recommender system using collaborative filtering learning algorithm, each user makes ratings for a subset of products, thus making contributions to learn better features, which are then used to make better predictions of rating for everyone.

01 July 2014

Posted by Anonymous

at 6:38 PM

2 comments

Labels: golang

Go Concurrency: From the Trenches

Recently I finished some work using the Go programming language and will share here some things learned while implementing web crawlers in this cool language.

Our Current Implementation and its Weak Points

Our goal was to improve the efficiency and performance of some web crawler scripts that have been running in production for over two years with increasing performance problems, primarily due to CPU usage. Currently we are managing these crawlers to grab data from various web sites. The crawlers were written in Ruby, and the method used is straightforward: first get a page URL or account ID, parse the page content or invoke an API call, persist data to the db, and repeat. The benefits of this approach are that it is simple and straightforward; anyone can easily understand the structure. However, when the size of a task queue grows to be very large, weaknesses emerge. The queued jobs can get backed up waiting for responses, which in turn extends the execution time of a batch significantly.

Built-in Concurrency Support

There are two possible paths to improve our code, one using multiple threads, the other using non-blocking IO and callbacks. After reading some code, I chose to use threads since I find such code to be more readable and maintainable.

goroutines

Golang has a built-in concurrency construct called a goroutine, based on the CSP model (It is said the model is Newsqueak-Alef-Limbo, which has some differences between original CSP, please refer to this slide for some extra information). Here is a brief instruction to goroutines, from the official slides.

An independently executing function, launched by a go statement.
Has its own call stack, which grows and shrinks as required.
It's very cheap. It's practical to have thousands, even hundreds of thousands of goroutines.
Not a thread.
There might be only one thread in a program with thousands of goroutines.
Instead, goroutines are multiplexed dynamically onto threads as needed to keep all the goroutines running.
If you think of it as a very cheap thread, you won't be far off.

Here is a simple sample program with goroutines:

package main

import "fmt"
import "time"

func main() {
     fmt.Println("START 1")
     for i := 0; i < 3; i++ {
          foo(i)
     }
     fmt.Println("END 1")

     fmt.Println("START 2")
     for i := 0; i < 3; i++ {
          go foo(i)
     }
     fmt.Println("END 2")

     time.Sleep(1 * time.Second)
}

func foo(i int) {
     time.Sleep(1)
     fmt.Printf("Call index: %d\n", i)
}

And here is the output:

START 1
Call index: 0
Call index: 1
Call index: 2
END 1
START 2
END 2
Call index: 0
Call index: 2
Call index: 1

There are two loops in the code, and each invokes function foo() three times. The difference is the function is launched directly in first loop, while launched via a go statement in the other one. The execution order of the first loop is normal, but the second loops output is out of order. Actually the output from second loop should have an unpredictable order, and you can check this by issuing more goroutines. By the way, the sleep used in Line 19 is to guarantee all goroutines has finished execution, since the main process does not wait for others. There are, of course, some better synchronisation mechanisms available. The full source can be found here.

channels

The goroutine shown above doesn't return data, so it's not a full functional component. Now let's talk about Go's channel construct. Channels are something like a pipe: data is pumped in from one side, and fetched from the other. Let's modify the code above to show how this works. I have removed the normal loop to make the code shorter.

package main

import "fmt"

func main() {

    foo_channel := make(chan string)
    channel_cnt := 3

    fmt.Println("Start launching goroutines")
    for i := 0; i < channel_cnt; i++ {
        go foo(i, foo_channel)
    }
    fmt.Println("Finish launching goroutines")

    for i := 0; i < channel_cnt; i++ {
        fmt.Println(<-foo_channel)
    }

}

func foo(i int, foo_channel chan string) {
    s := fmt.Sprintf("Call index %d", i)
    foo_channel <- s
}

And here is the output：

Start launching goroutines
Finish launching goroutines
Call index 0
Call index 1
Call index 2

As we can see, there is no sleep() call at the end, but instead a variable foo_channel which pipes data from goroutines back to the main process. The put operation will be blocked if there is already some data in the channel, and the get operation will also be blocked if there is no data in the channel. This is the method that controls function flow. Full source can be found here.

The size of this channel is 1, which makes it like a semaphore. But we can specify another positive integer while creating channel, then the new one should contain buffer function. While pushing data to the channel, the operation will only be blocked if all buffers has been filled. In the other hand, the get operation will only be blocked if all buffers are cleared. This feature can be used to control tje concurrency size of our web crawlers. Here is another example:

package main

import "fmt"
import "time"

func main() {

    channel_cnt := 10
    concurrency_chan := make(chan bool, 2)
    msg_chan := make(chan string)

    fmt.Println("Start launching goroutines")
    for i := 0; i < channel_cnt; i++ {
        go foo(i, concurrency_chan, msg_chan)
    }
    fmt.Println("Finish launching goroutines")

    for i := 0; i < channel_cnt; i++ {
        fmt.Println(<-msg_chan)
    }

}

func foo(i int, concurrency_chan chan bool, msg_chan chan string) {
    concurrency_chan <- true
    s := fmt.Sprintf("%s: Call index %d", time.Now(), i)
    msg_chan <- s
    time.Sleep(1 * time.Second)
    <-concurrency_chan
}

This program create two channels, and the concurrency_chan is a buffered channel of size 2. At the beginning of each goroutine, we put a true value into the channel, and receive back at the other end. With these two lines, we can guarantee that only 2 goroutines are ever running at same time. Adding a sleep() can show this more clearly. Again, source code can be found here.

Web Crawlers in Golang

With goroutines, we can implement concurrent web crawlers more easily than with traditional threads. Like with our original architecture, the new approach also has three basic steps: 1) fetch a task from queue server, 2) parse the URL or account ID, then 3) save to the database. The difference is in the second and third steps, now goroutines are used to inexpensively multiplex our calls to external services. Since a goroutine is implemented as a regular function, the changes to our original logic are minor, even trivial. However, the improvement is great. During the testing, we found CPU utilisation for hosts running our crawlers has been reduced by up to 90%. But this is an upper bound and multiple hosts are still needed for backup and to segregate API calls to multiple services.

17 January 2014

Posted by shiawuen

at 6:21 PM

0 comments

Labels: angularjs, polymer, web components

Polymer: First Impressions

After a few days of trying out Polymer, here are some of my first impressions.

What is Polymer?

According to its official website, Polymer is a new type of library for the web, built on top of Web Components, and designed to leverage the evolving web platform.

Solving the new <table /> issue

Although we have come a long way from <table /> layouts to the Semantic Web, it's still obvious that we are merely using <div /> or other elements to replace tables. With Polymer, which utilizes Web Components, we can finally realise the real Semantic Web to encapsulate structural details of components in separate files so that our documents are more readable.

Here's an example of creating a tab in Foundation.

index.html

<!doctype html>
<html>
<head>
<script src="polymer/platform.js"></script>

<link rel="import" href="name-list.html">
</head>
<body>
<dl class="tabs" data-tab>
<dd class="active"><a href="#panel2-1">Tab 1</a></dd>
<dd><a href="#panel2-2">Tab 2</a></dd>
</dl>
<div class="tabs-content">
<div class="content active" id="panel2-1">
<p>Abc</p>
</div>
<div class="content" id="panel2-2">
<p>Abc</p>
<p>Abc</p>
</div>
</div>
</body>
</html>

In Polymer, we can create the same tab this way

components.html

<link rel="import" href="bower_components/polymer/polymer.html">

<polymer-element name="x-tabs">
<template>
<dl id="tabs" class="tabs" data-tab>
<template repeat="{{ items }}">
<dd class="{{ active ? 'active' : '' }}">
<a href="#">{{ title }}</a>
</dd>
</template>
</dl>
<div class="tabs-content">
<content></content>
</div>
</template>
<script>
Polymer('x-tabs', {
ready: function(){
this.items = this.children.array().map(function(el){
return {
title: el.title,
active: el.getAttribute('active') !== null
};
});
}
})
</script>
</polymer-element>

<polymer-element name="x-tab" noscript>
<template>
<div class="content">
<content></content>
</div>
</template>
</polymer-element>

index.html

<!doctype html>
<html>
<head>
<script src="polymer/platform.js"></script>

<link rel="import" href="name-list.html">
</head>
<body>
<x-tabs>
<x-tab title="test">
<p>Abc</p>
</x-tab>
<x-tab title="Hello" active>
<p>Abc</p>
<p>Abc</p>
</x-tab>
</x-tabs>
</body>
</html>

Although this example might be missing some wiring, it still shows the clarity of the component structure as opposed to a pile of difficult to maintain code.

Data Binding

While we have so many MV frameworks out there, 2-way data binding is always a tough issue facing us developers. In Polymer, we get it by default with patterns similar to data binding in AngularJS.

In AngularJS

index.html

<!doctype html>
<html>
<head>
<script src="angular.js"></script>
<script src="app.js"></script>
</head>
<body ng-controller="MainCtrl">
<div ng-repeat="i in items">
<p>{{ i }}</p>
</div>
</body>
</html>

index.js

function MainCtrl($scope){
$scope.items = [1,2,3,4,5];
}

MainCtrl.inject = ['$scope'];

In Polymer

index.html

<!doctype html>
<html>
<head>
<script src="polymer/platform.js"></script>
</head>
<body>
<polymer-element name="x-items">
<template repeat="i in items">
<div><p>{{ i }}</p></div>
</template>
</polymer-element>
<x-items></x-items>
<script>
document.querySelector('x-items').model = { items: [1,2,3,4,5] };
</script>
</body>
</html>

You can see in the example above that there isn't much difference between the data binding strategies of AngularJS and Polymer.

Conclusion and first impressions

At the moment Polymer is still quite new for me and I'm still trying to grasp how it works. If fact, the examples I used above might not even be how the Google Developers Group intended it to be used, but I am sure it lays a good foundation for other upcoming web frameworks.

01 July 2013

Posted by Anonymous

at 12:28 AM

1 comments

Attempting to use Chrome's PNaCl

Google released its PNaCl project at Google I/O 2013 which allows users to write portable native client application code. Portable native client applications (PNaCl) can be translated to true native binary programs and executed on supported architectures (ARM and x86).

Currently, the official toolchain contains the Clang front end which can only be used to compile C/C++ source to PNaCl applications. But the PNaCl application is a subset of LLVM so I tried compiling some other languages with LLVM front ends to see if they would work. So far there hasn't been much progress but I'm still working on a solution and this post is just to document my attempts in hopes that others might learn from my experiences.

Toolchain Installation

The PNaCl toolchain can be downloaded via Google's Native Client SDK. Follow the instruction for downloading and installing. Here's a recap of the commands I used:

./naclsdk install pepper_canary
export NACL_SDK_ROOT=/where/you/put/the/pepper_canary
cd pepper_canary/examples/hello_world; TOOLCHAIN=pnacl CONFIG=Release make

Chrome also needed some additional setup which is also described in the SDK page. Notice that the #enable-nacl-debug flag must not be enabled, or else the PNaCl application will not be able to load (ref). Samples are also avaliable in the Chrome store, and the link can also be found on this page.

Golang and llgo

The first language I tried was Golang. There is a Golang frontend for LLVM called llgo, and I have tried it to generate LLVM bytecode from my Golang program.

go get is recommended for installation of the llgo package. The CGO_CFLAGS and CGO_LDFLAGS should be set before executing go get. I followed steps from the author's blog to install:

Install Go 1.1
Get the PNaCl SDK (contains Clang and LLVM)
Run go get github.com/axw/llgo/cmd/llgo-dist
Run llgo-dist to build and install the runtime

I tried this on OSX 10.7.5 but got a "No matching .ll file for asm_darwin_amd64.s" error message. I filed an issue on Github and learned from the response that I needed to set target flag '-triple' to x86_64-apple-macosx10.7.0 or i386-apple-macosx10.7.0. I then got another error while compiling the runtime with the llgo-dist command, which seems to be due to an upgrade of golang to 1.1. So I left llgo there and anxiously await new updates from the project.

Ruby

Another attempt is ruby. Since Ruby 2.0.0, NaCl support has been integrated which is concisely documented here. During compilation of the Ruby source I got an rdoc error, so I added --disable-install-rdoc to the configue command.

$ ./configure --prefix=/tmp/nacl-ruby --host=x86_64-nacl --with-baseruby=`which ruby` --disable-install-rdoc

Also I made a symbolic link of all directories in the /include SDK directory to Ruby for solving a 'header file not found' error.

$ ln -s /home/vagrant/source/nacl_sdk/pepper_canary/include/* ./

If running this command on any 64-bit OS, please make sure the 32-bit glibc library is installed, or some binaries may cause errors. I successfully compiled the native executable for Ruby, but for some reason which I'm still investigating, it can't be executed by the PNaCl binary.

Next steps

I will still try to write portable native client code with other languages. I just saw the author of llgo has made some changes in Github so I plan to pull his changes. Stay tuned for another post on my progress. Not giving up yet!

Archive

Tropical Software Observations