<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5627086854645065783</id><updated>2012-01-13T04:53:00.782-08:00</updated><category term='cassandra'/><category term='text similarity'/><category term='Lavenshtein distance'/><title type='text'>funny stuff</title><subtitle type='html'>let me see.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>15</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-3088579717515853457</id><published>2012-01-13T04:47:00.000-08:00</published><updated>2012-01-13T04:53:00.807-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cassandra'/><title type='text'>Cassandra distributed deletes</title><content type='html'>Cassandra~&lt;br /&gt;&lt;br /&gt;1. Cassandra replaces the value to be deleted with a special value called a tombstone in case of other nodes are down so that they can't receive the deletion request.&lt;br /&gt;&lt;br /&gt;2. delete the tombstone after GCGraceSeconds. So, a node is down for a longer than GCGraceSeconds can't get tombstone updates propagation. You need to remove the node.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;http://wiki.apache.org/cassandra/DistributedDeletes&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-3088579717515853457?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/3088579717515853457/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2012/01/cassandra-distributed-deletes.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/3088579717515853457'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/3088579717515853457'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2012/01/cassandra-distributed-deletes.html' title='Cassandra distributed deletes'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-9116383311582145099</id><published>2010-12-16T02:45:00.001-08:00</published><updated>2010-12-16T03:11:15.301-08:00</updated><title type='text'>c++ exceptions and memory leaks</title><content type='html'>I had to come up with a way to deal with erratic and exceptional condition by C++ exception mechanism.&lt;br /&gt;First I refereed to Google C++ coding convention. They say it is not good to use C++ exception with legacy codes which don't use it.&lt;br /&gt;One of reasons is that it causes resource leaks. see the following code.&lt;br /&gt;Here, h() calls g(), and g() calls f(). The problems is when f() throws an exception, if g() doesn't prepare this situation by RAII or others, a memory is leaked in g().&lt;br /&gt;g() has to be very careful that it cleans up every resources it created in a stack unwinding.&lt;br /&gt;&lt;br /&gt;void f() throw(myexception)&lt;br /&gt;{&lt;br /&gt;cout &lt;&lt; "f() called" &lt;&lt;&gt;&lt;br /&gt;throw myex;&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;void g()&lt;br /&gt;{&lt;br /&gt;int* leak = new int[10];&lt;br /&gt;cout &lt;&lt; "g() called" &lt;&lt;&gt;&lt;br /&gt;f();&lt;br /&gt;delete[] leak;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;void h()&lt;br /&gt;{&lt;br /&gt;cout &lt;&lt; "h() called" &lt;&lt;&gt;&lt;br /&gt;try {&lt;br /&gt;g();&lt;br /&gt;}&lt;br /&gt;catch (exception&amp; e)&lt;br /&gt;{&lt;br /&gt;cout &lt;&lt;&gt;&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;valgrind --tool=memcheck --leak-check=yes ./a.out&lt;br /&gt;==11647==&lt;br /&gt;==11647== HEAP SUMMARY:&lt;br /&gt;==11647== in use at exit: 40 bytes in 1 blocks&lt;br /&gt;==11647== total heap usage: 2 allocs, 1 frees, 176 bytes allocated&lt;br /&gt;==11647==&lt;br /&gt;==11647== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1&lt;br /&gt;==11647== at 0x4C27939: operator new[](unsigned long) (vg_replace_malloc.c:305)&lt;br /&gt;==11647== by 0x400DAF: g() (in /home/nhn/workspace/study/c++/exception/a.out)&lt;br /&gt;==11647== by 0x400E15: h() (in /home/nhn/workspace/study/c++/exception/a.out)&lt;br /&gt;==11647== by 0x400E93: main (in /home/nhn/workspace/study/c++/exception/a.out)&lt;br /&gt;==11647==&lt;br /&gt;==11647== LEAK SUMMARY:&lt;br /&gt;==11647== definitely lost: 40 bytes in 1 blocks&lt;br /&gt;==11647== indirectly lost: 0 bytes in 0 blocks&lt;br /&gt;==11647== possibly lost: 0 bytes in 0 blocks&lt;br /&gt;==11647== still reachable: 0 bytes in 0 blocks&lt;br /&gt;==11647== suppressed: 0 bytes in 0 blocks&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;When you use RAII patterns, there will be no memory leak.&lt;br /&gt;&lt;br /&gt;class MyRAII&lt;br /&gt;{&lt;br /&gt;public:&lt;br /&gt;MyRAII()&lt;br /&gt;{&lt;br /&gt;cout &lt;&lt; "MyRAII() called" &lt;&lt;&gt; m_data = new int[10];&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;~MyRAII()&lt;br /&gt;{&lt;br /&gt;cout &lt;&lt; "~MyRAII() called" &lt;&lt;&gt;&lt;br /&gt;delete[] m_data;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;private:&lt;br /&gt;int* m_data;&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;void g()&lt;br /&gt;{&lt;br /&gt;int* leak = new int[10];&lt;br /&gt;MyRAII myRaii; // no leak&lt;br /&gt;&lt;br /&gt;cout &lt;&lt; "g() called" &lt;&lt; endl&lt;br /&gt;f();&lt;br /&gt;delete[] leak;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Exceptions#Exceptions"&gt;http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Exceptions#Exceptions&lt;/a&gt;&lt;br /&gt;&lt;a href="http://publib.boulder.ibm.com/infocenter/comphelp/v8v101/index.jsp?topic=%2Fcom.ibm.xlcpp8a.doc%2Flanguage%2Fref%2Fcplr155.htm"&gt;http://publib.boulder.ibm.com/infocenter/comphelp/v8v101/index.jsp?topic=%2Fcom.ibm.xlcpp8a.doc%2Flanguage%2Fref%2Fcplr155.htm&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-9116383311582145099?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/9116383311582145099/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2010/12/c-exceptions-and-memory-leaks.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/9116383311582145099'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/9116383311582145099'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2010/12/c-exceptions-and-memory-leaks.html' title='c++ exceptions and memory leaks'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-5336047551256516983</id><published>2010-09-01T21:55:00.000-07:00</published><updated>2010-09-01T22:03:22.183-07:00</updated><title type='text'>Hadoop record boundary</title><content type='html'>I am working on build a infrastructure for analysis of a vast amount of emails.&lt;br /&gt;I made a chunk of email archives, size of which is 2G each.&lt;br /&gt;&lt;br /&gt;The problem was that when I concatenated each email into a chunk, there had to be a sync mark to distinct the boundaries between each email.&lt;br /&gt;So I put sync marks between emails.&lt;br /&gt;&lt;br /&gt;However, when I put this files on the DFS, and run a mapred job, I am just worried about the record boundaries of overlapped record between input splits and have some curiosity about how HADOOP deals with it.&lt;br /&gt;&lt;br /&gt;I was able to find the correct answer from this link.&lt;br /&gt;&lt;a href="http://wiki.apache.org/hadoop/FAQ#A23"&gt;http://wiki.apache.org/hadoop/FAQ#A23&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-5336047551256516983?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/5336047551256516983/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2010/09/hadoop-record-boundary.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/5336047551256516983'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/5336047551256516983'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2010/09/hadoop-record-boundary.html' title='Hadoop record boundary'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-1200244790214528847</id><published>2010-04-09T03:26:00.000-07:00</published><updated>2010-04-09T03:46:17.844-07:00</updated><title type='text'>nginx php-fpm TIME_WAIT</title><content type='html'>When you use nginx as a fastcgi web server and php-fpm for workers, you might come arocss TCP TIME_WAIT problem.&lt;br /&gt;TIME_WAIT is a state of socket waiting for a certain amount of time to close.&lt;br /&gt;&lt;br /&gt;Because of the very frequent communications between nginx and php-fpm, you will find sockets of TIME_WAIT accumulating a lot over 30,000. It looks uncomfortable.&lt;br /&gt;&lt;br /&gt;The solution is to use a unix domain socket.&lt;br /&gt;&lt;br /&gt;1. php-fpm.con&lt;br /&gt;&lt;value name="listen_address"&gt;/tmp/fastcgi.sock&lt;/value&gt;&lt;br /&gt;&lt;br /&gt;2. nginx.conf&lt;br /&gt;fastcgi_pass  unix:/tmp/fastcgi.sock;&lt;br /&gt;&lt;br /&gt;you will find a great reduction in the number of sockets of TIME_WAIT.&lt;br /&gt;However, from my test the performance is worse than TCP socket.&lt;br /&gt;I don't know why...&lt;br /&gt;&lt;br /&gt;Anyway, the result from my benchmark test, nginx was able to process 4~5 times more requests(10,000 units per sec) as Apache did. it's awesome.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-1200244790214528847?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/1200244790214528847/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2010/04/nginx-php-fpm-timewait.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/1200244790214528847'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/1200244790214528847'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2010/04/nginx-php-fpm-timewait.html' title='nginx php-fpm TIME_WAIT'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-8922173356577558712</id><published>2010-04-01T02:06:00.000-07:00</published><updated>2010-04-01T02:29:22.186-07:00</updated><title type='text'>apache mod_rewrite with percent signs</title><content type='html'>This article shows a little bit how to use mod_rewrite.&lt;br /&gt;&lt;br /&gt;The scenario is&lt;br /&gt;1. A client connects to the web server with a '%'-signed URL.&lt;br /&gt;For example, when you use MBSC(e.g. Korean), the URL turns into other form.&lt;br /&gt;http://test.com/테스트 =&gt; http://test.com/%C5%D7%BD%BA%C6%AE&lt;br /&gt;&lt;br /&gt;The Solution is the followings. But I don't know the internals.&lt;br /&gt;&lt;br /&gt;httpd.conf&lt;br /&gt;RewriteEngine on&lt;br /&gt;RewriteRule /테스트   http://test.com [R]&lt;br /&gt;RewriteLog "rewrite.log"&lt;br /&gt;RewriteLogLevel 10&lt;br /&gt;&lt;br /&gt;rewrite.log&lt;br /&gt;localhost - - [01/Apr/2010:14:38:07 +0900] [test.com/sid#94f2490][rid#95481a0/initial] (2) init rewrite engine with requested uri /테스트&lt;br /&gt;localhost - - [01/Apr/2010:14:38:07 +0900] [test.com/sid#94f2490][rid#95481a0/initial] (3) applying pattern '/%C5%D7%BD%BA%C6%AE' to uri '/테스트'&lt;br /&gt;localhost - - [01/Apr/2010:14:38:07 +0900] [test.com/sid#94f2490][rid#95481a0/initial] (1) pass through /테스트&lt;br /&gt;localhost - - [01/Apr/2010:14:39:46 +0900] [test.com/sid#8785490][rid#87db118/initial] (2) init rewrite engine with requested uri /테스트&lt;br /&gt;localhost - - [01/Apr/2010:14:39:46 +0900] [test.com/sid#8785490][rid#87db118/initial] (3) applying pattern '/테스트' to uri '/테스트'&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-8922173356577558712?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/8922173356577558712/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2010/04/apache-modrewrite-with-percent-signs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/8922173356577558712'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/8922173356577558712'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2010/04/apache-modrewrite-with-percent-signs.html' title='apache mod_rewrite with percent signs'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-8758696688267644992</id><published>2010-01-12T18:22:00.000-08:00</published><updated>2010-01-12T18:37:39.698-08:00</updated><title type='text'>C++ new operator and object cloning.</title><content type='html'>STL provides a few operator overloadings for the global new operator, which is used by new keyword to allocate a memory block to make a instance. New operators only provide a service for allocation of memory.&lt;br /&gt;&lt;br /&gt;These come with &lt;new&gt; header file &lt;span style="font-style: italic;"&gt;new&lt;/span&gt;&lt;new&gt;.&lt;br /&gt;When you declare it, it will apply the global namespace.&lt;br /&gt;&lt;br /&gt;http://www.cplusplus.com/reference/std/new/&lt;br /&gt;&lt;br /&gt;You can allocate a memory block for a uninitialized object(, which means its constructor has yet been called) and initialize with its constructor later.&lt;br /&gt;&lt;br /&gt;For instance, using this feature, you can do a cloning for a object.&lt;br /&gt;&lt;br /&gt;--------------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;#include iostream&lt;iostream&gt;&lt;br /&gt;#include stdlib&lt;stdlib.h&gt;&lt;br /&gt;#include new&lt;new&gt;&lt;br /&gt;using namespace std;&lt;br /&gt;&lt;br /&gt;struct myclass {&lt;br /&gt;    myclass(int i):m_i(i) {cout &lt;&lt; "myclass constructed\n";}&lt;br /&gt;    myclass(myclass &amp;amp;src) {&lt;br /&gt;        this-&gt;m_i = src.m_i;&lt;br /&gt;    }  &lt;br /&gt;&lt;br /&gt;    void print() { cout &lt;&lt; "my member is " &lt;&lt;&gt;m_i &lt;&lt; endl; }&lt;br /&gt;&lt;br /&gt;    int m_i;&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;int main() {&lt;br /&gt;    myclass tmp(1);&lt;br /&gt;&lt;br /&gt;    myclass *p3 = (myclass *)malloc(sizeof(myclass));&lt;br /&gt;&lt;br /&gt;    new(p3) myclass(tmp); // call constructor&lt;br /&gt;    //operator new (sizeof(myclass), p3);&lt;br /&gt;&lt;br /&gt;    p3-&gt;print();&lt;br /&gt;&lt;br /&gt;    return 0;&lt;br /&gt;}&lt;br /&gt;&lt;/new&gt;&lt;/stdlib.h&gt;&lt;/iostream&gt;&lt;/new&gt;&lt;/new&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-8758696688267644992?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/8758696688267644992/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2010/01/c-new-operator-and-object-cloning.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/8758696688267644992'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/8758696688267644992'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2010/01/c-new-operator-and-object-cloning.html' title='C++ new operator and object cloning.'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-551284080871454857</id><published>2010-01-07T02:45:00.000-08:00</published><updated>2010-01-07T03:01:37.533-08:00</updated><title type='text'>RETE algorithm.</title><content type='html'>I'm woking on developing a rule-based inference engine nowdays.&lt;br /&gt;I came across an algorithm which can do really fast pattern matching of facts against rules, called RETE.&lt;br /&gt;&lt;br /&gt;When you have facts which are augmented into the existing ones, or facts are continuosly chaning, RETE is the solution.&lt;br /&gt;&lt;br /&gt;The key idea of RETE is cache memories of nodes. When a node has caclulated a input once, it saves the result of it into its memeory. And when additional facts are known, only compute on newly added facts and join with the existing results in the memories.&lt;br /&gt;Also separation between alpha network and beta network is an amazing idea.&lt;br /&gt;&lt;br /&gt;There are a few implemention of this algorithm. Jess in java, Clips in C. Both are open source.&lt;br /&gt;&lt;br /&gt;Check the following link out.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Rete_algorithm"&gt;http://en.wikipedia.org/wiki/Rete_algorithm&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-551284080871454857?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/551284080871454857/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2010/01/rete-algorithm.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/551284080871454857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/551284080871454857'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2010/01/rete-algorithm.html' title='RETE algorithm.'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-6099047959677323227</id><published>2009-12-16T18:41:00.000-08:00</published><updated>2009-12-16T19:05:38.617-08:00</updated><title type='text'>Pig versus Hive</title><content type='html'>I'm recently working on detecting traffic trends to achieve my goal, which is classified.&lt;br /&gt;&lt;br /&gt;After a few days of contemplation, I've decided to adopt Hive like Twitter did.&lt;br /&gt;&lt;br /&gt;The reasons are..&lt;br /&gt;1. use Python to calculate complex statistics operation by Hadoop streaming.&lt;br /&gt;2. There are such many simple operations to run. use Hive QL&lt;br /&gt;3. Build dataware house.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-6099047959677323227?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/6099047959677323227/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/12/pig-versus-hive.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/6099047959677323227'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/6099047959677323227'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/12/pig-versus-hive.html' title='Pig versus Hive'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-4267592641098944216</id><published>2009-11-26T00:22:00.001-08:00</published><updated>2009-11-26T00:28:04.694-08:00</updated><title type='text'>hex to char</title><content type='html'>The blow code snipet converts hex value into charater in ISO 8859-1.&lt;br /&gt;&lt;br /&gt;For instance, we have letter 'B', the ascii value of which is 0x42.&lt;br /&gt;&lt;br /&gt;what[0] = '4'&lt;br /&gt;what[1] = '2'&lt;br /&gt;&lt;br /&gt;Since what[0] is less than 'A', substract '0' from it, and we get 4. Next, multiply 4 by 16 resulting 64. Finally we add 2 to 64. we get 66. 66 is the ascii value of 'B'.&lt;br /&gt;&lt;br /&gt;If what[0] is bigger than 'A', we can find the result charater in ISO 8859-1 charset.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;char _x2c(char *what)&lt;br /&gt;{&lt;br /&gt;    char digit;&lt;br /&gt;    digit = ((what[0] &gt;= 'A')&lt;br /&gt;            ? ((what[0] &amp;amp; 0xdf) - 'A') + 10&lt;br /&gt;            : (what [0] - '0'));&lt;br /&gt;    digit *= 16;&lt;br /&gt;    digit += ((what[1] &gt;= 'A')&lt;br /&gt;            ? ((what[1] &amp;amp; 0xdf) - 'A') + 10&lt;br /&gt;            : (what [1] - '0'));&lt;br /&gt;    return(digit);&lt;br /&gt;}&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-4267592641098944216?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/4267592641098944216/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/11/hex-to-char.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/4267592641098944216'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/4267592641098944216'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/11/hex-to-char.html' title='hex to char'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-996418713954862027</id><published>2009-09-17T02:43:00.000-07:00</published><updated>2009-09-17T05:23:10.404-07:00</updated><title type='text'>python generator.</title><content type='html'>Since I am quite a newbie to Python, I don't know much about sophisticate techniques of it. However, as I dig out Python, I find out something interesting topics.&lt;br /&gt;&lt;br /&gt;Python generator came out in the early of 2000. It's somewhat old concept, but for me this one is quite amazing.&lt;br /&gt;&lt;br /&gt;OOP pursues the idea that puts states and behaviors into An isolated and separated object. Closure in functional programming follow the same idea. It is a function which has free variables bound to lexical scope of the function containing them. In other words, closure is another case of encapsulation like OOP.&lt;br /&gt;&lt;br /&gt;Generator in Python goes even further than closure. This doesn't only encapsulates states and behaviors, also control flows. Because of this, there are so many application using this one to solve problems which people find difficult to come up with a neat solution in other languages.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://docs.python.org/tutorial/classes.html#generators"&gt;http://docs.python.org/tutorial/classes.html#generators&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.ibm.com/developerworks/linux/library/l-pycon.html"&gt;http://www.ibm.com/developerworks/linux/library/l-pycon.html&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-996418713954862027?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/996418713954862027/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/09/python-generator.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/996418713954862027'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/996418713954862027'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/09/python-generator.html' title='python generator.'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-2112154326323116601</id><published>2009-09-01T09:57:00.001-07:00</published><updated>2009-09-01T10:10:18.955-07:00</updated><title type='text'>Optimization proposal for Hadoop I/O pipelines</title><content type='html'>There are a movement to improve Hadoop's I/O performance. The main approach is reducing the number of byte buffer copy operations among the kernel, native and jvm. According to this article, it is possible to reduce 7-time buffer copys to 3-time in a input operation and 6-time to 4-time buffer copys in a output operation. There is sure to be an additional save by reducing switching cost between kernel mode and user modes.&lt;br /&gt;&lt;a href="http://developer.yahoo.net/blogs/hadoop/2009/08/the_anatomy_of_hadoop_io_pipel.html"&gt;http://developer.yahoo.net/blogs/hadoop/2009/08/the_anatomy_of_hadoop_io_pipel.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;When I saw a presentation "&lt;a href="http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime"&gt;HBase goes realtime&lt;/a&gt;", the main point of which was the great achievement in performance , I was curious of what "Unjavafy everything" meant. I guess this sentence meant what the above proposal says. I am going to figure this out soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-2112154326323116601?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/2112154326323116601/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/09/optimization-proposal-for-hadoop-io.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/2112154326323116601'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/2112154326323116601'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/09/optimization-proposal-for-hadoop-io.html' title='Optimization proposal for Hadoop I/O pipelines'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-5284820029902943036</id><published>2009-09-01T06:27:00.000-07:00</published><updated>2009-09-01T09:56:59.661-07:00</updated><title type='text'>Sector/Sphere</title><content type='html'>Sector/Sphere is another materialization of google's DFS and MapReduce framework. Sector and Sphere correspond DFS and MapReduce framework, respectively.&lt;br /&gt;&lt;br /&gt;One thing interesting is that Sector/Sphere use UDT which is alternative transportation protocol based on UDP. UDT uses UDP to transfer bulk data with its own   reliability control and congestion control mechanisms. It's called faster than TCP. I heard that Sector/Sphere is twice faster than Hadoop. I expect Hadoop will take UDT some day.&lt;br /&gt;&lt;br /&gt;There are other different things compared to Hadoop. Even though Sphere looks after MapReduce, you can use your own function, not only map or reduce. In practice, when we analyse log data or do some calculation, the map and reduce chain is convenient, but sometimes become limitation to some extent. Sphere can alleviate this limitation.&lt;br /&gt;&lt;br /&gt;check the following links out.&lt;br /&gt;&lt;a href="http://sector.sourceforge.net/index.html"&gt;http://sector.sourceforge.net/index.htm&lt;/a&gt;&lt;br /&gt;&lt;a href="http://developer.yahoo.net/blogs/hadoop/2009/08/the_anatomy_of_hadoop_io_pipel.html"&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-5284820029902943036?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/5284820029902943036/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/09/sectorsphere.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/5284820029902943036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/5284820029902943036'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/09/sectorsphere.html' title='Sector/Sphere'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-5318720705119064958</id><published>2009-08-31T22:11:00.000-07:00</published><updated>2009-08-31T23:47:43.901-07:00</updated><title type='text'>linux utility. `dislocate`</title><content type='html'>It's like screen utility when you start a long-term batch job and leave your office, but you still want to continue the job even after the terminal is closed. Next morning you can check the result of the job.&lt;br /&gt;&lt;br /&gt;`dislocate` has the ability you want to do. The program never die unless the system halts.&lt;br /&gt;&lt;br /&gt;The only thing you need to do is just&lt;br /&gt;$ dislocate `your command`&lt;br /&gt;That's it.&lt;br /&gt;&lt;br /&gt;check this out. You'll find funny to read this article.&lt;br /&gt;&lt;a href="http://wiki.tcl.tk/9263"&gt;http://wiki.tcl.tk/9263&lt;/a&gt;&lt;br /&gt;&lt;a href="http://expect.nist.gov/example/dislocate.man.html"&gt;http://expect.nist.gov/example/dislocate.man.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;But, I can't use ^D and ^Z command. I'm trying to figure out why.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-5318720705119064958?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/5318720705119064958/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/08/linux-utility-dislocate.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/5318720705119064958'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/5318720705119064958'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/08/linux-utility-dislocate.html' title='linux utility. `dislocate`'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-8431315917816663820</id><published>2009-07-30T18:33:00.000-07:00</published><updated>2009-08-03T08:23:44.101-07:00</updated><title type='text'>Hadoop and Stress testing</title><content type='html'>&lt;span style="color: rgb(51, 255, 51);"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;albeit hadoop is originally intended for data processing, I think there is another example of using it.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 255, 51);"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;br /&gt;MapReduce is a sort of algorithm strategy to solve a problem with a vast amount of input. It adopts divide and conquer which slice a large problem example into many small examples so that we can alleviate high time complexity.&lt;br /&gt;&lt;br /&gt;When we do a stress test, for example web server, we need to prepare a lot of computing power to make a huge stress. This means we want to have many concurrent tasks that make requests at the same time to the server.  To do this, we need t&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 255, 51);"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;o split large input into slices for each task and run them parallel. This is exactly what MapReduce framework does.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_JYWLK46vDAg/SnJS9mLjPwI/AAAAAAAAAAU/OvqNaaUFYyE/s1600-h/test2.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 183px;" src="http://2.bp.blogspot.com/_JYWLK46vDAg/SnJS9mLjPwI/AAAAAAAAAAU/OvqNaaUFYyE/s320/test2.gif" alt="" id="BLOGGER_PHOTO_ID_5364441324326174466" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 255, 51);"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;br /&gt;The recipe is..&lt;br /&gt;1. prepare input data in key, value format. If the source data is raw, we can run a mapreduce task to get normalized input data.&lt;br /&gt;2. write codes for mapper and reducer. Reducers are going to make request to servers because we can specify the number of reducers in order that we control the degree of stress. we can't do this for mappers. Reducers should emit test results as key, value format. If test failes due to exception occurs, increase the customized error count. We can use various counts to keep track of the testing process. In addition, we can see what's going on with them by web pages of the job tracker.&lt;br /&gt;3. submit the job. You can specify the number of reducers depending on how much stress you want to generate. However, due to physical limit of slaves' hardware there is certain point where no more reducers proceed concurrently. This is the maximum number of reducers making stress together. To make more, you need additional computing power.&lt;br /&gt;4. run a mapreduce job to get statistical summary of the testing from result files.&lt;br /&gt;&lt;br /&gt;I apply this method to test my server with 270 reducers on 9 nodes. I was able to use hundreds tasks since testing attempts are not CPU bound job, but IO(network) bound job. This works well even though there are some problems to figure out.&lt;br /&gt;&lt;br /&gt;Now, I start to think I can make a stress testing framework on Hadoop. Hadoop has almost everything to do a stress test.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-8431315917816663820?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/8431315917816663820/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/07/hadoop-to-do-stress-test.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/8431315917816663820'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/8431315917816663820'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/07/hadoop-to-do-stress-test.html' title='Hadoop and Stress testing'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_JYWLK46vDAg/SnJS9mLjPwI/AAAAAAAAAAU/OvqNaaUFYyE/s72-c/test2.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5627086854645065783.post-4577886294856229449</id><published>2009-05-14T04:00:00.000-07:00</published><updated>2009-07-30T19:37:44.193-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text similarity'/><category scheme='http://www.blogger.com/atom/ns#' term='Lavenshtein distance'/><title type='text'>Levenshtein distance</title><content type='html'>Lavenshtein distance shows a degree of how different two texts are. For example there are two texts, text1, text2.&lt;br /&gt;&lt;br /&gt;text1 = kitten&lt;br /&gt;text2 = sitting&lt;br /&gt;&lt;br /&gt;Lavenshtein distance is 3 becuase it take 3 actions to turn text1 into text2.&lt;br /&gt;1. substitute k with s resulting sitten&lt;br /&gt;2. substitute e with i resulting sittin&lt;br /&gt;3. add g at the end resulting sitting.&lt;br /&gt;&lt;br /&gt;To compute Lavenshtein distance, we use dynamic programming skill.&lt;br /&gt;&lt;br /&gt;This psuedo code is from wikipedia. Caculating the matrix to decide an action is the most important part.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;If the "upper cell + 1" is the smallest one, it means it needs a insertion to mimimize the cost.&lt;br /&gt;If the "left cell + 1"is the smallest one, it means it needs a deletion to minimize the cost.&lt;br /&gt;If the "upper-left + cost" is the smallest one, the action depends on cost. If cost is zero, there is no action to take becuase the letters of i, j are match. If cost is one, a substitution takes place.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;refer to the followings.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If we can transform &lt;code&gt;s[1..i]&lt;/code&gt; to &lt;code&gt;t[1..j-1]&lt;/code&gt; in &lt;code&gt;k&lt;/code&gt; operations, then we can simply add &lt;code&gt;t[j]&lt;/code&gt; afterwards to get &lt;code&gt;t[1..j]&lt;/code&gt; in &lt;code&gt;k+1&lt;/code&gt; operations.&lt;/li&gt;&lt;li&gt;If we can transform &lt;code&gt;s[1..i-1]&lt;/code&gt; to &lt;code&gt;t[1..j]&lt;/code&gt; in &lt;code&gt;k&lt;/code&gt; operations, then we can do the same operations on &lt;code&gt;s[1..i]&lt;/code&gt; and then remove the original &lt;code&gt;s[i]&lt;/code&gt; at the end in &lt;code&gt;k+1&lt;/code&gt; operations.&lt;/li&gt;&lt;li&gt;If we can transform &lt;code&gt;s[1..i-1]&lt;/code&gt; to &lt;code&gt;t[1..j-1]&lt;/code&gt; in &lt;code&gt;k&lt;/code&gt; operations, we can do the same to &lt;code&gt;s[1..i]&lt;/code&gt; and then do a substitution of &lt;code&gt;t[j]&lt;/code&gt; for the original &lt;code&gt;s[i]&lt;/code&gt; at the end if necessary, requiring &lt;code&gt;k+cost&lt;/code&gt; operations.&lt;/li&gt;&lt;li&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;pre&gt;&lt;b&gt;int&lt;/b&gt; LevenshteinDistance(&lt;b&gt;char&lt;/b&gt; s[1..m], &lt;b&gt;char&lt;/b&gt; t[1..n])&lt;br /&gt;&lt;i&gt;// d is a table with m+1 rows and n+1 columns&lt;/i&gt;&lt;br /&gt;&lt;b&gt;declare&lt;/b&gt; &lt;b&gt;int&lt;/b&gt; d[0..m, 0..n]&lt;br /&gt;&lt;br /&gt;&lt;b&gt;for&lt;/b&gt; i &lt;b&gt;from&lt;/b&gt; 0 &lt;b&gt;to&lt;/b&gt; m&lt;br /&gt;d[i, 0] := i&lt;br /&gt;&lt;b&gt;for&lt;/b&gt; j &lt;b&gt;from&lt;/b&gt; 0 &lt;b&gt;to&lt;/b&gt; n&lt;br /&gt;d[0, j] := j&lt;br /&gt;&lt;br /&gt;&lt;b&gt;for&lt;/b&gt; i &lt;b&gt;from&lt;/b&gt; 1 &lt;b&gt;to&lt;/b&gt; n&lt;br /&gt;&lt;b&gt;for&lt;/b&gt; j &lt;b&gt;from&lt;/b&gt; 1 &lt;b&gt;to&lt;/b&gt; m &lt;b&gt;{&lt;/b&gt;&lt;br /&gt;&lt;b&gt;if&lt;/b&gt; s[i] = t[j] &lt;b&gt;then&lt;/b&gt;&lt;br /&gt;cost := 0&lt;br /&gt;&lt;b&gt;else&lt;/b&gt;&lt;br /&gt;cost := 1&lt;br /&gt;d[i, j] := minimum(&lt;br /&gt;       d[i-1, j] + 1,     &lt;i&gt;// insertion&lt;/i&gt;&lt;br /&gt;       d[i, j-1] + 1,     &lt;i&gt;// deletion&lt;/i&gt;&lt;br /&gt;       d[i-1, j-1] + cost   &lt;i&gt;// substitution&lt;/i&gt;&lt;br /&gt;    )&lt;br /&gt;&lt;b&gt;}&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;return&lt;/b&gt; d[m, n]&lt;/pre&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Levenshtein_distance"&gt;http://en.wikipedia.org/wiki/Levenshtein_distance&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5627086854645065783-4577886294856229449?l=tuna4u.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://tuna4u.blogspot.com/feeds/4577886294856229449/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://tuna4u.blogspot.com/2009/05/levenshtein-distance.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/4577886294856229449'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5627086854645065783/posts/default/4577886294856229449'/><link rel='alternate' type='text/html' href='http://tuna4u.blogspot.com/2009/05/levenshtein-distance.html' title='Levenshtein distance'/><author><name>jaehong choi</name><uri>http://www.blogger.com/profile/10015382873553351073</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
