Content tagged python

Intro

Writing this blog became increasingly tedious over time. The reason for this was the slowness of the rendering tool I use - coleslaw. It seemed to work well for other people, though, so I decided to investigate what I am doing wrong. The problem came from the fact that the code coloring implementation (which I co-wrote) spawned a Python process every time it received a code block to handle. The coloring itself was fast. Starting and stopping Python every time was the cause of the issue. A solution for this malady is fairly simple. You keep the Python process running at all times and communicate with it via standard IO.

Surprisingly enough, I could not find an easy and convenient way to do it. The dominant paradigm of uiop:run-program seems to be spawn-process-close, and it does not allow for easy access to the actual streams. sb-ext:run-program does hand me the stream objects that I need, but it's not portable. While reading the code of uiop trying to figure out how to extract the stream objects from run-program, I accidentally discovered uiop:launch-program which does exactly what I need in a portable manner. It was implemented in asdf-3.1.7.39 released on Dec 1st, 2016 (a month and a half ago!). This post is meant as a piece of documentation that can be indexed by search engines to help spread my happy discovery. :)

Python

The Python code reads commands from standard input and writes the responses to standard output. Both, commands and response headers are followed by newlines and an optional payload.

The commands are:

  • exit - what it does is self-evident
  • pygmentize|len|lang[|opts]:
    • len is the length of the code snippet
    • lang is the language to colorize
    • optional parameter opts is the configuration of the HTML formatter
    • after the newline, len utf-8 characters of the code block need to follow

There's only one response: colorized|len, followed by a newline and len utf-8 characters of the colorized code as an HTML snippet.

Python's automatic inference of standard IO's encoding is still pretty messed up, even in Python 3. It's a good idea to create wrapper objects and interact only with them:

1 input  = io.TextIOWrapper(sys.stdin.buffer,  encoding='utf-8')
2 output = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

Printing diagnostic messages to standard error output is useful for debugging:

1 def eprint(*args, **kwargs):
2     print(*args, file=sys.stderr, **kwargs)

Lisp

OK, I have a python script that does the coloring. Before I can use it, I need to tell ASDF about it and locate where it is in the filesystem. The former is done by using the :static-file qualifier in the :components list. The latter is a bit more complicated. Since the file's location is known relative to the lisp file it will be used with, it's doable.

1 (defvar *pygmentize-path*
2   (merge-pathnames "pygmentize.py"
3                    #.(or *compile-file-truename* *load-truename*))
4   "Path to the pygmentize script")

The trick here is to use #. to execute the statement at read-time. You can see the full explanation here.

With that out of the way, I can start the renderer with:

1 (defmethod start-concrete-renderer ((renderer (eql :pygments)))
2   (setf *pygmentize-process* (uiop:launch-program
3                               (list *python-command*
4                                     (namestring *pygmentize-path*))
5                               :input :stream
6                               :output :stream)))

For debugging purposes, it's useful to add :error-output "/tmp/debug", so that the diagnostics do not get eaten up by /dev/null.

To stop the process, we send it the exit command, flush the stream, and wait until the process dies:

1 (defmethod stop-concrete-renderer ((renderer (eql :pygments)))
2   (write-line "exit" (process-info-input *pygmentize-process*))
3   (force-output  (process-info-input *pygmentize-process*))
4   (wait-process *pygmentize-process*))

The Lisp part of the colorizer sends the pygmentize command together with the code snippet to Python and receives the colorized HTML:

 1 (defun pygmentize-code (lang params code)
 2   (let ((proc-input (process-info-input *pygmentize-process*))
 3         (proc-output (process-info-output *pygmentize-process*)))
 4     (write-line (format nil "pygmentize|~a|~a~@[|~a~]"
 5                         (length code) lang params)
 6                 proc-input)
 7     (write-string code proc-input)
 8     (force-output proc-input)
 9     (let ((nchars (parse-integer
10                    (nth 1
11                         (split-sequence #\| (read-line proc-output))))))
12       (coerce (loop repeat nchars
13                  for x = (read-char proc-output)
14                  collect x)
15               'string))))

See the entire pull request here.

Stats

I was able to get down from well over a minute to less that three seconds with the time it takes to generate this blog.

]==> time ./coleslaw-old.x /path/to/blog/
./coleslaw-old.x /path/to/blog/  66.40s user 6.19s system 98% cpu 1:13.55 total
]==> time ./coleslaw-new-no-renderer.x /path/to/blog/
./coleslaw-new-no-renderer.x /path/to/blog/  65.50s user 6.03s system 98% cpu 1:12.53 total
]==> time ./coleslaw-new-renderer.x /path/to/blog/
./coleslaw-new-renderer.x /path/to/blog/  2.78s user 0.27s system 106% cpu 2.849 total
  • coleslaw-old.x is the original code
  • coleslaw-new-no-renderer.x starts and stops the renderer with every code snippet
  • coleslaw-new-renderer.x starts the renderer beforehand and stops it after all the job is done

Video Link

Pretty interesting talk on how to prevent squirrels from stealing bird food using python and computer vision.

Steps to recognize a squirrel on a picture:

  • Subtract background.
    • Compute average value of each pixel over time to build a background profile.
  • Detect blobs.
  • Discriminate blobs. Distinguish between squirrels from other things. The author used support vector machines.
    • Blob size
    • Color histograms
    • Entropy detection (squirrel tail)

Other interesting stuff mentioned:

Video Link

Way too long for the amount of useful content presented, but still quite OK.

Highlights:

Why, oh why?

GMail is pretty great, but I don't think I need to tell you that. I started to use it because of its killer spam filter and search capabilities, then, as new features were introduced, I liked it more and more. The fact that it picked into the mails I send and receive didn't really bother me that much until recently. My view on all this has changed after the new privacy policy was introduced, saying that they will share this sort of private info between all of the Google's services. Many people may be fine with it, but I am not. After reading a couple of blog posts (like this one, and this one), and seeing some lame Microsoft commercials (youtube), Google's reading my emails started to creep me out. And it didn't really matter how much they could possibly improve my world by doing this (fixing the search).

How?

Using IMAP, of course. There's a couple of articles on the web describing how to do that, but they all have one problem. They tell you to copy your emails from the "All Mail" folder in the gmail account, which means that you will lose all your labels, and that's probably not acceptable for someone with thousands of conversation threads all placed in the boxes they belong to. I have written a couple of python scripts that should help solving this problem. You can get them from github. Remember: You use them at your own risk. Read on, if you want to know what they do.

Problem with labels

There are some issues with labels when accessing gmail through IMAP. Since the labels are mapped to the IMAP folders, you might, at first, expect that, when you open the folder and browse it, you will see all the emails in all the conversations bearing the label. This is not true. GMail operates and labels things at the level of conversations, IMAP, however, operates at the level of messages, so you will only see the messages that were part of the conversation when you assigned the label but not the new ones. This makes some sense if you use gmail with a normal IMAP client, ie. you don't see the same new messages arriving in many folders, an you avoid having the archived messages back in the IMAP inbox when somebody responds to a message in an archived conversation.

This is a real problem when you'd like to move your nicely sorted conversations out of gmail to some other service. There is a suggestion at the Google Product Forums on how to deal with it: you should go to the web browser, open the conversation, remove the label, and then re-apply it. It's not really a viable solution if you have more than a couple of conversations, fortunately Google provides an API that will enable us to do that auto-magically.

This is what the gmail_label_remap.py does, it finds every conversation and re-applies the labels. You should see something like this when you run it from your terminal:

]==> ./gmail_label_remap.py
username (won't be echoed):
password (won't be echoed):
[i] Identify the IMAP folder name for "All Mail"...
[i] Done: [Gmail]/Tous les messages
[i] Opening [Gmail]/Tous les messages
[i] [Gmail]/Tous les messages contains 15299 messages
[i] Identify all the threads (may take some time)
[i] Found 4182 threads
[i] Processing thread 4182 of 4182
[i] Orphaned threads:   414
[i] Sent only threads:  255
[i] ALL DONE

It will create two extra labels:

  • orphaned - for the conversations without labels
  • sent_only - for the emails that you have sent but got no answer to them, so they have no assigned label

Moving emails to a different IMAP account

Now that all the label issues are sorted out, you need to copy the emails. You may use an IMAP client of your preference, or the imap_copy.py script. Using the script you may get the list of all your IMAP folders:

]==> ./imap_copy.py --list --source=imap.gmail.com:993
imap.gmail.com's username (won't be echoed):
imap.gmail.com's password (won't be echoed):
Fun (1641)
INBOX (114)
Private (484)
Work (1010)
[Gmail]/Brouillons (0)
[Gmail]/Corbeille (563)
[Gmail]/Important (5204)
[Gmail]/Messages envoy&AOk-s (4440)
[Gmail]/Spam (16)
[Gmail]/Suivis (59)
[Gmail]/Tous les messages (14736)

It will also let you copy the messages from one IMAP account to another:

]==> ./imap_copy.py --copy=INBOX.gmail-inbox,Fun \
                    --source=imap.gmail.com:993  \
                    --destination=imap.somewhere.else:993
imap.gmail.com's username (won't be echoed):
imap.gmail.com's password (won't be echoed):
imap.somewhere.else's username (won't be echoed):
imap.somewhere.else's password (won't be echoed): 
Copying INBOX => gmail-inbox
Copying 114 of 114
Copying Fun => Fun
Copying 1641 of 1641
ALL DONE

What you specify after the 'equal' sign of the --copy parameter is a comma separated list of folders. It may just be a folder name, like Fun in the example above. Or something like: INBOX.gmail-inbox which will take the INBOX folder at the source and copy its contents to the gmail-inbox folder at the destination.

What's next?

For the time being I will use KMail for e-mails and Psi for jabber, they are pretty decent, but I liked having everything (conversation history, settings, contacts) in one place on the web, so I will probably look for, or help developing, a solution that will give me all that and more. RoundCube looks promising.

Edit 28-04-2012: Continuation post has been added.