Breaking through Roadblocks

Hello in 2013! It has been ages since I’ve blogged anything, mainly because I enjoy Google’s social site, google+ way too much, despite, or perhaps due to it being filled mostly with my geek friends.

I decided to post this on wordpress, but it made me think about the possibilities to break the walled garden of g+ and somehow syndicate certain posts on this site. But that is perhaps material for another post.

What I wanted to share are two stumbling blocks, trivial for most of you, but very frustrating until you know the solution. A total must for your typical batch processing is the xargs utility. Used typically with find, it allows you to perform commands on a list of arguments. By default it lists all arguments on one line:

find . | grep svg$ | xargs echo

Now find itself has a million switches to perform filtering, but I prefer not diving into the manpage if given the option :) The default behavior of `xargs` leaves a lot to be desired, because usually there is a big list you are working on, and bash and other shells have a limit on the number of arguments. Additionally, it is very likely you will need another argument to follow the one you got passed. The magical parameter you’re looking for is -i that splits the inline list and calls the provided command separately for each passed argument. You can place that argument anywhere on the commandline using {} brackets:

find . | grep mp4$ | xargs -i ffmpeg -i {} -sameq {}.webm

So while the manpage surely includes this info, I bet someone will find this through a google query and will appreciate it :)

The other big stumbling block that I also hit with ruby is about Xpath queries in python. Big thanks to Patryk Zawadzki for the solution. When parsing inkscape svg xml documents, they actually include numerous namespaced tags, so simple queries like //rect will fail. You need to prepend all elements with the svg namespace (such as //{http://www.w3.org/2000/svg}rect). Full example here:


#!/usr/bin/env python3

import glob
import os
import csv
from xml.etree import ElementTree

members = csv.reader(open('members.csv'))
TEMPLATE = 'template.svg'

for data in members:
  print(data[0])
  svg = ElementTree.parse(TEMPLATE)
  svg.find(".//{http://www.w3.org/2000/svg}text[@id='memno']/{http://www.w3.org/2000/svg}tspan").text = data[0]
  svg.find(".//{http://www.w3.org/2000/svg}text[@id='name']/{http://www.w3.org/2000/svg}tspan").text = data[1]
  svg.find(".//{http://www.w3.org/2000/svg}text[@id='validto']/{http://www.w3.org/2000/svg}tspan").text = data[2]
  svg.write('./out/%s.svg' % (data[0]))
  os.system("inkscape -A ./out/%s.pdf ./out/%s.svg" % (data[0],data[0]))
  os.unlink('./out/%s.svg' % (data[0]))

Update: Turns out the “ evaluation in the xargs example was flawed. Thanks for spotting. Additionally, find itself seems to have an iterator of its own:

find . -name '*.avi' -exec echo ffmpeg -i '{}' -sameq '{}'.webm  ';'

9 Responses to “Breaking through Roadblocks”

  1. Janne Says:

    One added tip for find and xargs: use -print0 for find, -zZ for grep and -0 (that’s a zero) for xargs:

    find . -print0 | grep -zZ “whatever” |xargs -0 …

    Basically, it makes find and xargs null-termination for each line. That way you won’t get into trouble with data that happens to include a linefeed.

  2. jimmac Says:

    Thanks for that. Indeed the space in filenames is quite a roadblock too.

  3. Bernhard M. Wiedemann Says:

    also noteworthy is that xargs knows about the length limit of parameters as described by man xargs:
    The command line for command is built up until it reaches a system-de-
    fined limit (unless the -n and -L options are used). The specified
    command will be invoked as many times as necessary to use up the list
    of input items.

    btw: the find equivalent of your grep above is -name \*.svg

  4. daniels Says:

    And the obligatory smug zsh-user answer:
    % for i in **/*.mp4; do ffmpeg $i ${i:s/mp4$/webm/}; done

  5. Dominic Lachowicz Says:

    Also consider using ‘find -exec’ to do this. Eg:

    find . -name ‘*.svg’ -exec MYPROGRAM {} \;

  6. Icarus Sparry Says:

    I am afraid your ffmpeg example is badly flawed.

    find . | grep mp4$ | xargs -i ffmpeg -i {} -sameq `basename {} avi`.webm

    The stuff in backquotes is evaluated once, long before xargs is run. As the two characters {} do not end in the four characters .avi, basename returns them unchanged. This means your example is

    find . | grep mp4$ | xargs -i ffmpeg -i {} -sameq {}.webm

    In other words if you have a file x.avi, the output will be x.avi.webm rather than x.webm.

    The ‘-i’ argument to xargs is depreciated.

  7. Aristotle Pagaltzis Says:

    > The default behavior of `xargs` leaves a lot to be desired, because usually there is a big list you are working on, and bash and other shells have a limit on the number of arguments.

    Errr. That default behaviour is *the entire point* of xargs! I.e. to *not* run the given command over and over for every single file. Some commands, e.g. `rm`, do so little work that just starting the command 3,000 times consumes much more CPU and I/O than the 3,000 copies of the command itself perform. Then you use `xargs` to start that command just once with as much of the list given to it as possible – and then everything goes much, much faster.

    But you are using ffmpeg to recode media files. The resources it takes to start ffmpeg 3,000 times are insignificant compared to the resources it takes to convert 3,000 MP4 files to WebM. So using xargs makes hardly any difference, it just makes life harder.

    When you don’t need xargs, if you are using `find`, you can always just use the `-exec` switch of `find`, and not pipe its output anywhere.

    And if you are generating the list of filenames with something other than `find`, you can pipe them into a `while` loop, e.g.

    git ls-files | while read f ; do
    # do something with “$f” here
    done

    Or whatever.

  8. Aristotle Pagaltzis Says:

    The default behavior of xargs leaves a lot to be desired, because usually there is a big list you are working on, and bash and other shells have a limit on the number of arguments.

    Errr. That default behaviour is the entire point of xargs! I.e. to not run the given command over and over for every single file. Some commands, e.g. rm, do so little work that just starting the command 3,000 times consumes much more CPU and I/O than the 3,000 copies of the command itself perform. Then you use xargs to start that command just once with as much of the list given to it as possible – and then everything goes much, much faster.

    But you are using ffmpeg to recode media files. The resources it takes to start ffmpeg 3,000 times are insignificant compared to the resources it takes to convert 3,000 MP4 files to WebM. So using xargs makes hardly any difference, it just makes life harder.

    When you don’t need xargs, if you are using find, you can always just use the -exec switch of find, and not pipe its output anywhere.

    And if you are generating the list of filenames with something other than find, you can pipe them into a while loop, e.g.

    git ls-files | while read f ; do
    # do something with "$f" here
    done

    Or whatever.

  9. Mary Biggs Says:

    Hi.

    Coincidentally I’m currently working on XML using Python. It turns out that the library xml.etree isn’t available anywhere for Fedora 18. I’m happy using libxml2 in its place.

    Mary

Leave a Reply