I’m having a bit of time lately, since I didn’t want to keep one of the gigs I’ve been working on in the past year or so… this does not mean I’m more around than usual for now though, mostly because I’m so tired that I need to recharge before I can even start answering the email messages I’ve received in the past ten days or so.
While tinderboxing lately has been troublesome – the
dev-lang/tendra package build causes Yamato to reach out of memory, causing the whole system to get hosed; it looks like a fork bomb actually – I’ve also had some ideas on how to make it easier for me to report bugs, and in general to get other people to help out with analysing the results.
My original hope for the log analysis was to make it possible to push out the raw log, find the issues, and report them one by one.. I see now that this is pretty difficult, nearing infeasible, so I’d rather try a different approach. The final result I’d like to have now is a web-based equivalent of my current
grep combination: a list of log names, highlighting those that hit any trouble at all, and then within those logs, highlights on the rows showing trouble.
To get to this point, I’d like to start small and in a series of little steps… and since I honestly have not the time to work on it, I’m asking the help of anybody who’s interested in helping Gentoo. The first step here would be to find a way to process a build log file, and translate it into HTML. While I know this means incrementing its size tremendously, this is the simplest way to read the log and to add data over it. What I’d be looking for would be a line-numbered page akin to the various pastebins that you can find around. This does mean having one-span-per-line to make sure that they are aligned, and this will be important in the following steps.
The main issue with this is that there are build logs that include escape sequences, even though I disable Portage’s colours as well as build systems’, and that means that whatever converts the logs should also take care of stripping away said sequences. There are also logs that include outputs such as wget’s or curl’s, that use the carriage-return code to overwrite the output line, but creates a mess when viewing the log outside a terminal — I’m not sure why the heck they don’t check whether you’re outputting only on a tty. There are also some (usually Java) packages whose log appears to
grep as a binary file, and that’s also something that the conversion code should have to deal with.
As a forecast of what’s to come with the next few steps, I’ll need a way to match error messages in each line, and highlight those. Once they are highlighted, just using XPath expressions to count the number of matched lines should make it much easier to produce an index of relevant logs… then it’s a matter where to publish these. I think that it might be possible to just upload everything here in Amazon’s S3 system and use that, but I might be optimistic for no good reason, so it has to be discussed and designed.
Are you up to help me on this task? If so, join in the comments!
HTML is an overkill (try to open 1MB+ html page), maybe XML+XSLT solution would be better? You will still have XPath available, keep ability to view the data, and also have possibility to automate processing.
It might make sense to have a frontend which uses something like ExtJS’ paginated scrolling capabilities to only show the current lines, as “mrblur” already said, loading a 1MiB+ file as HTML might screw up most browsers.So the currently displayed lines (+ some scrolling buffer) would be loaded on demand through AJAX while those out of the scrolling scope will be deleted from the DOM.The fact that these lines don’t need to be edited makes the whole system way easier than the complex implementation of ExtJS which has to take care of storing edited lines, dealing with new records etc.@flameeyes: Could you possibly provide some of the weird cases in the log output which could be used for further experiments towards dealing with them?
I’d second to eliasp …. use AJAX to browse through the logfile. That way, html is doable by adding the html on the fly with the js printing the buildlog content.I’d also suggest building an index of potentially interesting strings. That way you could have a list of strings you’d like to match and you could easily ‘grep’ instantly over lots of data.The system would have a daemon continually indexing incoming logs and purging old logs (from storage and from the index). The index would match selected keywords with filename:linenumber pair.You could even have a fulltext search engine like sphinx just do the same work but since most of the output is not so interesting, a simpler solution like the above might yield better results.
I am also get tendra errors in my depchecking tinderbox.In my case the error was because I assumed all names of files was in utf8, but the tendra has non-utf8 characters in file names.
Sometimes I hate my smartphone. Typed a very lengthy design proposal for the system last night and the comment ended up in the digital nirvana.What I wanted to say:It would probably make sense, to actually rely on ExtJS for the frontend as they already did all the nasty work of implementing the “Infinite Grid Scrolling”. See this blog post: http://www.sencha.com/blog/…The look of the native ExtJS grid could be styled a little to look more like a regular textfile, but besides that, it even already provides the numbered lines etc.The easiest way to connect a datasource to an ExtJS model is using REST, which points IMHO to using Rails as a server backend as Rails allows building such an application quite fast and Ruby has some great tools for text processing.So the process of getting the Tinderbox logs onto the application server could look like this:1. Register new package atom and a hash of the buildtime attributes (CFLAGS, GCC version, arch, etc.) in the database and set an attribute like “build_succeeded”2. Upload corresponding raw build log data3. Upload corresponding raw elog data4. Process build log4.1 remove possible color escape codes, wget output etc.4.2 save possible findings together with the corresponding linenumbers4.2 write the processed file4.3 remove the raw file to save storage space.5. Process the elog data5.1 save findings classified (e.g. ‘qa_issue’) into a tableSo far a raw proposal… further (and more detailed) ideas to come..
Tendra uses pmake and it is written poorly, for example it uses own function vsnprintf(void) which calls vsprintf(s, fmt, args) and then returns strlen(s). It is totally wrong when null bytes are written.Also Tendra continuously writes /bin/sh: line 0: cd: /var/tmp/portage/dev-lang/tendra-5.0_pre20070510-r2/work/trunk/src/lib/machines//80×86/tokens: No such file or directory.
Admittedly I don’t know ExtJS, but I’d very much like to start small without even having a webapp to handle these things, if there is any chance at all of that.The main reason for that was to allow me to publish the logs without going crazy with authentication and authorization; I guess that might be an utopia..I’ll fetch a sample of troublesome logs and post them soonish…
r = re.compile (r’33[[0-9;?]*[a-zA-Z]’)r.sub (”, s)r = re.compile (r'(n|^)[^n]*r’)r.sub (‘n’, s)
Here is an alternative approach:I once published build logs of a automatic build system to a trac system. This was relatively easy, as trac has a xmlrpc interface.I could imagine a simple script running on your tinderbox that grep’s the logfiles for specific patterns, (“error:” comes to mind as an example) and then creates a ticket on the track server with the grep-result as bug description and the log attached for reference.That way, you could easily make use of trac’s collaboration functionality on analysing logs.I’m not clear at what stage you are concerned about escape-sequences? Trac would do the heavy lifting for displaying things.For filtering the log output, I would search really hard for pre-existing tools. I really doubt that problem has never been solved already.
Looks like an interesting project.Do you have any idea on the storage volume required ?
Are you still looking for a solution for this? It looks like an interesting project.Perhaps I can expand some code I’ve been using for my work; parsing and aggregating search queries using just shell script (and postgres for storing aggregates). The parsing process can even be run parallel using gnu parallel.If you have a sample of logs you want parsed, and some of the errors you’d like displayed, I can try to make something.