[00:00:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [00:00:55] Logged the message, Master [00:01:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57014 [00:02:13] Scap really is overkill.. [00:02:29] looks better now though :) [00:02:30] thx [00:02:40] probably wants a different logo? [00:02:42] scap rules everything around me: s.c.r.e.a.m. [00:02:59] Sue's face? [00:03:03] hah [00:03:04] lol Reedy [00:03:16] mutante, the wmf one should do [00:03:17] Reedy: uh, what? [00:03:27] notpeter: ? [00:03:38] Reedy: nevermind, I misread that [00:04:07] oh, where's the incubator wiki it refers too? hah [00:04:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:04:37] lesliecarr: are we leaving search1024 like this for awhile? [00:04:37] mutante that's where new languages grow before they move to new wikis [00:04:51] where is the text of https://transitionteam.wikimedia.org/w/index.php?title=Main_Page&action=edit - can it be edited? [00:05:02] making those sister project links protocol relative would be nice [00:05:20] extensions/WikimediaMaintenance [00:05:28] mutante: Did you update both interwiki caches? [00:05:47] https://gerrit.wikimedia.org/r/#/c/42133/ [00:05:54] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 16033 MB (1% inode=99%): [00:06:24] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [00:08:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:07:54 UTC 2013 [00:08:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:08:53] Thehelpfulone: i know there wasn't a real incubator for this one:) [00:09:00] Reedy: both? [00:09:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:09:00 UTC 2013 [00:09:06] still syncing [00:09:08] oh i see [00:09:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:09:23] Reedy: the one that creates a new .cdb [00:09:24] mutante: Cancel it [00:09:53] New review: Reedy; "This needs updating to how things are now..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/42133 [00:10:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:09:58 UTC 2013 [00:10:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:10:34] !log reedy synchronized php-1.21wmf12/cache/interwiki.cdb 'Updating 1.21wmf12 interwiki cache' [00:10:39] Logged the message, Master [00:10:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:10:51 UTC 2013 [00:11:12] Reedy: eh, ok, canceled. stopping in the middle of a sync ..though.. [00:11:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:11:44] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:11:38 UTC 2013 [00:12:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:12:24] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:12:18 UTC 2013 [00:12:39] A new wiki was created by apache at Mon, 01 Apr 2013 23:27:12 GMT for a Wikimedia in English (en). [00:12:39] -> on newprojects mailing list, this used to say who ran it, what changed? [00:13:13] "used to"? [00:13:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:13:14] When? [00:14:20] !log reedy synchronized php-1.22wmf1/cache/interwiki.cdb 'Updating 1.22wmf1 interwiki cache' [00:14:26] Logged the message, Master [00:14:44] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:14:37 UTC 2013 [00:15:19] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:15:49] A new wiki was created by reedy at Wed, 06 Feb 2013 23:45:09 GMT for a Wikipedia in Baso Minangkabau (min). [00:15:57] then the next one was A new wiki was created by apache at Tue, 05 Mar 2013 22:01:49 GMT for a Wikimedia in English (en). [00:16:00] What about the ones I did last week? [00:16:02] Lol [00:16:03] New patchset: Reedy; "Add script to update the interwiki cache on all currently deployed MW versions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:16:12] You can "blame" Tim for that [00:16:28] New patchset: Dzahn; "change logo for transitionteam wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57017 [00:16:35] Thehelpfulone: ^ [00:17:10] New patchset: Reedy; "Add script to update the interwiki cache on all currently deployed MW versions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:18:43] does that change the favicon too? [00:19:03] No [00:19:38] might as well do that too then mutante? [00:20:20] New patchset: Dzahn; "change logo and favicon for transitionteam wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57017 [00:20:57] or do you want black-globe.ico :p [00:21:40] New review: Reedy; "* The user must have write access to the directory, for temporary file creation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:22:01] heh, nah that will do :P [00:23:12] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57017 [00:23:29] how do you create the first account on a private wiki mutante, using a script? [00:24:04] RECOVERY - RAID on db1001 is OK: OK: State is Optimal, checked 2 logical device(s) [00:29:06] createAndPromote [00:30:21] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [00:30:26] Logged the message, Master [00:30:45] Thehelpfulone: logo/favicon done [00:31:48] New review: Asher; "I agree with Faidon, the existing behavior of redirecting m.wikipedia.org/$uri to en.m.wikipedia.org..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55302 [00:33:14] mutante, was it deployed? seems to be the same for me [00:33:55] Thehelpfulone: yes, pretty sure it's caching, i see the new ones [00:34:27] there we go [00:35:53] !log creating search index for transitionteamwiki [00:36:00] Logged the message, Master [00:36:03] PROBLEM - RAID on db1054 is CRITICAL: NRPE: Command check_raid not defined [00:36:30] New review: Reedy; "Still a problem depending on the owner of the cache dir and permissions on interwiki.cdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:36:33] PROBLEM - DPKG on db1054 is CRITICAL: NRPE: Command check_dpkg not defined [00:36:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:36:43] PROBLEM - Disk space on db1054 is CRITICAL: NRPE: Command check_disk_space not defined [00:40:56] !log restarting lucene on all pool4 servers (one by one) [00:41:02] Logged the message, Master [00:41:13] New patchset: Reedy; "Update dblists and wikiversions for transitionteamwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57021 [00:41:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57021 [00:42:36] Reedy: ooh, that is also in gerrit, sorry [00:43:06] !log now running the image img_media_mime migration on commons (the big one) [00:43:12] Logged the message, Master [00:44:25] !log reedy synchronized wmf-config/ [00:44:31] Logged the message, Master [00:44:51] commons image table migration currently estimated to take 8 hours.. wee! [00:46:08] now 9 hours [00:48:08] New patchset: Reedy; "Move interwiki.cdb and trusted-xff.cdb into wmf-config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57023 [00:48:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57023 [00:49:55] Reedy: where is createAndPromote [00:50:03] maintenance/createAndPromote.php [00:50:08] thx [00:52:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [00:54:23] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:54:30] New patchset: Reedy; "Add script to update the interwiki cache on all currently deployed MW versions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:55:53] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:57:26] New patchset: Reedy; "Move target of noc cdb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57025 [00:58:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:59:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57025 [01:00:59] New patchset: Reedy; "Add script to update the interwiki cache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [01:02:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [01:03:23] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:04:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [01:05:04] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:14] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:06:44] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 16312 MB (1% inode=99%): [01:06:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [01:07:54] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [01:10:44] PROBLEM - search indices - check lucene status page on search14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:56] daaaaamn it [01:11:25] icinga-wm: * Starting Lucene Search daemon [ OK ] [01:11:53] New patchset: Reedy; "Make extensions/WikimediaMaintenance/filebackend/setZoneAccess.php wikiless" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57026 [01:12:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57026 [01:15:06] !log removing labstore1 and labstore2 entries from projectstorage.wmnet rr dns entry in preparation for shrinking volumes [01:15:11] Logged the message, Master [01:18:14] PROBLEM - search indices - check lucene status page on search17 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 55856 bytes in 0.112 second response time [01:18:14] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:18:36] New patchset: Reedy; "Remove readonly.dblist. Essentially a dupe of closed.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57027 [01:21:54] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.017 second response time [01:22:35] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:25:02] New patchset: Reedy; "Reduce the amount of times the database lists are read in" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57028 [01:33:33] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:33:33] RECOVERY - search indices - check lucene status page on search13 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.112 second response time [01:34:04] yay [01:37:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:40:23] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:53:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:56:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:03:23] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:04:54] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [02:06:04] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [02:06:34] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15868 MB (1% inode=99%): [02:10:24] !log LocalisationUpdate completed (1.21wmf12) at Tue Apr 2 02:10:24 UTC 2013 [02:10:30] Logged the message, Master [02:17:14] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:17:34] !log LocalisationUpdate completed (1.22wmf1) at Tue Apr 2 02:17:33 UTC 2013 [02:17:40] Logged the message, Master [02:19:04] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1600 bytes in 2.192 second response time [02:19:24] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:24] PROBLEM - Apache HTTP on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:24] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:24] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:35] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1600 bytes in 2.172 second response time [02:19:44] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:04] PROBLEM - MySQL Slave Running on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:04] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:04] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61290 bytes in 0.308 second response time [02:20:07] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:07] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:07] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:07] PROBLEM - Apache HTTP on mw1220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:14] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:14] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:14] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.075 second response time [02:20:34] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:35] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [02:20:35] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:20:35] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:20:35] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61290 bytes in 0.215 second response time [02:20:44] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:44] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:54] RECOVERY - MySQL Slave Running on db1017 is OK: OK replication [02:20:54] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [02:20:54] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [02:20:54] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [02:20:54] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [02:20:55] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:21:04] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:21:04] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [02:21:18] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:21:18] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [02:21:18] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [02:21:24] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:21:34] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:21:35] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [02:21:35] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [02:21:35] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [02:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:35] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [02:26:35] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:33:24] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:38:23] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:53:13] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:02:23] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [03:04:19] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [03:05:59] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15474 MB (1% inode=99%): [03:05:59] RECOVERY - RAID on db1054 is OK: OK: State is Optimal, checked 2 logical device(s) [03:05:59] RECOVERY - Disk space on db1054 is OK: DISK OK [03:05:59] RECOVERY - DPKG on db1054 is OK: All packages OK [03:06:29] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [03:14:19] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:34:19] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [04:04:45] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:06:22] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15126 MB (1% inode=99%): [04:06:52] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [04:08:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:07:57 UTC 2013 [04:08:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:09:12] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:09:04 UTC 2013 [04:09:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:10:13] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:10:02 UTC 2013 [04:10:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:10:55 UTC 2013 [04:11:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:52] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:11:42 UTC 2013 [04:12:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:13:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:12:56 UTC 2013 [04:13:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:14:32] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:14:28 UTC 2013 [04:14:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:16:23] New patchset: Tim Starling; "Reduce non-video job queue size from 320 to 112" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [04:18:05] New patchset: Tim Starling; "Reduce non-video job queue size from 320 to 112" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [04:27:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [04:47:23] New patchset: Ryan Lane; "Use https for public puppet repo remote" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57033 [05:04:28] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [05:06:08] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15729 MB (1% inode=99%): [05:06:38] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [05:21:19] New patchset: MZMcBride; "Reduce non-video job queue size from 320 to 112" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:27:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [05:28:47] New patchset: Tim Starling; "Reduce non-video job queue size from 320 to 144" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:30:23] New review: Tim Starling; "PS4: increase dprioprocs from 5 to 7 at Aaron's suggestion, and fix the wikiadmin process limit in t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:35:13] PROBLEM - DPKG on vanadium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:36:13] RECOVERY - DPKG on vanadium is OK: All packages OK [05:37:13] New patchset: Aaron Schulz; "Reduce non-video job queue size from 320 to 144" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:42:39] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:42:43] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [05:42:43] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [05:42:43] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [05:57:23] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:34] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:53] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:03] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:13] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:43] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.020 second response time [05:59:43] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.416 second response time [05:59:53] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.607 second response time [05:59:53] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [06:00:03] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.093 second response time [06:00:03] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.050 second response time [06:00:03] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [06:00:03] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [06:00:03] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [06:00:03] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [06:00:03] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [06:00:23] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.048 second response time [06:00:23] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [06:00:23] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [06:00:23] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [06:00:23] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [06:00:23] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [06:00:23] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [06:00:24] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [06:00:24] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [06:06:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:08:39] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [06:09:09] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15342 MB (1% inode=99%): [06:26:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:09] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [06:27:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [06:29:59] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 06:29:55 UTC 2013 [06:30:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:30:49] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 06:30:39 UTC 2013 [06:31:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:31:49] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 06:31:46 UTC 2013 [06:32:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:04:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:06:04] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15011 MB (1% inode=99%): [07:06:34] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [07:32:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [07:47:50] !g I2a9fbe5f7522ba9fed64415b5f7b230ee50cfc23 [07:47:50] https://gerrit.wikimedia.org/r/#q,I2a9fbe5f7522ba9fed64415b5f7b230ee50cfc23,n,z [08:05:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:07:16] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14598 MB (1% inode=99%): [08:07:46] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [08:07:56] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 08:07:47 UTC 2013 [08:08:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:08:36] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 08:08:28 UTC 2013 [08:09:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:14:46] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 08:14:43 UTC 2013 [08:15:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:04:17] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:06:27] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [09:06:57] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14211 MB (1% inode=99%): [09:34:24] is the mediawiki::cgroup group already enabled on any of the servers? [09:34:37] dont see it included explicitly in puppet [10:04:15] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [10:06:25] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [10:06:55] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14572 MB (1% inode=99%): [10:53:51] j^: there's no such class [10:54:25] oh wait [10:54:27] paravoid: modules/mediawiki/manifests/cgroup.pp:class mediawiki::cgroup { [10:54:30] yeah [10:54:50] it's included by class mediawiki [10:54:51] init.pp [10:55:20] ah ok so should be used. [10:55:36] yes [10:55:47] we use cgroups for imagescaling nowadays [10:55:52] not sure about videoscaling though [10:55:56] now that the index is in place and i can see http://commons.wikimedia.org/wiki/Special:TimedMediaHandler i noticed that the videoscalers still have hanging processes from before that transition [10:56:29] whats the best way to kill those encodes that are running for months [10:56:44] I'll do that [10:57:19] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [10:58:05] paravoid: thanks [10:58:51] some things are also in the queue for way to long, not sure whats happening there, might be stuck for some reason during job queue updates or so [11:00:47] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [11:04:08] j^: I see no stale processes on tmh* boxes [11:04:37] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [11:06:06] paravoid: can you send me a full ps ax from tmm1001/2.eqiad [11:06:17] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14951 MB (1% inode=99%): [11:06:35] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [11:06:39] *tmh1001/2.eqiad [11:06:47] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [11:07:37] there's nothing relevant in tmh1001/tmh1002/tmh1/tmh2 [11:07:47] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [11:07:48] just jobs-loop.sh [11:08:05] what is it that Special:TMH polls? [11:08:26] I'm guessing something from the database? [11:09:04] yes thats from the database [11:09:20] its also cached if not admin so might be off [11:09:42] was never able to see it on commons until now [11:10:10] New review: Nemo bis; "Leslie, done (sorry for the spam): however, I don't know where the usual check on spence was suppose..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [11:10:40] any way to find out how many webVideoTranscode jobs are in the job queue? [11:14:29] 414 [11:14:38] commons that is [11:18:10] and how many are running on the tmh servers? [11:18:39] ps ax | grep avconv [11:18:42] 0 [11:19:00] ah wait [11:19:01] there is one now [11:19:21] 0 to 1 :) [11:20:53] so clearly jobs-loop.sh no longer does what it was doing [11:24:55] I wouldn't know :) [11:27:33] let me know if there's anything I can do to help [11:27:52] although for the more mediawiki internal parts, someone from the platform team would be more of a help [11:34:21] thanks, will try to analize and let you know if i need some more data from the running servers [12:08:19] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:10:29] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [12:10:59] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14251 MB (1% inode=99%): [12:58:22] mark, hi, do you have a moment to look at https://gerrit.wikimedia.org/r/#/c/55302/ [12:59:35] telcos want to start testing, and we have been pushing it back for a bit [13:02:27] can that be split into separate patchsets for the conceptually different changes? [13:03:04] i don't like these large-all-in-one-patchset changes [13:03:45] mark, most of it is one change - consolidation of the defaults [13:04:01] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [13:04:10] adam added a few ACLs yesterday thinking it would not be a problem [13:05:19] mark, if you want i could split it up, but do you think we could merge it today? [13:06:11] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [13:06:41] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 13587 MB (1% inode=99%): [13:07:32] hmm mark I think you put this in the wrong rt ticket: https://rt.wikimedia.org/Ticket/Display.html?id=4685 [13:09:58] indeed [13:11:05] New review: Mark Bergsma; "As per previous comments per Faidon/Asher: the redirection logic can and should be done in MobileFro..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/55302 [13:11:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [13:13:24] New patchset: Mark Bergsma; "Revert "Unified default lang redirect from m. & zero. Adding three carriers for testing, too."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57061 [13:13:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57061 [13:14:08] yurik: [13:14:10] Message from VCC-compiler: [13:14:10] Expected ')' got 'carrier_vimpelcom_mobilink_pakistan' [13:14:10] (program line 73), at [13:14:10] ('mobile-frontend.inc.vcl' Line 488 Pos 36) [13:14:10] } else if (client.ip ~ acl carrier_vimpelcom_mobilink_pakistan) { [13:14:11] -----------------------------------###################################--- [13:14:20] please correct and submit a new patchset [13:21:22] Vimpelcom Pakistan? WTF, globalisation goes way too far:P [13:21:53]