[00:00:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [00:00:55] Logged the message, Master [00:01:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57014 [00:02:13] Scap really is overkill.. [00:02:29] looks better now though :) [00:02:30] thx [00:02:40] probably wants a different logo? [00:02:42] scap rules everything around me: s.c.r.e.a.m. [00:02:59] Sue's face? [00:03:03] hah [00:03:04] lol Reedy [00:03:16] mutante, the wmf one should do [00:03:17] Reedy: uh, what? [00:03:27] notpeter: ? [00:03:38] Reedy: nevermind, I misread that [00:04:07] oh, where's the incubator wiki it refers too? hah [00:04:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:04:37] lesliecarr: are we leaving search1024 like this for awhile? [00:04:37] mutante that's where new languages grow before they move to new wikis [00:04:51] where is the text of https://transitionteam.wikimedia.org/w/index.php?title=Main_Page&action=edit - can it be edited? [00:05:02] making those sister project links protocol relative would be nice [00:05:20] extensions/WikimediaMaintenance [00:05:28] mutante: Did you update both interwiki caches? [00:05:47] https://gerrit.wikimedia.org/r/#/c/42133/ [00:05:54] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 16033 MB (1% inode=99%): [00:06:24] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [00:08:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:07:54 UTC 2013 [00:08:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:08:53] Thehelpfulone: i know there wasn't a real incubator for this one:) [00:09:00] Reedy: both? [00:09:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:09:00 UTC 2013 [00:09:06] still syncing [00:09:08] oh i see [00:09:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:09:23] Reedy: the one that creates a new .cdb [00:09:24] mutante: Cancel it [00:09:53] New review: Reedy; "This needs updating to how things are now..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/42133 [00:10:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:09:58 UTC 2013 [00:10:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:10:34] !log reedy synchronized php-1.21wmf12/cache/interwiki.cdb 'Updating 1.21wmf12 interwiki cache' [00:10:39] Logged the message, Master [00:10:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:10:51 UTC 2013 [00:11:12] Reedy: eh, ok, canceled. stopping in the middle of a sync ..though.. [00:11:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:11:44] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:11:38 UTC 2013 [00:12:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:12:24] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:12:18 UTC 2013 [00:12:39] A new wiki was created by apache at Mon, 01 Apr 2013 23:27:12 GMT for a Wikimedia in English (en). [00:12:39] -> on newprojects mailing list, this used to say who ran it, what changed? [00:13:13] "used to"? [00:13:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:13:14] When? [00:14:20] !log reedy synchronized php-1.22wmf1/cache/interwiki.cdb 'Updating 1.22wmf1 interwiki cache' [00:14:26] Logged the message, Master [00:14:44] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 00:14:37 UTC 2013 [00:15:19] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:15:49] A new wiki was created by reedy at Wed, 06 Feb 2013 23:45:09 GMT for a Wikipedia in Baso Minangkabau (min). [00:15:57] then the next one was A new wiki was created by apache at Tue, 05 Mar 2013 22:01:49 GMT for a Wikimedia in English (en). [00:16:00] What about the ones I did last week? [00:16:02] Lol [00:16:03] New patchset: Reedy; "Add script to update the interwiki cache on all currently deployed MW versions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:16:12] You can "blame" Tim for that [00:16:28] New patchset: Dzahn; "change logo for transitionteam wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57017 [00:16:35] Thehelpfulone: ^ [00:17:10] New patchset: Reedy; "Add script to update the interwiki cache on all currently deployed MW versions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:18:43] does that change the favicon too? [00:19:03] No [00:19:38] might as well do that too then mutante? [00:20:20] New patchset: Dzahn; "change logo and favicon for transitionteam wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57017 [00:20:57] or do you want black-globe.ico :p [00:21:40] New review: Reedy; "* The user must have write access to the directory, for temporary file creation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:22:01] heh, nah that will do :P [00:23:12] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57017 [00:23:29] how do you create the first account on a private wiki mutante, using a script? [00:24:04] RECOVERY - RAID on db1001 is OK: OK: State is Optimal, checked 2 logical device(s) [00:29:06] createAndPromote [00:30:21] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [00:30:26] Logged the message, Master [00:30:45] Thehelpfulone: logo/favicon done [00:31:48] New review: Asher; "I agree with Faidon, the existing behavior of redirecting m.wikipedia.org/$uri to en.m.wikipedia.org..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55302 [00:33:14] mutante, was it deployed? seems to be the same for me [00:33:55] Thehelpfulone: yes, pretty sure it's caching, i see the new ones [00:34:27] there we go [00:35:53] !log creating search index for transitionteamwiki [00:36:00] Logged the message, Master [00:36:03] PROBLEM - RAID on db1054 is CRITICAL: NRPE: Command check_raid not defined [00:36:30] New review: Reedy; "Still a problem depending on the owner of the cache dir and permissions on interwiki.cdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:36:33] PROBLEM - DPKG on db1054 is CRITICAL: NRPE: Command check_dpkg not defined [00:36:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:36:43] PROBLEM - Disk space on db1054 is CRITICAL: NRPE: Command check_disk_space not defined [00:40:56] !log restarting lucene on all pool4 servers (one by one) [00:41:02] Logged the message, Master [00:41:13] New patchset: Reedy; "Update dblists and wikiversions for transitionteamwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57021 [00:41:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57021 [00:42:36] Reedy: ooh, that is also in gerrit, sorry [00:43:06] !log now running the image img_media_mime migration on commons (the big one) [00:43:12] Logged the message, Master [00:44:25] !log reedy synchronized wmf-config/ [00:44:31] Logged the message, Master [00:44:51] commons image table migration currently estimated to take 8 hours.. wee! [00:46:08] now 9 hours [00:48:08] New patchset: Reedy; "Move interwiki.cdb and trusted-xff.cdb into wmf-config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57023 [00:48:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57023 [00:49:55] Reedy: where is createAndPromote [00:50:03] maintenance/createAndPromote.php [00:50:08] thx [00:52:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [00:54:23] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:54:30] New patchset: Reedy; "Add script to update the interwiki cache on all currently deployed MW versions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [00:55:53] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:57:26] New patchset: Reedy; "Move target of noc cdb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57025 [00:58:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:59:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57025 [01:00:59] New patchset: Reedy; "Add script to update the interwiki cache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [01:02:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [01:03:23] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:04:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [01:05:04] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:14] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:06:44] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 16312 MB (1% inode=99%): [01:06:54] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [01:07:54] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [01:10:44] PROBLEM - search indices - check lucene status page on search14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:56] daaaaamn it [01:11:25] icinga-wm: * Starting Lucene Search daemon [ OK ] [01:11:53] New patchset: Reedy; "Make extensions/WikimediaMaintenance/filebackend/setZoneAccess.php wikiless" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57026 [01:12:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57026 [01:15:06] !log removing labstore1 and labstore2 entries from projectstorage.wmnet rr dns entry in preparation for shrinking volumes [01:15:11] Logged the message, Master [01:18:14] PROBLEM - search indices - check lucene status page on search17 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 55856 bytes in 0.112 second response time [01:18:14] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:18:36] New patchset: Reedy; "Remove readonly.dblist. Essentially a dupe of closed.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57027 [01:21:54] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.017 second response time [01:22:35] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:25:02] New patchset: Reedy; "Reduce the amount of times the database lists are read in" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57028 [01:33:33] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:33:33] RECOVERY - search indices - check lucene status page on search13 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.112 second response time [01:34:04] yay [01:37:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:40:23] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:53:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:56:34] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:03:23] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:04:54] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [02:06:04] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [02:06:34] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15868 MB (1% inode=99%): [02:10:24] !log LocalisationUpdate completed (1.21wmf12) at Tue Apr 2 02:10:24 UTC 2013 [02:10:30] Logged the message, Master [02:17:14] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:17:34] !log LocalisationUpdate completed (1.22wmf1) at Tue Apr 2 02:17:33 UTC 2013 [02:17:40] Logged the message, Master [02:19:04] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1600 bytes in 2.192 second response time [02:19:24] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:24] PROBLEM - Apache HTTP on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:24] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:24] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:35] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1600 bytes in 2.172 second response time [02:19:44] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:44] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:04] PROBLEM - MySQL Slave Running on db1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:04] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:04] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61290 bytes in 0.308 second response time [02:20:07] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:07] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:07] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:07] PROBLEM - Apache HTTP on mw1220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:14] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:14] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:14] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.075 second response time [02:20:34] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:35] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [02:20:35] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:20:35] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:20:35] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61290 bytes in 0.215 second response time [02:20:44] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:44] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:54] RECOVERY - MySQL Slave Running on db1017 is OK: OK replication [02:20:54] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [02:20:54] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [02:20:54] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [02:20:54] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [02:20:55] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [02:21:04] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:21:04] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [02:21:18] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [02:21:18] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [02:21:18] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [02:21:24] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:21:34] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [02:21:35] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [02:21:35] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [02:21:35] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [02:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:35] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [02:26:35] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:33:24] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:38:23] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:53:13] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:02:23] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [03:04:19] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [03:05:59] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15474 MB (1% inode=99%): [03:05:59] RECOVERY - RAID on db1054 is OK: OK: State is Optimal, checked 2 logical device(s) [03:05:59] RECOVERY - Disk space on db1054 is OK: DISK OK [03:05:59] RECOVERY - DPKG on db1054 is OK: All packages OK [03:06:29] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [03:14:19] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:34:19] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [04:04:45] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:06:22] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15126 MB (1% inode=99%): [04:06:52] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [04:08:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:07:57 UTC 2013 [04:08:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:09:12] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:09:04 UTC 2013 [04:09:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:10:13] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:10:02 UTC 2013 [04:10:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:10:55 UTC 2013 [04:11:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:52] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:11:42 UTC 2013 [04:12:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:13:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:12:56 UTC 2013 [04:13:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:14:32] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 04:14:28 UTC 2013 [04:14:42] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:16:23] New patchset: Tim Starling; "Reduce non-video job queue size from 320 to 112" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [04:18:05] New patchset: Tim Starling; "Reduce non-video job queue size from 320 to 112" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [04:27:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [04:47:23] New patchset: Ryan Lane; "Use https for public puppet repo remote" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57033 [05:04:28] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [05:06:08] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15729 MB (1% inode=99%): [05:06:38] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [05:21:19] New patchset: MZMcBride; "Reduce non-video job queue size from 320 to 112" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:27:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:28:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [05:28:47] New patchset: Tim Starling; "Reduce non-video job queue size from 320 to 144" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:30:23] New review: Tim Starling; "PS4: increase dprioprocs from 5 to 7 at Aaron's suggestion, and fix the wikiadmin process limit in t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:35:13] PROBLEM - DPKG on vanadium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:36:13] RECOVERY - DPKG on vanadium is OK: All packages OK [05:37:13] New patchset: Aaron Schulz; "Reduce non-video job queue size from 320 to 144" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:42:39] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57030 [05:42:43] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [05:42:43] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [05:42:43] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [05:57:23] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:33] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:34] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:53] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:03] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:13] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:14] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:23] PROBLEM - Apache HTTP on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:43] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.020 second response time [05:59:43] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.416 second response time [05:59:53] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.607 second response time [05:59:53] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [06:00:03] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.093 second response time [06:00:03] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.050 second response time [06:00:03] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [06:00:03] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [06:00:03] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [06:00:03] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [06:00:03] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [06:00:23] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.048 second response time [06:00:23] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [06:00:23] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [06:00:23] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [06:00:23] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [06:00:23] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [06:00:23] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [06:00:24] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [06:00:24] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [06:06:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:08:39] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [06:09:09] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15342 MB (1% inode=99%): [06:26:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:09] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [06:27:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [06:29:59] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 06:29:55 UTC 2013 [06:30:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:30:49] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 06:30:39 UTC 2013 [06:31:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:31:49] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 06:31:46 UTC 2013 [06:32:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:04:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:06:04] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 15011 MB (1% inode=99%): [07:06:34] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [07:32:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [07:47:50] !g I2a9fbe5f7522ba9fed64415b5f7b230ee50cfc23 [07:47:50] https://gerrit.wikimedia.org/r/#q,I2a9fbe5f7522ba9fed64415b5f7b230ee50cfc23,n,z [08:05:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:07:16] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14598 MB (1% inode=99%): [08:07:46] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [08:07:56] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 08:07:47 UTC 2013 [08:08:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:08:36] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 08:08:28 UTC 2013 [08:09:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:14:46] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 08:14:43 UTC 2013 [08:15:36] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:04:17] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:06:27] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [09:06:57] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14211 MB (1% inode=99%): [09:34:24] is the mediawiki::cgroup group already enabled on any of the servers? [09:34:37] dont see it included explicitly in puppet [10:04:15] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [10:06:25] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [10:06:55] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14572 MB (1% inode=99%): [10:53:51] j^: there's no such class [10:54:25] oh wait [10:54:27] paravoid: modules/mediawiki/manifests/cgroup.pp:class mediawiki::cgroup { [10:54:30] yeah [10:54:50] it's included by class mediawiki [10:54:51] init.pp [10:55:20] ah ok so should be used. [10:55:36] yes [10:55:47] we use cgroups for imagescaling nowadays [10:55:52] not sure about videoscaling though [10:55:56] now that the index is in place and i can see http://commons.wikimedia.org/wiki/Special:TimedMediaHandler i noticed that the videoscalers still have hanging processes from before that transition [10:56:29] whats the best way to kill those encodes that are running for months [10:56:44] I'll do that [10:57:19] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [10:58:05] paravoid: thanks [10:58:51] some things are also in the queue for way to long, not sure whats happening there, might be stuck for some reason during job queue updates or so [11:00:47] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [11:04:08] j^: I see no stale processes on tmh* boxes [11:04:37] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [11:06:06] paravoid: can you send me a full ps ax from tmm1001/2.eqiad [11:06:17] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14951 MB (1% inode=99%): [11:06:35] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [11:06:39] *tmh1001/2.eqiad [11:06:47] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [11:07:37] there's nothing relevant in tmh1001/tmh1002/tmh1/tmh2 [11:07:47] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [11:07:48] just jobs-loop.sh [11:08:05] what is it that Special:TMH polls? [11:08:26] I'm guessing something from the database? [11:09:04] yes thats from the database [11:09:20] its also cached if not admin so might be off [11:09:42] was never able to see it on commons until now [11:10:10] New review: Nemo bis; "Leslie, done (sorry for the spam): however, I don't know where the usual check on spence was suppose..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [11:10:40] any way to find out how many webVideoTranscode jobs are in the job queue? [11:14:29] 414 [11:14:38] commons that is [11:18:10] and how many are running on the tmh servers? [11:18:39] ps ax | grep avconv [11:18:42] 0 [11:19:00] ah wait [11:19:01] there is one now [11:19:21] 0 to 1 :) [11:20:53] so clearly jobs-loop.sh no longer does what it was doing [11:24:55] I wouldn't know :) [11:27:33] let me know if there's anything I can do to help [11:27:52] although for the more mediawiki internal parts, someone from the platform team would be more of a help [11:34:21] thanks, will try to analize and let you know if i need some more data from the running servers [12:08:19] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:10:29] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [12:10:59] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 14251 MB (1% inode=99%): [12:58:22] mark, hi, do you have a moment to look at https://gerrit.wikimedia.org/r/#/c/55302/ [12:59:35] telcos want to start testing, and we have been pushing it back for a bit [13:02:27] can that be split into separate patchsets for the conceptually different changes? [13:03:04] i don't like these large-all-in-one-patchset changes [13:03:45] mark, most of it is one change - consolidation of the defaults [13:04:01] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [13:04:10] adam added a few ACLs yesterday thinking it would not be a problem [13:05:19] mark, if you want i could split it up, but do you think we could merge it today? [13:06:11] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [13:06:41] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 13587 MB (1% inode=99%): [13:07:32] hmm mark I think you put this in the wrong rt ticket: https://rt.wikimedia.org/Ticket/Display.html?id=4685 [13:09:58] indeed [13:11:05] New review: Mark Bergsma; "As per previous comments per Faidon/Asher: the redirection logic can and should be done in MobileFro..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/55302 [13:11:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [13:13:24] New patchset: Mark Bergsma; "Revert "Unified default lang redirect from m. & zero. Adding three carriers for testing, too."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57061 [13:13:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57061 [13:14:08] yurik: [13:14:10] Message from VCC-compiler: [13:14:10] Expected ')' got 'carrier_vimpelcom_mobilink_pakistan' [13:14:10] (program line 73), at [13:14:10] ('mobile-frontend.inc.vcl' Line 488 Pos 36) [13:14:10] } else if (client.ip ~ acl carrier_vimpelcom_mobilink_pakistan) { [13:14:11] -----------------------------------###################################--- [13:14:20] please correct and submit a new patchset [13:21:22] Vimpelcom Pakistan? WTF, globalisation goes way too far:P [13:21:53] no, VCL bloat goes way too far :P [13:22:22] mark, are you satisfied with the caching wikitech-l thread? [13:22:52] once I finish catching up on my email, maybe ;) [13:23:30] MaxSem: I made a comment about bits that was relayed by asher [13:23:33] has this been addressed? [13:23:57] paravoid, yes [13:24:17] https://gerrit.wikimedia.org/r/#/c/56774/ [13:24:42] great, thanks [13:26:15] uhm, I don't think the proposal was that [13:27:03] anyway, let's hear mark first, no point in doing ping pongs now [13:28:02] """ [13:28:06] use something like: [13:28:06] http://bits.wikimedia.org/m/en.wikipedia.org/load.php?.. [13:28:06] Then we can if (req.url ~ "^/m/") { tag_carrier + strip the /m/ }, so the overhead only effects mobile requests. [13:28:08] New patchset: Yurik; "Unified default lang redirect from m. & zero. Adding three carriers for testing, too." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57063 [13:28:08] """ [13:28:37] s/tag_carrier/device_detection/ [13:29:02] i really don't like doing the device detection on bits too [13:30:36] mark, thanks, fixed. https://gerrit.wikimedia.org/r/#/c/57063/ [13:30:40] then we can always switch em back to .m. domains as originally intended [13:30:50] yeah I think I prefer that [13:31:10] however, what's so bad about doing device detection for select paths? [13:31:34] i like bits currently being mean, lean and efficient [13:31:47] and since mobile tends to like to do stuff in VCL, I like to keep you off bits [13:32:09] and I don't really see disadvantages to keeping that on the mobile servers either [13:32:24] right, that's why I said this isn't exactly the proposal [13:32:36] mark, i think mobile would much rather do most of the work in php ;) [13:32:52] aren't there arguments to load.php per device? [13:33:10] bits.wm.org/(m/)load.php?device=android&foo or something? [13:33:11] paravoid, so how did your proposal sounded before it was interpreted by Asher? [13:33:27] paravoid: that's not possible with unified HTML is it [13:33:31] oh wait, this is the whole not doing ESI [13:33:33] okay, nevermind [13:33:34] yes [13:34:03] as for sharding with pipelining, we can always setup a special service IP for that of course [13:34:04] ignore me, I'm getting confused [13:34:30] New patchset: Demon; "Show notice to users who are using legacy skins" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56408 [13:34:31] yeah, asher thought of mbits too [13:34:44] ugh, so do browser pipeline per IP or per domain? [13:34:44] or bits.m :) [13:34:45] exactly [13:34:48] per domain [13:34:56] so then we can point that at whatever varnish cluster we like [13:34:59] although from a quick googling mobile browsers seem to be better than desktop ones [13:35:00] which probably will be mobile for now [13:35:09] i.e. they have more than 2 max connections per domain [13:35:17] that's hilarious [13:35:31] http://www.guypo.com/mobile/http-pipelining-big-in-mobile/ [13:35:38] lol, in opera it's configurable with default being 8 connections per domain/64 total [13:35:59] desktop seem better than I remembered [13:36:06] newer versions I guess [13:36:19] just looked, 16 "per server" [13:38:03] sooo [13:38:05] anyway [13:38:09] other than that, good job, guys [13:38:16] i might even have to deploy those esams servers soon ;) [13:38:23] mark, btw, the patch i just submitted is rebased from master, so there are a few new bits there for device detection [13:38:23] thanks:) [13:38:34] mark: do we have IPs now? [13:38:39] not yet [13:38:46] mark, we hope to deploy this stuff next week [13:38:46] waiting for ts? [13:38:48] yes, soon [13:38:53] ok! [13:41:45] and speaking of deployment, we will need some ops attention during it [13:42:10] is there an easy way to apply current varnish config to a labs instance? I'm in a process of setting up varnish test rig so that any VCL changes are easy(er) to test, and given that there are 12 varnish files in puppets repo, I'm not sure of the best course of action [13:43:00] current varnish/caching puppet manifests don't fully work yet inside labs [13:43:27] mark, but how do you test VCL changes? [13:43:27] hashar is working on improving that in the context of the beta cluster [13:43:31] we don't [13:43:36] we test in production [13:43:36] live tests? :) [13:43:40] lovelly [13:43:46] best testbed ever [13:43:51] they mostly work :-] [13:43:55] could you pass me that root please [13:44:01] no [13:44:02] still have to polish up the role::cache::mobile class though [13:44:03] you'd break it [13:44:04] ;) [13:44:20] hehe [13:44:33] in theory, we're gonna use beta for that [13:44:40] in practice, it's not there yet [13:45:14] mark, beta doesn't work for us - beta is to test the stuff that has been merged into master (from what we were told) [13:45:26] beta is to test everything [13:45:48] I think different people call different things beta [13:45:56] mark, is it possible to change varnish config on beta without pulling it from git? [13:46:15] i don't think so, [13:46:21] i mean - will it be possible to edit varnish files on it [13:46:31] but once beta is in use there's no reason you couldn't setup your specific labs project for testing varnish changes [13:46:34] because without it, its a staging server for ops, not a test dev server [13:46:35] using those same manifests [13:46:43] hmm [13:46:52] talk to hashar, I don't really know exactly [13:46:59] beta is definitely also a test dev server [13:47:16] more or less :-D [13:47:27] i haven't used beta at all [13:47:32] I think we need different instances (or maybe even projects) for each component to be tested [13:47:38] exactly - if i can't ssh into the server and edit the vcl file, its not that useful :) [13:47:41] the varnish caches in beta are running the manifests that comes from puppet master. So you can't really develop anything [13:47:44] I'm not sure how this can be all called "beta", maybe that's too confusing [13:48:02] it wouldn't make sense testing a varnish change on the same cluster someone else is trying to test mediawiki [13:48:07] I can see more specific labs projects used for developing [13:48:11] and then beta as a final integrated test [13:48:14] then production [13:48:16] right [13:48:29] yeah that is the idea between beta. To test out your changes before they land in production. [13:48:34] dev should be done somewhere else [13:48:55] sure, that will work, but that leaves the question that hashar is working on - how to best set up a test rig [13:49:20] i guess i will poke hashar in a bit to see if i can get it set up [13:49:24] what do you want to do exactly? [13:49:45] of course we're not gonna support many different labs projects inside our production manifests [13:49:47] change vcl, see that it compiles, see that it sets the headers correctly, etc [13:49:48] so that will be a problem [13:50:25] i already got mobile-varnish instance up, but haven't finished configuring varnish yet [13:50:34] you can always make your own labs instance, deploy varnish much like is done in production, and hack your local manifests until you have a working solution [13:50:48] it's certainly not fully puppet automated at the moment [13:50:59] but given that you want the ability to locally edit things anyway, perhaps that shouldn't be such a big issue [13:51:12] mark, sure, but could you give me a few pointers on how/where to hack that manifest? [13:51:15] for testing syntax compliance and limited functionality testing, that could work just fine [13:51:31] yurik: basically, a mobile varnish server includes the "role::cache::mobile" manifest [13:51:32] i am new to pupetireeing :) [13:51:37] everything else is pulled in from there [13:51:43] so you could try that on a labs instance [13:51:48] it will fail horribly [13:51:55] but just fix up the manifests locally until you have it working [13:52:10] right right, but i need a starting point - which file to hack, and how to run it [13:52:13] it'll fail on things like lookups in a hash file to find the backend servers needed, or to create a file system on a partition that doesn't exist in labs [13:52:30] you'll want to hack role/cache.pp and varnish.pp, as well as the VCL templates [13:52:34] as i said - very new to puppets [13:53:14] and use puppet apply varnish.pp ? [13:53:26] or some other command? [13:53:43] read up on puppetmaster::self in labs [13:54:44] mark https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster ? [13:54:57] yes [13:55:12] thanks, should be good to get started [13:55:19] hopefully we can get you guys some more ops support to help with this soon [13:55:29] would be awesome [13:57:55] mark, and yes, i don't want to keep using varnish as in https://gerrit.wikimedia.org/r/#/c/57063/ [13:58:33] my target is to adapt/rewrite geoIp-style lookup for IP->carrier code string, and do everything else in php [13:58:56] cool [13:58:57] and introduce a proper zero portal [13:59:10] so that we don't redirect left and right [14:03:11] ottomata: here? [14:04:01] yurik: check line 372 [14:04:09] uncommited squid changes [14:04:16] uh [14:04:18] nevermind [14:04:19] wth [14:04:25] not sure if they're deployed somewhere or not, so I'm reluctant to deploy [14:04:29] what's wrong with my font [14:05:03] yup, hiya! [14:05:10] -cache_access_log udp://208.80.154.73:8420 wikimedia [14:05:14] that's gadolinium [14:05:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57063 [14:06:09] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [14:06:53] eh? [14:07:06] uncommited squid change on fenari [14:07:10] yurik: live on one box now [14:07:16] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [14:07:17] yei!!! [14:07:22] thanks mark! [14:07:24] gadolinium is using the oxygen multicast stream... [14:07:46] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 13919 MB (1% inode=99%): [14:07:48] so, is this deployed? [14:08:20] wait, not sure what you are saying, uhhh, frontend caches should not send logs directly to gadolinium [14:08:35] no, I'm saying that someone has modified squid on fenari and hasn't commited [14:08:39] oh! [14:08:42] agh [14:08:49] that was me then [14:08:50] sorry [14:08:50] and I want to deploy something else now [14:08:53] yes, that is committed [14:09:06] sorry [14:09:07] that is deployed [14:09:09] ok [14:09:32] sorry about that, sigh, that's a tough one to remember, will do better next time [14:21:31] mark, are any VCL changes needed to serve load.php from m domains? [14:21:44] New review: Nemo bis; "Double checked that it's what they want." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/56420 [14:21:53] i'm still pondering between the two options [14:21:58] i'll reply on wikitech-l later [14:22:17] and yes, VCL changes would be needed for that [14:24:11] !log deploying squid config, diverting all of upload to swift [14:24:18] Logged the message, Master [14:27:27] KILL SOLARIS [14:28:12] we really need a better index page [14:29:44] i'm seeing roughly 7500 RL requests from mobile atm [14:29:46] per second [14:37:30] New patchset: Faidon; "upload varnish: switch everything to Swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57069 [14:37:35] mark: wanna do a quick 10-line review? [14:37:55] sure [14:38:08] that pmtpa varnish stanza within varnish_be_directors is redundant, right? [14:38:16] it confused me for a moment there [14:38:45] since it's self-referrential, if it wasn't enclosed in an "if eqiad" :) [14:40:02] +1 [14:40:22] thanks [14:41:13] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57069 [14:47:22] hmm [14:47:30] New review: OliverKeyes; "Can't comment on the code (ooh, alliteration) but the project is sound and the patch is needed :)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/56408 [14:48:36] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56408 [14:49:18] ah, now I understand why LU [14:49:47] !log demon synchronized wmf-config/CommonSettings.php 'Notice for users of disabled skins' [14:50:00] Logged the message, Master [14:50:49] !log demon synchronized wmf-config/CommonSettings.php 'I hate l10nupdate' [14:50:55] meh [14:50:58] Logged the message, Master [14:51:01] perhaps we should do device detection on bits [14:51:13] problem with varnish is that it doesn't have clean separation of storage backends [14:51:25] I can't reliably tell it to put resource loader content in a separate malloc backend [14:51:36] and with the high churn on the mobile frontends that could be a problem [14:54:00] !log demon synchronized php-1.21wmf12/extensions/WikimediaMessages/WikimediaTemporaryMessages.i18n.php '8th time is the charm' [14:54:06] Logged the message, Master [14:55:17] so is it a big problem if mobile RL came from persistent storage? [14:55:34] frontends don't have persistent storage [14:55:40] frontends only have a small malloc backend [14:57:01] I know [14:57:09] but I don't understand what you're saying [14:57:19] why a separate malloc backend? [14:58:07] New patchset: Jeremyb; "[it planet] fix doppiequadre per Elitre" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57074 [14:58:26] New review: Jeremyb; "http://www.w3.org/Provider/Style/URI.html" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57074 [15:00:11] i don't want the resource loader assets to LRU expire so much [15:02:11] jeremyb_: thanks for your patches :) [15:02:30] aww she was disappointed with me "ma nemo!!!" [15:04:50] New review: Nemo bis; "aye *faceplam*" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57074 [15:05:13] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [15:07:23] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [15:07:53] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 13292 MB (1% inode=99%): [15:08:37] !log LocalisationUpdate completed (1.21wmf12) at Tue Apr 2 15:08:37 UTC 2013 [15:08:44] Logged the message, Master [15:11:14] PROBLEM - RAID on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:12:46] !log LocalisationUpdate completed (1.22wmf1) at Tue Apr 2 15:12:45 UTC 2013 [15:12:53] Logged the message, Master [15:25:59] greg-g: what's the latest on the HTTP auth saga? [15:27:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [15:31:28] jeremyb_: as in oauth/etc? [15:31:37] err? no [15:31:52] as in ishmael/graphite/icinga-admin [15:32:06] hiyaaa paravoid, whatcha think of this? [15:32:06] https://gerrit.wikimedia.org/r/#/c/56537/ [15:33:42] !log anomie synchronized php-1.21wmf12/extensions/WikimediaMessages/WikimediaTemporaryMessages.i18n.php [15:33:48] Logged the message, Master [15:37:40] greg-g: ? [15:38:19] <^demon> anomie: So yeah, I'd already sync'd that, after updating. [15:38:35] jeremyb_: oh, right... sorry, haven't finished drinking my coffee yet. no movement in the RT ticket last I checked (yesterday afternoon) [15:38:35] <^demon> Plus, if it wasn't up to date, how would the other languages have shown up? [15:38:37] ^demon- Just trying it. Didn't work :( [15:38:51] greg-g: no, i meant it still doesn't work? [15:39:05] greg-g: (i already checked the ticket myself :) ) [15:39:05] I was about to say "if it's how it's configured, oh well, why not" [15:39:06] then saw /opt/kraken [15:39:24] yeah yeah yeah [15:39:25] i know [15:39:47] and the hashing by awk [15:39:54] jeremyb_: still fails on ishmael, at least [15:39:57] Hmm. [15:39:59] ok [15:40:12] I think I'm leaning towards no [15:40:21] jeremyb_: and graphite [15:40:23] that is how its currently configured, and that won't fly for initail base cluster [15:40:34] this is just so we can get monitoring working on those instances [15:40:43] greg-g: k [15:40:52] <^demon> anomie: I guess we could try a full scap? But that seems overkill. [15:40:58] !log anomie synchronized php-1.21wmf12/cache/l10n/l10n_cache-en.cdb [15:41:04] Logged the message, Master [15:41:09] <^demon> Weirdddd [15:41:12] \o/ [15:41:21] andrewbogott: want to look at rt 4853 when you have a min? (it's your week :) ) [15:41:24] <^demon> Wonder why l10nupdate missed that file. [15:41:27] <^demon> When the rest went out. [15:41:31] jeremyb: yep [15:41:35] No idea why, but apparently the en cache file didn't get copied. [15:42:21] <^demon> Bizarre. [15:42:25] !log demon synchronized wmf-config/CommonSettings.php 'Message for ancient skin users' [15:42:32] Logged the message, Master [15:42:59] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [15:42:59] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [15:42:59] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [15:43:27] New patchset: Ottomata; "Puppetizing udp2log instances on analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56537 [15:43:43] paravoid, I understand your concern about /opt/kraken, that is not correct and is not intended to be the correct solution. Its just what is there right now [15:43:53] hashing by awk is what is there right now too, and it is working fine [15:43:57] it is also not the final solution [15:45:02] New patchset: Demon; "Switch nostalgiawiki to use Nostalgia from extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56402 [15:45:02] I think we agreed to not merge what exists now but start fresh [15:47:03] riiiiight, but in the meantime people are worried about potential packet loss on udp2log instances that are used to import the mobile data into kraken [15:47:45] the fact that it isn't monitored properly means that we are less confident about the accuracy of data we generate [15:48:15] i agree to start fresh, 100%, that's why I added the comments about how this is temporary [15:48:57] (btw, I'm waiting on puppet-merge review and kafka debian review as the first steps in starting fresh, and I expect to have more time allocated to this after mid may) [15:49:27] jeremyb: Can you explain that patch to me slightly? I don't know the context at all. What is that file used for? [15:49:57] andrewbogott: which patch? [15:50:24] planet/it_config.erb [15:50:35] it's used for it.planet.wikimedia.org [15:51:05] Oh, the 'it' is for italian, not information technology :) [15:51:11] yes :) [15:51:20] the request was made by an itwiki sysop and +1'd by Nemo_bis [15:51:28] That is obvious in retrospect [15:51:31] OK, will merge. [15:51:42] danke [15:51:54] * jeremyb_ wonders about greg though :) [15:55:01] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57074 [15:56:34] jeremyb_: thanks for caring :) [16:02:12] andrewbogott: i think you're not supposed to touch verified fwiw [16:04:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:07:06] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [16:07:36] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 12568 MB (1% inode=99%): [16:07:56] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:07:53 UTC 2013 [16:08:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:09:56] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:09:55 UTC 2013 [16:10:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:12:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:11:50 UTC 2013 [16:12:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:13:46] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:13:40 UTC 2013 [16:13:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:14:43] Nemo_bis: hrmmm, there's 2 doppiequadres? [16:15:26] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:15:24 UTC 2013 [16:15:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:17:06] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:16:58 UTC 2013 [16:17:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:18:36] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:18:30 UTC 2013 [16:18:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:19:56] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:19:52 UTC 2013 [16:20:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:21:16] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:21:11 UTC 2013 [16:21:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:22:26] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:22:21 UTC 2013 [16:22:59] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:23:36] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:23:26 UTC 2013 [16:23:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:24:26] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:24:23 UTC 2013 [16:24:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:25:17] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:25:15 UTC 2013 [16:25:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:26:06] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:25:58 UTC 2013 [16:26:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:27:16] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Apr 2 16:27:11 UTC 2013 [16:27:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:46] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [16:27:56] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:28:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [16:28:50] New patchset: Ottomata; "Abstracting out udp2log monitoring into its own define" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56537 [16:34:37] jeremyb_: wasn't the other removed [16:34:47] no, the tumblr [16:34:48] New patchset: MaxSem; "Add a mobile log group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57078 [16:34:56] look at the current site [16:35:25] yes [16:35:30] the umblr was a new one [16:35:36] damn lag [16:35:38] why are there 2? [16:35:51] anyway, tell her to remove the www. from tumblr :) [16:35:57] and maybe make it a link too [16:36:22] > Casa base su http://www.doppiequadre.wordpress.com/ [16:44:55] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56537 [16:50:43] New patchset: Ottomata; "Removing inherit on analytics1003-1006" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57080 [16:52:09] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57080 [16:53:12] New patchset: Ottomata; "Only analytics1003 and 1011 are ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57081 [16:54:39] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57081 [16:56:50] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:58:00] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [16:58:30] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 12057 MB (1% inode=99%): [17:03:45] New patchset: coren; "Add service group support in-instance with nslcd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57082 [17:03:58] !log setting weight to 100 on db1001 [17:04:03] Logged the message, Master [17:04:13] New patchset: Ottomata; "Need role::analytics on analytics1003-1006" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57083 [17:06:54] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57083 [17:09:49] New review: coren; "Works on tools-puppet-test as advertized. :-)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57082 [17:10:18] New patchset: Cmjohnson; "Addind db1001 back into production" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57084 [17:12:56] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57082 [17:13:05] New patchset: Cmjohnson; "Adding db1001 back into production removing db1028 from production for h/w fix" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57084 [17:13:56] Change merged: Cmjohnson; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57084 [17:13:57] New patchset: Ottomata; "Sometimes order makes a difference with $ganglia_aggregator." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57085 [17:14:14] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57085 [17:14:35] notpeter, ping [17:14:53] andrewbogott: did you merge my changes to? [17:15:20] andrewbogott: sup [17:15:37] cmjohnson1: you have to deploy that patchset via the mediawiki config method [17:15:55] notpeter, I'm told that this ticket requires a db person: https://rt.wikimedia.org/Ticket/Display.html?id=4862 is that something you can take on? [17:16:35] andrewbogott: this will probably be hella annoying, but sure,i can look into it [17:16:46] thanks [17:17:02] notpeter: oh! well shit...i don't see that in wikitech [17:17:18] hhhmmm, let's see if you have deploy access :) [17:17:28] ssh to fenari, and make sure to forward agent for this [17:18:27] i am on fenari [17:18:30] cool [17:18:34] cd /home/w/common [17:18:46] k [17:18:50] git pull [17:18:52] cmjohnson1: https://wikitech.wikimedia.org/wiki/How_to_do_a_configuration_change#Change_wiki_configuration [17:19:12] notpeter cool [17:19:16] worked [17:19:49] woo! now [17:20:00] Who is the king and/or queen of bugzilla? [17:20:27] sync-file wmf-config/db-eqiad.php "" [17:20:39] andrewbogott: I like the and/or :) [17:21:00] andrewbogott: how do you mean? but probably andre__ [17:21:07] cmjohnson1: if you don't see a whole screen full of errors, it worked :) [17:21:11] andrewbogott: andree [17:21:21] mutante's so slow :) [17:21:27] likely me [17:21:28] !log cmjohnson synchronized wmf-config/db-eqiad.php 'adding db1001 back to production removing db1028' [17:21:29] jeremyb_, mutante, I'm trying to address a request from andre_ to deploy a change. [17:21:33] Logged the message, Master [17:21:38] haha [17:21:39] notpeter: worked [17:21:43] Oh oh, I see. Now this way. [17:21:46] andrewbogott: andre: rephrasing: bugzillaadmin@wm :) [17:22:07] (which is just andre__) [17:22:30] yea, right now it is. but it might have more on it in the future [17:22:35] i guess [17:22:53] andrewbogott: which one? [17:23:01] https://rt.wikimedia.org/Ticket/Display.html?id=4867 [17:23:31] Is bugzilla running out of a git repo? If so I guess I can deploy myself... [17:24:00] cmjohnson1: woo! [17:24:17] cmjohnson1: so, the only thing left is in a bit, to increase the weight on db1001 [17:24:22] back up to 400 [17:24:37] cool [17:26:56] andrewbogott: ok,, so ssh to kaulen, then cd /root/bzmod/modifications , git pull [17:27:05] New patchset: Jgreen; "move fundraising banner log collection pipeline from locke to gadolinium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57088 [17:27:16] mutante: So, not packaged, not puppetized [17:27:21] I guess you've been saying that for months :) [17:27:39] !log disable fundraising banner log rotation on locke [17:27:45] Logged the message, Master [17:27:52] andrewbogott: then the last step you'd have to do manually, copy the file from there to: /srv/org/wikimedia/bugzilla/extensions/Wikimedia [17:27:56] andrewbogott: no, there is no package [17:28:02] Ubuntu dropped it [17:28:33] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57088 [17:28:42] also, the structure of that git repo with the different bz versions needs changing [17:30:29] andre__: Can you verify that the patch is now deployed? [17:30:48] andrewbogott: hey, think positive, the whole change is in gerrit and you can clone it, until not so long ago you would have gotten .diff files and use patch to apply it:) [17:31:14] It looks to me like it would be an improvement to run bugzilla straight out of git -- could we do that? [17:31:17] That's easy enough to puppetize. [17:31:54] i don't know, i think we should have a .deb for the normal Bugzilla [17:32:03] and then puppetize that it fetches our modification stuff from git [17:32:15] ewww [17:32:16] this repo is just called "modifications" [17:33:07] * jeremyb_ says no mixing. either all .deb all the way or not .deb at all [17:33:30] i don't see why it should be much different than mediawiki deploymenjt [17:33:36] deployment* [17:33:49] !log removing disk 0 from db1048 to replace [17:33:54] i think it's the clean way to separate "vanilla" bugzilla and our mods [17:33:54] Logged the message, Master [17:33:58] !log correction db1028 [17:34:03] Logged the message, Master [17:34:19] jeremyb_: is mediawiki packaged ?:) [17:34:19] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:34:19] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [17:34:22] andrewbogott, verified, all perfect, big thanks! [17:34:27] mutante: yes [17:34:35] andre__: Cool, mind closing out the RT ticket as well? [17:34:42] andrewbogott, I will do [17:34:50] mutante: Wouldn't git be a reasonable way of distinguishing, anyway? [17:35:25] jeremyb_: sadly the official recommendation is to git clone though and the .debs are very outdated [17:35:50] mutante: huh? [17:35:57] mutante: i think you're outdated :) [17:36:19] http://lists.wikimedia.org/pipermail/mediawiki-distributors/ [17:36:39] http://lists.alioth.debian.org/pipermail/pkg-mediawiki-devel/ [17:36:47] heh, i created that list [17:37:09] http://packages.ubuntu.com/search?keywords=mediawiki [17:37:30] andrewbogott: I'd love to get rid of some substeps when getting Bugzilla code changes deployed (and tested) for sure, whatever makes sense. [17:37:41] andrewbogott: you mean 2 separate repos? [17:37:42] mutante: do we care about ubuntu? just look at wheezy :) [17:37:46] i mean it's always git, of course [17:37:53] even if it's a package,it's in git [17:38:00] jeremyb_: i don't [17:38:15] we can switch to Debian ,heh [17:38:27] mutante, well, presumably the 'normal' bugzilla is already in a repo, someplace else that we don't have to maintain? [17:38:39] andrewbogott: is it? i don't think so, we download tarballs [17:38:45] And we can just configure the repo to use that as an upstream so we can diff... [17:38:49] Oh, if it isn't then… [17:38:51] mozilla does not have packages either [17:38:52] afaik [17:38:57] well, yeah, then we'ld have to keep track of it somehow [17:39:41] and the general puppet files vs. .deb package, i don't have a strong opinion, let's ask architects [17:39:47] but i expect .deb [17:40:08] andrewbogott: mutante: upstream is bzr [17:40:20] so it would make sense to keep the local hacks in bzr too [17:40:32] I've asked Mozilla if there are .deb packages but they said it's up to distros... [17:40:37] but then most people won't know bzr :( [17:41:12] we could automate converting bzr to git [17:41:15] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [17:41:20] not sure how it works over time though [17:41:24] !log restarted varnishncsa-multicast_relay on cp1028 [17:41:25] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [17:41:29] e.g. on the resulting commit ids consistent [17:41:30] Logged the message, Mistress of the network gear. [17:41:57] !log restarted varnishncsa-locke on cp1023 [17:42:03] Logged the message, Mistress of the network gear. [17:42:07] New review: Asher; "This doesn't address variance, and I don't think we want resourceloader to include an X-Device Vary ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/56774 [17:42:15] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [17:43:25] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [17:44:20] * andrewbogott is sorry he asked [17:44:23] New patchset: Jgreen; "fundraising banner rotation tweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57089 [17:44:25] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:44:34] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57089 [17:44:47] andrewbogott: :D [17:45:17] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:45:25] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [17:48:15] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [17:49:15] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [17:51:55] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:52:55] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [17:53:15] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [17:54:59] New review: MaxSem; "Not every RL request needs to be varied by X-Device - only the ones that contain the autodetect modu..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56774 [17:55:25] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:58:55] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:02:26] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [18:04:13] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [18:04:22] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [18:04:52] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [18:05:13] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:06:02] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [18:06:32] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 11396 MB (1% inode=99%): [18:06:50] New review: Asher; "If that's already taken into account by the resourceloader module, nevermind hashing in vcl." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56774 [18:07:23] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:09:20] sbernardin: I need you to put a network ticket in for rdb1 and 2 please [18:10:22] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:11:13] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [18:14:07] New review: Dr0ptp4kt; "I believe this is already covered in https://gerrit.wikimedia.org/r/55302 and https://gerrit.wikimed..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56333 [18:14:13] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:14:32] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [18:16:38] ugh, mark / terry :( [18:16:46] should 4785/4685 be merged now? [18:17:01] New patchset: Pyoungmeister; "pre-labsdb dbs: more node defs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57095 [18:17:14] New patchset: Jgreen; "fix fundraising log rotation path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57096 [18:17:36] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57096 [18:18:48] New patchset: Pyoungmeister; "pre-labsdb dbs: more node defs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57095 [18:19:32] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:19:49] notpeter, you poked me some time ago about Solr-related cronspam. Does it continue now? [18:21:13] MaxSem: it doesn't look like it [18:21:17] but I'll keep an eye out [18:21:21] whee [18:21:30] I haven't been vigilant about cronspam of late [18:21:31] notpeter: so, what's up with 4844? [18:21:33] Change abandoned: Yurik; "heh, this has been pending for the past 5 days :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56333 [18:22:09] jeremyb_: I don't know. I haven't had time to look into it [18:22:22] New review: Dr0ptp4kt; "That explains it! Thanks." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56333 [18:22:49] notpeter: ok. well i'm confident in my testing so if you need me to retest or find someone to test let me know [18:22:52] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [18:22:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57095 [18:22:56] ori-l: are you on comcom? i guess not [18:23:13] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [18:23:22] jeremyb_: cool. I just wanted confirmation from a speaker of the languages. [18:23:29] as it was working previously [18:23:43] notpeter: well i originally investigated only because a local complained [18:23:57] cool [18:24:10] k [18:25:13] RECOVERY - RAID on db1028 is OK: OK: State is Optimal, checked 2 logical device(s) [18:26:07] New patchset: Matmarex; "(bug 46330) Set $wgCategoryCollation to 'uca-fi' on all Finnish wikis except Wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57098 [18:27:48] lol [18:27:59] New patchset: Reedy; "Remove readonly.dblist. Essentially a dupe of closed.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57027 [18:28:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57027 [18:28:35] New patchset: Reedy; "Reduce the amount of times the database lists are read in" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57028 [18:28:42] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57028 [18:29:05] New patchset: Jgreen; "grr. forgot ensure=>directory..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57099 [18:30:23] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [18:30:28] New patchset: Jgreen; "grr. forgot ensure=>directory..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57099 [18:31:22] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [18:31:43] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57099 [18:33:57] New patchset: Pyoungmeister; "can't repeat a key in a hash and expect it to work right" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57101 [18:35:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57101 [18:35:18] !log reedy synchronized wmf-config/CommonSettings.php [18:35:24] Logged the message, Master [18:36:18] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [18:39:27] !log completed image img_media_mime migration on all projects [18:39:33] Logged the message, Master [18:39:34] AaronSchulz: ^^ [18:39:50] cmjohnson1: did you re-raise the weight on db1001 [18:39:51] ? [18:39:59] not yet [18:40:01] binasher: send in the mims [18:40:03] *mimes [18:40:12] cmjohnson1: cool. go for it when you're ready [18:40:23] ok...will get it a few [18:40:27] in a few [18:40:38] cool [18:41:26] binasher: Yay [18:42:24] cmjohnson1: notpeter: oh, actually please just leave db1001 where it is [18:42:42] i'm probably going to repurpose it [18:43:11] okay...also took db1028 out of production to replace the disk..once the rebuild is done...will put it back [18:43:28] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [18:44:13] binasher: does https://gerrit.wikimedia.org/r/#/c/57087/1/includes/StatCounter.php look OK? [18:44:42] cmjohnson1: put back in at decreased weight, yeah? :) [18:44:49] db1028, imean [18:44:53] right...gotcha [18:44:57] cmjohnson1: cool! [18:46:35] binasher: on the db's for labstore...do you want raid 10 there as well or raid 5? [18:47:37] cmjohnson1: that actual servers are ciscos, right? [18:47:54] not on labstore r510's [18:48:00] labsdb are ciscos [18:48:06] !log starting innopack from db31 to db1054 for pre-labsdb db (sanitarium) [18:48:11] Logged the message, notpeter [18:49:21] binasher ^ [18:49:41] oh [18:49:47] doh [18:49:50] that's not going to work [18:49:50] wait [18:49:58] i don't know what the labstore servers are [18:51:10] binasher: what do you mean? [18:51:58] cmjohnson1: i don't know what you're asking me about? [18:52:20] Ryan_Lane: ^^ labstore? are those for gluster? [18:52:27] yes [18:52:41] so labstore is unrelated to the dbs [18:53:19] cmjohnson1: did you have a chance to check the shelves for the labstore systems in eqiad? [18:53:36] ryan_lane...yep...not turned on [18:53:40] :D [18:53:44] that'll do it [18:53:44] had toggle that on/off switch [18:53:47] yep [18:54:21] cmjohnson1: here is the ticket for rdb1 & rdb2... https://rt.wikimedia.org/Ticket/Display.html?id=4870 [18:54:24] ryan_lane so they're all yours whenever you are ready [18:54:31] cmjohnson1: thanks :) [18:54:32] !log actually starting innopack from db65 to db1054 for pre-labsdb db (sanitarium) [18:54:38] Logged the message, notpeter [18:56:23] !log cmjohnson synchronized wmf-config/db-eqiad.php 'adding db1028 back' [18:56:28] Logged the message, Master [18:56:39] thanks sbernardin [18:57:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:15] AaronSchulz: what if $statsline is > than 1472 bytes [18:58:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:01:45] binasher: maybe it could be partitioned into <=512 byte calls [19:03:16] AaronSchulz: the udplog patch goes for 1450 [19:03:34] well, are we the only ones using this? [19:04:08] i'm pretty sure we are [19:04:15] which is still too large for ipv6 [19:04:18] * paravoid hides [19:04:50] paravoid: how many bytes is an ipv6 header? [19:04:54] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:05:05] hah [19:05:11] binasher: Very much variable. [19:05:14] (normally) :) [19:05:36] yeah, there's a extension headers [19:05:58] but I don't think they'd made sense here [19:06:04] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [19:06:22] [14:53:36] ryan_lane...yep...not turned on Well, /there's/ your problem. [19:06:34] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 12660 MB (1% inode=99%): [19:06:35] :) [19:06:47] so maybe it'd fit [19:06:58] 40 bytes is the absolutely necessary ipv6 header [19:07:08] plus 8 for udp iirc [19:08:18] maybe that's why it's 1450? [19:08:59] anyway, this was a joke, I don't think we'll switch logging to ipv6 before we switch away from udp2log :) [19:10:56] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53884 [19:14:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:27] New patchset: Reedy; "(bug 46330) Set $wgCategoryCollation to 'uca-fi' on all Finnish wikis except Wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57098 [19:16:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [19:21:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57098 [19:21:58] !log reedy synchronized wmf-config/InitialiseSettings.php [19:22:05] Logged the message, Master [19:22:06] paravoid: udplog4life! [19:24:21] i have that tattooed on my left buttock [19:24:25] true story. [19:24:53] Pic or didn't happen? [19:25:02] odder: pervert [19:25:41] Nemo_bis: just a Wikipedian; my immediate thought was {{citation needed}} # [19:26:05] no citations, only sequence ids [19:27:15] odder: that's a form of perversion I guess [19:27:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [19:28:51] who is our contact at MaxMind? [19:28:54] anyone knows? [19:29:22] yurik: Max? [19:29:34] Reedy, max who? [19:29:40] nvm [19:29:53] sorry, still jetlagged ;) [19:30:07] Pfft [19:30:11] A whole 3 hours? [19:30:18] redeye [19:30:25] A whole 3 hours? [19:30:37] 6 hours sleeping in a plane :-P [19:31:02] aided by chemical substances ;) [19:31:02] Pfft [19:31:06] If you slept... [19:31:21] nvm [19:31:40] aaanyway, i guess i will send an email to their general hotline :) [19:31:48] yurik: You might want to try poking our fundraising people [19:32:02] #wikimedia-fundraising [19:32:12] i spoke with Matthew, but he didn't know [19:32:20] might be a good idea, thx :) [19:36:38] paravoid: ori-l: until i ripped it out and nuked from orbit, flickr was doing their request logging via a token ring topology reliable multicast udp protocol that also tried to guarantee in-order delivery. the apache module for this system (http://www.backhand.org/mod_log_spread/) would block accepting new requests until it got the token and could send its logged. [19:36:47] udplog… it could be so much worse. [19:37:15] ohmygod [19:37:19] seriously? [19:37:47] not joking :( [19:37:50] so the problems you encountered were impossible because the system was provably correct [19:37:57] must be you, etc. [19:38:00] right? [19:38:08] exactly [19:38:22] figures [19:40:00] we could do the same thing from squid!! http://www.squid-cache.org/mail-archive/squid-users/200508/0178.html [19:40:11] yurik: i am the contact person for MaxMind [19:41:03] drdee, sweet :) I was hoping to find out if maxmind has a binary database *encoding* tool, so that we can create a custom DB of our own [19:41:10] guys am I a dummy here or what?: [19:41:23] the goal is to map carrier IP ranges to their ID [19:41:28] yurik: I already replied to that [19:41:30] yurik: paravoid: doesn't debian have a cvs -> maxmind db tool? [19:41:42] yes, I said that in a mail days ago [19:41:56] I also said that it sounds hacky to me [19:42:24] and that writing a program in C that imports a whatever file format into a radix tree and then doing lookups over that tree is trivial [19:42:28] paravoid, sorry, must have missed the tool ref, will go doubl check. Why is it hacky if its exactly the same goal -- ip -> string? [19:42:39] probably a day or two effort [19:42:58] binasher: s/cvs/csv/, brrr that sounded horrible [19:42:58] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:43:11] cvs -> maxmind [19:43:27] hahah [19:44:16] does lvm not report disk usage properly sometimes? [19:44:23] https://gist.github.com/ottomata/5295534 [19:44:57] Change abandoned: Mattflaschen; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57004 [19:46:28] paravoid: do you know if cpu mhz in /proc/cpuinfo on a 3.2 kernel should reflect the actual currently scaled speed? [19:47:33] New patchset: Aklapper; "Use my real name on Planet Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57166 [19:48:17] New patchset: Reedy; "Revert "Reduce the amount of times the database lists are read in"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57167 [19:48:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57167 [19:48:31] drdee: syslog during failed install an1007 http://p.defau.lt/?mqGr04BAa4TtR_zzj95xmQ [19:48:48] ty cmjohnson1! [19:49:12] !log reedy synchronized wmf-config/CommonSettings.php [19:49:19] Logged the message, Master [19:49:25] is that a permalink or will it expire sono? [19:51:19] notpeter, can I ping you about some disk free weirdness? I must be doing something real dumb [19:51:24] https://gist.github.com/ottomata/5295534 [19:53:32] paravoid: nm, verified that it does [19:54:28] cpu frequency scaling seems broken on the new r720+E5-2620 dbs [19:57:02] drdee...no idea..i can email you text file [19:57:33] New patchset: Asher; "pulling db1001" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57169 [19:57:56] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57169 [20:00:23] sorry, I was having dinner [20:00:28] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:01:21] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1001' [20:01:28] Logged the message, Master [20:03:47] fyi, i'm testing something, expecting a disk alert in a sec... [20:03:53] on analytics1026 [20:04:10] PROBLEM - Disk space on analytics1026 is CRITICAL: DISK CRITICAL - free space: /mnt/tmp_test_otto 0 MB (0% inode=99%): [20:04:20] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:05:00] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:05:10] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [20:05:28] ottomata: stat / /mnt/tmp_test_otto{,/file2} [20:05:28] New patchset: Reedy; "Cache loaded dblists when tagged. Reuse for SiteMatrix, CentralAuth and Incubator" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57173 [20:06:46] jeremyb_, i think it had something to do with the way I created the file, not sure, will do that in a sec [20:06:50] New review: JanZerebecki; "The single [OR] would be correct as no [OR] means an implicit AND which I assume matches the intent ..." [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/49069 [20:06:54] if I copied a real file in place it showed disk usage properly [20:07:08] oh, yeah [20:07:12] your dd is screwy [20:07:19] you: dd bs=1 count=0 seek=21M [20:07:27] should be: dd bs=1 count=21M [20:07:29] err [20:07:48] should be: dd bs=1M count=21 [20:07:52] ottomata: ^ [20:08:08] hm, i grabbed that from the dd wikipedia page, although that was for empty files of arbitrary size (from /dev/zero) [20:08:39] mk cool, thanks jeremb_ i know it was something real dumb [20:08:43] next time throw in some du :) [20:08:49] yeah I did too! [20:08:53] it showed it as free [20:09:14] !log added marc to the ops ldap group [20:09:21] Logged the message, Master [20:09:46] hm, did any of you guys get that recent analytics1026 disk alert in your emails? [20:10:01] New patchset: Asher; "moving s1 watchlist to db1052, putting db1043 to full weight" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57174 [20:10:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57174 [20:11:14] !log asher synchronized wmf-config/db-eqiad.php 'moving s1 watchlist to db1052, db1043 to full weight' [20:11:20] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:11:21] Logged the message, Master [20:11:30] New patchset: Ryan Lane; "Adding marc (Coren) as root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57175 [20:11:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57033 [20:12:08] ottomata: whatsup? [20:12:34] ha, hey, figured out the disk thing (i was being dumb, i knew it!), but now i'm confused about something else. [20:12:43] i just triggered a critical disk alert in icinga [20:12:55] but I get no email…looking around to find out why [20:12:58] any ideas? [20:13:11] icinga sucks... web interface way too slow [20:13:14] and not reliable [20:13:27] ](not sure if spence was better though) [20:13:39] jeremyb_: I don't understand what you mean by slow [20:13:40] its cool in the web interface [20:13:42] it's fast for me [20:13:56] Ryan_Lane: https://icinga.wikimedia.org/cgi-bin/icinga/notifications.cgi?contact=all [20:14:06] ottomata: they aren't "critical" [20:14:12] that is indeed slow [20:14:15] by default, things don't email [20:14:17] I never use that page [20:14:22] hm [20:14:32] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1026&service=Disk+space [20:14:37] lemme see the def in puppet [20:14:49] Ryan_Lane: i can't recall having this problems with non-wmf nagios. can't remember about spence [20:15:09] Ryan_Lane: i can definitely say this is a regular problem and not just today/right now [20:15:32] all of the pages I normally use work fine [20:15:56] notpeter: base.pp line 433 [20:16:16] nrpe::monitor_service { "disk_space" [20:16:25] yes [20:16:37] look at nagios.pp:59 [20:16:44] that is how checks are defined [20:17:01] the $critical var [20:17:05] critical="false ? [20:17:05] which is by default false [20:17:05] ahhh [20:17:06] hm [20:17:11] has to be "true" to get page [20:17:12] Ryan_Lane: the output ends with service-by-irc'>notify-service-by-irc and contact_grou [20:17:19] hmmmm [20:17:20] ok [20:17:27] hmmm, cool! ok [20:17:27] so [20:17:34] and takes 20 secs to even get that far [20:17:34] so I can override base::monitoring::host for analytics nodes [20:17:35] you could define another check_disck thing [20:17:37] for you [20:17:39] and change it and change contact_groups [20:17:39] ? [20:17:47] ok [20:17:53] why do you want this to email you so badly? [20:18:19] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57078 [20:18:24] I would say [20:18:32] that if you're expecting disk space problems [20:18:36] to make sometihng to clean them up [20:18:37] in case disk fills on up on analytics nodes, this happened on a couple of non-critical not production nodes the other week due to cRaZy udp2log process [20:18:42] and if you're not [20:18:42] not expecting [20:18:50] then just check it from time to time [20:19:03] well, dschoon et. all got all upset while I was at a seder dinner, heheh [20:19:14] not my fault they're anti-semites [20:19:16] ;) [20:19:18] heheh [20:19:22] anyway [20:19:28] I think that this is overkill [20:19:29] and messy [20:19:31] ottomata: when in brooklyn... [20:19:33] really? [20:19:42] and that fixing the root problem is the correct approach [20:19:44] you don't want paged if your disk on prod nodes fills up? [20:19:46] disk can vary quickly. [20:19:48] there's no root problem, atm [20:19:50] this was a fluke [20:20:02] as data volume is high, and a small misconfiguration can create a large volume of garbage [20:20:09] i would like to know about the garbage earlier rather than later. [20:20:48] ottomata: you can write any icinga checks you want :) [20:20:50] go for it [20:20:57] please don't edit the stuff in base.pp, though [20:20:59] I mean, you can [20:21:07] but, like, iunno, it's included on all [20:21:11] no no, certainly not [20:21:16] cool [20:21:17] i would inherit the class, but i thikn your way is cleaner [20:21:25] just make a new alert? [20:21:27] is better? [20:21:28] yeah [20:21:33] k [20:21:36] can customize it however you want :) [20:22:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57175 [20:22:52] you can also just leave it as it is, but just not make it page [20:23:00] by setting critical => false [20:23:24] mutante, we want page :0 [20:23:35] notpeter, any reason why this is nrpe vs just monitor_service? [20:24:24] I would certainly -2 any change that adds pages for analytics disk checks [20:24:24] because you need to execute it on the remote server [20:24:24] ottomata because you have to run a local command with nrpe to check disk space [20:24:34] there is no way to know that from the outside of the box [20:24:41] hm, aye ok right [20:24:42] New patchset: Ori.livneh; "Enable PostEdit on bn, br, ca, cs, et, ka and zh wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57178 [20:24:45] paravoid: I think this is just for the analytics alerting group [20:24:49] ok, need to ad dnew param to nrpe::monitor_service [20:24:51] $citircal [20:24:57] ok, that's better :) [20:24:58] but still [20:25:02] our job is to be proactive [20:25:04] yeah, just for analytics contact group [20:25:05] not reactive [20:25:18] hence logrotate [20:25:22] paravoid: this is what i was trying to say about "configure your stuff right and address root cause" [20:25:29] hah, mutante, yes, this is not a logrotate problem [20:25:51] yes, totally agreed [20:25:52] i dunno, but i saw full disks and huge logfiles [20:26:21] ottomata: for nrpe::monitor_service you would actually want to add the $critical var and the $contact_group var [20:26:26] aye [20:26:37] and pass those to the monitor service define [20:26:39] ok, guys, if all 3 of you tell me not to turn on these alerts though [20:26:40] i'm fine with that [20:26:51] for the most part i'm not worried about this either [20:26:54] who knows about nimsoft? [20:27:03] what it checks and how to configure it? [20:27:04] dschoon ^ [20:27:19] paravoid: web based AFAIK [20:27:29] jeremyb_: I meant more than that :) [20:27:35] paravoid: yeah [20:27:35] (not git) :( [20:27:38] paravoid: it's watchmouse [20:27:56] ottomata: I actually think the ops team should be the single point of contact for pages and such alerts [20:28:01] paravoid: cat /h/w/doc/watchmouse [20:28:06] paravoid: http://cloudmonitor.nimsoft.com/en/ [20:28:13] I've configured some [20:28:20] same here [20:28:26] and that pages going to the analytics group is a wrong premise [20:28:33] hmmm, yeah i think I agree [20:28:35] of the analytics group doing operations [20:29:24] hmm, what about the IRC notices? [20:29:25] notpeter, mutante: I'm in, thanks [20:29:38] woudl be nice if analytics contact group stuff would send alerts to #analytics irc room too, no? [20:29:46] at least then someone there would be more likely to notice and then come in here [20:30:12] * jeremyb_ wonders what's up with rt 4340. mutante ? [20:31:25] paravoid: are these pages actually a problem, or are they not and you're correcting the check def right now? [20:31:42] ottomata: [20:31:42] they're not, I just fixed the check and got an okay page [20:31:43] so [20:31:49] paravoid: thanks! [20:31:53] paravoid: while in it, you might want to check your alert settings, you don't have to have 24/7, but there is just one global timezone. which should be UTC [20:32:00] ottomata: so, i think that the root of this situations [20:32:10] is that we had no disk monitoring for a while [20:32:14] and once I got it going again [20:32:14] the check was trying /pybaltestfile.txt a while back I changed to /monitoring/backend [20:32:16] on all boxes [20:32:19] sometimes there was confusion because people thought the timezone setting is per user, but its not [20:32:25] then it was an omg emergency on th analytics boxes [20:32:26] that is actually being served by swift, instead of ms7 [20:32:32] and today I removed ms7 altogether [20:32:35] usually, like, now, this wouldn't sneak up on you/us like that [20:32:42] so I think that more emails won't be needed [20:32:42] ahhhhhHHHhh [20:32:47] I had updated our checks and all that but hadn't thought of watchmouse [20:32:47] that makes a lot of sense [20:32:57] because disks dont accidentally become 100% in a matter of hours [20:32:57] i was trying to figure out how this happened all of the sudden all at once [20:33:01] yeah [20:33:11] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [20:33:11] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [20:33:18] so, i think that making these things page for you/analytics [20:33:22] would be fixing the last situation [20:33:25] ok cool, thanks, i'll leave it as is then and update…MINGLE [20:33:25] which was a problem for sure [20:33:31] MINGLE THE SOURCE OF ALL KNOWLEDGE [20:33:34] ASK THE MINGLE [20:33:36] but I think that we're in a much better position at this point [20:33:48] !log maxsem synchronized wmf-config/InitialiseSettings.php 'https://gerrit.wikimedia.org/r/#/c/57078/' [20:33:55] Logged the message, Master [20:34:04] ottomata: if disk space explotions continue, then I think that's a bigger problem [20:34:09] jgonera: are you in? [20:34:09] totally. [20:34:10] and should be solved with capacity planning :) [20:34:25] if you need more boxes, you need more boxes :) [20:34:34] jeremyb_, ? [20:34:40] jgonera: stat1? [20:34:46] ottomata: fwiw, "jmxtrans" logs were also quite a bit [20:34:46] oh, yes, thanks [20:34:48] ottomata: wtf is mingle? [20:34:50] :) [20:35:07] and kafka.log [20:35:09] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:35:11] mingle: because we don't have enough task trackers [20:35:12] Ryan_Lane: it's like windows, but for process ;) [20:36:35] oh mutante, interesting, ok [20:36:55] its really fun guys! you should try it! [20:36:57] there are cards! [20:37:02] and inboxes! [20:37:10] the cards all have numbers [20:37:13] logs notpeter into mingleotrswiki [20:37:18] you don't have to refer to work taks with words anymore [20:37:35] we speak in numbers only over in #analytics [20:37:49] ottomata: can you play 3-card montey with them? [20:37:57] I'm only interested if swindling is possible [20:37:58] probably! actually! you should try! [20:38:06] log into analytics mingle and move all the cards around [20:38:07] see what happens [20:38:08] is there mao? [20:38:25] itllbefunipromise [20:38:38] ottomata!!! [20:38:43] hahah [20:38:47] uh oh! he's in this room :!!!! [20:38:49] heheheh [20:38:51] ottomata: I don't use closed source stuff. that's why I use debian on this laptop [20:38:54] uh [20:38:56] I'm lurking. :) [20:38:58] 111 116 116 111 109 097 116 097 058 032 114 101 100 109 105 110 101 032 102 116 119 [20:38:58] that uses an open source bios [20:39:00] I swear [20:39:06] hahah [20:39:09] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:39:17] coreboot? [20:39:53] mutante, re jmxtrans and kafka logs, those will be rotated when uhhhhhh, i can make changes to the cluster again (one day soon, let's keep our toes crossed!) [20:40:54] ottomata: oh, i may have missed something there, ok [20:41:10] RECOVERY - Disk space on analytics1026 is OK: DISK OK [20:41:24] !log maxsem synchronized php-1.21wmf12/includes/OutputPage.php 'https://gerrit.wikimedia.org/r/#/c/49071/' [20:41:24] :) [20:41:31] Logged the message, Master [20:44:38] New patchset: Reedy; "Cache closed, fishbowl and private dblists and reuse" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57184 [20:45:09] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [20:46:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57184 [20:49:11] !log reedy synchronized wmf-config/CommonSettings.php [20:49:17] Logged the message, Master [20:49:39] New patchset: Reedy; "More closing brackets" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57188 [20:49:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57188 [20:50:09] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:50:30] New patchset: Reedy; "Add transitionteam docroot" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57189 [20:50:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57189 [20:55:09] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection refused [20:55:36] New patchset: Asher; "returning db1028 to service" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57191 [20:55:58] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57191 [20:56:09] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.030 second response time [20:56:22] binasher: I tweaked StatCounter thing [20:57:45] !log asher synchronized wmf-config/db-eqiad.php 'returning db1028' [20:57:51] Logged the message, Master [20:59:09] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection refused [21:01:09] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.139 second response time [21:02:09] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [21:03:09] scapping... [21:06:26] AaronSchulz: hmm, we only send -total if wfIncrStats is called.. i thought we did for every request, though i suppose nearly all will call wfIncrStats [21:08:08] binasher: yeah, it's sent only if something gets incremented [21:08:18] there used to be one for session-setup that hit every requests [21:08:27] it spammed the collector and tim disabled it [21:09:51] ehm Allowed memory size of 183500800 bytes exhausted (tried to allocate 242394 bytes) in /usr/local/apache/common-local/php-1.21wmf12/includes/libs/jsminplus.php on line 1772 [21:14:25] binasher: the collector seems OK enough that I don't see a need to backport that [21:16:36] AaronSchulz: yeah, the graphs are pretty again and the collector usually isn't pegging a core [21:16:46] it'll be good to get more headroom though [21:25:01] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:25:07] Logged the message, Master [21:25:21] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [21:26:29] LeslieCarr: on https://gerrit.wikimedia.org/r/#/c/37441/ , was the en.wiki job queue check on spence removed on purpose or by mistake? is it ok to reintroduce it? [21:26:51] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [21:26:51] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [21:27:41] well spence is decommissioned minus ishmael, so it is gone on purpose [21:28:05] so running on hume is okay [21:28:11] but this needs to be put in the icinga.pp file [21:28:15] instead of nagios.pp [21:29:49] all right, shit hit fan [21:30:22] ExtensionMessages-1.22wmf1.php gets generated with PHP warnings [21:30:27] oh shit [21:30:33] Reedy: ^^ [21:30:39] i think i may know why varnishes are suddenly having errors [21:30:39] bacause several extensions are present in extensions-list [21:30:42] maxing out interfaces! [21:30:46] but are absent from 1.22 [21:30:48] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57166 [21:30:54] Reedy, ^^ [21:31:16] Eh? [21:31:23] Oh.. [21:31:35] I had to abort scap [21:31:39] Ignore it [21:31:42] It's fine to carry on [21:31:51] Reedy, no [21:31:52] 110 PHP Warning: Cannot modify header information - headers already sent by (output started at /home/wikipedia/common/wmf-config/ExtensionMessages [21:31:52] -1.22wmf1.php:9) in /home/wikipedia/common/php-1.22wmf1/includes/WebResponse.php on line 38 [21:32:22] o_0 [21:32:31] I never thought that it gets included by apache scripts [21:32:41] Scap was fine for me yesterday at least twice [21:32:44] I thought it's for shell only [21:32:51] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [21:33:28] Uhh [21:33:33] Where's it included? :/ [21:34:01] Oh [21:34:01] require( "$wmfConfigDir/ExtensionMessages-$wmfExtendedVersionNumber.php" ); [21:34:44] I'm confused why that's apparently causing header problems [21:35:23] And only just started now.. [21:36:20] Only srv193? [21:36:21] Do we care? [21:36:38] Reedy, once we sync it will be on mw.o and test2 [21:36:47] Do it? [21:36:49] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [21:36:53] and once you switch moar wikis to it.. [21:37:11] We have random header warnings like that appear from time to time [21:37:15] they go away by themselves too [21:37:49] Reedy, 110 warnings in a very short time [21:37:55] fun [21:37:56] !log reedy synchronized wmf-config/ [21:38:02] Logged the message, Master [21:38:03] All testwiki? [21:38:15] yes, while the scap was running [21:38:30] No one cares about testwiki. Maybe. Apparently. Or something [21:38:32] and I aborted really quickly and removed the errors manually [21:38:58] sooooo [21:39:13] I could run the remaining scap commands manually [21:40:19] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:41:18] binasher: did https://gerrit.wikimedia.org/r/#/c/52606/ get pushed out to production? [21:41:29] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection refused [21:42:03] !log maxsem Started syncing Wikimedia installation... : [21:42:10] Logged the message, Master [21:42:25] so yeah I'm doing it manually [21:42:29] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.012 second response time [21:42:43] awjr: if it's merged and > 30mins have passed, yes :) [21:42:50] New patchset: RobH; "brandon black added to roots as new ops team member" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57199 [21:42:52] thanks paravoid [21:43:19] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [21:44:36] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57199 [21:45:22] Reedy: is there actually a problem? [21:45:29] I've no idea [21:52:19] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [21:54:49] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [21:54:59] awjr: yeah [21:55:07] thanks binasher [21:56:20] lesliecarr: new ex4200 on a3? [21:56:32] yep, don't attach it to the braid yet and give me the serial number please [21:57:24] if [ "`uname -s`" != Linux ]; then [21:57:30] echo "ERROR: This script requires the Linux operating system to function correctly" [21:57:33] so..i have a problem with that locaation...no available power [21:57:43] gah! [21:57:45] noes!!! [21:57:59] * AaronSchulz finds that amusing, maybe someone else is going to use mw-update-l10n and needs to be aware ;) [21:58:09] what about relocating several? but that means downtime [21:58:09] hrm… so i think all the machines on that rack are in use … double checking racktables [21:58:28] the text squids are underutilized, so that could work [21:58:55] we could move to row c? [21:59:16] maybe cp1019 and cp1020 ? [21:59:25] would those from a physical point of view be ok to move ? [21:59:49] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [22:00:19] they would...also cp1037/9/40 are turned off atm [22:00:26] they've been off since I've been here [22:01:16] AaronSchulz: hashar did actually try to run it on Mac OS X [22:01:19] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [22:01:24] and then he complained in gerrit about all the things that broke [22:01:29] so I added that error message in response [22:01:30] lol [22:03:00] oh realy ? [22:03:23] well turned off is the easiest to do :) [22:03:36] so yeah, move those, and let me know the new location, I'll give them dns [22:04:41] really just taking out cp1039 and cp1040 should give us enough power for the new switch ? yeah ? [22:04:50] lesliecarr: yes [22:05:13] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:05:30] cool :) let me know the new ports/locations :) [22:05:35] woot [22:08:05] !log maxsem Finished syncing Wikimedia installation... : [22:08:11] Logged the message, Master [22:08:13] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [22:08:30] lesliecarr: they're going to asw-c7 0/25 amd 0/26 [22:08:40] New patchset: Pyoungmeister; "need to include --defaults-file for init script to actually work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57204 [22:08:53] cool :) [22:10:57] !log maxsem synchronized php-1.21wmf12/extensions/MobileFrontend [22:11:04] Logged the message, Master [22:12:13] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [22:14:15] !log maxsem synchronized php-1.21wmf12/extensions/MobileFrontend 'touch' [22:14:22] Logged the message, Master [22:16:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57204 [22:17:13] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:17:35] news everyone! the first of the pre-labsdb dbs is slaving [22:17:38] the data is flowing! [22:17:48] wooot [22:17:49] the spice must flow [22:18:04] the bits must flow... MUHAHAHAHAHAHA [22:18:29] also, I really like writing the phrase "pre-labsdb dbs" [22:19:56] say that 5 times fast [22:19:59] New patchset: Reedy; "Remove wgUseMemCached, died in 1.17" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57206 [22:20:19] !log authdns-update for cp1039 and cp1040 move [22:20:25] Logged the message, Mistress of the network gear. [22:23:53] !log maxsem synchronized php-1.21wmf12/includes/resourceloader/ResourceLoaderStartUpModule.php [22:24:00] Logged the message, Master [22:25:43] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [22:27:13] PROBLEM - Host db1053 is DOWN: PING CRITICAL - Packet loss = 100% [22:28:03] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:28:41] hi, who grants access to stats? https://rt.wikimedia.org/Ticket/Display.html?id=4835 [22:31:57] whoever is on rt duty [22:32:13] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [22:32:15] New patchset: Ori.livneh; "$wgNavigationTimingSamplingFactor: 10000 => 5000." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57207 [22:32:23] RECOVERY - Host db1053 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:33:17] PROBLEM - SSH on db1053 is CRITICAL: Connection refused [22:33:17] PROBLEM - NTP on db1053 is CRITICAL: NTP CRITICAL: No response from NTP server [22:34:06] ^^ binasher: see commit message. this will up the rate of navtiming events from ~2.75/s (current) to ~5.5/s. cool by you? [22:34:14] yurik: /topic says andrewbogott_afk [22:34:20] who is... afk :) [22:34:27] yurik: are your 3 days up? [22:34:48] jeremyb_, i think so [22:34:54] ori-l: yep! [22:35:17] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:35:18] yurik: you could make a gerrit patchset (or i could if you like) and then you need to find someone to merge it. :) [22:35:53] jeremyb_, i could, which file though? [22:35:57] ori-l: mark has volunteered to write us something that will do ip address -> asa routing information, based on the current bgp tables on our routers [22:36:12] binasher: ooohhhh, very cool [22:36:22] yurik: you'll need to edit both manifests/site.pp and manifests/admins.pp [22:36:27] yurik: in operations/puppet [22:36:35] yurik: (this is stat1, right?) [22:36:39] yep [22:36:42] ori-l: waiting til we're capturing that to worry about visualizing or graphing, since just be country isn't necessarily going to be very useful [22:36:43] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56696 [22:36:46] i'm sure other servers will follow ;) [22:36:52] though might still be useful for trending [22:37:05] yurik: you'll probably be going into admins::restricted [22:37:53] yurik: https://gerrit.wikimedia.org/r/56958 can be your guide [22:38:09] thanks jeremyb_ ! [22:38:36] jeremyb_, are you sure, that looks like a removal/refactoring patch [22:38:47] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:38:49] binasher: makes sense [22:39:23] yurik: right. but you'll just do the reverse :) [22:39:31] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/56696/' [22:39:38] Logged the message, Master [22:39:46] yurik: or do you want me to do it for you? [22:39:56] its ok, need to learn :) [22:40:10] ok [22:40:28] just make sure you add to stat1 not vanadium [22:40:37] cmjohnson1: how's the moving going ? [22:40:40] ori-l: did you clean up vanadium manually? [22:41:34] the new switch is in the rack...(we are utilizing all possible u's) [22:41:48] i did not move the other servers since they were already off...figured i do that last (lesliecarr) [22:41:48] hehe awesome [22:41:53] ok cool :) [22:41:59] what's hte serial number of that switch ? [22:42:17] before we break open the braid and add that switch [22:42:17] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [22:42:58] RECOVERY - SSH on db1053 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:43:43] i just realized your patch won't remove all those people. (https://gerrit.wikimedia.org/r/56958) [22:43:47] ori-l: ^ [22:44:19] BP0211500170 (lesliecarr) [22:44:37] woot [22:46:17] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [22:47:04] jeremyb_, what number should i use for $uid in admins.pp? [22:47:39] yurik: just pick a new one that's low but also higher than everything else in the file [22:47:59] and keep in mind that they're not sorted [22:48:18] cmjohnson1: so attach one of the braid cables to the new switch ? [22:48:22] which is "member 9" [22:48:50] one of the cables going to asw-a3? [22:49:20] !log maxsem synchronized php-1.21wmf12/resources/Resources.php 'https://gerrit.wikimedia.org/r/57208' [22:49:28] Logged the message, Master [22:50:05] jeremyb_: what do you mean? home directories? i left them alone, since they weren't taking up much space anyhow [22:50:23] ori-l: look at the current live def of stat1 [22:50:25] yeah, do we have any extra cables ? [22:50:32] yurik: i'm sorry. not *everything else* in the file. everything else below 1000 [22:50:46] jeremyb_: live def? [22:50:49] jeremyb_, no worries, figured :) [22:51:15] (eep i hope, because i'd prefer to not have it single-braided) [22:51:17] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [22:51:17] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:51:41] ori-l: removing those lines doesn't disable their accounts [22:51:53] jeremyb_: oh, i removed them from /etc/passwd [22:52:03] i thought that that was cleaner than a bunch of ensure => absents [22:52:31] ori-l: ewww [22:52:38] jeremyb_, in site.pp, should i add myself to "include accounts::yurik, or to sudo_user { [ ... ? [22:52:47] that patch you showed me used sudo [22:52:51] yurik: depends what your ticket says :) [22:52:56] but i don't think i need that [22:53:01] i just need read access [22:53:09] ori-l: leave them in /etc/passwd but remove their sudo and ~/.ssh/authorized_keys [22:53:13] yurik: right [22:53:18] jeremyb_: did that too [22:53:21] jeremyb_, ticket doesn't say anything about it :) [22:53:31] jeremyb_, right what? :) [22:53:32] yurik: so, submit and I'll review [22:53:39] include? [22:53:48] yurik: "i just need read access" means no sudo [22:54:42] lesliecarr: idk...i think we do but I need to check [22:55:07] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [22:55:25] ok [22:57:19] New patchset: Yurik; "(rt 4835) Added yurik account for stat1 non-sudo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57210 [22:57:30] jeremyb_, ^ [22:58:07] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [23:00:57] !log maxsem synchronized php-1.21wmf12/resources/Resources.php 'https://gerrit.wikimedia.org/r/#/c/57212/' [23:00:59] lesliecarr: what are we calling the switch asw2-a3? [23:01:03] Logged the message, Master [23:01:12] sounds good to me [23:01:13] :) [23:01:16] it's the tradition [23:02:20] yurik: ugh, a 1024 bit key? [23:02:23] let's keep up with tradition than [23:02:33] idk if we have a policy on key length... [23:03:14] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [23:03:27] lesliecarr: you have console access now [23:04:08] yay [23:04:32] jeremyb_, i think so, i used puttygen to generate SSH-2 RSA 1024 key. Should i use something else? [23:05:14] PROBLEM - DPKG on db1053 is CRITICAL: NRPE: Command check_dpkg not defined [23:05:24] PROBLEM - Disk space on db1053 is CRITICAL: NRPE: Command check_disk_space not defined [23:05:34] PROBLEM - RAID on db1053 is CRITICAL: NRPE: Command check_raid not defined [23:05:35] yurik: i'd say 2048 or 4096 [23:05:44] yurik: and ewwwww, windows??? [23:05:45] SSH2 RSA? [23:05:48] yes [23:05:51] hehe [23:05:55] looks ready to attach to the stack now :) [23:06:11] * MaxSem throws CP/M at jeremyb_ [23:06:14] * anomie prepares to make use of the lightning deploy window [23:06:41] anomie, RC continue? [23:06:47] yurik- Yes [23:07:34] RECOVERY - RAID on db1053 is OK: OK: State is Optimal, checked 2 logical device(s) [23:08:14] RECOVERY - DPKG on db1053 is OK: All packages OK [23:08:24] RECOVERY - Disk space on db1053 is OK: DISK OK [23:08:35] lesliecarr: okay..i have cables...how do you want to do this? [23:10:12] so i think in the end it should go asw-a2 <-> asw-a3 <-> asw2-a3 <-> asw-a4 [23:10:14] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [23:10:26] !log anomie synchronized php-1.21wmf12/includes/api/ApiQueryRecentChanges.php 'Fix for API list=recentchanges rccontinue' [23:10:33] Logged the message, Master [23:10:35] but to start unplug the asw-a3 to asw-a4 cable [23:10:53] !log anomie synchronized php-1.22wmf1/includes/api/ApiQueryRecentChanges.php 'Fix for API list=recentchanges rccontinue' [23:11:00] Logged the message, Master [23:11:04] RECOVERY - NTP on db1053 is OK: NTP OK: Offset -0.009376049042 secs [23:11:07] every time i look at this file i find problems [23:11:18] siebrand: daniel kinzler has a key named siebrand? [23:12:10] hah [23:12:25] * anomie is done. LIGHTNING DEPLOY!!!11one [23:12:25] give me a few mins missing some labels [23:12:35] anomie: :) [23:13:42] New patchset: Pyoungmeister; "pointing search traffic back at eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57215 [23:13:44] PROBLEM - Varnish HTTP upload-backend on cp1028 is CRITICAL: Connection refused [23:14:48] lesliecarr: are you monitoring..i am taking about what I believe to be the asw-a3-asw-a4 [23:15:00] cmjohnson1: are you about? [23:15:14] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [23:15:15] oh, oyu just said something, so probably :) [23:15:16] yes but in the middle of something [23:15:18] ok [23:15:33] no biggy. will just make a ticket :) [23:15:44] RECOVERY - Varnish HTTP upload-backend on cp1028 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.020 second response time [23:15:49] ok..cool [23:15:50] thx [23:16:06] cmjohnson1: yes monitoring [23:17:29] yay see the unplugged cable [23:17:32] yurik: so want to make a new key? [23:17:37] ok...right one? [23:17:43] yep [23:17:47] cool [23:17:58] jeremyb_, commiting, one sec [23:18:06] wanna make that asw2-asw4 [23:18:19] sorry asw3 -4 [23:19:04] why don't you hook up asw-a3's interface vcp1 to asw2-a3 vcp0 ? [23:19:11] yurik: hold on [23:19:29] yurik: also, add yourself to admins::restricted. in admins.pp [23:19:56] ok [23:20:41] anomie|away-ish: thanks for noting it on the deploy wiki [23:20:48] greg-g- You're welcome [23:21:31] New patchset: Yurik; "(rt 4835) Added yurik account for stat1 non-sudo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57210 [23:21:33] jeremyb_, ^ [23:23:14] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [23:23:56] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57215 [23:24:41] !log py synchronized wmf-config/lucene-production.php 'moving all search traffic back to eqiad' [23:24:42] why do some people have , at the end of the key and some ; ? [23:24:47] Logged the message, Master [23:25:04] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [23:25:25] i hate this file. i want to switch us to hiera... [23:26:16] jeremyb_: {{sofixit}} ;) [23:26:25] Reedy: ikr [23:26:34] and it's not even a wiki [23:26:36] omg [23:27:01] Reedy: hey at least there's a reasonable chance i could fix it. stuff like DNS i can't fix [23:27:30] You could propose patches based on the unknown... [23:27:31] :D [23:28:00] New patchset: Pyoungmeister; "using db1057 for prelabsdb db instead of db1055" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57218 [23:29:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57218 [23:30:04] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:32:14] PROBLEM - RAID on analytics1022 is CRITICAL: Timeout while attempting connection [23:32:25] PROBLEM - RAID on wtp1002 is CRITICAL: Timeout while attempting connection [23:32:25] PROBLEM - RAID on stafford is CRITICAL: Timeout while attempting connection [23:32:25] PROBLEM - RAID on analytics1015 is CRITICAL: Timeout while attempting connection [23:32:25] PROBLEM - RAID on wtp1003 is CRITICAL: Timeout while attempting connection [23:32:25] PROBLEM - Disk space on db1033 is CRITICAL: Timeout while attempting connection [23:33:24] PROBLEM - Host cp1023 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:24] PROBLEM - Host cp1018 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:24] PROBLEM - Host cp1012 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:24] PROBLEM - Host cp1014 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:24] PROBLEM - Host cp1024 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:25] PROBLEM - Host cp1006 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:25] PROBLEM - Host cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:26] PROBLEM - Host cp1016 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:26] PROBLEM - Host cp1015 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:35] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57207 [23:33:49] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57178 [23:34:02] PROBLEM - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:02] PROBLEM - Host cp1044 is DOWN: CRITICAL - Host Unreachable (208.80.154.54) [23:34:02] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [23:34:02] PROBLEM - Host cp1043 is DOWN: CRITICAL - Host Unreachable (208.80.154.53) [23:34:02] PROBLEM - Host cp1002 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:03] PROBLEM - Host cp1003 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:03] PROBLEM - Host cp1009 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:04] PROBLEM - Host cp1008 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:04] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.027 second response time [23:34:07] PROBLEM - DPKG on cp1021 is CRITICAL: Timeout while attempting connection [23:34:07] PROBLEM - Host cp1042 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:07] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:07] PROBLEM - Host cp1036 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:07] PROBLEM - Host cp1013 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:08] PROBLEM - Host cp1034 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:08] PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:09] PROBLEM - Host cp1007 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:11] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [23:34:14] RECOVERY - Host cp1002 is UP: PING WARNING - Packet loss = 50%, RTA = 0.41 ms [23:34:15] RECOVERY - Host cp1001 is UP: PING WARNING - Packet loss = 50%, RTA = 0.32 ms [23:34:15] RECOVERY - Host cp1003 is UP: PING WARNING - Packet loss = 50%, RTA = 0.37 ms [23:34:15] RECOVERY - Host cp1007 is UP: PING WARNING - Packet loss = 44%, RTA = 0.31 ms [23:34:15] RECOVERY - Host cp1006 is UP: PING WARNING - Packet loss = 44%, RTA = 0.28 ms [23:34:15] RECOVERY - Host cp1008 is UP: PING WARNING - Packet loss = 44%, RTA = 0.29 ms [23:34:20] egads [23:34:21] RECOVERY - Host cp1012 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [23:34:21] RECOVERY - Host cp1009 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [23:34:21] RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [23:34:21] RECOVERY - Host cp1015 is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [23:34:21] RECOVERY - Host cp1034 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [23:34:22] RECOVERY - Host cp1016 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [23:34:22] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [23:34:23] RECOVERY - Host cp1018 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [23:34:23] RECOVERY - Host cp1013 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [23:34:24] RECOVERY - Host cp1014 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [23:34:24] RECOVERY - Host cp1023 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [23:34:25] RECOVERY - Host cp1042 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [23:34:25] RECOVERY - Host cp1043 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [23:34:26] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 16%, RTA = 38.41 ms [23:34:26] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [23:34:27] RECOVERY - Host cp1024 is UP: PING OK - Packet loss = 16%, RTA = 61.46 ms [23:34:27] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [23:34:28] RECOVERY - LVS HTTPS IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61448 bytes in 0.052 second response time [23:34:30] bad switch [23:34:31] whoa [23:34:31] RECOVERY - DPKG on cp1021 is OK: All packages OK [23:34:31] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 61448 bytes in 0.025 second response time [23:34:32] leslicarr: wtf? [23:34:40] LeslieCarr: are those all on one switch? [23:34:45] the "totally not impacting" procedures …. impact [23:34:46] yep [23:34:47] yeah, asw-a-eqiad [23:34:48] bad switches get stitches [23:34:49] ah, ok [23:34:52] :D [23:34:54] binasher: true story [23:35:01] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19631 bytes in 0.004 second response time [23:35:11] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [23:35:14] ouch [23:35:30] so asw-a2 <<>> asw2-a3 is linked [23:36:41] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:37:31] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:38:11] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:42:31] PROBLEM - RAID on wtp1 is CRITICAL: Timeout while attempting connection [23:42:42] PROBLEM - RAID on analytics1014 is CRITICAL: Timeout while attempting connection [23:42:42] PROBLEM - RAID on analytics1021 is CRITICAL: Timeout while attempting connection [23:42:42] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection timed out [23:42:42] PROBLEM - RAID on snapshot1 is CRITICAL: Timeout while attempting connection [23:42:42] PROBLEM - RAID on ms-be8 is CRITICAL: Timeout while attempting connection [23:42:43] PROBLEM - RAID on snapshot1004 is CRITICAL: Timeout while attempting connection [23:42:43] PROBLEM - RAID on solr1 is CRITICAL: Timeout while attempting connection [23:42:44] PROBLEM - RAID on analytics1011 is CRITICAL: Timeout while attempting connection [23:42:44] PROBLEM - RAID on wtp1004 is CRITICAL: Timeout while attempting connection [23:42:51] PROBLEM - RAID on snapshot3 is CRITICAL: Timeout while attempting connection [23:42:51] PROBLEM - RAID on solr1001 is CRITICAL: Timeout while attempting connection [23:42:51] PROBLEM - RAID on ms-be6 is CRITICAL: Timeout while attempting connection [23:43:31] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [23:43:51] PROBLEM - Host cp1044 is DOWN: CRITICAL - Host Unreachable (208.80.154.54) [23:43:51] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [23:43:56] LeslieCarr: ? [23:44:01] PROBLEM - Host cp1043 is DOWN: CRITICAL - Host Unreachable (208.80.154.53) [23:44:11] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [23:44:11] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 40.81 ms [23:44:15] my finely tuned spider sense says there might be a subtle problem somewhere [23:44:19] bad gateway for me on mw.org [23:44:21] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [23:44:21] PROBLEM - Host cp1041 is DOWN: PING CRITICAL - Packet loss = 100% [23:44:31] and for me on enwiki [23:44:31] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 632 bytes in 0.162 second response time [23:44:31] RECOVERY - Host cp1043 is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [23:44:31] RECOVERY - Host cp1041 is UP: PING OK - Packet loss = 0%, RTA = 42.54 ms [23:44:32] sorry [23:44:35] bblack: I think you're on to something... [23:44:35] is it ok now ? [23:44:41] that was my fault [23:44:46] bblack: is that puppetized? [23:44:51] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 863 bytes in 0.001 second response time [23:44:59] oh ? [23:45:02] what's up cmjohnson1 ? [23:45:07] lesliecarr: so asw-a3 is connected asw2-a3 [23:45:33] i pulled the cable from asw-a2 [23:45:38] ah ok [23:45:41] !log olivneh synchronized wmf-config/CommonSettings.php '(Ibc2633f1c) : 10000 => 5000.' [23:45:47] hrm this is interesting [23:45:48] Logged the message, Master [23:45:50] cp1041 is showing significant increase in memory usage and load over the last hour or so; we just finished a mobile deployment, about an hour ago… is this likely related to the switch issue, or something we did/ [23:45:51] ? [23:45:51] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 19669 bytes in 0.002 second response time [23:45:54] !log olivneh synchronized wmf-config/InitialiseSettings.php '(Ia2665c4fe) Enable PostEdit on bn, br, ca, cs, et, ka and zh wikis' [23:46:01] Logged the message, Master [23:46:02] the rest of the mobile varnish cache cluster looks pretty normal tho [23:46:08] waiting for member 9 to join … but it's not very happy [23:46:16] awjr: nope, switch issue [23:46:22] ok phew [23:46:27] for me, at least :p [23:47:47] ok so it's "sort of connected" management wise but not attaching tot he forwarding plane [23:47:49] investigating [23:49:27] New patchset: Jeremyb; "(RT 4835) Added yurik account for stat1 non-sudo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57210 [23:49:58] lesliecarr: do i need to connect asw2-a3 to asw-a4? [23:50:07] yurik: made a couple tweaks [23:50:24] yes but not yet since it's still not actively working - if you'd like to cnnect the second interfaces of the cp's now , that would be ok [23:50:26] oops, sorry about spacing - my space comp was off [23:51:06] lesliecarr: all cp's or just cp1021-1034? [23:51:18] yurik: spacing? that was someone else's mistake. i'm just taking the opportunnity to fix it [23:51:35] hmm, i thought i made that, nvm [23:52:13] just cp1021 to 1034 [23:52:14] :) [23:54:45] New review: Jeremyb; "Looks good to me (assuming the ticket is ok, haven't seen it)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/57210 [23:55:01] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [23:57:30] yurik: so, poke andrewbogott now that he's back :) [23:57:55] yurik, what's up? [23:58:05] https://gerrit.wikimedia.org/r/57210 [23:58:10] andrewbogott, ^ :) [23:59:27] and jeremyb_, don't worry about the RT ticket, i faked it using tfinc's stollen laptop ;) [23:59:39] !log olivneh synchronized php-1.21wmf12/extensions/PostEdit [23:59:45] Logged the message, Master