[00:02:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:24] but i'm basically like those people who flood police tip hotlines with useless tips (neighbor hasn't shaved!) so i'm going to stop now [00:05:01] PROBLEM - Puppet freshness on mc1008 is CRITICAL: Puppet has not run in the last 10 hours [00:12:10] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:14:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.276 second response time [00:20:11] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [00:41:11] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:43:12] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:50:10] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [00:55:49] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [00:56:09] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [00:57:09] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:03:19] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:06:09] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [01:06:49] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [01:15:20] New review: Tim Starling; "Looks good, thanks for that. Can be merged once the whitespace issues are fixed." [operations/debs/lucene-search-2] (master) C: -1; - https://gerrit.wikimedia.org/r/53299 [01:22:19] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:23:21] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:31:21] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:31:22] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:34:19] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [01:44:20] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:45:05] New patchset: Ori.livneh; "Add 'eventlogging' puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54324 [01:46:10] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:47:19] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [01:51:14] ops: I'd appreciate if someone gave RT 4752 a fitting title. I left it out and am not allowed to edit. Something like 'Add database access credentials for EventLogging to puppet-private'. [01:55:21] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:58:46] Red herring is my favorite type of herring. [02:00:26] salted, smoked, or pickled? [02:03:09] Deep-fried. [02:05:20] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [02:07:09] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [02:24:53] New patchset: Ram; "Bug: 45266 Use sequence numbers instead of timestamps" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [02:30:04] !log LocalisationUpdate completed (1.21wmf11) at Mon Mar 18 02:30:03 UTC 2013 [02:30:11] Logged the message, Master [02:32:06] New review: Ram; "Patch set 4 fixes whitespace issues." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [02:43:09] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:47:10] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:49:32] ori-l: how do you like putting eventlogging creds in a my.cnf formatted file? [02:50:24] funny that ori-l can't edit. /me edits [02:51:04] it could work; the format is more or less equivalent to the one used by python's ConfigParser module [02:51:53] ori-l: refresh the ticket [02:52:21] thanks, jeremyb_ [02:52:59] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [02:53:10] ori-l: you forgot to ask if it's wine sauce or cream sauce [02:53:32] i knew what the answer would be [02:53:41] really?? [02:53:51] no :) [02:56:04] i hear http://www.vitafoodproducts.com/c-68-herring.aspx is a particularly good brand. of course they lose points for .NET [03:01:10] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [03:06:09] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:17:10] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:24:27] New review: Tim Starling; "I don't think it's appropriate to put load balancing code in configuration files. You should set $wg..." [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/43029 [03:37:09] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:39:47] New review: Tim Starling; "Note that the vhost will stay there indefinitely if this change is merged, since there is no ensure=..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52742 [03:58:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.360 second response time [04:04:59] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [04:09:59] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [04:16:25] New review: Tim Starling; "In general, you can't have different redirects for different protocols using mod_rewrite, see https:..." [operations/apache-config] (master) C: -2; - https://gerrit.wikimedia.org/r/47088 [04:19:21] ori-l: yt? [04:33:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [04:33:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 198 seconds [04:34:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [04:39:59] PROBLEM - Puppet freshness on mw1061 is CRITICAL: Puppet has not run in the last 10 hours [04:49:49] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [04:50:01] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:57:01] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [04:59:05] New review: Tim Starling; "Reviewed the complete diff." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/53125 [05:05:53] ori-l: well, i'm running away. you should send a puppet changeset that makes an eventlogging.conf similarly to how https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=commitdiff;h=6fb59e38e59c88c4938a6ca412598f6a7b3d5741 does [05:13:00] jeremyb_: ah, that makes total sense. [05:13:49] New review: Tim Starling; "(8 comments)" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [05:16:18] New patchset: Tim Starling; "Bug: 45266 Use sequence numbers instead of timestamps" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [05:16:56] New review: Tim Starling; "PS5: fixed space indenting" [operations/debs/lucene-search-2] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/53299 [05:16:57] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [05:23:01] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [05:23:01] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [05:23:01] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [05:23:01] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [05:59:09] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [06:02:59] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [06:04:59] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [06:21:19] New patchset: ArielGlenn; "bug fix, handle partial buffers that don't start with open parenthesis" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/54330 [06:37:01] PROBLEM - Puppet freshness on db56 is CRITICAL: Puppet has not run in the last 10 hours [06:38:59] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [07:22:19] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds [07:22:49] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 195 seconds [07:30:07] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/54330 [07:33:19] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:33:51] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:39:29] PROBLEM - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [07:40:41] PROBLEM - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:40:41] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:40:41] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:40:41] PROBLEM - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [07:40:41] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:40:44] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [07:40:44] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:40:44] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [07:41:14] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:41:14] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:41:14] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [07:41:14] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:41:14] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:41:59] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [07:42:19] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.012 second response time [07:42:19] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15993 bytes in 0.041 second response time [07:42:44] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.006 second response time [07:42:44] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.021 second response time [07:42:44] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.021 second response time [07:42:44] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.016 second response time [07:42:44] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3846 bytes in 0.001 second response time [07:42:54] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.016 second response time [07:42:55] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66823 bytes in 0.022 second response time [07:42:55] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 602 bytes in 0.013 second response time [07:43:22] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.021 second response time [07:43:22] RECOVERY - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66821 bytes in 0.011 second response time [07:43:22] RECOVERY - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3852 bytes in 0.009 second response time [07:43:35] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 66822 bytes in 0.012 second response time [07:43:35] RECOVERY - LVS HTTP IPv6 on foundation-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 66822 bytes in 0.007 second response time [08:00:13] helo [08:06:16] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53875 [08:13:22] New patchset: ArielGlenn; "new tool fifo_to_mysql.pl for feeding chunks to LOAD DATA INFILE" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/54337 [08:29:04] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:33:35] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 196 seconds [08:33:45] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 199 seconds [08:36:35] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [08:36:45] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [08:41:04] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [08:50:06] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/54337 [08:59:47] New patchset: Krinkle; "Add sudo user "krinkle" on gallium." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53861 [09:28:34] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [09:28:35] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 182 seconds [09:31:35] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 193 seconds [09:31:35] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 194 seconds [09:33:29] New patchset: Hashar; "apache confs for nagios.* are no more needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54341 [09:37:35] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 196 seconds [09:37:55] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 204 seconds [09:39:36] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [09:39:36] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [09:39:44] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [09:39:56] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [09:43:52] New patchset: Hashar; "(bug 45926) b/c for nagios URL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54343 [10:05:19] New patchset: Silke Meyer; "Adjusted load balancer settings on Wikidata test repos." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54344 [10:06:04] PROBLEM - Puppet freshness on mc1008 is CRITICAL: Puppet has not run in the last 10 hours [10:34:44] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 189 seconds [10:34:55] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 197 seconds [10:37:55] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [10:38:46] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [11:02:54] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:54] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:03:45] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 1.466 second response time [11:03:54] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:06:56] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:14:46] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [11:27:37] !log restarted both the varnishncsas on niobium, they were giant again [11:27:43] Logged the message, Master [11:34:04] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 182 seconds [11:34:55] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 209 seconds [11:36:54] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [11:37:05] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [11:47:11] varnishncsa was? [11:47:28] not varnish? [11:54:02] "again"? [11:56:50] not varnish [11:57:09] and yes, again (well recently I only restarted the vanadium one, iirc, I logged it at the time) [12:13:59] New patchset: Matmarex; "(bug 45911) Set $wgCategoryCollation to 'uca-pt' for the Portuguese Wikipedia and Wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52903 [12:15:42] New patchset: Matmarex; "(bug 45968) Set $wgCategoryCollation to 'uca-pl' on Polish Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54352 [12:18:48] New patchset: Matmarex; "(bug 45596) Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54353 [12:26:27] New review: Nikerabbit; "Why is this not done for all wikis in the same language?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54353 [12:26:43] New review: Peachey88; "Possibly causes Bug 46264?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46826 [12:33:54] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 187 seconds [12:35:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 3 seconds [12:39:49] New patchset: Mark Bergsma; "Update streaming range patch to M.B.Grydeland's updated version" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54358 [12:39:49] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm7) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54359 [12:39:49] New patchset: Mark Bergsma; "Disable internal jemalloc so the system jemalloc can be used" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54360 [12:39:50] New patchset: Mark Bergsma; "varnish (3.0.3plus-rc1-1~1.gbpae5519) UNRELEASED; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54361 [12:39:50] New patchset: Mark Bergsma; "Refresh the varnishncsa udplog patch against 3.0.3plus-rc1" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54362 [12:39:50] New patchset: Mark Bergsma; "Remove escaping of spaces in header lines" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54363 [12:39:51] New patchset: Matmarex; "(bug 46005) Set $wgCategoryCollation to 'uca-be-tarask' on be-x-old.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54364 [12:39:51] New patchset: Matmarex; "(bug 46004) Set $wgCategoryCollation to 'uca-be' on be.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54365 [12:41:11] Change merged: Mark Bergsma; [operations/debs/ganglia] (master) - https://gerrit.wikimedia.org/r/53374 [12:42:23] New patchset: Mark Bergsma; "Update streaming range patch to M.B.Grydeland's updated version" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54358 [12:42:39] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54358 [12:42:59] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm7) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54359 [12:43:08] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54359 [12:43:38] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54360 [12:43:49] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54361 [12:45:45] New patchset: Mark Bergsma; "Refresh the varnishncsa udplog patch against 3.0.3plus-rc1" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54362 [12:45:46] New patchset: Mark Bergsma; "Remove escaping of spaces in header lines" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54363 [12:48:16] New review: Matmarex; "Because that's how it was done before, because for now I'm just trying to deal with the wikis that s..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54353 [12:52:23] apergos: do you have any spare cycles? [12:52:31] :-D [12:52:36] no but ask anyways, what's up? [12:52:36] :) [12:53:32] New review: Wizardist; "Lacks bewikisource config. Community notice link provided in Bugzilla." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/54365 [12:53:37] I'm building a new Swift version [12:53:50] to fix a bug that's been holding the deployment of large files [12:53:52] (according to Aaron) [12:54:05] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [12:54:05] (RT 4499) [12:54:12] would you like to be involved in the upgrade? [12:54:21] eh might as well :-D [12:54:51] I see [12:55:09] we still have boxes on the old old vrsion [12:55:20] as they get replaced they get 1.7.x [12:55:47] but given how slow that process is you may want to do this round differently [12:57:30] New patchset: Matmarex; "(bug 46004) Set $wgCategoryCollation to 'uca-be' on be.wikipedia and be.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54365 [12:57:30] New patchset: Matmarex; "(bug 46081) Set $wgCategoryCollation to 'uca-default' on Polish Wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54367 [12:57:55] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [12:58:01] no, that's for proxies only [12:58:03] so that should be okay [12:58:05] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [12:58:18] ah good [12:58:21] New review: Hashar; "I have confirmed that it works properly by crafting a tiny job that creates a single files." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53990 [12:58:27] how's the box replacement going? [12:58:27] tediously slow [12:58:44] where are we now? [12:58:47] and when we finally get rid of all the old ones we still get to pull out the new ones that have the wrong controller and ssds >_< [12:59:00] https://wikitech.wikimedia.org/wiki/Swift/Deploy_Plan_-_R720xds_in_tampa [12:59:18] Mon Mar 18 - done remove weight from ms-be9 to 0 add weight to ms-be12 to 66 [12:59:43] after this next box comes out there's still three to go [12:59:55] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [12:59:58] three c2100s? [13:00:03] yep [13:00:06] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:01:47] !log depooling ms-fe1 for new swift version (patched 1.7.4) [13:01:54] Logged the message, Master [13:02:17] but you're replacing those with H710 boxes now, right? [13:02:20] yep [13:02:29] the ones coming out now get replaced with the right stuff [13:02:40] May? [13:02:42] omg [13:02:45] it's the 710s thaat got put in before you discovered the controller issue that are the problem [13:02:46] yeah [13:02:51] so this like a full 6 months to replace boxes [13:02:56] tell me about it, makes me want to shoot something (maybe me) [13:03:03] yes, horrid is as horrid does [13:12:46] apergos: I'm restarting the rest [13:12:54] ok [13:13:17] poor swift, 1.4k req/s [13:13:46] how is ceph coming along? [13:15:16] haven't had the chance to give it the love it needs yet [13:16:00] would love to try it out sometime [13:16:20] sure [13:16:33] are you doing dns first? [13:16:51] I'd like to, yes [13:17:03] but ceph is much more important [13:17:07] considering it's almost there... [13:19:01] hm, swift object count stopped working an hour ago [13:19:13] oh god, I upgraded ganglia-frontend on the boxes [13:19:27] that means ben's ganglia plugin might have stopped working [13:19:29] oh dear [13:19:34] shouldn't [13:23:27] !log depooled, upgraded, restarted and pooled again all ms-fe[1-4] [13:23:33] Logged the message, Master [13:24:14] RECOVERY - Puppet freshness on search20 is OK: puppet ran at Mon Mar 18 13:24:11 UTC 2013 [13:25:54] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Swift%20pmtpa&h=ms-fe1.pmtpa.wmnet&v=85.6663827975&m=swift_200_hits_%25&r=hour&z=default&jr=&js=&st=1363613021&vl=hps&z=large [13:25:54] lol [13:25:54] something's broken since... May last year [13:25:54] and fixed now [13:26:41] and yet that broke http://ganglia.wikimedia.org/latest/graph_all_periods.php?m=swift_object_count&z=small&h=Swift+pmtpa+prod&c=Swift+pmtpa&r=hour [13:29:17] it's a bunch of cronjobs in root crontab [13:30:35] the logtailer seems to work, somehow [13:30:52] huh [13:30:57] I'm trying to see where that swift_object_count is generated [13:31:16] doesn't seem to be logtailer [13:31:48] /usr/local/bin/swift-ganglia-report-global-stats does that [13:32:07] yeah [13:32:09] just found that :) [13:32:39] it uses gmetric [13:32:47] yuck [13:33:16] logtailer too [13:33:54] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 191 seconds [13:34:06] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 191 seconds [13:34:43] know what it might be? [13:34:53] gmond no longer runs as nobody:root or ganglia:root now [13:35:01] the new version actually does setgid() [13:36:33] ha! [13:36:35] that's exactly it [13:36:43] there's a config file with the password [13:36:58] root:root [13:37:04] hehe [13:37:04] hehe [13:37:12] I found it before I switched back to IRC [13:37:27] * mark giggles [13:37:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 13 seconds [13:37:56] RECOVERY - Puppet freshness on mw1061 is OK: puppet ran at Mon Mar 18 13:37:51 UTC 2013 [13:38:05] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 5 seconds [13:39:18] hm, although [13:39:28] hm, maybe not [13:39:33] the script runs from root's crontab [13:46:33] did you touch ganglia? [13:46:37] it's throwing php errors now [13:46:44] I didn't [13:46:55] ok [13:46:57] * mark looks at it [13:49:33] these are some hella error outputs [13:50:50] yeah i get ganglia probs too, the php files are being served directly rather than executing [13:51:22] interesting [13:51:28] looks like the php apache module isn't enabled anymore [13:51:44] lol? [13:51:52] there was just an apache USN released [13:51:57] do we have ensure => latest? [13:52:02] doubt it [13:53:14] did you ran an upgrade? [13:53:21] running it now [13:53:24] iU apache2-utils 2.2.14-5ubuntu8.11 utility programs for webservers [13:53:27] ii apache2.2-bin 2.2.14-5ubuntu8.11 Apache HTTP Server common binary files [13:53:31] ah [13:53:33] okay, wasn't sure if it was puppet or you [13:53:47] someone just set up https for ganglia, right? did that change something? [13:54:05] apache2: Syntax error on line 204 of /etc/apache2/apache2.conf: Syntax error on line 1 of /etc/apache2/mods-enabled/php5.load: Cannot load /usr/lib/apache2/modules/libphp5.so into server: /usr/lib/apache2/modules/libphp5.so: cannot open shared object file: No such file or directory [13:54:30] root@nickel:/var/log# apt-cache policy libapache2-mod-php5 [13:54:30] libapache2-mod-php5: [13:54:30] Installed: (none) [13:54:30] Candidate: 5.3.2-2wm1 [13:54:33] wha?! [13:55:04] 2013-03-18 13:35:00 remove libapache2-mod-php5 5.3.2-2wm1 5.3.2-2wm1 [13:55:06] aha [13:55:10] mpm-worker [13:55:12] probably removed php [13:56:45] Mar 18 13:35:14 nickel puppet-agent[25055]: (/Stage[main]/Webserver::Php5/Package[apache2]/ensure) ensure changed '2.2.14-5ubuntu8.10' to '2.2.14-5ubuntu8.11' [13:57:02] package { [ "apache2", "libapache2-mod-php5" ]: [13:57:03] ensure => latest; [13:57:03] } [13:57:06] oooof course. [13:57:50] since you are looking that and here, paravoid, is ensure => latest a good thing to do? [13:58:06] seems error prone, puppet will upgrade things when you aren't looking [13:58:09] right? [13:58:13] that's right [13:58:18] so you use it only when that's not a problem [13:58:22] like ganglia ;-p [13:58:24] I hate ensure => latest [13:58:30] ok cool, in general me too [13:58:38] i've seen it in a lot of our manifests and was wondering [13:59:04] PROBLEM - Puppet freshness on mw1104 is CRITICAL: Puppet has not run in the last 10 hours [13:59:04] PROBLEM - Puppet freshness on mw1131 is CRITICAL: Puppet has not run in the last 10 hours [13:59:58] ok ganglia back up [14:00:06] PROBLEM - Puppet freshness on mw1124 is CRITICAL: Puppet has not run in the last 10 hours [14:00:52] thank you! [14:01:23] New patchset: Faidon; "Remove ensure => latest from webserver.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54371 [14:02:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54371 [14:02:44] why [14:02:57] New patchset: Mark Bergsma; "Install apache2-mpm-prefork instead of letting APT install -worker" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54372 [14:02:57] thanks for conflicting [14:03:53] New patchset: Mark Bergsma; "Revert "Remove ensure => latest from webserver.pp"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54373 [14:04:02] lol [14:04:04] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [14:04:07] commit wars? :) [14:04:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54373 [14:04:44] New patchset: Mark Bergsma; "Install apache2-mpm-prefork instead of letting APT install -worker" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54372 [14:05:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54372 [14:06:04] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [14:06:53] I still think that we shouldn't let puppet upgrade apaches [14:06:56] and restart them [14:07:28] until we have better security upgrade processes in place, i think it's a good idea [14:10:22] New patchset: Mark Bergsma; "Fix dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54385 [14:10:50] New patchset: Mark Bergsma; "Fix dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54385 [14:12:11] back to swift ganglia [14:12:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54385 [14:13:41] are you guys aware that blog.wm.o is serving up php source? [14:13:52] * hexmode checks backlog to see [14:13:56] haha [14:14:02] yes i am [14:14:33] good... just wanted to verify :) [14:16:42] ekrem too [14:16:48] are you handling all of them mark? [14:17:20] yes [14:17:25] k [14:24:43] New review: Ottomata; "We are not going to use this change, I will eventually abandon it." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/48041 [14:28:21] good thing ganglia data is not stored by the php [14:28:59] brb, errands [14:29:09] good thing production apaches aren't webserver::php5 [14:29:17] hehe [14:29:30] now that would have been fun [14:30:17] ha [14:30:22] found the problem with gmetric [14:30:30] shot in the dark [14:30:36] root@ms-fe1:~# /usr/bin/gmetric --name "swift object change" --type int32 --conf /etc/ganglia/gmond.conf --spoof "Swift pmtpa prod:Swift pmtpa prod" --value 5 --units "objects per second" [14:30:40] root@ms-fe1:~# /usr/bin/gmetric --name "swift_object_change" --type int32 --conf /etc/ganglia/gmond.conf --spoof "Swift pmtpa prod:Swift pmtpa prod" --value 5 --units "objects per second" [14:30:44] first one works, second one doesn't [14:30:46] other way around [14:30:49] er [14:31:04] first one is what ben's script runs and it fails with the newer gmetric [14:34:04] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [14:35:05] PROBLEM - Puppet freshness on db1011 is CRITICAL: Puppet has not run in the last 10 hours [14:35:05] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [14:36:16] New patchset: Faidon; "Fix swift-ganglia-report-global-stats for 3.5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54464 [14:38:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54464 [14:42:36] New patchset: Hashar; "contint: update apache conf file headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54465 [14:42:37] New patchset: Hashar; "Apache conf for https://zuul.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54466 [14:43:37] nice [14:50:04] and percentage queries by status also fixed [14:50:08] hits__ -> hits_%25 [14:50:41] now if only I could merge https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Swift%20pmtpa&h=ms-fe1.pmtpa.wmnet&v=84.7094625632&m=swift_200_hits__&r=hour&z=default&jr=&js=&st=1363617641&vl=hps&z=large history into https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Swift%20pmtpa&h=ms-fe1.pmtpa.wmnet&v=83.6527293844&m=swift_200_hits_%25&r=hour&z=default&jr=&js=&st=1363617641&vl=hps&z=large [14:52:03] hashar: so you're renaming integration.wm.org to zuul.wm.org? [14:52:53] paravoid: hop adding a new vhost :-] [14:53:15] paravoid: I will eventually have to pull Zuul out of gallium to a new host. Maybe next year. [14:53:47] paravoid: and will eventually mimic openstack by providing a nice status page such as http://zuul.openstack.org/ ( with nice performances stats from graphite) [14:54:01] the aim is to replace https://integration.wikimedia.org/zuul/status entirely [14:54:04] heh [14:54:26]