[00:20:21] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 193 seconds [00:21:06] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 186 seconds [00:23:30] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [00:34:54] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 3 seconds [00:35:39] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 20 seconds [00:43:09] New patchset: Thehelpfulone; "meta sysop +/- transadmin self,crat -transadmin" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9333 [00:43:15] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9333 [00:43:40] it worked?! :D [00:43:56] Reedy: ^ is that okay in the description or does it need to be on the first line? [00:50:39] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 186 seconds [00:51:33] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 238 seconds [00:53:24] New review: Jeremyb; "Looks good. not touching $wgAddGroups because that's already set for crats in CommonSettings" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/9333 [01:00:42] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 195 seconds [01:01:27] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 205 seconds [01:05:39] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 211 seconds [01:17:03] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [01:17:39] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 17 seconds [01:40:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [01:44:57] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:44:30] PROBLEM - Puppet freshness on srv192 is CRITICAL: Puppet has not run in the last 10 hours [02:49:26] New patchset: Aaron Schulz; "Purge from squid all thumbs in Swift on purge." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9355 [02:49:31] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9355 [02:50:30] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [02:56:30] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [03:00:33] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [03:00:33] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [03:01:54] RECOVERY - Puppet freshness on srv192 is OK: puppet ran at Wed May 30 03:01:50 UTC 2012 [03:07:27] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [03:07:27] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [03:07:27] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [03:18:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:02:08] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [04:56:17] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 184 seconds [04:57:11] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 195 seconds [05:15:47] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 1 seconds [05:16:32] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 1 seconds [05:44:55] New review: Hashar; "Why not 12 ? ;-D" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9130 [06:03:59] running sync-dir docroot/mediawiki/xml '{{bug|37111}} deploying export-0.7.xsd' [06:04:04] not sure IRC notification has been done there since I was not connected [06:06:30] hashar: but you're in the channel? [06:06:41] 30 06:05:03 <+logmsgbot> !log hashar synchronized docroot/mediawiki/xml '{{bug|37111}} deploying export-0.7.xsd' [06:06:44] 30 06:05:08 <+morebots> Logged the message, Master [06:07:02] also, bonjour! [06:07:13] yeah I have connected [06:07:22] grr [06:07:27] i mean before those msgs [06:07:46] nop I was not [06:07:49] that was logmsgbot [06:07:56] 30 06:03:35 -!- hashar [~sempitern@mediawiki/hashar] has joined #wikimedia-tech [06:08:01] look at the timestamps [06:08:19] right [06:15:22] well it worked [06:15:27] now I will finish my breakfast :))) [07:32:20] hello [07:46:53] apergos: we have a new XML export schema :-D [07:46:55] 0.7 [07:46:55] http://www.mediawiki.org/xml/export-0.7/ [07:47:16] riiight [07:47:37] good morning ;) [07:47:43] I hope I did not wake you up [07:47:47] no [07:47:47] by pinging ya on IRC [07:47:51] \O/ [07:48:03] I've been online a couple hours already [07:48:08] more than that actually [07:48:36] what's the diff to the previous schema? [07:48:45] for reference : https://bugzilla.wikimedia.org/4220 Unique identity constraints for XML dump format schema [07:48:47] finding you the diff [07:49:14] https://gerrit.wikimedia.org/r/#/c/8889/ || https://gerrit.wikimedia.org/r/gitweb?p=mediawiki%2Fcore.git;a=commit;h=d2e8dd6251552aa5e9c0a35eb94f2eaa91a5d42a [07:49:18] sha1 : d2e8dd6 [07:49:30] ohh [07:49:33] there is no diff sorry [07:49:54] I see that no one replied to my concern [07:50:00] so you have to do something like: diff -u docs/export-0.{6,7}.xsd [07:50:06] it was just overlooked, patched and merged [07:50:44] ohhh [07:50:54] I think I have just read the first sentence of your comment 7 [07:51:08] I bring it up because such a proposal has been floated around in the past [07:52:07] so I would like folks to think aobut it and discuss [07:53:36] so should I just revert my change ? [07:53:49] or write a mail to wikitech-l somewhere announcing 0.7 schema maybe [07:53:51] it's in trunk right now, not the various wmf branches right? [07:54:03] yup only master for now [07:54:08] yes, master [07:54:10] hopefully [07:54:15] I'm going to be calling it trunk for years :-p [07:55:03] $ git branch --contains d2e8dd6 [07:55:03] * master [07:55:04] so I'd leave it for now but if you're not getting any feedback on bugzilla (which is appears you aren't) please drop an email to xmldatadumps-l and to wikitech-l [07:55:04] $ [07:55:05] \O/ [07:55:49] nothing huge, just asking for commentary [07:56:10] I have sent myself a remember [07:56:14] ok cool [07:57:07] so I have learned from this (yet again) that people don't read what I write, even if it's very short. :-( [07:57:39] that is why I have just merged the change [07:57:47] if 0.7 cause any trouble, we can still create a new 0.8 [07:57:55] right [07:57:56] and add a note in 0.7 how it is obsolete / should not be used or something [07:58:16] though to be honest, I should probably have asked first before merging ;-]] [07:59:13] ok well next time [08:11:04] New patchset: ArielGlenn; "fix indentation with verbose (this itme, remotedict creation)" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9362 [08:12:00] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9362 [08:12:02] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9362 [09:16:09] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [09:16:09] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [09:16:09] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:16:09] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [09:38:24] New patchset: Pyoungmeister; "fixing typo in mobile.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9366 [09:38:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9366 [09:40:47] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9366 [09:40:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9366 [09:44:40] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 182 seconds [09:45:25] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 186 seconds [09:58:46] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [09:59:55] New patchset: Pyoungmeister; "turning down log level on lucene log file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9367 [10:00:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9367 [10:00:29] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9367 [10:00:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9367 [10:05:22] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [10:06:07] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 2 seconds [10:08:04] New patchset: Bhartshorne; "making the owa servers join the pmtpa-test swift cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9368 [10:08:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9368 [10:08:28] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9368 [10:08:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9368 [10:10:19] RECOVERY - Puppet freshness on owa3 is OK: puppet ran at Wed May 30 10:09:47 UTC 2012 [10:10:19] RECOVERY - Puppet freshness on owa1 is OK: puppet ran at Wed May 30 10:09:51 UTC 2012 [10:10:19] RECOVERY - Puppet freshness on owa2 is OK: puppet ran at Wed May 30 10:09:52 UTC 2012 [10:15:03] New patchset: Bhartshorne; "continuing creation of the owa / ms pmtpa-test cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9369 [10:15:23] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9369 [10:15:23] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9369 [10:24:24] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [10:33:52] New patchset: Asher; "make maxClauseCount configurable" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9370 [10:34:13] notpeter: ^^ [10:42:56] New patchset: Bhartshorne; "creating accounts for SwiftStack contractors, giving them access + sudo to pmtpa-test swift cluster. Note these accounts are missing ssh keys so won't work yet." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9372 [10:43:18] New patchset: Bhartshorne; "opening ssh to the public internet to the swift pmtpa and eqiad test clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9373 [10:43:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9372 [10:43:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9373 [10:45:35] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9372 [10:45:38] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9372 [10:45:46] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9373 [10:45:48] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9373 [10:54:25] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 183 seconds [10:54:25] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 183 seconds [11:26:31] New review: Pyoungmeister; "(no comment)" [operations/debs/lucene-search-2] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9370 [11:29:14] New review: Pyoungmeister; "(no comment)" [operations/debs/lucene-search-2] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9370 [11:29:33] New review: Pyoungmeister; "(no comment)" [operations/debs/lucene-search-2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9370 [11:29:35] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9370 [11:56:13] New patchset: Mark Bergsma; "Replace deprecated module md5 by hashlib" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9375 [11:56:13] New patchset: Mark Bergsma; "Remove umask(0); unnecessary and creates world writeable log files" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9376 [11:56:14] New patchset: Mark Bergsma; "Replace all bare try except: statements by try except Exception:" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9377 [11:56:15] New patchset: Mark Bergsma; "Use a more specific exception on module import" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9378 [11:56:15] New patchset: Mark Bergsma; "Remove CVS $Id$ headers, update copyright notices" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9379 [11:58:33] New patchset: Mark Bergsma; "Fix hashlib usage, .update() doesn't return self" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9380 [11:58:55] New review: Mark Bergsma; "Broken, but fixed in a subsequent commit" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9375 [11:58:57] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9375 [11:59:21] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9376 [11:59:22] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9376 [11:59:57] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9377 [11:59:59] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9377 [12:00:33] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9378 [12:00:34] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9378 [12:00:56] !log restarting pdns on ns2 [12:00:59] Logged the message, Master [12:01:20] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9379 [12:01:22] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9379 [12:01:47] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9380 [12:01:49] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/9380 [12:05:31] New patchset: Dzahn; "analytics partman recipe - don't use logical partition if you say primary before" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9381 [12:05:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9381 [12:06:15] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9381 [12:06:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9381 [12:10:23] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [12:10:41] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [12:11:26] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [12:11:32] New patchset: Hashar; "wgLoadScript is only used on production cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9383 [12:11:38] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9383 [12:19:29] New patchset: Hashar; "wgHTCPMulticast* is only used on pmtpa cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9384 [12:19:35] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9384 [12:20:06] New patchset: Pyoungmeister; "adding debian dir" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9385 [12:27:42] New review: Pyoungmeister; "(no comment)" [operations/debs/lucene-search-2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9385 [12:27:44] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9385 [12:31:34] New patchset: Pyoungmeister; "wrong format" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9386 [12:32:12] New review: Pyoungmeister; "(no comment)" [operations/debs/lucene-search-2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9386 [12:32:14] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9386 [12:32:29] New patchset: Dzahn; "need to separate analytics partman - cisco vs. dell" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9387 [12:32:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9387 [12:33:28] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9387 [12:33:30] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9387 [12:43:24] fcking php packages [12:43:30] take like an hour and a half to build [12:43:53] and in the new versions the build-dep on mysql-server and actually SPAWN A MYSQL SERVER during build time [12:43:56] to do tests [12:51:55] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [12:52:19] paravoid: hehe [12:52:31] the varnish build actually does varnish testing and spawns up tons of varnish instances [12:52:35] and sometimes they fail, especially in labs [12:52:40] because of timeouts or whatever [12:52:43] and then your build fails [12:52:45] oh I know, I've fixed a bug in the test suite at some point [12:52:46] pretty annoying as well [12:53:00] it even has its own nice DSL just for testing [12:53:03] yep [12:53:13] it's a cool concept, but sometimes in your way ;) [12:57:46] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [13:01:49] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [13:01:49] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [13:08:52] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [13:08:52] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [13:08:52] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [13:18:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:29:52] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:28] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [13:39:01] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 194 seconds [13:39:19] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 190 seconds [13:47:54] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 204 seconds [13:58:26] paravoid / mark, https://test.wikipedia.org/w/index.php?title=Special:CentralNotice&method=listNoticeDetail¬ice=POTY+Test+Campaign+01 who should I tell about that error? [13:59:04] no idea :-) [14:03:30] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [14:05:18] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 4 seconds [14:05:27] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [15:41:16] New patchset: Pyoungmeister; "configure file is incompatible with building debian package." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9400 [15:42:21] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 201 seconds [15:42:42] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 200 seconds [15:43:26] New review: Pyoungmeister; "(no comment)" [operations/debs/lucene-search-2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9400 [15:43:29] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9400 [15:49:00] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 195 seconds [15:50:03] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 229 seconds [15:53:08] New patchset: Pyoungmeister; "updating changelog for precise" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9402 [15:54:14] New review: Pyoungmeister; "(no comment)" [operations/debs/lucene-search-2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9402 [15:54:16] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/9402 [16:13:15] New patchset: Bhartshorne; "inserting darrell's key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9403 [16:13:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9403 [16:14:06] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9403 [16:14:08] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9403 [16:18:05] New patchset: Bhartshorne; "damned duped identifiers." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9404 [16:18:26] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9404 [16:18:26] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9404 [16:35:26] binasher: ping? [16:35:30] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Wed May 30 16:35:20 UTC 2012 [16:37:14] so, disabling the suhosin patch in lucid seems to make it fail to build [16:37:17] something about a missing header, no idea why it would present by disabling suhosin [16:37:19] I can debug it and fix it [16:37:21] but otoh the precise built fine without suhosin (there's even support for that in debian/rules) [16:37:22] what's the process of disabling it? [16:37:22] i wonder if a later patch references parts of it [16:37:22] probably [16:37:30] I was wondering if you could your benchmarks with precise [16:37:34] maybe check for that header in other patches/ files [16:37:39] or if it's an requirement to use lucid [16:37:43] and hence I should spent time on it [16:37:51] spend even [16:38:50] i was going to run evil stealing a server from prod benchmarks [16:39:00] but i could do something else [16:39:33] New patchset: Bhartshorne; "one more try to get SwiftStack user accounts on the hw test cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9405 [16:39:38] I made wikimedia-app-server install on precise the other day [16:39:42] forward-ported some packages [16:39:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9405 [16:39:59] but I see your point [16:40:01] so I'll give it a try [16:40:49] it shouldn't be too hard to get mediawiki working on precise with prod config files if its just for quick and messy testing purposes [16:41:03] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9405 [16:41:06] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9405 [16:42:27] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7852 [16:43:57] New patchset: Lcarr; "removing old classes from searchindexer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9407 [16:44:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9407 [16:44:21] RECOVERY - Lucene on search21 is OK: TCP OK - 0.001 second response time on port 8123 [16:44:40] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9407 [16:44:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9407 [16:51:26] New patchset: Lcarr; "adding analytics1001 into decom to prevent duplicates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9408 [16:51:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9408 [16:54:00] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9408 [16:54:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9408 [17:10:36] RECOVERY - MySQL Idle Transactions on bellin is OK: OK longest blocking idle transaction sleeps for seconds [17:10:45] RECOVERY - MySQL Replication Heartbeat on bellin is OK: OK replication delay seconds [17:10:45] RECOVERY - MySQL Slave Delay on bellin is OK: OK replication delay seconds [17:10:45] RECOVERY - Full LVS Snapshot on bellin is OK: OK no full LVM snapshot volumes [17:10:45] RECOVERY - MySQL Slave Running on bellin is OK: OK replication [17:10:45] RECOVERY - MySQL disk space on bellin is OK: DISK OK [17:10:46] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [17:11:39] RECOVERY - MySQL Recent Restart on bellin is OK: OK seconds since restart [17:14:39] New patchset: Lcarr; "removing analytics1001 from decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9411 [17:15:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9411 [17:15:25] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9411 [17:15:28] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9411 [17:30:51] PROBLEM - NTP on bellin is CRITICAL: NTP CRITICAL: Offset unknown [17:40:20] New patchset: Lcarr; "fixing submit_check_result for passive checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9416 [17:40:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9416 [17:42:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9416 [17:43:01] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9416 [17:44:21] PROBLEM - Host mw57 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:21] PROBLEM - Host mw52 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:21] PROBLEM - Host mw55 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:21] PROBLEM - Host mw53 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:21] PROBLEM - Host mw51 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:22] PROBLEM - Host mw56 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:22] PROBLEM - Host mw54 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:29] ... [17:44:30] PROBLEM - Host mw47 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:30] PROBLEM - Host mw46 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:39] PROBLEM - Host mw37 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:39] PROBLEM - Host mw29 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:39] PROBLEM - Host mw41 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:39] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:39] PROBLEM - Host mw34 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:40] PROBLEM - Host mw35 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:40] PROBLEM - Host mw48 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:41] PROBLEM - Host mw45 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:48] PROBLEM - Host mw40 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:48] PROBLEM - Host mw33 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:48] PROBLEM - Host mw39 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:48] PROBLEM - Host mw42 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:48] PROBLEM - Host mw44 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:49] PROBLEM - Host mw38 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:49] I take it you guys noticed that some servers seem unresponsive? [17:44:54] Stupid nagios [17:44:57] PROBLEM - Host mw36 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:57] PROBLEM - Host mw30 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:57] Ping is fine [17:45:01] LeslieCarr: ^^ [17:45:06] PROBLEM - Host mw32 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:13] mobile varnish is throwing 503s [17:45:13] hrm [17:45:16] sounds like a spence issue [17:45:21] !log ganglia uploaded backported ganglia 3.3.5 deb package to precise-wikimedia repo [17:45:24] oh mobile varnish down ? [17:45:25] Logged the message, Master [17:45:26] !log ganglia uploaded backported ganglia 3.3.5 deb package to precise-wikimedia repo [17:45:30] Logged the message, Master [17:45:32] LeslieCarr: something's weird [17:45:33] RECOVERY - Host mw30 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [17:45:33] RECOVERY - Host mw38 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [17:45:33] RECOVERY - Host mw29 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [17:45:33] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:45:41] * Reedy kicks nagios-wm [17:45:42] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:42] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:42] PROBLEM - Apache HTTP on srv283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:42] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:42] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:43] RECOVERY - Host mw48 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:45:43] RECOVERY - Host mw46 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [17:45:43] need me to clear mobile cache ? [17:45:44] RECOVERY - Host mw47 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:45:44] RECOVERY - Host mw41 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [17:45:45] RECOVERY - Host mw35 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [17:45:45] RECOVERY - Host mw37 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [17:45:46] RECOVERY - Host mw52 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [17:45:46] RECOVERY - Host mw56 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:45:47] RECOVERY - Host mw51 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [17:45:52] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:52] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:52] PROBLEM - Apache HTTP on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:52] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:52] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:53] PROBLEM - Apache HTTP on srv196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:53] RECOVERY - Host mw53 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:45:53] im getting project home pages but Special:Random results in 503 [17:45:54] RECOVERY - Host mw39 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:45:54] RECOVERY - Host mw33 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:45:55] RECOVERY - Host mw45 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:45:55] RECOVERY - Host mw34 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:45:57] !log ganglia uploaded backported ganglia 3.3.5 deb package to precise-wikimedia repository [17:46:00] PROBLEM - Apache HTTP on srv198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:00] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:00] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:00] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:00] RECOVERY - Host mw55 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:46:01] Logged the message, Master [17:46:01] RECOVERY - Host mw44 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:46:01] RECOVERY - Host mw42 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [17:46:02] RECOVERY - Host mw57 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:46:02] RECOVERY - Host mw36 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:46:09] PROBLEM - Apache HTTP on srv245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:09] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:09] RECOVERY - Host mw54 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [17:46:09] RECOVERY - Host mw32 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:46:18] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:18] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:18] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:18] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:18] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:19] PROBLEM - Apache HTTP on srv263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:19] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:20] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] PROBLEM - Apache HTTP on srv271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:27] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:28] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:28] PROBLEM - Apache HTTP on srv242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:29] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:29] LeslieCarr maybe that would help? [17:46:34] i dunno [17:46:36] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:36] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:36] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:36] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:36] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:37] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:37] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:38] RECOVERY - Host mw40 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:46:39] spence is doing a lot of freaking out due to max service chekcs ... [17:46:45] PROBLEM - Apache HTTP on srv226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:45] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:45] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:45] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:45] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:46] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:46] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:47] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:47] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:48] PROBLEM - Apache HTTP on srv267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:48] oh actually, lemme check something ... [17:46:53] People are reporting issues though... [17:46:54] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:54] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:54] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:54] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:54] PROBLEM - Apache HTTP on srv285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:55] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:55] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:56] PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:56] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:57] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:57] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:58] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:03] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:47:03] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:04] PROBLEM - Apache HTTP on srv197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:04] PROBLEM - Apache HTTP on srv241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:04] PROBLEM - Apache HTTP on srv287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:04] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:04] PROBLEM - Apache HTTP on srv199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:05] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:05] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:12] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.422 second response time [17:47:12] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.480 second response time [17:47:12] PROBLEM - Apache HTTP on srv208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:12] PROBLEM - Apache HTTP on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:12] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:13] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.110 second response time [17:47:13] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.302 second response time [17:47:21] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:47:21] RECOVERY - Apache HTTP on srv198 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [17:47:27] wha..? [17:47:30] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [17:47:30] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [17:47:39] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:47:39] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:47:39] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:47:39] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [17:47:39] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:47:48] RECOVERY - Apache HTTP on srv271 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [17:47:48] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:47:48] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [17:47:48] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [17:47:48] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:47:49] RECOVERY - Apache HTTP on srv242 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [17:47:49] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [17:47:50] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [17:47:57] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:47:57] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:47:57] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [17:47:57] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:47:57] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:47:58] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:47:58] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:47:59] !log cleared mobile varnish cache [17:48:02] Logged the message, Mistress of the network gear. [17:48:06] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:48:06] RECOVERY - Apache HTTP on srv226 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:48:06] RECOVERY - Apache HTTP on srv267 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [17:48:06] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:48:06] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [17:48:06] awjr: looking better ? [17:48:07] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:48:07] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [17:48:08] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:48:12] LeslieCarr: yeah, 503s are gone [17:48:15] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:48:15] RECOVERY - Apache HTTP on srv285 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [17:48:15] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [17:48:15] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [17:48:15] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [17:48:15] cool [17:48:16] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [17:48:16] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:48:17] RECOVERY - LVS HTTP on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 58117 bytes in 0.208 seconds [17:48:17] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [17:48:18] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [17:48:18] RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.924 second response time [17:48:19] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [17:48:19] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [17:48:20] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [17:48:20] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [17:48:21] RECOVERY - Apache HTTP on srv287 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [17:48:23] hah [17:48:24] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [17:48:24] RECOVERY - Apache HTTP on srv199 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [17:48:24] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:48:24] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [17:48:24] RECOVERY - Apache HTTP on srv279 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:48:25] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.905 second response time [17:48:27] LeslieCarr: did you just clear the cache or was it something else? [17:48:33] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:48:33] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [17:48:39] awjr i just cleared cache [17:48:42] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.309 second response time [17:48:51] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.942 second response time [17:48:57] LeslieCarr: interesting ok - i wonder what it was, it was affecting multiple projects i tried [17:49:01] thanks LeslieCarr [17:49:36] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [17:49:54] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:49:54] RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.257 second response time [17:50:30] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [17:50:30] RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.851 second response time [17:50:30] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.716 second response time [17:50:57] RECOVERY - Apache HTTP on mw3 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [17:50:57] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [17:51:11] argh [17:51:24] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [17:51:42] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [17:52:20] hrm [17:53:52] hey guys, something is kinda funny with a couple of the disks on the analytics machines [17:53:54] i'm not sure what [17:54:15] sda and sdb are (afaik) un formatted and unpartitoined [17:54:17] bit [17:54:18] but [17:54:26] # fdisk /dev/sda [17:54:26] Unable to open /dev/sda [17:54:40] sdc and sdd are formatted and partitioned [17:54:43] they are fine [17:54:46] and the rest of the disks [17:54:50] sde, sdf, etc. [17:54:51] are fine too [17:55:00] they are unformatted and unpartitioned [17:55:29] I probably need mutante's help on this oneā€¦but I thought I'd ask in case someone saw something dumb I was doing [18:01:27] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 18 seconds [18:01:49] LeslieCarr, maybe since you've been in there recently, you wouldn't mind taking a quick look for me? [18:04:54] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 30 seconds [18:05:34] weird [18:05:46] ottomata: the ciscos and partman are like a horrible horrible nightmare [18:05:54] hm [18:06:08] well, hm [18:06:21] afaik, the partman stuff mutante was doing was just for the OS disks [18:06:29] which it looks like he set up on sdc and sdd [18:06:36] usind md software raid [18:06:47] the other disks were supposed to be left completely unformatted and unpartitioned [18:06:48] and we do that ourselves [18:06:54] hrm [18:07:00] why on sdc and sdd ? [18:07:04] which is why sde, sdf, etc. are good [18:07:06] no idea actually [18:07:09] i would have done sda,sdb [18:07:13] and left the rest for us [18:08:04] yeah [18:08:09] i am assuming that was unintentional [18:09:26] ok, so something is definitely weird then, right? [18:11:23] if you don't know what's happening, then I will send an email to daniel seeing if he can check it out [18:17:43] definitely weird [18:18:23] New patchset: Lcarr; "fixing nagios.cmd file location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9417 [18:18:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9417 [18:19:22] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9417 [18:19:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9417 [18:23:17] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [18:25:58] bla [18:26:16] gadgets extension seems to have died on en.wiki [18:36:29] PROBLEM - Puppet freshness on bellin is CRITICAL: Puppet has not run in the last 10 hours [18:42:57] New patchset: Lcarr; "fixing up file locations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9421 [18:43:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9421 [18:44:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9421 [18:44:17] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9421 [18:55:02] New patchset: Lcarr; "fixing snmptrapd.conf in icinga (i hope)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9423 [18:55:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9423 [18:55:43] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9423 [18:55:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9423 [19:04:23] New patchset: Hashar; "Always exclude webVideoTranscode jobs from queue processing" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9424 [19:04:29] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9424 [19:09:31] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9424 [19:10:32] New review: Hashar; "Thanks" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9424 [19:10:34] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9424 [19:11:14] !log rebooting neon [19:11:18] Logged the message, Mistress of the network gear. [19:17:35] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [19:17:35] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [19:17:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [19:17:35] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [19:25:44] New patchset: Lcarr; "fixing interpreter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9427 [19:26:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9427 [19:27:20] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9427 [19:27:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9427 [19:32:19] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 224 seconds [19:32:55] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 216 seconds [19:41:09] Lol, gerrit doesn't seem very happy [19:42:04] PROBLEM - Host locke is DOWN: CRITICAL - Host Unreachable (208.80.152.138) [19:42:40] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:49] PROBLEM - Host ssl4 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:07] PROBLEM - Host es4 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:07] PROBLEM - Host es3 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:07] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:07] PROBLEM - Host ms-be1 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:16] PROBLEM - Host db60 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:16] PROBLEM - Host ms-fe1 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:16] PROBLEM - Host db9 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:25] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:30] LeslieCarr: paravoid apergos ^^ [19:43:43] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [19:43:50] (Cannot contact the database server: Unknown error (10.0.0.227)) [19:44:19] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [19:44:27] es3 [19:44:28] RECOVERY - Host ms-fe1 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [19:44:28] RECOVERY - Host es4 is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [19:44:28] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [19:44:28] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:44:28] RECOVERY - Host db9 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:44:29] RECOVERY - Host es3 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [19:44:29] RECOVERY - Host db60 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [19:44:30] RECOVERY - Host ms-be1 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:44:30] RECOVERY - Host locke is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms [19:44:46] RECOVERY - Host ssl4 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:44:46] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [19:59:55] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [20:01:07] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 67.3425379091 (gt 8.0) [20:03:31] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.17996939655 [20:08:10] New patchset: Raimond Spekking; "Prevent search engines from indexing the user namespace in German Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9469 [20:08:16] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9469 [20:13:07] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:15:58] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:16:47] New review: Platonides; "I'd have mentioned the bug number in the commit message, but it's ok anyway." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/9469 [20:24:49] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [20:34:04] New review: Hashar; "Good for me." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/9469 [20:46:04] just got back [20:47:10] root@es3:~# uptime 20:46:07 up 207 days, 22:23, 1 user, load average: 3.36, 2.73, 2.47 [20:47:13] grrr [20:47:36] Reedy: do you know what's needed to be done to reboot that without downtime? [20:47:58] errm [20:48:21] I think it's a slave, so should be OK just to be done... [20:48:29] I'd have to check first [20:49:04] mailing ops [20:50:00] es3 is a Wikimedia External Storage server (master) (db::es). [20:50:11] ah [20:50:23] master rotation first then [20:50:38] is a ES cluster actively written? [20:50:44] ben/asher should be able to help [20:51:08] how many other boxes have we got that are coming up to this bug? :/ [20:52:29] es is mysterious [20:53:36] they look equal in DB.PHP [20:54:04] it seems we only have three ES servers, with es2 being master on most of them except es3 being the master on cluster22 and cluster23 [20:55:17] what's the bug in that box? [20:57:48] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [20:57:49] !log restarted networking on cp1036 [20:57:53] Logged the message, Mistress of the network gear. [20:59:18] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 193 seconds [21:01:15] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 193 seconds [21:03:04] !log restarted db1012 (unresponsive server) [21:03:08] Logged the message, Mistress of the network gear. [21:05:00] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 188 seconds [21:05:27] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 206 seconds [21:05:27] RECOVERY - Host db1012 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [21:07:09] !log restarted db1026 (unresponsive server) [21:07:13] Logged the message, Mistress of the network gear. [21:08:46] !log rebooted db1029 (unresponsive server) [21:08:50] Logged the message, Mistress of the network gear. [21:10:39] !log rebooted db1031 (unresponsive server) [21:10:42] Logged the message, Mistress of the network gear. [21:11:00] RECOVERY - Host db1026 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [21:12:43] Platonides: 207 day uptime bug [21:14:18] RECOVERY - Host db1031 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [21:14:27] RECOVERY - Host db1029 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [21:16:49] !log rebooted db1044 (unresponsive server) [21:16:53] Logged the message, Mistress of the network gear. [21:17:16] * Damianz thinks morebots should change LeslieCarr's title to 'Killer of uptime' [21:17:18] Reedy, what's that bug? [21:17:21] haha [21:17:30] :) [21:17:47] i want to do the big switchover of our monitoring server -- but also killing all the unresponsive servers while i do it :) [21:18:13] Platonides: some kernel bug that causes epic fail with 207 days uptime. Ask paravoid, he was one of the original people to discover the bug (I don't have access to my email wiht the info) [21:18:27] LeslieCarr: is this all co-incidence? Or just stuff people haven't bothered poking before? [21:18:41] dunno, it may be all 207 uptime bug [21:18:49] i think coincidence [21:18:52] ltos of older servers [21:19:06] we don't really have a rotation/person responsible for looking at the nagios alerts on a regular basis [21:19:18] Totally just need some nagios event handlers to go reboot nodes are random after 200 days because fixing bugs is too much fun :D [21:20:00] RECOVERY - Host db1044 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [21:22:15] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 23 seconds [21:22:54] hehe [21:23:03] well we need to do some sort of apt-get upgrade with regularity [21:23:26] Indeed [21:24:02] Misc servers get dealt with... Non misc, well, don't [21:25:24] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 8 seconds [21:25:56] wow, db1044 has been out for quite a while ... [21:26:14] didn't have an update from like 6 months ago [21:33:30] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 185 seconds [21:33:57] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 186 seconds [21:34:40] New patchset: Lcarr; "db41 has changed to manutius" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9481 [21:35:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9481 [21:37:02] Ouch [21:39:03] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 15 seconds [21:52:37] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 22 seconds [22:10:19] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9481 [22:10:21] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9481 [22:12:07] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [22:15:00] New patchset: Lcarr; "force fixing external command file permissions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9486 [22:15:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9486 [22:16:40] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9486 [22:16:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9486 [22:26:10] maplebed and/or lesliecarr, can I get some partman help? Regarding virt-raid10.cfg [22:26:52] I'm gonna pass the buck on that one. [22:26:56] That script worked properly on one server but is not working on the second. I suspect this is because of existing partitions... [22:27:01] sorry andrewbogott. I claim midnight-thirty. [22:27:22] maplebed: Oh, are in Europe? then you're excused. [22:27:31] andrewbogott: shit, i don't have that excuse [22:28:10] \o/ [22:28:18] pin the partman on the leslie! [22:28:25] The file in question is a product of repeated copy-pastes (not just mine)... [22:28:28] New patchset: Lcarr; "adding in snmptt init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9487 [22:28:29] I'm puzzled by [22:28:30] d-i partman-md/device_remove_md boolean true [22:28:30] d-i partman-md/confirm_nooverwrite boolean true [22:28:38] oh you mean like every other thing? ;) [22:28:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9487 [22:29:00] Seems like those sort of contradict themselves...? And our guide doesn't mention confirm_nooverwrite [22:29:14] But it's in the file twice, which makes it seem important! [22:29:40] PROBLEM - MySQL Slave Running on db1042 is CRITICAL: Connection refused by host [22:30:51] LeslieCarr: You could claim that your heart and/or circadian clock is still in Europe. [22:31:07] hehe [22:31:14] actually it sort of is ;) [22:31:19] but anyways... [22:32:24] yeah i have no clue [22:32:26] ;) [22:32:31] my partman was all trial and error [22:32:40] virt-raid10 is roughly a copy of cp-varnish.cfg, which was written by... mark! Who is also in europe :( [22:32:50] you could see if nooverride false [22:32:55] you call it trial and error, I call it SCIENCE! [22:33:02] if that works [22:33:06] that'd be my guess [22:33:09] there is http://wikitech.wikimedia.org/view/PartMan [22:33:12] Yep, i will start flipping flags at random shortly. Just wanted to find out if there was any underlying theory first. [22:33:17] hehe, and my favorite quote "Theoretically now it confirms stuff automatically. It doesn't. Partman lies" [22:33:19] :) [22:33:49] Yep, it doesn't document that flag though. [22:39:04] New patchset: Andrew Bogott; "Random attempt to make partman behave" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9488 [22:39:25] New review: Andrew Bogott; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9488 [22:39:25] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9488 [22:52:51] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [22:58:51] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [22:59:00] andrewbogott: love to commit message [23:00:55] looks like gerrit is down: https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git [23:02:54] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [23:02:54] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [23:09:57] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [23:09:57] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [23:09:57] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [23:19:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:31:49] has anyone ever seen issues with old resources (like css files) get stuck in the cache from bits? [23:32:50] maplebed ^ ? [23:33:19] I haven't, sorry.