[00:17:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:24:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [00:24:17] o.0 [00:57:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.997 seconds [01:38:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.388 seconds [01:45:42] Ryan_Lane: do it, do it, stylize! [01:45:57] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 273 seconds [01:48:28] we should have scap distribute the data recursively [01:49:40] if we do it right then we should be able to send data at 500 Mbps or so to every apache without saturating any links [01:50:12] or maybe we should use multicast... [01:50:39] * AaronSchulz wanders aimlessly in concurrency land [01:50:46] you know I've read about reliable multicast protocols a few times in this context [01:51:30] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 608s [01:54:13] pity Leslie and Roan are both gone [01:54:18] I should get up earlier [01:54:21] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:54:30] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 9 seconds [01:55:59] I'll watch the network graph in ganglia [01:57:31] * AaronSchulz saturates brion's cr time [01:57:51] Not a directory: /home/wikipedia/common/php-php-1.20wmf1 [01:57:51] Found syntax errors in php-1.20wmf1, cannot sync. [01:58:11] :) [01:58:19] let me check something [01:58:32] is the syntax error "Not a directory"? [01:59:07] $ mwversionsinuse [01:59:07] php-1.20wmf1 php-1.20wmf2 [02:00:37] AaronSchulz: are you checking something relevant to scap, or relevant to concurrency land? [02:01:42] scap [02:02:45] did you change this code recently? it doesn't look familiar [02:03:18] maybe I never reviewed it [02:05:05] * AaronSchulz is done [02:06:31] thanks [02:12:21] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [02:17:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:10] !log aborted scap and re-ran with fanout=5 instead of 30, since nfs1 CPU was maxed out [02:21:13] Logged the message, Master [02:23:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.724 seconds [02:26:58] still broken [02:27:25] every apache gives "Unable to read wikiversions.dat or it is empty" [02:27:47] --report shows nothing informative [02:29:12] oh, right --report is broken [02:29:23] commenting out the error_reporting(0) makes it work [02:31:15] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [02:38:21] !log fixed scap, was failing on the remote side due to mwversionsinuse exiting with status 1 due to /home/wikipedia/common not existing on apaches [02:38:25] Logged the message, Master [02:57:04] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Thu May 3 02:56:39 UTC 2012 [03:20:17] "Bug in Mailman version 2.1.13 [03:20:17] We're sorry, we hit a bug!Please inform the webmaster for this site of this problem. Printing of traceback and other system information has been explicitly inhibited, but the webmaster can find this information in the Mailman error logs. " [03:20:26] (but it's fine) [04:12:03] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:17:45] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:21:42] New patchset: Tim Starling; "Fix scap error handling, reduce fanout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6463 [04:22:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6463 [05:29:57] PROBLEM - Puppet freshness on db30 is CRITICAL: Puppet has not run in the last 10 hours [05:31:00] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [06:15:10] New patchset: ArielGlenn; "was deleting only with --verbose, fixed." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6464 [07:14:57] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6464 [07:15:00] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6464 [07:22:38] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/fundraising/bi-filter, [07:24:17] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [07:41:09] New patchset: ArielGlenn; "check for runphpscriptlet in same directory" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6465 [07:43:03] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6465 [07:43:05] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6465 [08:10:05] New patchset: Hashar; "git ignore /private/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6470 [08:10:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6470 [09:03:38] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [10:26:49] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:32:22] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:32:22] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [10:33:27] !log starting container-auditor on ms-be3 [10:33:30] Logged the message, Master [10:33:43] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:41:31] New patchset: Mark Bergsma; "Initial import." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6481 [10:41:31] New patchset: Mark Bergsma; "wikimedia-lvs-realserver (0.02) unstable; urgency=low" [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6482 [10:41:32] New patchset: Mark Bergsma; "wikimedia-lvs-realserver (0.03) edgy; urgency=low" [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6483 [10:41:33] New patchset: Mark Bergsma; "wikimedia-lvs-realserver (0.04) edgy; urgency=low" [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6484 [10:41:33] New patchset: Mark Bergsma; "Add and delete files..." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6485 [10:41:34] New patchset: Mark Bergsma; "Set the individual interface sysctls, to make automated testing in shell scripts practical." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6486 [10:41:35] New patchset: Mark Bergsma; "Revert Tim's sysctl.conf changes, as those keys do not generally exist on all servers and will throw errors if sysctl is not run with -e Instead we will change apache-sanity-check to trust the 'all' interfaces key." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6487 [10:42:13] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6481 [10:42:22] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6481 [10:42:24] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6481 [10:42:52] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6482 [10:43:01] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6482 [10:43:03] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6482 [10:43:23] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6483 [10:43:25] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6483 [10:43:43] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6484 [10:43:45] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6484 [10:44:47] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6485 [10:44:49] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6485 [10:45:06] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6486 [10:45:09] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6486 [10:45:25] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6487 [10:45:27] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6487 [10:49:24] New patchset: Dzahn; "replace "*.wikimedia.org" with "star.wikimedia.org" per RT-2512 | get rid of star_wikimedia_org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [10:49:41] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3238 [10:53:15] New patchset: Dzahn; "replace "*.wikimedia.org" with "star.wikimedia.org" per RT-2512 | get rid of star_wikimedia_org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [10:53:32] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3238 [10:55:36] New patchset: Dzahn; "replace "*.wikimedia.org" with "star.wikimedia.org" per RT-2512 | get rid of star_wikimedia_org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [10:55:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3238 [10:57:05] New review: Dzahn; "re: Ryan. inline comment done." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [11:19:28] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [11:29:23] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [11:30:12] New patchset: Dzahn; "mwscriptwikiset - do not rely on mwscript being in path (f.e. cron jobs)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6489 [11:30:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6489 [11:31:23] New review: Dzahn; "this one failed on me in a cron, not finding mwscript" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6489 [11:33:43] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [11:46:22] New patchset: QChris; "Allow database hosts with port specification" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6491 [11:46:22] New patchset: QChris; "Allow temp dir on different partition than dump dirs" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6492 [11:50:30] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6491 [11:50:33] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6491 [11:52:08] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6492 [11:52:10] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6492 [11:53:59] New patchset: Mark Bergsma; "Setup APT preferences in Puppet instead of in package wikimedia-base" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6493 [11:54:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6493 [11:54:41] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6493 [11:54:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6493 [12:00:45] New patchset: Dzahn; "change working refreshLinks crons to monthly schedule originally requested. leave s1 deactivated." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6494 [12:01:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6494 [12:04:46] New review: Dzahn; "just once monthly. keep it simple: cluster 2 on day 2, 3 on 3, and so on...at midnight." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6494 [12:04:49] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6494 [12:08:18] New patchset: Mark Bergsma; "Remove 'quiet' kernel option in Puppet instead of in package wikimedia-base" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6495 [12:08:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6495 [12:09:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6495 [12:09:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6495 [12:10:10] New patchset: Mark Bergsma; "Typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6496 [12:10:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6496 [12:10:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6496 [12:10:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6496 [12:13:22] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [12:17:52] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [12:25:04] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [12:25:31] PROBLEM - udp2log processes for emery on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, /var/log/squid/filters/india-filter, /usr/local/bin/sqstat, /var/log/squid/filters/latlongCountry-writer, [12:25:55] New patchset: Mark Bergsma; "Setup vim as the default editor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6497 [12:26:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6497 [12:26:27] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6497 [12:26:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6497 [12:26:52] RECOVERY - udp2log processes for emery on emery is OK: OK: all filters present [12:29:39] New patchset: Mark Bergsma; "Don't try to use Upstart jobs on Hardy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6498 [12:30:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6498 [12:30:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6498 [12:30:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6498 [12:35:11] New patchset: Mark Bergsma; "Fix logic for grub quiet option removal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6499 [12:35:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6499 [12:35:32] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6499 [12:35:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6499 [12:37:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [12:43:17] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [12:43:59] New patchset: Mark Bergsma; "Manage root's .bashrc in Puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6500 [12:44:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6500 [12:44:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6500 [12:44:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6500 [12:50:17] New patchset: Mark Bergsma; "Do things the Puppet way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6501 [12:50:31] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/6501 [12:50:58] New patchset: Mark Bergsma; "Do things the Puppet way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6501 [12:51:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6501 [12:51:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6501 [12:51:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6501 [13:02:49] New patchset: Mark Bergsma; "Install vim and sysstat on all servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6503 [13:03:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6503 [13:03:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6503 [13:03:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6503 [13:21:50] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [13:27:32] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [13:27:35] New review: Dzahn; "hehe yeah, hate it when then it's not default in crontab -e and stuff, joe has a user base in labs t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6497 [13:31:44] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [13:41:29] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [13:43:22] New review: Tim Starling; "IIRC, the values on the individual interfaces override the values for "all", but only when they are ..." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6486 [13:49:33] !log Built new wikimedia-base 1.00 package, stripped of most stuff now handled by Puppet, and inserted it into the lucid-wikimedia and precise-wikimedia APT repositories [13:49:37] Logged the message, Master [13:50:07] you're reviewing svn commits that are years old there Tim ;-) [13:51:48] New patchset: Mark Bergsma; "Remove code for unused /etc/wikimedia-cluster" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6505 [13:51:49] New patchset: Mark Bergsma; "Remove APT pinning setup; now handled by Puppet" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6506 [13:51:50] New patchset: Mark Bergsma; "Puppet manages sysctl on Lucid and higher" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6507 [13:51:50] New patchset: Mark Bergsma; "Set default editor in Puppet" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6508 [13:51:51] New patchset: Mark Bergsma; "Remove sysctl.conf" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6509 [13:51:52] New patchset: Mark Bergsma; "Manage bashrc in Puppet" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6510 [13:51:52] New patchset: Mark Bergsma; "Retab postinst" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6511 [13:51:53] New patchset: Mark Bergsma; "Remove outdated, unused example files" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6512 [13:51:54] New patchset: Mark Bergsma; "Remove TCP removal, retab prerm" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6513 [13:51:54] New patchset: Mark Bergsma; "No longer depend on sysstat and vim" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6514 [13:51:55] New patchset: Mark Bergsma; "Remove sysctl.conf installation in debian/rules" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6515 [13:51:56] New patchset: Mark Bergsma; "wikimedia-base (1.00) lucid-wikimedia; urgency=low" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6516 [13:52:39] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6505 [13:52:41] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6505 [13:52:53] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [13:53:11] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6506 [13:53:13] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6506 [13:53:34] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6507 [13:53:42] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6507 [13:53:44] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6507 [13:54:07] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6508 [13:54:09] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6508 [13:54:31] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6509 [13:54:33] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6509 [13:54:53] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6510 [13:54:55] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6510 [13:55:16] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6511 [13:55:18] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6511 [13:55:36] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6512 [13:55:38] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6512 [13:56:11] PROBLEM - SSH on sq59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:51] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6513 [13:57:53] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6513 [13:58:19] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6514 [13:58:21] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6514 [13:58:41] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6515 [13:58:43] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6515 [13:58:53] RECOVERY - SSH on sq59 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:59:08] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6516 [13:59:10] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6516 [13:59:40] dearest mark [13:59:46] what say you of this change: https://gerrit.wikimedia.org/r/#change,6392 [13:59:46] ? [14:00:04] you have glanced at it I believe, asking if I have tested it [14:00:06] I have responded [14:00:19] i know that this change might need to be babysat [14:00:23] and maybe you are not a babysitter [14:01:53] PROBLEM - SSH on sq61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:20] PROBLEM - SSH on sq60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:14] RECOVERY - SSH on sq61 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:03:14] PROBLEM - SSH on sq59 is CRITICAL: Server answer: [14:03:23] PROBLEM - SSH on sq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:32] PROBLEM - SSH on sq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:41] PROBLEM - SSH on sq53 is CRITICAL: Server answer: [14:04:35] PROBLEM - SSH on sq52 is CRITICAL: Server answer: [14:04:53] RECOVERY - SSH on sq54 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:05:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [14:06:33] ottomata: i'll get to it [14:06:38] mmk! [14:07:04] no worries mark, i know you guys are busy [14:07:18] i just have to poke in here to make sure it happens, thank you! [14:07:26] PROBLEM - SSH on sq61 is CRITICAL: Server answer: [14:10:35] PROBLEM - SSH on sq54 is CRITICAL: Server answer: [14:11:37] New patchset: Mark Bergsma; "Use the default Ubuntu squid packages on the install server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6519 [14:11:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6519 [14:12:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6519 [14:12:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6519 [14:13:17] PROBLEM - SSH on sq57 is CRITICAL: Server answer: [14:13:53] hello mark, can you have a look at the remote syslog server patch ? :) https://gerrit.wikimedia.org/r/5813 [14:15:50] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [14:17:29] PROBLEM - SSH on sq51 is CRITICAL: Server answer: [14:18:59] New patchset: Mark Bergsma; "Qualify global variables in base.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6521 [14:19:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6521 [14:20:10] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6521 [14:20:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6521 [14:20:47] PROBLEM - SSH on sq58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:50] PROBLEM - SSH on sq62 is CRITICAL: Server answer: [14:24:41] PROBLEM - SSH on sq56 is CRITICAL: Server answer: [14:25:35] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:28:44] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [14:29:57] New patchset: Mark Bergsma; "Restart varnishncsa processes on init scripts and default file changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6522 [14:30:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6522 [14:30:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6522 [14:30:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6522 [14:32:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6392 [14:32:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6392 [14:35:48] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:35:56] ottomata: puppet is applying that change across the cluster in the next half hour [14:36:57] ohhhh boy [14:38:08] it works totally cool on my local, and I don't think it will be a prob at all, but [14:38:24] would appreciate it if you could run puppet manually on one of the varnish machines [14:38:29] to make sure all is well [14:38:29] I did, on cp1028 [14:38:33] varnishncsa is running [14:38:36] and that's all I know :) [14:38:36] great [14:38:41] haha, ok... [14:38:43] can you check [14:38:46] oh [14:38:49] check ps [14:39:00] what opts is varnishncsa running with? [14:39:15] with the -F argument [14:39:25] cool, with Accept-Language in there? [14:39:25] 110 2004 8.4 0.2 98384 83584 ? Ss 14:35 0:18 /usr/bin/varnishncsa -n frontend -w 208.80.154.15:8419 -m RxRequest:^(?!PURGE$) -D -P /var/run/varnishncsa/varnishncsa-multicast_relay.pid -F %l %n %t %{Varnish:time_firstbyte}x %h %{Varnish:handling}x/%s %b %m http://%{Host}i%U%q - %{Content-Type}o %{Referer}i %{X-Forwarded-For}i %{User-agent}i [14:39:30] i'm sorry [14:39:33] that's another ticket [14:39:34] yeah [14:39:36] that looks right [14:39:38] Content-Type [14:39:43] cool! [14:39:49] thank you! [14:39:59] now hashar [14:40:12] mark: danke [14:40:20] who will need to rebase his change, I can already see that ;) [14:40:26] I modified base.pp heavily today [14:40:43] ahhhhhhh [14:40:47] and that's still not valid puppet syntax, hashar [14:40:52] mark/hashar: "joe" :) [14:40:54] you can't pass arguments to classes that way [14:41:13] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [14:41:34] hashar: suggested to add those tool/editor packages for labs in one class when i saw Mark adding defaults for production [14:43:02] maybe there should be a "realm-check" in that generic class in base? if labs, then joe is ok [14:43:22] maybe, maybe not [14:43:28] that change was already handled by wikimedia-base before [14:43:31] which is also installed in labs [14:43:37] however, puppet enforces it now on every run ;) [14:44:27] so yeah, perhaps a $::realm check is reasonable [14:44:49] mark: do you have any doc about passing parameter to classes ? [14:44:56] hashar: it works like this: [14:44:59] the alternative seemed like base::labs-standard-packages [14:45:08] class { "classname": param1 => value1, param2 => value 2 } [14:47:10] oh yeah, so $::realm == "labs" = good. but we still have $realm == "labs" and ( $realm == "labs" ) in other places [14:47:45] yes [14:47:49] notpeter: search 1-12....i want to confirm it's ok to shut them down [14:47:50] unfortunately the world is not a perfect place [14:48:15] search1-12 are going away? :) [14:50:37] mark: yes...i have the boxes here and ready to swap out the old ones [14:50:45] yay [14:54:15] AZEHAZR stupid git-review did a rebase again [14:54:27] oh on a merged change [14:54:30] * hashar is luck [14:54:31] y [14:55:46] mark: class {} looks uglier than include : -D [14:55:47] https://gerrit.wikimedia.org/r/#patch,unified,5813,5,manifests/site.pp [14:57:06] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [14:58:08] this whole change is ugly [14:59:24] mark: I could make base::remote-syslog to use $::syslog_server as a default [14:59:30] thus we can stick to include syntax [15:00:00] hashar: i was going to change the mentioned base::syslogs class anyways [15:00:40] yeah but I would like to get rid of those global variables too [15:00:43] let me think about this a little bit [15:03:33] each time I touch a class it needs some more refactoring :-] [15:03:59] <^demon> Story of our lives. [15:04:00] <^demon> :) [15:05:17] !log powercycling srv266 [15:05:21] Logged the message, Master [15:07:42] hashar: it's only getting uglier I think [15:07:45] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [15:07:54] RECOVERY - Memcached on srv266 is OK: TCP OK - 0.004 second response time on port 11000 [15:07:56] just go with your original change (more or less), put a simple $::realm check in base::remote-syslog [15:08:02] we'll fix it in a nicer way later [15:08:13] * hashar digs in patchsets [15:09:03] no global variable yet, the rest of that class has realm specific ugly things anyway [15:10:45] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [15:12:05] cmjohnson1: yep! go for it [15:12:12] I'm going to log it [15:12:50] !log chris is taking down search1-12 to replace with new search nodes [15:12:52] Logged the message, notpeter [15:13:38] q for you guys [15:13:42] i'm about to commit something to gerrit [15:13:56] but part of the change is a patch to the squid frontend.conf.php script [15:13:59] which is not in puppet [15:14:04] how should I submit that bit? [15:14:12] notpeter: thx [15:14:14] it needs to go out as part of the git commit [15:14:32] put it in your comments, set your change to -1 or something [15:14:52] put the patch in the commetns? [15:14:55] no [15:14:59] provide that elsewhere [15:15:00] where should I put the patch? [15:15:07] on fenari or so [15:15:13] hm, ok [15:15:16] my home there is ok? [15:15:18] yeah [15:15:21] k [15:15:24] danke [15:17:30] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [15:19:06] mark: and here is patch set 6 : https://gerrit.wikimedia.org/r/#patch,unified,5813,6,manifests/base.pp [15:19:54] notpeter: is searchidx1 staying? [15:20:29] you need to rebase your change [15:20:36] you're not operating on a current repo :) [15:21:00] cmjohnson1: hhhmmm, searchidx1 probably needs to be decomissioned, tbh [15:21:05] but I'd check with RobH about that [15:21:10] yes [15:21:12] decommission it. [15:21:24] PROBLEM - Host search1 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:58] mark: it is based on latest version of `test` branch [15:22:13] oh damn this is test [15:22:16] yeah do whatever in test [15:22:21] :-] [15:22:26] test is so far apart from production that I've stopped caring ;) [15:22:42] we will have to have someone to merges it [15:22:45] PROBLEM - Host search2 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:46] it needs a merge from hell anyway [15:22:50] err to merge from prod to test [15:23:08] because we are setting up a test cluster on labs and will most probably requires changes from production [15:23:10] merged [15:23:22] labs is moving towards a system with one branch per project [15:23:39] PROBLEM - Host search3 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:42] that will be much easier [15:23:50] and then you can commit your own changes without going through gerrit [15:24:04] that will be great [15:24:15] PROBLEM - Host search4 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:25] then I guess it will be the responsibility of each project owner to merge from production ? [15:24:36] notpeter: did you put these search hosts in decommissioned_servers? [15:24:41] hashar: yes [15:25:19] mark: not yet [15:25:28] notpeter: ideally that's done while they still run :) [15:25:33] ah, yes [15:25:37] too late now [15:25:58] but i do need to turn off nagios... [15:26:39] no [15:26:48] that's done by putting them in decommissioned_servers [15:27:46] New patchset: Ottomata; "Log Format changes for Rt 2674 and RT 2745." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6526 [15:28:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6526 [15:28:54] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [15:31:09] PROBLEM - Puppet freshness on db30 is CRITICAL: Puppet has not run in the last 10 hours [15:31:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3511 [15:31:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3511 [15:32:12] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [15:33:42] New patchset: Pyoungmeister; "decommisioning search1-12" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6527 [15:33:42] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [15:34:00] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2283 [15:34:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6527 [15:34:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6527 [15:34:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6527 [15:34:29] hey mark, looking at https://gerrit.wikimedia.org/r/#change,6500 - why do the sed thing instead of pulling the entire .bashrc into puppet? [15:37:26] because I don't want to update it between ubuntu releases [15:39:01] arg. I can't run commands on neon :/ [15:39:07] we could look at deploying a custom .bashrc as well, but I didn't want to investigate & deal with that now, so I just moved to puppet the way it's always been done in wikimedia-base [15:39:34] I've been wanting to add HISTTIMEFORMAT='%F %T ' to root's .bashrc but haven't gotten around to it since it wasn't managed. [15:39:50] <^demon> mark: Deploying a standard .bashrc would be nice. I hate ending up on $randomserver and my bash is all monochrome and useless. [15:40:05] I want it monochrome :P [15:40:09] I hate the colored ones [15:40:18] notpeter: do you know what's up with the search flapping? [15:40:32] maplebed: yes. chris is decomissioning the old search nodes [15:40:47] and I odn't have perms to turn off checks in icinga [15:40:49] but feel free to deploy a .bashrc in puppet, as long as it's done in a careful way which works well for everyone and on all 3-4 ubuntu releases we run now [15:41:41] so all users need an likescolor true/false flag in their account?:) [15:41:55] well this is only root's .bashrc, which is shared [15:42:01] what people do in their own accounts I don't care about [15:42:12] then again, it's not on NFS so puppet deploys it from skel [15:42:16] isn' thtere a bashrc skeleton thing [15:42:18] that you can provide? [15:42:28] that will end up being in new user's .bashrc file? [15:42:32] yes [15:42:37] oh ok [15:42:39] sorry, just read that last bit [15:42:41] bout root [15:43:01] cmjohnson1: can you stop pulling servers until i figure out how to disable this fucking monitoring? [15:43:39] PROBLEM - Host search6 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:39] PROBLEM - Host search8 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:39] PROBLEM - Host search7 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:09] spence will remove monitoring for servers in decommissioned_servers list [15:44:11] neon doesn't afaik [15:44:22] because it's naggen [15:44:24] mark: it's the lvs checks, not the host checks that are paging :/ [15:44:31] and i don't want to remove those [15:44:37] but there are no hosts behind them now [15:44:41] heh [15:44:42] so they're failing [15:44:55] with nagios, I'd just disable through web interface [15:45:01] notpeter: if you want to turn of notifications via shell, this should still work on icinga afaik: http://wikitech.wikimedia.org/view/Nagios#Scheduling_downtimes_with_a_shell_command [15:45:03] but I don't have perms on icinga [15:45:13] ah, thank you! [15:45:24] New review: Ottomata; "I'm -1ing this for now so that it doesn't get accidentally approved (HA like that would ever happen)..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/6526 [15:45:48] notpeter: but the timestamp in there is UNIX timestamp, and need to convert it to something in the future [15:46:21] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2313 [15:46:21] gotcha [15:47:48] notpeter: okay [15:47:51] I didn't get any search pages [15:47:55] mark: forgot to avoid syslog looping :-D https://gerrit.wikimedia.org/r/6528 [15:48:52] mutante: what do all of those fields mean? [15:49:03] PROBLEM - Host search5 is DOWN: PING CRITICAL - Packet loss = 100% [15:49:38] notpeter: SCHEDULE_HOST_DOWNTIME;;;;;;;; [15:50:06] gotcha [15:50:16] set fixed to 1 and triger_id to 0 [15:50:42] i set start_time to current unix timestamp by using [15:50:51] $(date +%s) [15:50:55] yeah [15:54:45] ok, now do find the eqiv of nagios.cmd for icinga :/ [15:54:45] see note below: If the "fixed" argument is set to one (1), downtime will start and end at the times specified by the "start" and "end" arguments. Otherwise, downtime will begin between the "start" and "end" times and last for "duration" seconds. [15:55:03] notpeter: replace string nagios with string icinga in pathes and file names [15:55:56] there's nothing in /var/lib/icinga/rw [15:55:58] what [15:55:59] should be icinga.cmd [15:56:00] the fuck [15:56:04] there's not :/ [15:57:02] am i right you want "locate" right now? [15:57:11] command_file=/var/lib/nagios/rw/nagios.cmd [15:57:14] it should be off now [15:57:15] wooo [15:57:21] ah [15:57:33] hurray for nothing fucking being in any sensible places [15:57:43] and for web uis not working [15:57:47] and for no documentation [15:57:50] *sigh* [15:57:59] i have disabled ALL notifications from icinga [15:58:01] until now it seemed like they s/nagios/icinga everywhere else, nod [15:58:15] and: root@neon:~# chown www-data /var/lib/nagios/rw/nagios.cmd [15:58:25] binasher: you can ah [15:58:27] that's good [15:58:42] mark: I was also missing some passwords in the private repo :D That prevents labs from using misc::scripts https://gerrit.wikimedia.org/r/5792 [15:58:50] mark: I have set dummy passwords for now [16:00:27] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 182 seconds [16:00:54] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 189 seconds [16:02:29] cmjohnson1: ok! please continue pulling boxes! [16:03:40] daughter duty. Will be back [16:03:54] notpeter: sure thing...you may also want to remove searchidx1 from nagios [16:06:43] cmjohnson1: already done. thanks! [16:22:30] PROBLEM - Host search9 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:51] PROBLEM - Host search10 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:00] PROBLEM - Host search11 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:00] PROBLEM - Host search12 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:03] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 29 seconds [16:28:12] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 28 seconds [16:39:55] New patchset: Jgreen; "revoking accounts on fundraising boxes for users who no longer do fundraising work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6530 [16:40:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6530 [16:43:49] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6530 [16:43:52] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6530 [16:56:37] PROBLEM - Apache HTTP on mw59 is CRITICAL: Connection refused [17:03:21] !log mwm59 out of apache pool. using it for some testing [17:03:24] Logged the message, notpeter [17:04:52] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:09:04] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:12:13] LeslieCarr: https://gerrit.wikimedia.org/r/6533 [17:13:14] New review: Lcarr; "lint checker borked. submitting for preilly" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6533 [17:13:16] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6533 [17:16:18] !log reloaded and purged varnish cache for mobile in eqiad [17:16:21] Logged the message, Mistress of the network gear. [17:17:19] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [17:18:34] !log removing mw58 from pool for more testin' [17:18:37] Logged the message, notpeter [17:20:37] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.064 second response time [17:21:31] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [17:22:55] RoanKattouw: quick question [17:23:09] do you know if our apaches currently use /a for *anything* [17:23:11] ? [17:24:25] I don't know [17:24:39] ok [17:24:43] could one of the ops have a look at this DNS entry request please, https://rt.wikimedia.org/Ticket/Display.html?id=2891 [17:25:17] It doesn't look like it's used [17:25:31] RoanKattouw: ok, I didn't think so, but I like to ask first, just in case [17:27:25] notpeter: /a is empty on every host I've checked (granted, that was only 3) [17:27:41] maplebed: yeah, has been the case for my checks as well [17:27:43] thanks! [17:27:50] and now, off to the metrics meeting! [17:29:02] hi :) can someone please approve a class placeholder in private repo ? https://gerrit.wikimedia.org/r/#change,5792 [17:30:13] Ryan_Lane: hmm, I typed "git pull" in my git bash but I don't see any output - does that mean it's downloading something? [17:30:34] on an instance? [17:30:35] it is blocking the beta labs project since that is required to install misc::scripts :/ [17:30:38] on my local computer [17:31:31] ah there we go - it worked [17:31:39] can you have a look at https://rt.wikimedia.org/Ticket/Display.html?id=2891 if you get a minute please Ryan_Lane? [17:32:55] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:34:04] !log shutting down db1029 for ssd card testing removal per rt 2766 [17:34:07] Logged the message, RobH [17:36:49] PROBLEM - Host db1029 is DOWN: PING CRITICAL - Packet loss = 100% [17:39:58] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [17:40:52] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [17:42:13] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [17:43:54] !log returning mw58 to pool [17:43:57] Logged the message, notpeter [17:44:19] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [17:44:39] !log db1029 ssd test items removed, can go back to normal service via asher [17:44:42] Logged the message, RobH [17:47:37] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 191 seconds [17:48:23] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 194 seconds [17:52:41] robh: power load on ps1-b5 is no longer balanced...the z phase has increased it's consumption in the last 2 days by 20% [17:52:52] do you know if anything different is going on? [17:56:38] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [17:57:32] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 183 seconds [17:57:50] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 186 seconds [18:00:50] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [18:16:44] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/5810 [18:24:34] we're losing sq59, s60, sq61, sq62 at pmtpa [18:24:47] ssh is not working and I'm currently trying to login via mgmt to sq61 [18:25:05] paravoid: what do you mean losing? [18:25:08] I wonder why the bot hasn't notified yet [18:25:12] they are dead? [18:25:19] load is spiking and SSH immediately closes [18:25:28] root login has hanged, probably due to load [18:25:38] eep [18:25:39] (still waiting) [18:26:45] on sq59's console too, same [18:26:52] trying to login to sq60's console [18:27:07] okay, I'm on 59's and 61's [18:27:20] text text squids [18:28:51] does anyone know how to send "break" via the DRAC? [18:28:55] for some reason sq65 (working text squid) is trying to talk to some decommissioned text squids… probably unrelated [18:29:20] (serial break is equivalent to magic sysrq) [18:32:49] that's funny because those squids are barely used [18:32:51] only by esams [18:33:50] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [18:35:11] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2388 [18:37:17] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 182 seconds [18:38:02] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 186 seconds [18:38:25] mark: where's the slow query log for the production databases? [18:38:38] I dunno? [18:38:44] nonexistent? [18:38:44] me either [18:38:47] heh [18:38:54] I'm still waiting for a shell prompt, I don't think I'll get one anytime soon [18:39:01] ah [18:39:03] it's on the system [18:39:12] paravoid: which one? [18:39:24] none of the two that I've tried to [18:39:31] sq59 & sq61 [18:39:40] they are at /a/sqldata/-slow.log [18:39:53] it would be awesome to have those syslog'd to a central spot [18:41:22] argh, same with 8 upload squids [18:41:35] sq51-sq58 [18:43:09] !log powercycling sq59; inaccessible via either SSH or serial due to load [18:43:12] Logged the message, Master [18:44:47] PROBLEM - Host sq59 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:05] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:45:23] RECOVERY - SSH on sq59 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:45:33] RECOVERY - Host sq59 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [18:45:33] PROBLEM - Host wiktionary-lb.pmtpa.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [18:45:33] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [18:45:40] uh oh [18:46:26] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [18:46:27] yeah... [18:47:15] Is the world falling? [18:47:58] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=lvs6.wikimedia.org&m=network_report&r=hour&s=by%20name&hc=4&mc=2 [18:47:59] we've lost 8 upload squids (sq51-sq58) + 4 text squids (sq59-sq62) at pmtpa [18:48:08] but I can't see how this corellates with the other alerts [18:48:24] we should really split the lvs servers into their own ganglia group [18:48:34] lumping them in with misc servers makes it hard to see if something is up with them [18:48:59] PROBLEM - Frontend Squid HTTP on sq59 is CRITICAL: Connection refused [18:49:37] why is the load spiking on lvs6? [18:49:53] PROBLEM - Backend Squid HTTP on sq59 is CRITICAL: Connection refused [18:50:20] RECOVERY - Frontend Squid HTTP on sq59 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.009 seconds [18:51:05] RECOVERY - Host wiktionary-lb.pmtpa.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [18:51:06] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [18:51:14] RECOVERY - Backend Squid HTTP on sq59 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.007 seconds [18:52:16] and it's now going back to normal: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=lvs6.wikimedia.org&m=network_report&r=hour&s=by%20name&hc=4&mc=2 [18:52:32] so, I'm looking at sq59's atop [18:53:21] oom and swapping before it died [18:54:21] (sq60) Uptime 209 days, 21:31:29 [18:54:26] ARGH @#$$^@^$!@%^%!@%^!%$! [18:55:03] Heh... [18:55:09] same with sq61 [18:55:22] oh, i'm guessing sq60 is the same .... [18:55:37] it is, see above [18:55:58] lvs6 is up 8 days [18:56:05] and all of sq51-58 are [18:56:13] so, we have an answer for the squids [18:57:03] could the squids outage affect lvs6? [18:57:24] paravoid: are you on the other channel? [18:57:50] yes it could [18:57:59] which one? [18:59:42] PROBLEM - Host sq61 is DOWN: PING CRITICAL - Packet loss = 100% [19:00:15] !log powercycling all of sq51-sq62, hanged due to 209 days uptime [19:00:18] Logged the message, Master [19:01:03] RECOVERY - SSH on sq61 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:01:12] RECOVERY - Host sq61 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [19:03:36] PROBLEM - Host sq51 is DOWN: PING CRITICAL - Packet loss = 100% [19:03:36] PROBLEM - Host sq52 is DOWN: PING CRITICAL - Packet loss = 100% [19:03:36] PROBLEM - Host sq53 is DOWN: PING CRITICAL - Packet loss = 100% [19:03:54] RECOVERY - SSH on sq51 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:04:03] RECOVERY - Host sq51 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [19:04:12] PROBLEM - Backend Squid HTTP on sq61 is CRITICAL: Connection refused [19:04:12] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [19:04:39] PROBLEM - Frontend Squid HTTP on sq61 is CRITICAL: Connection refused [19:05:24] RECOVERY - SSH on sq53 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:05:24] RECOVERY - SSH on sq52 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:05:24] PROBLEM - Host sq54 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:33] RECOVERY - Host sq53 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:05:33] RECOVERY - Host sq52 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:06:36] RECOVERY - SSH on sq55 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:06:36] RECOVERY - SSH on sq54 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:06:45] RECOVERY - Host sq54 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [19:07:12] PROBLEM - Backend Squid HTTP on sq51 is CRITICAL: Connection refused [19:07:30] PROBLEM - Frontend Squid HTTP on sq51 is CRITICAL: Connection refused [19:07:57] PROBLEM - Host sq57 is DOWN: PING CRITICAL - Packet loss = 100% [19:08:06] RECOVERY - SSH on sq56 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:08:24] RECOVERY - SSH on sq57 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:08:33] PROBLEM - Backend Squid HTTP on sq53 is CRITICAL: Connection refused [19:08:33] RECOVERY - Host sq57 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:08:42] PROBLEM - Backend Squid HTTP on sq52 is CRITICAL: Connection refused [19:08:51] PROBLEM - Frontend Squid HTTP on sq53 is CRITICAL: Connection refused [19:08:51] PROBLEM - Frontend Squid HTTP on sq52 is CRITICAL: Connection refused [19:09:18] PROBLEM - Host sq58 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:03] PROBLEM - Backend Squid HTTP on sq56 is CRITICAL: Connection refused [19:10:03] PROBLEM - Backend Squid HTTP on sq54 is CRITICAL: Connection refused [19:10:03] PROBLEM - Backend Squid HTTP on sq55 is CRITICAL: Connection refused [19:10:21] PROBLEM - Frontend Squid HTTP on sq54 is CRITICAL: Connection refused [19:10:21] PROBLEM - Frontend Squid HTTP on sq55 is CRITICAL: Connection refused [19:10:21] PROBLEM - Frontend Squid HTTP on sq56 is CRITICAL: Connection refused [19:10:21] RECOVERY - SSH on sq58 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:10:30] RECOVERY - Host sq58 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [19:11:33] PROBLEM - Frontend Squid HTTP on sq57 is CRITICAL: Connection refused [19:12:18] RECOVERY - SSH on sq62 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:12:18] RECOVERY - SSH on sq60 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:12:46] PROBLEM - Backend Squid HTTP on sq57 is CRITICAL: Connection refused [19:14:15] PROBLEM - Backend Squid HTTP on sq58 is CRITICAL: Connection refused [19:14:24] PROBLEM - Backend Squid HTTP on sq60 is CRITICAL: Connection refused [19:14:33] PROBLEM - Frontend Squid HTTP on sq60 is CRITICAL: Connection refused [19:14:33] PROBLEM - Frontend Squid HTTP on sq58 is CRITICAL: Connection refused [19:15:54] PROBLEM - Backend Squid HTTP on sq62 is CRITICAL: Connection refused [19:16:03] PROBLEM - Frontend Squid HTTP on sq62 is CRITICAL: Connection refused [19:18:45] RECOVERY - Frontend Squid HTTP on sq54 is OK: HTTP OK HTTP/1.0 200 OK - 607 bytes in 0.010 seconds [19:19:57] RECOVERY - Backend Squid HTTP on sq54 is OK: HTTP OK HTTP/1.0 200 OK - 467 bytes in 0.003 seconds [19:19:57] RECOVERY - Backend Squid HTTP on sq56 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.003 seconds [19:20:24] RECOVERY - Frontend Squid HTTP on sq56 is OK: HTTP OK HTTP/1.0 200 OK - 600 bytes in 0.008 seconds [19:21:18] RECOVERY - Backend Squid HTTP on sq60 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.004 seconds [19:21:45] RECOVERY - Frontend Squid HTTP on sq60 is OK: HTTP OK HTTP/1.0 200 OK - 27542 bytes in 0.015 seconds [19:24:18] RECOVERY - Backend Squid HTTP on sq62 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.011 seconds [19:24:36] RECOVERY - Frontend Squid HTTP on sq62 is OK: HTTP OK HTTP/1.0 200 OK - 27545 bytes in 0.013 seconds [19:24:36] RECOVERY - Frontend Squid HTTP on sq52 is OK: HTTP OK HTTP/1.0 200 OK - 601 bytes in 0.005 seconds [19:25:39] RECOVERY - Backend Squid HTTP on sq52 is OK: HTTP OK HTTP/1.0 200 OK - 467 bytes in 0.004 seconds [19:27:00] RECOVERY - Backend Squid HTTP on sq53 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.004 seconds [19:27:45] RECOVERY - Frontend Squid HTTP on sq53 is OK: HTTP OK HTTP/1.0 200 OK - 601 bytes in 0.003 seconds [19:28:39] RECOVERY - Frontend Squid HTTP on sq58 is OK: HTTP OK HTTP/1.0 200 OK - 601 bytes in 0.005 seconds [19:28:39] RECOVERY - Frontend Squid HTTP on sq55 is OK: HTTP OK HTTP/1.0 200 OK - 601 bytes in 0.008 seconds [19:29:06] RECOVERY - Backend Squid HTTP on sq55 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.002 seconds [19:29:42] RECOVERY - Backend Squid HTTP on sq58 is OK: HTTP OK HTTP/1.0 200 OK - 467 bytes in 0.005 seconds [19:29:42] RECOVERY - Backend Squid HTTP on sq61 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.004 seconds [19:30:09] RECOVERY - Backend Squid HTTP on sq51 is OK: HTTP OK HTTP/1.0 200 OK - 467 bytes in 0.001 seconds [19:30:09] RECOVERY - Frontend Squid HTTP on sq61 is OK: HTTP OK HTTP/1.0 200 OK - 27544 bytes in 0.016 seconds [19:30:18] RECOVERY - Frontend Squid HTTP on sq51 is OK: HTTP OK HTTP/1.0 200 OK - 601 bytes in 0.010 seconds [19:32:15] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [19:32:15] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [19:35:50] hi all [19:35:57] i'm about to work on some labs test puppet stuff [19:36:05] but I would like some of my recent commits to the production branch to be in the test branch [19:36:10] so that it is available in labs [19:36:15] what's the best way to go about that? [19:36:17] cherry pick? [19:36:54] RECOVERY - Frontend Squid HTTP on sq57 is OK: HTTP OK HTTP/1.0 200 OK - 602 bytes in 0.004 seconds [19:36:54] RECOVERY - Backend Squid HTTP on sq57 is OK: HTTP OK HTTP/1.0 200 OK - 467 bytes in 0.005 seconds [19:36:57] unfortunately yes [19:37:06] the production vs test branches are in a very sad state [19:37:17] and merges are impossible [19:37:21] yeah [19:37:26] ok [19:37:30] so cherry picking right now is the best way until we move to a new model with a branch per labs project [19:37:47] mmk, thanks [19:45:24] !log starting script to move /usr/local/apache to /a partition on all non-imagescaler, non-jobrunner apaches [19:45:26] Logged the message, notpeter [19:50:03] !log restarted profiling collector post parser.php livehack and stats.db removal [19:50:06] Logged the message, Master [19:53:39] New patchset: Demon; "Moving gitweb config to its own class, adding blame support (bug 36234)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5810 [19:53:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5810 [19:58:21] hm, another puppet test branch question [19:58:33] i want to set up some minimal configs on my labs instance [19:58:39] but these configs will never be used in production [19:58:55] i want to set up nginx, varnish + ncsa, squid, udp2log, etc. [19:59:01] but I want to do it just for testing changes [19:59:07] much like I've done on my local VM now [19:59:14] but more officially in labs [19:59:18] i could skip puppet altogether [19:59:25] but i thought it'd be nice to have the labs instance puppetized [19:59:39] i think I need to make new classes in order to install this stuff without all the production files [19:59:55] For example, I just need squid doing a simple reverse proxy to localhost [20:00:11] there is no way for me to just install squid and set up my custom conf file [20:00:18] there's no squid base class, as far as I can tell [20:01:54] so I guess i'll create a class for my machine? [20:01:56] which is kinda dumb [20:02:00] or a class for logging_tests [20:02:04] ergh [20:02:19] or shoudl I even bother puppetizing this? [20:10:21] ohhhh silent nighhhhhhhhhht [20:10:25] sillllilent ops room [20:10:42] how about: [20:10:44] Ryan_Lane :) [20:10:56] oh maybe I should ask in #labs [20:10:57] oops [20:43:59] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [20:44:08] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [20:47:32] !log removing srv219 from pybal pool for repartitioning [20:47:35] Logged the message, notpeter [20:53:44] PROBLEM - Apache HTTP on srv219 is CRITICAL: Connection refused [20:55:21] RobH: hey are you in dc today ? [20:59:08] preilly: https://gerrit.wikimedia.org/r/#change,6556 [20:59:23] New review: Lcarr; "lint checker dead. reverting for preilly" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6556 [20:59:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6556 [21:06:05] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5810 [21:06:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5810 [21:11:55] lesliecarr: working on rt2886 for you...quick observation....the uplink for asw-a4 is going to mrjp-a2 (23) which is on a line card , csw1-sdtpa [21:12:06] line card 3 [21:12:10] cool [21:12:10] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [21:12:52] there is no more room on that line card to add an additional cable [21:13:43] any other nearby emptyish panels ? [21:15:35] that's the rack with nagios and fenari and a nfs master … been causing some of the nasty page storms :( [21:16:09] i have a few open spot on the panel in a3 ...which is LC 2 [21:16:16] cool [21:16:26] port1 [21:16:34] oh LC 2 is almost completely empty :) [21:16:56] the 2nd port, right ?i have in 2/1 msw1-sdtpa:0 [21:17:31] that will work [21:19:53] !log putting srv219 back into pybal pool [21:19:55] Logged the message, notpeter [21:23:05] lesliecarr: you want the cable to go to port 2 on mrjp-a3? [21:24:20] yes please [21:24:48] which port do you want to originate on asw-a4? 45 or 46 [21:24:59] 46 ? [21:28:37] Lesliecarr: done [21:28:58] thanks cmjohnson1 i see it up :) [21:29:52] !log switching asw-a4-sdtpa from single uplink to lag [21:29:55] Logged the message, Mistress of the network gear. [21:30:08] !log removing srv220 from pybal pool for repartitioning [21:30:11] Logged the message, notpeter [21:31:13] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:31:58] there is a danger of some frozen fenari sessions and incorrect pages [21:32:35] thanks, nagios! [21:34:13] PROBLEM - Host owa2 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:28] PROBLEM - Host owa3 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:28] PROBLEM - Host ocg3 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:28] PROBLEM - Host ocg1 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:28] PROBLEM - Host payments.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:34:31] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [21:34:40] PROBLEM - Host wikimedia-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:34:49] PROBLEM - Apache HTTP on srv220 is CRITICAL: Connection refused [21:34:58] PROBLEM - Host sq74 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:58] PROBLEM - Host sq71 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:58] PROBLEM - Host sq77 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:58] PROBLEM - Host sq72 is DOWN: PING CRITICAL - Packet loss = 100% [21:34:58] PROBLEM - Host sq79 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:07] PROBLEM - Host sq83 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:07] PROBLEM - Host sq85 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:07] PROBLEM - Host lvs5 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:07] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:12] uh [21:35:15] Leslie-Carr is doing something to cause and fix ^^^^ [21:35:16] PROBLEM - Host wikinews-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:17] PROBLEM - Host lvs6 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:17] PROBLEM - Host mediawiki-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:17] PROBLEM - Host cr2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [21:35:25] PROBLEM - Host wikisource-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:25] PROBLEM - Host wikipedia-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:26] PROBLEM - Host bits.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:27] PROBLEM - Host ssl4 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:27] PROBLEM - Host wikiquote-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:27] PROBLEM - Host wikibooks-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:27] PROBLEM - Host cr1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [21:35:39] ok [21:35:43] PROBLEM - Host appservers.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [21:35:43] PROBLEM - Host yvon is DOWN: PING CRITICAL - Packet loss = 100% [21:35:44] PROBLEM - Host lvs2 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:44] PROBLEM - Host upload.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:50] in network admins we trust [21:35:52] PROBLEM - Host ssl1 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:52] PROBLEM - Host wiktionary-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:53] PROBLEM - Host lvs3 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:53] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:35:53] RECOVERY - Host sq79 is UP: PING WARNING - Packet loss = 93%, RTA = 0.42 ms [21:35:53] PROBLEM - BGP status on cr2-eqiad is CRITICAL: (Service Check Timed Out) [21:36:08] PROBLEM - Host transcode1 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:08] PROBLEM - Host payments3 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:08] PROBLEM - Host payments4 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:39] ^demon: it looks like dns is affected... soooo... not much [21:44:41] development on your local laptop? [21:44:43] <^demon> Oh yeah, I can commit without pushing :) [21:44:43] <^demon> #gloriousgitfuture [21:44:44] :-D [21:44:45] I was waiting for the git plug [21:44:46] i'm just surprised the office network is up [21:44:46] it always comes...it always comes [21:44:46] shh don't jinx it [21:45:42] well, I htink this calls for pokemon. [21:46:56] <^demon> notpeter: I just recently started re-playing the original. True story. [21:47:08] ^demon: I'm some number of hours into diamond right now [21:47:34] <^demon> I'm on the second gym in Blue. [21:47:36] <^demon> :p [21:47:40] nice [21:47:58] I htink i've finished two gyms [21:48:16] yeah, because the second was lulztasticaly easy, as my starting pokemon was fire, and it was the all grass gym... [21:54:12] wow. that was so many pages. [21:54:17] literlaly all of them [21:54:35] heh [21:54:39] yeah [21:54:39] :( [21:54:40] sorry [21:54:43] http://ganglia.wikimedia.org/latest/?c=Bits%20caches%20esams&h=cp3001.esams.wikimedia.org&m=load_one&r=20min&s=by%20name&hc=4&mc=2 [21:54:46] eh, happens [21:54:58] what's up with that? [21:55:06] shit, is that the varnish bug ? [21:55:24] 3002 is ok though [21:55:26] when the backends drop off it forkbombs itself ? [21:55:40] let me take a look (don't restart) [21:59:08] so many threads gdb is struggling, heh [21:59:30] LeslieCarr: That is BY FAR the best description I've heard [21:59:32] I'm gonna use that [22:00:53] bah, it was restarted before gdb became responsive [23:01:41] New patchset: Lcarr; "switched the folder as well as the argument reading" [operations/software] (master) - https://gerrit.wikimedia.org/r/6565 [23:01:55] New review: Lcarr; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6565 [23:01:57] Change merged: Lcarr; [operations/software] (master) - https://gerrit.wikimedia.org/r/6565 [23:16:05] !log restarting networking on sq51 [23:16:08] Logged the message, Mistress of the network gear. [23:20:36] * AaronSchulz checks if the site is still up :) [23:21:56] !log restarting networking on sq52 [23:21:58] Logged the message, Mistress of the network gear. [23:22:02] what's going on? [23:22:49] maplebed: What does A&O stand for? [23:23:04] achievements (from the last week) and objectives (for the next) [23:23:26] also, thanks! you're the first person to ask. [23:23:33] Aaah OK [23:23:44] :p [23:23:57] Aaah & OK? [23:24:15] lol [23:24:41] !log restarting networking on sq53 [23:24:44] Logged the message, Mistress of the network gear. [23:25:07] RoanKattouw_away: if you're so inclined, I heartily recommend the practice. [23:25:35] it's amazing how much additional sanity a mere 15 minutes once/week looking at what's happened and what's going to happen can give. [23:25:49] We have been doing this in the features team, more or less [23:25:50] plus when it comes time to look back over what's happened during the past year, hey! you have notes! [23:25:58] I don't actually use it as much as I sohuld [23:26:29] so what is it you'd say you do around here ? .... [23:26:33] it's just a tool. when used consistently, it helps you the most, with the side benefit of also helping others around you. [23:27:07] LeslieCarr: I take the bits from the swift and hand them to the squid. [23:27:24] :) [23:27:33] !log restarting networking on sq54 [23:27:35] Logged the message, Mistress of the network gear. [23:29:11] !log restarting networking on sq55 [23:29:14] Logged the message, Mistress of the network gear. [23:33:32] notpeter: Wait, how are you moving /u/l/apache onto /a exactly? Are you symlinking it from the old location? [23:33:38] * RoanKattouw_away hopes things will not break [23:36:26] dreamer [23:42:51] -ETOOMANYOUTAGES [23:47:43] it's all the video's fault [23:47:51] victor has convinced us it's too cool to have outages [23:49:35] notpeter: It looks like your switcharoo script only succeeded on mw60, and failed on all the others [23:52:00] was there a special port to get to the cameras on ? [23:52:04] (trying to switch their ip's) [23:58:18] New patchset: Demon; "(bug 35538) Actions from L10n-bot should not spam IRC." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6578 [23:58:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6578