[00:17:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:24:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [00:24:17] o.0 [00:57:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.997 seconds [01:38:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.388 seconds [01:45:42] Ryan_Lane: do it, do it, stylize! [01:45:57] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 273 seconds [01:48:28] we should have scap distribute the data recursively [01:49:40] if we do it right then we should be able to send data at 500 Mbps or so to every apache without saturating any links [01:50:12] or maybe we should use multicast... [01:50:39] * AaronSchulz wanders aimlessly in concurrency land [01:50:46] you know I've read about reliable multicast protocols a few times in this context [01:51:30] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 608s [01:54:13] pity Leslie and Roan are both gone [01:54:18] I should get up earlier [01:54:21] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:54:30] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 9 seconds [01:55:59] I'll watch the network graph in ganglia [01:57:31] * AaronSchulz saturates brion's cr time [01:57:51] Not a directory: /home/wikipedia/common/php-php-1.20wmf1 [01:57:51] Found syntax errors in php-1.20wmf1, cannot sync. [01:58:11] :) [01:58:19] let me check something [01:58:32] is the syntax error "Not a directory"? [01:59:07] $ mwversionsinuse [01:59:07] php-1.20wmf1 php-1.20wmf2 [02:00:37] AaronSchulz: are you checking something relevant to scap, or relevant to concurrency land? [02:01:42] scap [02:02:45] did you change this code recently? it doesn't look familiar [02:03:18] maybe I never reviewed it [02:05:05] * AaronSchulz is done [02:06:31] thanks [02:12:21] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [02:17:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:10] !log aborted scap and re-ran with fanout=5 instead of 30, since nfs1 CPU was maxed out [02:21:13] Logged the message, Master [02:23:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.724 seconds [02:26:58] still broken [02:27:25] every apache gives "Unable to read wikiversions.dat or it is empty" [02:27:47] --report shows nothing informative [02:29:12] oh, right --report is broken [02:29:23] commenting out the error_reporting(0) makes it work [02:31:15] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [02:38:21] !log fixed scap, was failing on the remote side due to mwversionsinuse exiting with status 1 due to /home/wikipedia/common not existing on apaches [02:38:25] Logged the message, Master [02:57:04] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Thu May 3 02:56:39 UTC 2012 [03:20:17] "Bug in Mailman version 2.1.13 [03:20:17] We're sorry, we hit a bug!Please inform the webmaster for this site of this problem. Printing of traceback and other system information has been explicitly inhibited, but the webmaster can find this information in the Mailman error logs. " [03:20:26] (but it's fine) [04:12:03] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:17:45] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:21:42] New patchset: Tim Starling; "Fix scap error handling, reduce fanout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6463 [04:22:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6463 [05:29:57] PROBLEM - Puppet freshness on db30 is CRITICAL: Puppet has not run in the last 10 hours [05:31:00] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [06:15:10] New patchset: ArielGlenn; "was deleting only with --verbose, fixed." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6464 [07:14:57] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6464 [07:15:00] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6464 [07:22:38] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/fundraising/bi-filter, [07:24:17] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [07:41:09] New patchset: ArielGlenn; "check for runphpscriptlet in same directory" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6465 [07:43:03] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6465 [07:43:05] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6465 [08:10:05] New patchset: Hashar; "git ignore /private/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6470 [08:10:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6470 [09:03:38] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [10:26:49] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:32:22] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:32:22] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [10:33:27] !log starting container-auditor on ms-be3 [10:33:30] Logged the message, Master [10:33:43] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:41:31] New patchset: Mark Bergsma; "Initial import." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6481 [10:41:31] New patchset: Mark Bergsma; "wikimedia-lvs-realserver (0.02) unstable; urgency=low" [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6482 [10:41:32] New patchset: Mark Bergsma; "wikimedia-lvs-realserver (0.03) edgy; urgency=low" [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6483 [10:41:33] New patchset: Mark Bergsma; "wikimedia-lvs-realserver (0.04) edgy; urgency=low" [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6484 [10:41:33] New patchset: Mark Bergsma; "Add and delete files..." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6485 [10:41:34] New patchset: Mark Bergsma; "Set the individual interface sysctls, to make automated testing in shell scripts practical." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6486 [10:41:35] New patchset: Mark Bergsma; "Revert Tim's sysctl.conf changes, as those keys do not generally exist on all servers and will throw errors if sysctl is not run with -e Instead we will change apache-sanity-check to trust the 'all' interfaces key." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6487 [10:42:13] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6481 [10:42:22] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6481 [10:42:24] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6481 [10:42:52] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6482 [10:43:01] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6482 [10:43:03] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6482 [10:43:23] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6483 [10:43:25] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6483 [10:43:43] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6484 [10:43:45] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6484 [10:44:47] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6485 [10:44:49] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6485 [10:45:06] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6486 [10:45:09] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6486 [10:45:25] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-lvs-realserver] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6487 [10:45:27] Change merged: Mark Bergsma; [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6487 [10:49:24] New patchset: Dzahn; "replace "*.wikimedia.org" with "star.wikimedia.org" per RT-2512 | get rid of star_wikimedia_org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [10:49:41] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3238 [10:53:15] New patchset: Dzahn; "replace "*.wikimedia.org" with "star.wikimedia.org" per RT-2512 | get rid of star_wikimedia_org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [10:53:32] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3238 [10:55:36] New patchset: Dzahn; "replace "*.wikimedia.org" with "star.wikimedia.org" per RT-2512 | get rid of star_wikimedia_org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [10:55:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3238 [10:57:05] New review: Dzahn; "re: Ryan. inline comment done." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [11:19:28] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [11:29:23] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [11:30:12] New patchset: Dzahn; "mwscriptwikiset - do not rely on mwscript being in path (f.e. cron jobs)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6489 [11:30:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6489 [11:31:23] New review: Dzahn; "this one failed on me in a cron, not finding mwscript" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6489 [11:33:43] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [11:46:22] New patchset: QChris; "Allow database hosts with port specification" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6491 [11:46:22] New patchset: QChris; "Allow temp dir on different partition than dump dirs" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6492 [11:50:30] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6491 [11:50:33] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6491 [11:52:08] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6492 [11:52:10] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/6492 [11:53:59] New patchset: Mark Bergsma; "Setup APT preferences in Puppet instead of in package wikimedia-base" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6493 [11:54:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6493 [11:54:41] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6493 [11:54:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6493 [12:00:45] New patchset: Dzahn; "change working refreshLinks crons to monthly schedule originally requested. leave s1 deactivated." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6494 [12:01:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6494 [12:04:46] New review: Dzahn; "just once monthly. keep it simple: cluster 2 on day 2, 3 on 3, and so on...at midnight." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6494 [12:04:49] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6494 [12:08:18] New patchset: Mark Bergsma; "Remove 'quiet' kernel option in Puppet instead of in package wikimedia-base" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6495 [12:08:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6495 [12:09:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6495 [12:09:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6495 [12:10:10] New patchset: Mark Bergsma; "Typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6496 [12:10:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6496 [12:10:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6496 [12:10:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6496 [12:13:22] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [12:17:52] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [12:25:04] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [12:25:31] PROBLEM - udp2log processes for emery on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, /var/log/squid/filters/india-filter, /usr/local/bin/sqstat, /var/log/squid/filters/latlongCountry-writer, [12:25:55] New patchset: Mark Bergsma; "Setup vim as the default editor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6497 [12:26:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6497 [12:26:27] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6497 [12:26:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6497 [12:26:52] RECOVERY - udp2log processes for emery on emery is OK: OK: all filters present [12:29:39] New patchset: Mark Bergsma; "Don't try to use Upstart jobs on Hardy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6498 [12:30:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6498 [12:30:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6498 [12:30:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6498 [12:35:11] New patchset: Mark Bergsma; "Fix logic for grub quiet option removal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6499 [12:35:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6499 [12:35:32] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6499 [12:35:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6499 [12:37:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [12:43:17] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [12:43:59] New patchset: Mark Bergsma; "Manage root's .bashrc in Puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6500 [12:44:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6500 [12:44:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6500 [12:44:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6500 [12:50:17] New patchset: Mark Bergsma; "Do things the Puppet way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6501 [12:50:31] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/6501 [12:50:58] New patchset: Mark Bergsma; "Do things the Puppet way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6501 [12:51:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6501 [12:51:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6501 [12:51:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6501 [13:02:49] New patchset: Mark Bergsma; "Install vim and sysstat on all servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6503 [13:03:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6503 [13:03:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6503 [13:03:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6503 [13:21:50] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [13:27:32] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [13:27:35] New review: Dzahn; "hehe yeah, hate it when then it's not default in crontab -e and stuff, joe has a user base in labs t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6497 [13:31:44] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [13:41:29] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [13:43:22] New review: Tim Starling; "IIRC, the values on the individual interfaces override the values for "all", but only when they are ..." [operations/debs/wikimedia-lvs-realserver] (master) - https://gerrit.wikimedia.org/r/6486 [13:49:33] !log Built new wikimedia-base 1.00 package, stripped of most stuff now handled by Puppet, and inserted it into the lucid-wikimedia and precise-wikimedia APT repositories [13:49:37] Logged the message, Master [13:50:07] you're reviewing svn commits that are years old there Tim ;-) [13:51:48] New patchset: Mark Bergsma; "Remove code for unused /etc/wikimedia-cluster" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6505 [13:51:49] New patchset: Mark Bergsma; "Remove APT pinning setup; now handled by Puppet" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6506 [13:51:50] New patchset: Mark Bergsma; "Puppet manages sysctl on Lucid and higher" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6507 [13:51:50] New patchset: Mark Bergsma; "Set default editor in Puppet" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6508 [13:51:51] New patchset: Mark Bergsma; "Remove sysctl.conf" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6509 [13:51:52] New patchset: Mark Bergsma; "Manage bashrc in Puppet" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6510 [13:51:52] New patchset: Mark Bergsma; "Retab postinst" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6511 [13:51:53] New patchset: Mark Bergsma; "Remove outdated, unused example files" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6512 [13:51:54] New patchset: Mark Bergsma; "Remove TCP removal, retab prerm" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6513 [13:51:54] New patchset: Mark Bergsma; "No longer depend on sysstat and vim" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6514 [13:51:55] New patchset: Mark Bergsma; "Remove sysctl.conf installation in debian/rules" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6515 [13:51:56] New patchset: Mark Bergsma; "wikimedia-base (1.00) lucid-wikimedia; urgency=low" [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6516 [13:52:39] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6505 [13:52:41] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6505 [13:52:53] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [13:53:11] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6506 [13:53:13] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6506 [13:53:34] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6507 [13:53:42] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6507 [13:53:44] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6507 [13:54:07] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6508 [13:54:09] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6508 [13:54:31] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6509 [13:54:33] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6509 [13:54:53] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6510 [13:54:55] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6510 [13:55:16] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6511 [13:55:18] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6511 [13:55:36] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6512 [13:55:38] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6512 [13:56:11] PROBLEM - SSH on sq59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:51] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6513 [13:57:53] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6513 [13:58:19] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6514 [13:58:21] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6514 [13:58:41] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6515 [13:58:43] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6515 [13:58:53] RECOVERY - SSH on sq59 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:59:08] New review: Mark Bergsma; "(no comment)" [operations/debs/wikimedia-base] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6516 [13:59:10] Change merged: Mark Bergsma; [operations/debs/wikimedia-base] (master) - https://gerrit.wikimedia.org/r/6516 [13:59:40] dearest mark [13:59:46] what say you of this change: https://gerrit.wikimedia.org/r/#change,6392 [13:59:46] ? [14:00:04] you have glanced at it I believe, asking if I have tested it [14:00:06] I have responded [14:00:19] i know that this change might need to be babysat [14:00:23] and maybe you are not a babysitter [14:01:53] PROBLEM - SSH on sq61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:20] PROBLEM - SSH on sq60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:14] RECOVERY - SSH on sq61 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:03:14] PROBLEM - SSH on sq59 is CRITICAL: Server answer: [14:03:23] PROBLEM - SSH on sq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:32] PROBLEM - SSH on sq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:41] PROBLEM - SSH on sq53 is CRITICAL: Server answer: [14:04:35] PROBLEM - SSH on sq52 is CRITICAL: Server answer: [14:04:53] RECOVERY - SSH on sq54 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:05:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [14:06:33] ottomata: i'll get to it [14:06:38] mmk! [14:07:04] no worries mark, i know you guys are busy [14:07:18] i just have to poke in here to make sure it happens, thank you! [14:07:26] PROBLEM - SSH on sq61 is CRITICAL: Server answer: [14:10:35] PROBLEM - SSH on sq54 is CRITICAL: Server answer: [14:11:37] New patchset: Mark Bergsma; "Use the default Ubuntu squid packages on the install server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6519 [14:11:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6519 [14:12:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6519 [14:12:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6519 [14:13:17] PROBLEM - SSH on sq57 is CRITICAL: Server answer: [14:13:53] hello mark, can you have a look at the remote syslog server patch ? :) https://gerrit.wikimedia.org/r/5813 [14:15:50] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [14:17:29] PROBLEM - SSH on sq51 is CRITICAL: Server answer: [14:18:59] New patchset: Mark Bergsma; "Qualify global variables in base.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6521 [14:19:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6521 [14:20:10] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6521 [14:20:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6521 [14:20:47] PROBLEM - SSH on sq58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:50] PROBLEM - SSH on sq62 is CRITICAL: Server answer: [14:24:41] PROBLEM - SSH on sq56 is CRITICAL: Server answer: [14:25:35] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:28:44] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [14:29:57] New patchset: Mark Bergsma; "Restart varnishncsa processes on init scripts and default file changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6522 [14:30:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6522 [14:30:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6522 [14:30:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6522 [14:32:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6392 [14:32:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6392 [14:35:48] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:35:56] ottomata: puppet is applying that change across the cluster in the next half hour [14:36:57] ohhhh boy [14:38:08] it works totally cool on my local, and I don't think it will be a prob at all, but [14:38:24] would appreciate it if you could run puppet manually on one of the varnish machines [14:38:29] to make sure all is well [14:38:29] I did, on cp1028 [14:38:33] varnishncsa is running [14:38:36] and that's all I know :) [14:38:36] great [14:38:41] haha, ok... [14:38:43] can you check [14:38:46] oh [14:38:49] check ps [14:39:00] what opts is varnishncsa running with? [14:39:15] with the -F argument [14:39:25] cool, with Accept-Language in there? [14:39:25] 110 2004 8.4 0.2 98384 83584 ? Ss 14:35 0:18 /usr/bin/varnishncsa -n frontend -w 208.80.154.15:8419 -m RxRequest:^(?!PURGE$) -D -P /var/run/varnishncsa/varnishncsa-multicast_relay.pid -F %l %n %t %{Varnish:time_firstbyte}x %h %{Varnish:handling}x/%s %b %m http://%{Host}i%U%q - %{Content-Type}o %{Referer}i %{X-Forwarded-For}i %{User-agent}i [14:39:30] i'm sorry [14:39:33] that's another ticket [14:39:34] yeah [14:39:36] that looks right [14:39:38] Content-Type [14:39:43] cool! [14:39:49] thank you! [14:39:59] now hashar [14:40:12] mark: danke [14:40:20] who will need to rebase his change, I can already see that ;) [14:40:26] I modified base.pp heavily today [14:40:43] ahhhhhhh [14:40:47] and that's still not valid puppet syntax, hashar [14:40:52] mark/hashar: "joe" :) [14:40:54] you can't pass arguments to classes that way [14:41:13] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [14:41:34] hashar: suggested to add those tool/editor packages for labs in one class when i saw Mark adding defaults for production [14:43:02] maybe there should be a "realm-check" in that generic class in base? if labs, then joe is ok [14:43:22] maybe, maybe not [14:43:28] that change was already handled by wikimedia-base before [14:43:31] which is also installed in labs [14:43:37] however, puppet enforces it now on every run ;) [14:44:27] so yeah, perhaps a $::realm check is reasonable [14:44:49] mark: do you have any doc about passing parameter to classes ? [14:44:56] hashar: it works like this: [14:44:59] the alternative seemed like base::labs-standard-packages [14:45:08] class { "classname": param1 => value1, param2 => value 2 } [14:47:10] oh yeah, so $::realm == "labs" = good. but we still have $realm == "labs" and ( $realm == "labs" ) in other places [14:47:45] yes [14:47:49] notpeter: search 1-12....i want to confirm it's ok to shut them down [14:47:50] unfortunately the world is not a perfect place [14:48:15] search1-12 are going away? :) [14:50:37] mark: yes...i have the boxes here and ready to swap out the old ones [14:50:45] yay [14:54:15] AZEHAZR stupid git-review did a rebase again [14:54:27] oh on a merged change [14:54:30] * hashar is luck [14:54:31] y [14:55:46] mark: class {} looks uglier than include : -D [14:55:47] https://gerrit.wikimedia.org/r/#patch,unified,5813,5,manifests/site.pp [14:57:06] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [14:58:08] this whole change is ugly [14:59:24] mark: I could make base::remote-syslog to use $::syslog_server as a default [14:59:30] thus we can stick to include syntax [15:00:00] hashar: i was going to change the mentioned base::syslogs class anyways [15:00:40] yeah but I would like to get rid of those global variables too [15:00:43] let me think about this a little bit [15:03:33] each time I touch a class it needs some more refactoring :-] [15:03:59] <^demon> Story of our lives. [15:04:00] <^demon> :) [15:05:17] !log powercycling srv266 [15:05:21] Logged the message, Master [15:07:42] hashar: it's only getting uglier I think [15:07:45] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [15:07:54] RECOVERY - Memcached on srv266 is OK: TCP OK - 0.004 second response time on port 11000 [15:07:56] just go with your original change (more or less), put a simple $::realm check in base::remote-syslog [15:08:02] we'll fix it in a nicer way later [15:08:13] * hashar digs in patchsets [15:09:03] no global variable yet, the rest of that class has realm specific ugly things anyway [15:10:45] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [15:12:05] cmjohnson1: yep! go for it [15:12:12] I'm going to log it [15:12:50] !log chris is taking down search1-12 to replace with new search nodes [15:12:52] Logged the message, notpeter [15:13:38] q for you guys [15:13:42] i'm about to commit something to gerrit [15:13:56] but part of the change is a patch to the squid frontend.conf.php script [15:13:59] which is not in puppet [15:14:04] how should I submit that bit? [15:14:12] notpeter: thx [15:14:14] it needs to go out as part of the git commit [15:14:32] put it in your comments, set your change to -1 or something [15:14:52] put the patch in the commetns? [15:14:55] no [15:14:59] provide that elsewhere [15:15:00] where should I put the patch? [15:15:07] on fenari or so [15:15:13] hm, ok [15:15:16] my home there is ok? [15:15:18] yeah [15:15:21] k [15:15:24] danke [15:17:30] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [15:19:06] mark: and here is patch set 6 : https://gerrit.wikimedia.org/r/#patch,unified,5813,6,manifests/base.pp [15:19:54] notpeter: is searchidx1 staying? [15:20:29] you need to rebase your change [15:20:36] you're not operating on a current repo :) [15:21:00] cmjohnson1: hhhmmm, searchidx1 probably needs to be decomissioned, tbh [15:21:05] but I'd check with RobH about that [15:21:10] yes [15:21:12] decommission it. [15:21:24] PROBLEM - Host search1 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:58] mark: it is based on latest version of `test` branch [15:22:13] oh damn this is test [15:22:16] yeah do whatever in test [15:22:21] :-] [15:22:26] test is so far apart from production that I've stopped caring ;) [15:22:42] we will have to have someone to merges it [15:22:45] PROBLEM - Host search2 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:46] it needs a merge from hell anyway [15:22:50] err to merge from prod to test [15:23:08] because we are setting up a test cluster on labs and will most probably requires changes from production [15:23:10] merged [15:23:22] labs is moving towards a system with one branch per project [15:23:39] PROBLEM - Host search3 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:42] that will be much easier [15:23:50] and then you can commit your own changes without going through gerrit [15:24:04] that will be great [15:24:15] PROBLEM - Host search4 is DOWN: PING CRITICAL - Packet loss = 100% [15:24:25] then I guess it will be the responsibility of each project owner to merge from production ? [15:24:36] notpeter: did you put these search hosts in decommissioned_servers? [15:24:41] hashar: yes [15:25:19] mark: not yet [15:25:28] notpeter: ideally that's done while they still run :) [15:25:33] ah, yes [15:25:37] too late now [15:25:58] but i do need to turn off nagios... [15:26:39] no [15:26:48] that's done by putting them in decommissioned_servers [15:27:46] New patchset: Ottomata; "Log Format changes for Rt 2674 and RT 2745." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6526 [15:28:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6526 [15:28:54] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [15:31:09] PROBLEM - Puppet freshness on db30 is CRITICAL: Puppet has not run in the last 10 hours [15:31:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3511 [15:31:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3511 [15:32:12] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [15:33:42] New patchset: Pyoungmeister; "decommisioning search1-12" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6527 [15:33:42] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [15:34:00] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2283 [15:34:00]