[00:00:09] Nikerabbit: lots of 'MessageCache failed to load messages' in exception.log [00:00:59] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:02] Fatal error: wikiversions.cdb has no version entry for `DB connection error: No working slave server: Unknown error (10.0.6.46)`. [00:02:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39330 [00:02:08] haha [00:04:17] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.237 second response time [00:06:00] can someone please flush Varnish? [00:08:56] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:14] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:26] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.333 second response time [00:11:21] notpeter: "Since 1997, as the bureau chief of the 'Institute of Junior Assembly Members Who Think About the Outlook of Japan and History Education'" [00:11:33] http://en.wikipedia.org/wiki/Shinz%C5%8D_Abe#Unpopularity_and_sudden_resignation [00:12:01] maybe that just translated funny [00:12:24] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.358 second response time [00:14:09] MaxSem: I'm going to do it, but as I told jon, if you guys are going to need a flush on deploy you need to poke an ops person *before* you deploy [00:14:24] Ryan_Lane, thanks [00:14:43] and there's no excuse for not knowing. [00:14:54] flushed [00:15:32] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:50] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [00:16:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:02] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.117 second response time [00:17:29] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:14] PROBLEM - Varnish HTTP mobile-backend on cp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:41] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:19:10] New patchset: Cmjohnson; "Adding tellurium mac address" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39332 [00:19:44] RECOVERY - Varnish HTTP mobile-backend on cp1042 is OK: HTTP OK HTTP/1.1 200 OK - 698 bytes in 0.058 seconds [00:21:32] New patchset: Cmjohnson; "Adding tellurium mac address" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39332 [00:21:50] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [00:22:15] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39332 [00:23:15] anyone want to check out http://ganglia.wikimedia.org/3.5.4/ before we make it "latest" ? [00:23:47] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [00:25:08] * Matthew_ can't even get in, so meh :) [00:25:26] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:32] !log modifying project groups [00:25:39] Logged the message, Master [00:26:12] !log make that modifying project groups in labs by running syncProjectGroups.php maintenance script in OpenStackManager [00:26:19] Logged the message, Master [00:29:00] AaronSchulz: maybe we need one of those for the US [00:29:02] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.577 second response time [00:31:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.383 seconds [00:33:09] !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/javascripts/common/main.js 'touch file' [00:33:17] Logged the message, Master [00:33:32] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.496 second response time [00:33:59] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:01] I really should remember to do batch installs when doing package upgrades using salt '*' [00:35:10] New patchset: Lcarr; "Oupgraded ganglia-web to latest version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39334 [00:35:23] notpeter: https://gerrit.wikimedia.org/r/#/c/39334/ [00:35:24] New patchset: Brion VIBBER; "offhost_backups should only copy gzipped db dumps" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39335 [00:40:14] LeslieCarr: LeslieCarr lgtm [00:40:17] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:06] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.109 second response time [00:45:32] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.275 second response time [00:46:08] !log taking down for another reinstall (this time with raid!) [00:46:17] Logged the message, notpeter [00:46:50] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39334 [00:50:29] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:30] New patchset: Ryan Lane; "Also don't specify the top dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39338 [00:53:48] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [00:55:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39338 [00:56:20] PROBLEM - SSH on hume is CRITICAL: No route to host [01:00:50] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [01:01:44] Is anyone deploying now? [01:01:53] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.326 second response time [01:05:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:50] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:29] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:09:07] New patchset: Bsitu; "Enable Echo on test server" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39341 [01:09:59] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.370 second response time [01:11:02] RECOVERY - SSH on hume is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:11:04] Change merged: Bsitu; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39341 [01:11:09] New patchset: Dfoy; "comment change only revised comments in file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39342 [01:14:31] !log demon synchronized php-1.21wmf6/extensions/Wikibase/lib/resources/templates.js 'Deploying I641725a2' [01:14:42] Logged the message, Master [01:15:41] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 314 seconds [01:16:03] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable Echo on test server' [01:16:12] Logged the message, Master [01:16:53] RECOVERY - Puppetmaster HTTPS on sockpuppet is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [01:17:20] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:19:54] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.172 second response time [01:20:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [01:24:26] PROBLEM - NTP on hume is CRITICAL: NTP CRITICAL: Offset unknown [01:26:05] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:14] RECOVERY - NTP on hume is OK: NTP OK: Offset -0.01475405693 secs [01:30:08] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:31:38] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [01:32:41] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.657 second response time [01:37:38] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:19] !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/javascripts/modules/mf-cleanuptemplates.js [01:41:27] Logged the message, Master [01:44:01] !log finished upgrading salt on all production minions [01:44:09] Logged the message, Master [01:44:53] !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/stylesheets/modules/mf-cleanuptemplates.css 'touch file' [01:45:01] Logged the message, Master [01:46:29] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:24] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.920 second response time [01:48:00] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [01:51:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:29] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:11] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:22] !log awjrichards synchronized php-1.21wmf5/extensions/MobileFrontend/javascripts/common/main.js [02:01:30] Logged the message, Master [02:02:08] New patchset: Ryan Lane; "Bringing salt master thread count down." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39345 [02:02:28] !log awjrichards synchronized php-1.21wmf5/extensions/MobileFrontend/javascripts/modules/mf-toggle.js 'touch file [02:02:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39345 [02:02:36] Logged the message, Master [02:09:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [02:11:14] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.198 second response time [02:15:35] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.662 second response time [02:20:32] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:38] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.642 second response time [02:30:44] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:20] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.339 second response time [02:42:17] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:15] New patchset: Bsitu; "Enable Echo on test2wiki and mediawikiwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39349 [02:47:52] about to run scap [02:48:17] !log LocalisationUpdate completed (1.21wmf6) at Wed Dec 19 02:48:16 UTC 2012 [02:48:26] Logged the message, Master [03:03:53] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [03:10:29] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.215 second response time [03:15:26] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:26] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.009 second response time on port 11000 [03:24:24] !log kaldari Started syncing Wikimedia installation... : [03:24:33] Logged the message, Master [03:31:24] !log LocalisationUpdate completed (1.21wmf5) at Wed Dec 19 03:31:24 UTC 2012 [03:31:32] Logged the message, Master [03:36:44] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.189 second response time [03:41:41] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:49:55] Change merged: Bsitu; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39349 [03:59:09] !log kaldari Finished syncing Wikimedia installation... : [03:59:17] Logged the message, Master [04:01:20] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.429 second response time [04:03:19] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable Echo on test2 and mediawiki' [04:03:27] Logged the message, Master [04:06:17] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:27] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.092 second response time [04:19:29] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:08] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.228 second response time [04:21:57] bsitu / kaldari: cool! [04:24:09] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [04:25:56] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:49] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:43] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.390 second response time [04:28:19] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.232 second response time [04:31:46] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [04:31:46] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [04:31:47] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [04:31:47] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [04:31:47] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [04:32:40] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:35:58] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.576 second response time [04:40:55] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:45:52] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [04:50:49] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.654 second response time [04:55:46] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:22:46] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:25:47] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.769 second response time [05:30:53] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:37:20] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.272 second response time [05:42:26] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:47:14] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.508 second response time [05:50:14] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:50:15] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:52:20] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:32] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.163 second response time [05:58:29] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:01:47] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.541 second response time [06:06:17] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:06:44] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:45] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.077 second response time [06:52:44] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:38] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.799 second response time [06:55:53] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.166 second response time [06:58:35] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:00:05] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time [07:06:19] !log shot a bunch of converts on the image scalers, looks like a couple started flapping about 7-8 hours ago [07:06:28] Logged the message, Master [07:40:29] archive.org down since hours ago for power outage... do we have a paging system from outside the datacentres too? [07:42:30] really it is? aww :-( [07:42:49] we have a notification system yes [07:42:58] that is not dependent on us or our dcs [08:08:02] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [08:35:09] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [09:02:09] if you have a power outage, you don't need a paging system though [09:02:14] trust me, you'll know [09:31:21] !log Jenkins: enabled unit test run on mw/core for some whitelisted people {{gerrit|39310}} [09:31:31] Logged the message, Master [09:35:56] RECOVERY - Puppet freshness on ms1004 is OK: puppet ran at Wed Dec 19 09:35:46 UTC 2012 [09:43:25] New review: Hashar; "recheck" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39056 [09:43:37] New review: Hashar; "recheck" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39057 [09:43:48] New review: Hashar; "recheck" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39058 [09:43:59] New review: Hashar; "recheck" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39059 [09:44:14] New review: Hashar; "recheck" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39060 [09:59:28] hi is here anyone who can handle some labs issue? [09:59:42] everyone sleeps :/ [10:05:02] zZZ zZZ [10:12:10] petan: paravoid is probably awake and might be able to handle labs issues. [10:12:31] we're already talking in #-labs [10:12:32] I figured out [10:16:35] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [10:24:32] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [10:55:25] hello , no space left on build1.pmtpa.wmflabs [10:55:26] zero [10:55:42] can we please have some space there ? I need to build some packages [10:58:17] you may want to ask for that in the labs irc channel [11:01:59] apergos: hi, I just talked to hashar, he re-directed me here [11:02:09] oh :-D [11:02:17] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [11:06:38] ends up the analytics instances have not been rebooted :) [11:07:35] uh do they need to be? (I have no idea how any of this stuff works, I figure the people who do are in the labs channel) [11:09:30] apergos: yup that is needed. /home/ used to be mounted on some NFS file system which has been made readonly. It has been migrated to Gluster so one need to reboot to change the /home/ mount :) [11:13:05] New patchset: Ori.livneh; "Use logrotate to archive log files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39366 [11:18:04] apergos: got a sec for https://gerrit.wikimedia.org/r/#/c/39366/1? [11:18:37] lemme see [11:20:27] * apergos removes the ? from the end of that link (stupid irc client) [11:21:04] well, probably my fault -- it's a valid URI :) [11:21:47] heh [11:29:47] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39366 [11:30:48] ariel, thanks, whatever your nickname is :) [11:31:51] just a sec, running puppet on stat1 now [11:32:59] The following packages have unmet dependencies: [11:32:59] libmysqlclient-dev : Depends: libmysqlclient18 (= 5.5.28-0ubuntu0.12.04.3) but 5.5.28-mariadb-wmf201212041~precise is to be installed [11:33:11] err: /Stage[main]/Misc::Statistics::Gerrit_stats/Git::Clone[gerrit-stats]/Exec[git_pull_gerrit-stats]/returns: change from notrun to 0 failed: git pull --quiet returned 128 instead of one of [0] at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:679 [11:33:20] and finally [11:33:21] err: /Stage[main]/Misc::Statistics::Mediawiki/Git::Clone[statistics_mediawiki]/Exec[git_pull_statistics_mediawiki]/returns: change from notrun to 0 failed: git pull --quiet returned 1 instead of one of [0] at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:679 [11:33:32] if you aren't already aware of those [11:33:49] the /etc/logrotate.d/eventlogging change applied fine [11:34:49] ori-l: [11:35:07] apergos: none of those are related to my change or my config classes [11:35:14] but i'll let andrew otto know [11:35:28] no, they aren't related to your change, just issues that someone ought to either fix or know can be ignored [11:35:31] thanks [11:35:40] i'll e-mail him right now [11:35:46] thanks for flagging [11:35:49] yup [11:36:03] and for running puppet :) [11:36:09] sure [11:43:21] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [11:48:18] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.005 second response time on port 11000 [11:53:51] !log Updated solr, cleaning up killlist [11:53:59] Logged the message, Master [12:58:30] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [12:58:30] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [12:58:39] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [13:01:48] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: Connection refused by host [13:01:48] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: Connection refused by host [13:01:57] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: Connection refused by host [13:01:58] PROBLEM - swift-container-server on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:06] PROBLEM - SSH on ms-be1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:07] PROBLEM - swift-account-reaper on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:07] PROBLEM - swift-account-server on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:07] PROBLEM - swift-object-auditor on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:07] PROBLEM - swift-container-updater on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:15] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: Connection refused by host [13:02:15] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: Connection refused by host [13:02:16] PROBLEM - swift-container-server on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:16] PROBLEM - swift-object-replicator on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:16] PROBLEM - swift-object-server on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:16] PROBLEM - swift-container-replicator on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:25] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [13:02:25] PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: Connection refused by host [13:02:25] PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: Connection refused by host [13:02:25] PROBLEM - swift-account-auditor on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:25] PROBLEM - swift-object-updater on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:33] PROBLEM - swift-object-server on ms-be1005 is CRITICAL: Connection refused by host [13:02:34] PROBLEM - swift-container-auditor on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:34] PROBLEM - swift-account-replicator on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:34] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: Connection refused by host [13:02:51] PROBLEM - swift-object-updater on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:51] PROBLEM - swift-object-replicator on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:51] PROBLEM - swift-container-replicator on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:52] PROBLEM - swift-account-auditor on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:52] PROBLEM - swift-container-auditor on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:52] PROBLEM - SSH on ms-be1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:52] PROBLEM - swift-container-updater on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:00] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: Connection refused by host [13:03:00] PROBLEM - swift-account-replicator on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:01] PROBLEM - swift-account-server on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:10] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: Connection refused by host [13:03:10] PROBLEM - swift-object-server on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:10] PROBLEM - swift-account-reaper on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:18] PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: Connection refused by host [13:03:21] *eyeroll* [13:03:23] apergos: swift seems to have some issue :/ [13:03:26] oh page :) [13:03:37] pages are probably more reliable than irc notification Oo [13:03:37] swift in eqiad. unused. etc. [13:03:46] PROBLEM - swift-object-auditor on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:09] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:30] it's me [13:11:33] I'm reformatting the boxes [13:11:51] RECOVERY - Host ms-be1005 is UP: PING WARNING - Packet loss = 37%, RTA = 26.64 ms [13:11:53] have fun [13:13:58] RECOVERY - SSH on ms-be1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:14:16] pep8_1.3.3-0ubuntu1_all.deb !!! [13:14:20] I BACKPORTED A PACKAGE!!!!!!!!!!!!!!!!!!! [13:14:22] oh yeah [13:14:24] \O/ [13:14:42] RECOVERY - SSH on ms-be1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:21:27] PROBLEM - NTP on ms-be1006 is CRITICAL: NTP CRITICAL: No response from NTP server [13:21:54] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:22:39] PROBLEM - NTP on ms-be1007 is CRITICAL: NTP CRITICAL: No response from NTP server [13:24:09] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [13:28:20] apergos: paravoid : what does it take to get a package uploaded on apt.w.o ? Should I open a rt ticket giving the places where the .deb .gz .changes .dsc files are? [13:28:27] or is it something I can do myself ? [13:28:49] no it needs global root [13:29:03] you can open an RT ticket or you can give it to me now and I'll do it for you :-) [13:29:55] I choose the later if you are available now :-D [13:30:02] The result is available on fenari in /home/hashar/pep8-backport [13:30:11] oo wait [13:30:22] just figured out I could potentially grant myself root by giving a fake package [13:30:26] how would you validate it ? [13:30:37] I'll look at the source [13:30:52] from the .deb ? [13:31:05] well, ideally we'd rebuild them [13:31:07] from source [13:31:12] I don't think we realistically do that [13:31:34] I am too paranoid maybe [13:31:44] no you're right [13:32:26] so that's a straight backport? [13:32:29] have you rebuilt the package? [13:32:55] I used backportpackage from ubuntu-dev-tools package [13:33:03] basically followed the nice guide at https://wikitech.wikimedia.org/view/Backport_packages [13:33:15] it is not signed with a PGP key since I don't have one [13:33:50] So I did: backportpackage --dont-sign -s raring -d precise -w workdir pep8 [13:34:19] then pbuilder --basetgz=/path/to/precise.tgz build pep8.dsc [13:34:21] hmm [13:34:27] I guess that rebuild it from scratch. [13:34:31] why doesn't it have a ~precise1 suffix? [13:35:46] PROBLEM - NTP on ms-be1005 is CRITICAL: NTP CRITICAL: No response from NTP server [13:35:50] backportpackage says it does that [13:36:08] ahh I must have run pbuilder against pep8_1.3.3-0ubuntu1.dsc instead of pep8_1.3.3-0ubuntu1~precise1.dsc [13:36:23] ha [13:36:49] who knew, reviews help! [13:38:00] rebuilding [13:43:24] paravoid: rebuild. I have deleted all files from fenari:/home/hashar/pep8-backport and reuploaded the result. I got a deb named pep8_1.3.3-0ubuntu1~precise1_all.deb [13:45:17] done [13:45:33] is it for gallium? [13:45:40] or do you need me to install it somewhere? [13:47:41] just for gallium, I will update it tthere [13:48:00] cool [13:48:34] !log gallium: apt-get install pep8 v1.3.3 (backported from raring) [13:48:42] Logged the message, Master [13:49:04] (apt-get upgrade would also do it, fwiw) [13:49:48] I am so happy to have been able to backport a package [13:50:01] * hashar strikes achievement "backport an Ubuntu package" [13:50:06] sometimes it's more difficult [13:50:17] but others it's just a 5' work [13:50:21] I have noticed that with the python modules I needed for Zuul :/ [13:50:35] luckily upgrading gallium to Precise fixed it [13:50:39] I can take a stab at them, although probably not this week [13:50:41] oh [13:50:45] even better :-) [13:51:12] Some packages were from Quantal and back porting them to Lucid would have required to backport a tooooon of dependencies [13:51:19] so yeah, fixed :-] [13:51:42] oh yeah, that path is almost always harder [13:51:50] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39039 [13:52:28] so hmm [13:52:34] we have a lot of python issues https://integration.mediawiki.org/ci/job/operations-puppet-pep8/36/violations/? :-] [13:53:29] I'm not surprised [13:55:10] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39049 [13:56:32] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39040 [14:03:07] !log gallium: updated PHPUnit to 3.7.10 thus solving {{bug|42724}} [14:03:16] Logged the message, Master [14:08:31] hashar: incidentally, it occurred to me that the lack of librsvg on gallium means I'll have to take the tests out or jenkins will block the merge. That's right isn't it? [14:08:54] Jarry1250: I thought we had that issue sorted out aren't we ? [14:09:16] Did we? [14:09:29] I can't remember the bug # either :-D [14:10:10] I don't think there ever was a bug [14:10:17] How does one search gerrit by keyword? [14:11:23] New patchset: Reedy; "RT #2295: Run cleanupUploadStash across all wikis daily" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37968 [14:24:32] hashar? Any ideas? [14:31:03] Jarry1250: i use google :-] [14:31:12] git log :-] [14:31:29] there is some trace at https://gerrit.wikimedia.org/r/#/c/36583/ [14:32:14] Jarry1250: the jenkins box has rsvg-convert : $ rsvg-convert --version [14:32:15] rsvg-convert version 2.36.1 (Wikimedia) [14:32:19] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [14:32:19] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [14:32:19] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [14:32:19] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [14:32:20] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [14:32:20] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [14:32:20] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [14:32:21] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [14:32:21] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:38] hasshar: Oh, okay, great, I just misremembered then. [14:34:42] *hashar [14:34:44] Coolio. [14:41:56] New patchset: Demon; "Make github replication config forward compatible" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39385 [14:43:09] New review: Demon; "Replication plugin will just ignore directives it doesn't understand, so this can be merged whenever..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/39385 [14:46:35] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [14:55:17] RECOVERY - Host ms-be3004 is UP: PING OK - Packet loss = 0%, RTA = 109.72 ms [15:08:32] Change abandoned: Jgreen; "I thought I'd already abandoned this one--script has been fixed." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39335 [15:27:47] New review: Hashar; "Has Tim said, this is going nowhere. I have logged Tim's idea under https://bugzilla.wikimedia.org/s..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15720 [15:29:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.262 seconds [15:34:12] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39056 [15:36:38] !log hashar synchronized wmf-config/CommonSettings.php [15:36:47] Logged the message, Master [15:37:35] !log hashar synchronized wmf-config/InitialiseSettings.php [15:37:44] Logged the message, Master [15:37:50] !log hashar synchronized wmf-config/InitialiseSettings.php [15:37:58] Logged the message, Master [15:46:47] New patchset: Hashar; "update puppet-lint rake target" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34850 [15:48:53] may I get a change of ops/puppet rakefile changed ? It is about removing some very noisy puppet lint checks. Does not do any harm since that is not yet run from anywhere :) https://gerrit.wikimedia.org/r/#/c/34850/ [15:48:56] thanks!! [15:50:40] and could use https://gerrit.wikimedia.org/r/#/c/39049/ which configures the python linter to lint *.py.erb files in addition to the *.py file :-] [15:51:32] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:51:33] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:52:15] New patchset: Hashar; "find filenames based on realm/datacenter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39191 [15:52:35] New review: Hashar; "Seems good. Will get it merged tonight." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/39191 [15:56:02] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34850 [15:58:53] danke apergos :-] [16:04:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:40] same old whine about /srv/org/wikimediaq/doc/index.html but that's it so fr (waiting for sockpuppet to hurry up), I need to run now and wil be back in a few hours [16:04:52] see folks in a while [16:04:57] * hashar waves [16:06:54] daughter time [16:07:26] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:16:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.057 seconds [16:37:11] Yay, newer git on fenari [16:39:20] !log reedy synchronized php-1.21wmf6/extensions/CategoryTree [16:39:26] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [16:39:27] Offending key for IP in /etc/ssh/ssh_known_hosts:5732 [16:39:27] Matching host key in /etc/ssh/ssh_known_hosts:603 [16:39:29] Logged the message, Master [16:39:31] ^ Weren't those fixed already? :( [16:48:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:01] New patchset: Cmjohnson; "Addin solr servers to yttrium stanza" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39393 [16:51:33] reedy can you look at that and tell me if it looks okay ^^ [16:51:50] New review: RobH; "the rest of site.pp has the FQDN, so please include the (eqiad|wmnet) info in the regex" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/39393 [16:52:05] cmjohnson1: I know that they say you dont need to include it [16:52:10] but the rest of site.pp has FQDN [16:52:18] so please correct and include to match the rest of site.pp [16:52:45] better to be too narrow scoped when applying manifests [16:55:01] New patchset: Cmjohnson; "Adding solr servers to yttrium stanza" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39393 [16:55:13] robh take a look now [16:56:22] New review: RobH; "much better, thx!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/39393 [16:56:23] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39393 [16:56:33] cmjohnson1: yer merged on gerrit, up to you on sockpuppet [16:56:37] k [16:56:40] thx [16:56:42] welcome [16:57:23] !log reedy synchronized php-1.21wmf6/extensions/Wikibase [16:57:29] ... [16:57:31] Logged the message, Master [16:58:16] so puppet is in progress on yttrium [16:59:18] hrmm, so just running isnt conclusive [16:59:33] but watch the output and ensure its applying stanza specific listing in the output [16:59:42] well, mayne not applying, as its been applied, but lists in output [17:00:05] as any server that we have signed on puppet, but ISNT specifically listed in site.pp gets a standard puppet info [17:00:24] so just running may mean stanza is bad, but its getting stock info [17:00:31] (i may be wrong in this, but its my understanding) [17:00:44] hence why i say in here, someone will correct me im sure [17:06:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [17:38:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:50] PROBLEM - Host es1009 is DOWN: CRITICAL - Network Unreachable (10.64.32.19) [17:45:50] PROBLEM - Host es1008 is DOWN: CRITICAL - Network Unreachable (10.64.32.18) [17:47:11] PROBLEM - Host analytics1016 is DOWN: CRITICAL - Network Unreachable (10.64.36.116) [17:47:11] PROBLEM - Host analytics1015 is DOWN: CRITICAL - Network Unreachable (10.64.36.115) [17:47:11] PROBLEM - Host analytics1011 is DOWN: CRITICAL - Network Unreachable (10.64.36.111) [17:47:29] PROBLEM - Host analytics1024 is DOWN: CRITICAL - Network Unreachable (10.64.36.124) [17:47:29] PROBLEM - Host es1007 is DOWN: CRITICAL - Network Unreachable (10.64.32.17) [17:47:30] PROBLEM - Host es1010 is DOWN: CRITICAL - Network Unreachable (10.64.32.20) [17:47:30] PROBLEM - Host analytics1025 is DOWN: CRITICAL - Network Unreachable (10.64.36.125) [17:47:30] PROBLEM - Host analytics1018 is DOWN: CRITICAL - Network Unreachable (10.64.36.118) [17:47:30] PROBLEM - Host analytics1022 is DOWN: CRITICAL - Network Unreachable (10.64.36.122) [17:47:30] PROBLEM - Host analytics1017 is DOWN: CRITICAL - Network Unreachable (10.64.36.117) [17:47:47] PROBLEM - Host analytics1013 is DOWN: CRITICAL - Network Unreachable (10.64.36.113) [17:47:48] PROBLEM - Host analytics1020 is DOWN: CRITICAL - Network Unreachable (10.64.36.120) [17:47:56] PROBLEM - Host analytics1012 is DOWN: CRITICAL - Network Unreachable (10.64.36.112) [17:47:57] PROBLEM - Host analytics1014 is DOWN: CRITICAL - Network Unreachable (10.64.36.114) [17:48:00] sorry [17:48:05] turns out we have some crossed wires [17:48:06] PROBLEM - Host analytics1027 is DOWN: CRITICAL - Network Unreachable (10.64.36.127) [17:48:06] PROBLEM - Host ms-be1005 is DOWN: CRITICAL - Network Unreachable (10.64.32.10) [17:48:06] PROBLEM - Host ms-be1006 is DOWN: CRITICAL - Network Unreachable (10.64.32.11) [17:48:06] PROBLEM - Host analytics1026 is DOWN: CRITICAL - Network Unreachable (10.64.36.126) [17:48:06] PROBLEM - Host analytics1019 is DOWN: CRITICAL - Network Unreachable (10.64.36.119) [17:48:23] PROBLEM - Host ms-be1010 is DOWN: CRITICAL - Network Unreachable (10.64.32.15) [17:48:23] PROBLEM - Host ms-be1007 is DOWN: CRITICAL - Network Unreachable (10.64.32.12) [17:48:24] PROBLEM - Host analytics1023 is DOWN: CRITICAL - Network Unreachable (10.64.36.123) [17:48:41] PROBLEM - Host analytics1021 is DOWN: CRITICAL - Network Unreachable (10.64.36.121) [17:50:25] uh oh again! [17:50:43] sorry :( [17:50:49] analytics is getting the brunt of this [17:50:53] :( [17:50:57] i owe you whiskey [17:52:03] so'k! better now than later! [17:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.665 seconds [17:54:55] maxsem@fenari:/home/wikipedia/common/php-1.21wmf6/extensions/GeoData$ ping solr1001 [17:54:55] PING solr1001.eqiad.wmnet (10.64.0.98) 56(84) bytes of data. [17:54:55] 64 bytes from search1004.eqiad.wmnet (10.64.0.98): icmp_req=1 ttl=62 time=26.4 ms [17:54:58] eh? [18:00:32] New review: Hashar; "DO NOT MERGE TILL https://gerrit.wikimedia.org/r/#/c/39191/ is merged." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/39057 [18:09:19] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [18:23:36] New patchset: Cmjohnson; "dhcpd entry for solr1-3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39397 [18:24:15] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39397 [18:24:17] cmjohnson1: are you back in the dC ? [18:24:44] lesliecarr: i am not in dc ...do you need me there? will take me about 20mins [18:24:52] yes definitely need you there [18:24:55] if you can [18:25:07] yep..np..omw [18:25:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:34] LeslieCarr, my guitar is sitting behind me, and a string just snapped all by itself [18:28:40] maybe your ethernet cables are too tight [18:28:57] notpeter: Any chance you could clear up the hume related ssh annoyances on fenari? Thanks [18:29:27] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [18:29:27] Offending key for IP in /etc/ssh/ssh_known_hosts:5732 [18:29:27] Matching host key in /etc/ssh/ssh_known_hosts:603 [18:30:09] ottomata: hehe [18:30:13] sigh [18:33:17] Reedy: sure. not sure why hume is so annoying, but yes, I'll see what I can do [18:36:19] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [18:39:16] ottomata: so unlike previous times, turning the lag back up is not working right now - i am and have been on the phone with juniper [18:39:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.860 seconds [18:39:21] we are trying to get this as soon as possible [18:48:10] RECOVERY - Host analytics1025 is UP: PING WARNING - Packet loss = 54%, RTA = 26.53 ms [18:48:10] RECOVERY - Host analytics1027 is UP: PING WARNING - Packet loss = 54%, RTA = 26.58 ms [18:48:10] RECOVERY - Host analytics1024 is UP: PING WARNING - Packet loss = 54%, RTA = 26.53 ms [18:48:10] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [18:48:10] RECOVERY - Host analytics1023 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [18:48:11] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:48:11] RECOVERY - Host analytics1026 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:48:12] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [18:48:12] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [18:48:13] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 26.70 ms [18:48:13] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [18:48:14] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 27.03 ms [18:48:14] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [18:48:15] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:48:15] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:48:19] RECOVERY - Host es1010 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [18:48:19] RECOVERY - Host es1008 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [18:48:20] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:48:20] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [18:48:21] ottomata: yay up now - though fragile [18:48:28] RECOVERY - Host es1009 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [18:48:46] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [18:49:04] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:49:04] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:49:40] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [18:49:45] lesliecarr: hi [18:49:55] my hero! [18:50:16] do you have a spare ex4200 little module ? [18:50:24] the one with the two xfp uplinks [18:50:25] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:50:40] idk [18:50:52] lemme look....maybe robh would know [18:50:55] we think there might be a bad module [18:51:01] RobH ping ^^ [18:51:03] thanks [18:51:09] worst case, we will try reseating modules [18:52:19] we don't normally have spares ...give me 5 mins to check storage [18:53:34] cool [18:55:27] leslicarr: no spare modules [18:56:34] who's the best person to discuss job queue issues with? [18:57:01] AaronSchulz, probably [18:58:00] cmjohnson1, from what I've seen in this channel it looks like the Solr hosts are almost ready? [18:58:06] okay cmjohnson1 -- so can you reseat the module in asw-c1-eqiad ? [18:58:08] New patchset: Demon; "Switching all 'pedias to 1.21wmf6" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39403 [18:58:18] <^demon> AaronSchulz: ^ [18:58:35] maxsem yep almost [18:58:47] will be done today [18:58:49] and thank you for driving out [18:58:52] yep [18:59:07] lesliecarr: yes [18:59:16] and okay to do it now? [18:59:35] np...i live down the street...takes me longer to park and get through the doors than to drive here [19:00:11] ok to do now [19:00:31] anything you do to this switch can't make it more broken ;) [19:00:36] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: All 'pedias to 1.21wmf6 [19:00:40] ok..doing it now [19:00:44] Logged the message, Master [19:06:47] kaldari: EchoNotificationJob does not set the job ID [19:06:59] the constructor should pass it to super [19:07:28] super? [19:07:28] anyway, the core code was dummy proofed against this, but some old justs never were ack'ed, thus they stayed in the queue [19:07:47] so the jobs were run, just never acked (they were run ~10 or so after being added) [19:07:54] kaldari, super is the java name of the parent [19:07:57] they look to be quite old, eventually they will be deleted [19:07:59] ah [19:08:17] LeslieCarr: sorry saw ping [19:08:18] the job count on those graphs should only count things in the "unclaimed queue" [19:08:18] the ones from an hour ago still haven't run though [19:08:22] was afk making lunch [19:08:34] so we dont have spare SFP addon modules for hte 4200s that i know of [19:08:40] though it would be nice to track those too to make sure they don't balloon [19:08:43] iirc we sent what little we had to go into use in tampa [19:08:50] usualy we order to use, not spare. [19:08:59] at the moment that count is just every row in the table [19:09:01] claimed or not [19:09:03] (though i suppose we need to keep a spare) [19:10:33] kaldari: "call super" [19:10:38] wikipedia it :) [19:10:47] ok, I'll try that [19:14:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:42] looks like the constructor is passing it to the parent. hmm [19:16:57] RobH: lemme open a ticket in procurement for that now :) [19:17:07] lesliecarr: okay reseated [19:17:20] LeslieCarr: include a spare for eqiad and sdtpa [19:17:34] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:43] PROBLEM - Host es1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:43] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:52] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:53] PROBLEM - Host es1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:58] cool [19:18:01] PROBLEM - Host es1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:02] PROBLEM - Host analytics1023 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:10] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:22] cmjohnson1: now, can you swap the fibers on switch asw-c-eqiad.mgmt (xe-8/1/0 and xe-8/1/2) [19:18:28] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:37] PROBLEM - Host analytics1025 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:37] PROBLEM - Host analytics1027 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:55] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:55] PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100% [19:19:00] i mean asw-c8-eqiad (xe-8/1/0 and xe-8/1/2 ) [19:19:50] ok [19:19:58] PROBLEM - MySQL Idle Transactions on es1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:19:58] PROBLEM - SSH on analytics1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:07] PROBLEM - SSH on analytics1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:53] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:20:55] lesliecarr: complete [19:21:10] PROBLEM - MySQL Replication Heartbeat on es1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:21:29] RECOVERY - MySQL Idle Transactions on es1010 is OK: OK longest blocking idle transaction sleeps for seconds [19:21:38] thanks cmjohnson1 [19:21:41] looking good to me [19:21:52] asw-c1 okay? [19:22:40] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:23:16] RECOVERY - SSH on analytics1022 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:24:19] PROBLEM - SSH on analytics1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:46] RECOVERY - SSH on analytics1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:25:11] cmjohnson1: it is better than before ;) [19:25:31] PROBLEM - SSH on analytics1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:39] lesliecarr: i need to remove the fibers again...i didn't close the cover [19:26:00] ok, go for it [19:26:01] from c1 [19:26:10] ok, do it now please [19:26:28] AaronSchulz: it finally ran all the new jobs, but it took an hour [19:26:43] RECOVERY - Host ms-be1006 is UP: PING WARNING - Packet loss = 93%, RTA = 26.61 ms [19:26:43] RECOVERY - Host ms-be1010 is UP: PING WARNING - Packet loss = 86%, RTA = 26.59 ms [19:26:52] RECOVERY - Host es1009 is UP: PING WARNING - Packet loss = 73%, RTA = 26.51 ms [19:27:01] RECOVERY - Host es1008 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [19:27:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.751 seconds [19:27:47] RECOVERY - Host analytics1023 is UP: PING WARNING - Packet loss = 86%, RTA = 26.60 ms [19:27:47] RECOVERY - Host analytics1013 is UP: PING WARNING - Packet loss = 80%, RTA = 26.59 ms [19:27:55] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 26.70 ms [19:27:55] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [19:27:56] RECOVERY - Host analytics1027 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [19:27:56] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [19:28:04] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 26.79 ms [19:28:22] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [19:28:32] RECOVERY - Host analytics1025 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [19:28:40] RECOVERY - SSH on analytics1024 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:28:40] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [19:29:07] RECOVERY - SSH on analytics1014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:29:08] RECOVERY - MySQL Replication Heartbeat on es1010 is OK: OK replication delay 0 seconds [19:29:16] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [19:34:52] maxsem: regarding solr servers...all but solr1002 will be completed today...solr1002 was delivered with h/w issue that will be resolved tomorrow (hopefully) [19:37:36] cmjohnson1, great. thank you [19:42:21] huzzah!!! [19:43:33] so - it looks like the problem was threefold (RobH , mark, paravoid , cmjohnson1 , ottomata may all be interested) -- #1 was that 8/1/0 and 8/1/2 were swapped. #2 is that the module in asw-c1-eqiad is bad, #3 is that junos 11.4 default behavior changed with respect to lldp neighborships and ae bundles must be stated instead of their member interfaces [19:43:51] that seems like a cluster f#@ [19:43:52] cool! [19:44:04] LeslieCarr: so how awesome do you feel right now having figured it out? [19:44:09] hungry [19:44:40] :) [19:44:44] you rock! [19:47:21] so basically [19:47:35] so had human, software, and hardware failure [19:47:38] trifecta of awesome [19:48:06] (human error was not chris) [19:48:21] so had to be whoever was in charge of eqiad onsite then, too bad we have no records. [19:48:25] * RobH runs away [19:49:31] need to get some noms , bbiab [19:49:31] hehehe [19:49:32] :) [19:51:58] New patchset: Hashar; "pass pep8 on deployment module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39407 [19:59:06] !log authdns update correcting solr1003 entry [19:59:14] Logged the message, Master [20:03:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:15] sbernardin: can you put a network ticket in for the solr servers w/ports and switches plz [20:17:20] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:17:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [20:21:43] cmjohnson1: will do [20:21:51] thx [20:25:17] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [20:28:39] !log demon synchronized php-1.21wmf6/extensions/ParserFunctions/ 'Rolling back pfuncs to d96a17c' [20:28:48] Logged the message, Master [20:36:10] back [20:39:01] front [20:41:23] left [20:42:27] bottom [20:44:13] !log authdns update solr1001/1002 correction [20:44:21] Logged the message, Master [20:48:59] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:49:08] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:18] cmjohnson1: hey... what's up with ns0? [20:53:07] hey..idk [20:53:46] fuck [20:53:56] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.017 seconds response time. www.wikipedia.org returns 208.80.154.225 [20:54:28] notpeter: i updated twice...probably too close together [20:54:33] ah, was curling when I meant to be digging... man, that was terrifying [20:54:42] cmjohnson1: dunno if oy'ure aware of it [20:54:48] but after pushing new zone file [20:54:50] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [20:54:58] it's best practice to dig @ ns0/1/2 [20:55:08] to make sure they're still working [20:55:14] not aware but i will do that for now on [20:55:33] every so often they fail and the pdns daemon needs to be rebooted [20:55:34] cooL! [20:55:38] and now you know :) [20:55:55] thx [20:56:08] yep! [20:58:08] PROBLEM - Lucene disk space on search1001 is CRITICAL: Connection refused by host [20:58:44] PROBLEM - SSH on search1001 is CRITICAL: Connection refused [20:59:21] search1001 error is my doing [20:59:43] ok :) [20:59:58] was about to check it, but I shall not worry [21:00:29] Does anyone who isn't Asher understand about the new mariadb packages? [21:00:57] andrewbogott: not really :) [21:01:18] I continue to not understand why previously-innocent puppet classes now pull down maria packages and crap out. [21:01:18] I didn't even really know it was happening until i got his email [21:01:33] andrewbogott: oh, i d ounderstand that [21:01:34] Or, more specifically: I understand /why/ that happens but I don't understand why that isn't considered a serious problem [21:02:25] notpeter, can you advise about the right path forward? [21:02:58] andrewbogott: probably the best path forward is to just remove the packages from our apt-repo for now [21:03:14] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [21:03:14] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [21:03:19] we can also fuck around with apt pinning [21:03:32] notpeter: Ok, so you are supporting the argument that the status quo is simply wrong and broken [21:03:34] but, as it's only in testing, ihave no problem pulling it [21:03:43] andrewbogott: that is correct [21:03:44] well [21:03:54] maybe not wrong, imean puppets doing exactly what we told it to do ;) [21:03:58] but broken, yes [21:03:58] andrewbogott: where were we surprise-switched to mariadb? [21:04:23] Jeff_Green: when you try to install a regular mysql server package [21:04:32] it pulls the mariadb mysql-common [21:04:37] Jeff_Green: I think what happened is that Asher was trying to just stage things for a future migration. But apt mistakes some his hypothetical/future packages for newer substitutes for existing dependencies. [21:04:47] because it's the newest one of our packages that proviced mysql-common [21:04:56] but it doesn't pull all mariadb shit [21:04:59] so it shits the bed [21:05:07] fugly [21:05:10] yeah [21:05:23] When I asked Asher about this before he said something like, "that's what you get for using ensure=>latest" Except today I'm trying to configure a brand new server and can't do that either. [21:05:42] bah. that's what you get for crapping up your repo with conflicting packages [21:05:45] It seems like the maria packages should be orthogonal to the mysql dependency path [21:05:47] so, the problem is when something requires mysql for a package [21:05:53] and so it auto-installs deps [21:06:00] andrewbogott: yes [21:06:03] seems to me that pinning is probably the way to go [21:06:10] long-term anyway [21:06:20] Jeff_Green: yes. [21:06:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [21:06:26] these should be able to coexist [21:06:38] but for now, as mariadb is still in testing, pulling them is also legit [21:06:54] would be great if someone would fix with pinning, though :) [21:06:59] ok. Probably I can get by with just pulling that single package; I will try. [21:07:10] I do not know what 'pinning' is so will ignore that portion of this conversation :) [21:07:27] ha [21:07:59] andrewbogott: please pull all mariadb packages, not just common [21:08:02] i think it must be a sewing analogy [21:08:26] I assumed it had something to do with going steady, and also with this being the 1950s. [21:08:45] could be! [21:11:29] New patchset: Ori.livneh; "(Bug 43273) Enable Extension:PostEdit for itwikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39495 [21:14:31] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:19:01] PROBLEM - NTP on search1001 is CRITICAL: NTP CRITICAL: No response from NTP server [21:27:19] hi CT [21:28:14] hi [21:36:18] New patchset: Dzahn; "puppetize bugzilla community metrics "active users" stats script RT-3962" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39497 [21:38:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:06] Ryan_Lane: the pep8 changes have made are in our Gerrit : for deployment module : https://gerrit.wikimedia.org/r/#/c/39407/ and then Gerrit hooks : https://gerrit.wikimedia.org/r/#/c/39040/ [21:52:16] yep, I saw the email :) [21:52:25] ;-D [21:52:33] will setup puppet lint in january [21:54:08] and pep8 can even be made to lint our erb templates :-D ( https://gerrit.wikimedia.org/r/#/c/39049/ ) [21:54:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [22:06:35] New patchset: Dereckson; "(bug 43274) Enable WikiLove on it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39498 [22:26:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:42:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [22:44:00] notpeter: have you run into an incidence during install where the puppet stalls during the cfg? [22:44:08] New patchset: Dzahn; "puppetize bugzilla community metrics "active users" stats script RT-3962" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39497 [22:46:02] New patchset: Dzahn; "puppetize bugzilla community metrics "active users" stats script RT-3962" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39497 [22:51:23] cmjohnson1: I've run into lots of problems :) [22:51:33] can you describe more what you're seeing? [22:51:37] maybe pastebin it? [22:52:01] LeslieCarr: question about icinga: where is /usr/lib/nagios/plugins/check_nrpe pulled in from? [22:52:43] notpeter: http://p.defau.lt/?umG1JrWRnnVanuWJO1jAnw [22:53:38] like what packages ? [22:53:51] notpeter: nagios-plugins ? [22:54:04] oh, no wait [22:54:09] it's actually from the files [22:54:10] nagios-nrpe-server [22:54:19] in manifests/misc/icinga.pp [22:54:52] class icinga::monitor::files::nagios-plugins [22:54:53] so, it's compiled against an old version of libssl [22:54:56] starts at line 262 [22:55:01] and I htink that's why nrpe is busted [22:55:07] ah :) [22:55:17] nagios-plugins wasn't installing most of the plugins we used [22:55:24] it changed [22:55:26] /usr/lib/nagios/plugins/check_nrpe: error while loading shared libraries: libssl.so.0.9.8: cannot open shared object file: No such file or directory [22:55:29] they split it up into multiple packages! [22:55:40] cmjohnson1: uuuhhhh lemme check sockpuppet [22:55:46] sneaky [22:55:59] nagios-plugins, nagios-plugins-basic, nagios-plugins-standard, nagios-plugins-extra ! [22:56:10] hehehe [22:56:12] mutante: and they're all installed and neon [22:56:13] seriously ? [22:56:18] and I don't think any of them have check_nrpe [22:56:19] :) [22:56:43] nagios-nrpe-plugin - Nagios Remote Plugin Executor Plugin [22:57:00] ah, ok [22:57:04] let's grab that package [22:57:09] not installed on spence though ...hrmm [22:57:14] and hten should be good to go [22:57:35] mutante: spence is in a stable state. don't look too closely, or you risk madness [22:57:57] cmjohnson1: try removing /var/lib/puppet/ssl *on the client* [22:58:04] not on sockpuppet [22:58:07] don't be like peter [22:58:13] and try running that command again [22:58:14] notpeter: how about apt-get install nagios-nrpe-plugin on neon and then dpkg -L nagios-nrpe-plugin [22:58:26] heh...ok [22:58:29] sure, I can do that [22:58:54] mutante: looks like oyu already are :) [22:59:10] actually, no, i was just on spence [22:59:30] but ok ..now;) [22:59:36] so, just make sure you take the relevant files away from puppet or else it will be overwritten [22:59:52] The following NEW packages will be installed: nagios-nrpe-plugin nagios3 nagios3-cgi nagios3-common nagios3-core [22:59:55] wth [23:00:05] dependency on nagios ..arg [23:00:14] hahaha [23:00:18] oh that will fuck shit up [23:01:02] that's some bullshit [23:01:09] well, we can just grab the binary if we have to [23:01:32] is there an icinga-nrpe-plugin ? [23:01:41] sadly, it doesn't look like it [23:02:03] cmjohnson1: working? [23:02:07] no.. but wow . .we have so much other stuff now [23:02:09] check-mk-config-icinga - general purpose nagios-plugin for retrieving data [23:02:26] root@neon:~# apt-cache search check-mk* [23:02:56] so stuff we cant use because we switched ? sigh sigh [23:02:57] i removed it..i have to remove the certificates and do again [23:03:09] cmjohnson1: ok, cool [23:03:15] just lemme know if you want any help [23:03:23] mk-livestatus being packaged is ..cool though [23:03:34] and don't delete /var/lib/puppet/ssl on sockpuppet. srsly. trust me on this one :) [23:03:36] i will thx [23:03:44] apt-cache show check-mk-livestatus [23:03:59] "obsoletes NRPE, check_by_ssh, NSClient and check_snmp. [23:04:03] i won't i will just leave that to you ...:-] [23:04:14] hehehe [23:04:24] mutante: awesome. [23:07:02] !log aaron synchronized wmf-config/CommonSettings.php 'use swift for captchas for testwiki/mediawikiwiki' [23:07:11] Logged the message, Master [23:07:54] notpeter: i am getting this http://p.defau.lt/?iqJ0ZVOFGiYuT8jFb_i21Q [23:07:56] my Bash script really enjoyed the ; character as part of a password string [23:07:59] :p [23:08:09] i think i know what to do but would like your help [23:08:23] New patchset: Aaron Schulz; "Use swift for testwiki/mediawikiwiki captchas" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39516 [23:08:31] cmjohnson1: ok [23:09:02] http://wikitech.wikimedia.org/view/Build_a_new_server#Get_puppet_running [23:09:08] <-- has the puppet signing part [23:09:40] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39403 [23:09:48] mutante...you are talking about thisfind /var/lib/puppet/ssl -type f -exec rm {} \; to clean out the client. [23:10:13] cmjohnson1: ok, puppet is now running [23:10:15] what I did was [23:10:23] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39516 [23:10:32] puppetca --clean solr1001.eqiad.wmnet [23:10:34] on sockpuppet [23:10:38] then on the box [23:10:44] cmjohnson1: yes, well, both parts, one on sockpuppet , the other on the client [23:10:50] I did rm -rf /var/lib/puppet/ssl [23:11:00] then logged back into solr1001 [23:11:11] ran puppetd --test --ca_sercer sockpuppet.pmtpa.wmnet [23:11:19] back to sockpuppet and signed the request [23:11:21] back to solr1001 [23:11:24] now is happy [23:11:27] basically [23:11:34] lots of nuking from orbit and trying again [23:11:52] cool..i was concerned w/removing something from sockpuppet [23:11:57] ah [23:12:04] so, using puppetca --clean [23:12:07] is legit [23:12:18] doing that by hand can get really messy [23:12:52] thx for the help and explanation [23:13:00] yep! definitely! [23:15:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:31:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.100 seconds [23:33:44] New review: Dzahn; "removing jenkins bot as reviewer" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/39497 [23:33:46] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39497 [23:36:17] New patchset: Dzahn; "fix syntax error in bugzilla.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39522 [23:37:19] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39522 [23:38:04] mutante: LeslieCarr so what do we want to do about the nrpe check thing [23:38:08] it seems like our options are [23:38:23] grab the bin fron the nagios package and throw it onto neon [23:38:36] or redo the checks in whatever the new recommended way is [23:38:47] which would be a huge pain, but would probably be "the right way" [23:38:49] as much as i hate the grabbing the bin, it is already done now and so won't be any worse .... [23:38:50] TimStarling: how big was your inbox? [23:39:00] do the bad way, open a ticket for the right way [23:39:16] notpeter: what is the new recommended way though if you ask Ubuntu [23:39:30] are they packaging both, Nagios and Icinga [23:39:43] or is it a mess because some things are there for Icinga and others are not [23:40:24] mutante: would check-mk-livestatus be the new icinga way? [23:41:09] AaronSchulz: depends on how you count it [23:41:11] LeslieCarr: I'd kinda rather just keep using spence and do it the right way eventually than start using neon and do it the right way eventuall [23:41:14] personally [23:41:24] because eventually never comes when there's no need :) [23:41:44] hehe [23:41:50] notpeter: no, i think its just the new Nagios way [23:41:54] there is "check-mk-config-icinga - general purpose nagios-plugin for retrieving data" [23:41:58] only a few non-spam messages directly addressed to me, and I read most of those while I was away in case they were important [23:42:02] but that is not check-mk-livestatus [23:42:15] but hundreds of mailing list posts to read [23:42:19] i'm not super stuck on either way - though i think i'm more in the camp of "dear god please let's get the migration over" [23:43:11] icli - command line interface for the icinga monitoring system [23:44:01] LeslieCarr: I think you get to make the call :) [23:44:13] lets put the binary in our own package [23:44:26] if there really is none [23:44:31] then it seems less ghetto [23:44:34] i like that compromise [23:45:17] actually, can we just take the Nagios package, unpack it, remove the Nagios package dependencies, rename it to Icinga [23:45:21] and build it again [23:45:27] yes [23:45:30] I like tha tidea [23:45:34] ftw! [23:45:38] I really like it if you're doing it ;) [23:46:14] heh, or put the binary in puppet /misc/files/others/dontlookhere :) [23:46:50] well we already have the binary in puppet [23:46:58] so, the package idea is better [23:47:39] eh..why does it fail on neon then [23:47:50] because it's the old one i guess [23:48:08] it does not even have Nagios it its name [23:48:12] in [23:48:30] and by i guess i mean i'm pretty sure it was the old one :) [23:48:58] wait.. so how does Icinga think we solve this [23:49:36] whats old about this one [23:49:44] i mean, my suggestion would have also been using the "old" one [23:49:48] the one from the Nagios package [23:49:59] this was from the lucid nagios package [23:50:08] maybe it's been updated, hence the libso error ? [23:50:14] compiled against an old version of libssl [23:50:25] yeah [23:50:36] so, we definitely need a newer version :) [23:50:37] lets say we never used Nagios ever [23:50:46] and we just want to setup a fresh Icinga server and have NRPE [23:50:59] what would they tell us to do [23:51:31] https://wiki.icinga.org/display/howtos/Setting+up+NRPE+with+Icinga [23:51:49] they are installing nagios-nrpe-server as was the first guess [23:52:16] and then apt-get --no-install-recommends install nagios-nrpe-plugin [23:52:28] they just ignore the recommendations [23:52:33] to get other Nagios packages [23:52:55] Setting up nagios-nrpe-plugin (2.12-5ubuntu1.1) ... [23:52:57] done on neon [23:53:19] !log neon - install nagios-nrpe-plugin with --no-install-recommends [23:53:30] Logged the message, Master [23:53:31] /usr/lib/nagios/plugins/check_nrpe [23:53:38] fixed [23:54:11] woo. [23:54:28] hey, look at all those recovery emails [23:54:58] TimStarling: are you doing cr today? [23:55:55] maybe if there is something urgent to review [23:56:05] hotness [23:56:43] coldness [23:56:45] mutante: maybe leave a note about that in the icinga.pp ? [23:59:22] I mean, or we could define an exec... [23:59:25] but, iunno [23:59:35] that's a different flavor of janky...