[00:02:28] (03PS1) 10Ottomata: Remove analytics1012 1013 1020 from list of kafka brokers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/229035 (https://phabricator.wikimedia.org/T106581) [00:02:51] (03CR) 10Ottomata: [C: 032 V: 032] Remove analytics1012 1013 1020 from list of kafka brokers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/229035 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [00:07:59] (03PS1) 10BryanDavis: Make docroot/bits/static and docroot/bits/static-current symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229036 [00:08:01] (03PS1) 10BryanDavis: Cleanup stale docroot/bits/static-1.26wmf* content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229037 [00:08:03] (03PS1) 10BryanDavis: Update multiversion/updateBranchPointers whitespace and docs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229038 [00:11:17] ori, Reedy, twentyafterfour, bblack: ^ [00:11:38] Its a bit late in my day to deploy and babysit those today [00:11:53] bd808: I'm on it [00:13:01] awesome [00:16:18] looking... [00:18:32] (03CR) 10BBlack: [C: 031] Make docroot/bits/static and docroot/bits/static-current symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229036 (owner: 10BryanDavis) [00:18:37] (03CR) 10BBlack: [C: 031] Cleanup stale docroot/bits/static-1.26wmf* content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229037 (owner: 10BryanDavis) [00:19:00] (03CR) 10BBlack: [C: 031] Update multiversion/updateBranchPointers whitespace and docs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229038 (owner: 10BryanDavis) [00:19:27] ^ +1 for sounds like and looks like the right stuff there, but there's a lot of complexity in all of this that I really don't fully comprehend :) [00:20:06] bblack: that's why I pinged Reedy. He may be the only human on the planet who really groks this part of the config ;) [00:20:48] I'm 99% sure this is the right fix. [00:21:06] it's almost certainly not more broken than what we have right now [00:21:24] but I still wonder why bits has special treatment in the past [00:21:32] s/has/had/ [00:26:13] (03CR) 10Dzahn: [C: 04-2] "lots of gmond ports and snmpwalk make this unfeasible" [puppet] - 10https://gerrit.wikimedia.org/r/194802 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [00:29:22] (03CR) 10Gergő Tisza: Add configuration for authmetrics logging (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227630 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [00:29:32] (03CR) 1020after4: [C: 031] Make docroot/bits/static and docroot/bits/static-current symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229036 (owner: 10BryanDavis) [00:31:01] (03PS2) 10Gergő Tisza: Add configuration for authmetrics logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227630 (https://phabricator.wikimedia.org/T91701) [00:33:20] bd808: should I deploy this and see how it goes? [00:33:28] (03CR) 10BryanDavis: [C: 031] "You should put this up for swat tomorrow morning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227630 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [00:33:46] twentyafterfour: if you have the energy, that would be awesome [00:34:55] but it can wait until you run the train tomorrow if you'd rather too. Up to you really [00:35:23] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1505187 (10Dzahn) summarizing what is done and what is not: heze (storage): done helium (director, storage): NOT done lithium (host, syslog): done terbium (host,maintenance): not done other hosts are adde... [00:42:47] (03PS1) 10Dzahn: bacula: enable firewall on helium [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) [00:44:56] (03CR) 10Dzahn: "need to be really sure all rules are there not just for bacula but also for poolcounter, no other poolcounter host uses it yet and poolcou" [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [00:47:24] does a poolcounter host use Apache? [00:47:46] there is one but just a default install [00:48:01] aka "It works" [00:50:20] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1505203 (10Dzahn) Do we have a VM for mailman yet? Is it going to be a VM? [00:50:36] 6operations, 10Wikimedia-Mailing-lists: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1505204 (10Dzahn) [00:52:43] 6operations, 10Wikimedia-Mailing-lists: Ferm rules for mailman - https://phabricator.wikimedia.org/T104980#1505211 (10Dzahn) there are existing iptables rules on sodium - where exactly do they come from? they need to be translated to ferm. they seem to be there to drop spammers [01:09:16] (03PS1) 10Aude: Add config for Wikisource badges on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) [01:27:32] (03PS1) 10Dzahn: update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/229063 (https://phabricator.wikimedia.org/T106598) [01:29:29] (03PS2) 10Dzahn: update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/229063 (https://phabricator.wikimedia.org/T106598) [01:32:37] (03PS3) 10Dzahn: update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/229063 (https://phabricator.wikimedia.org/T106598) [01:33:37] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:34:47] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from Parsoid returned the unexpected status 500 (expecting: 200) [01:35:46] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [01:37:48] (03CR) 10Dzahn: [C: 032] update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/229063 (https://phabricator.wikimedia.org/T106598) (owner: 10Dzahn) [01:39:19] (03PS4) 10Dzahn: update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/229063 (https://phabricator.wikimedia.org/T106598) [01:40:57] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [01:42:40] 6operations, 7Easy, 5Patch-For-Review: Update people.wikimedia.org with the 2015 Wikimania group photo - https://phabricator.wikimedia.org/T106598#1505280 (10Dzahn) 5Open>3Resolved https://people.wikimedia.org/ [01:48:17] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:49:27] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:49:27] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:20] (03CR) 1020after4: [C: 032] "I guess I'm gonna deploy this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229036 (owner: 10BryanDavis) [01:50:26] (03Merged) 10jenkins-bot: Make docroot/bits/static and docroot/bits/static-current symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229036 (owner: 10BryanDavis) [01:52:22] !log twentyafterfour Started scap: sync https://gerrit.wikimedia.org/r/#/c/229036/1 [01:52:26] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [01:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:57:46] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [01:57:46] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [01:58:25] 6operations: request VM for grafana - https://phabricator.wikimedia.org/T107832#1505304 (10Dzahn) 3NEW [01:58:51] 6operations: request VM for grafana - https://phabricator.wikimedia.org/T107832#1505312 (10Dzahn) [01:58:52] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1505311 (10Dzahn) [02:00:01] 6operations: request VM for grafana - https://phabricator.wikimedia.org/T107832#1505304 (10Dzahn) Labs Project Tested: n/a Site/Location: EQIAD Number of systems: 1 Service: grafana Networking Requirements: internal IP Processor Requirements: 1 Memory: ? Disks: ? Other Requirements: ? [02:01:03] 6operations: request VM for grafana - https://phabricator.wikimedia.org/T107832#1505316 (10Dzahn) @ori any recommendations what grafana needs? disk / memory / other / ^^^ [02:02:29] 10Ops-Access-Requests, 6operations, 6Reading-Admin, 5Patch-For-Review: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1505321 (10Dzahn) [02:02:30] 10Ops-Access-Reviews, 6operations: Review: access to stat1002 for tbayer - https://phabricator.wikimedia.org/T106317#1505319 (10Dzahn) 5Open>3Resolved a:3Dzahn [02:03:39] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1505324 (10Dzahn) [02:03:45] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:04:34] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1505325 (10Dzahn) so either misc-web or the cert for dumps should be renewed and get an additional SAN of download.wm [02:05:37] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1505326 (10Dzahn) a:3ArielGlenn @ArielGlenn Wanna talk to Brandon about the concerns re: huge file sizes and varnish? Or rather to Robh about getting a new cert? [02:06:06] (03CR) 1020after4: "this seems to have worked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229036 (owner: 10BryanDavis) [02:06:36] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:29] bd808, bblack, chasemp: the fix for static/current worked out as far as I can see. [02:08:49] 6operations, 6Discovery, 10MediaWiki-Search, 7Monitoring: Search service monitoring should fail if search results only return exact matches and suggestions don't work - https://phabricator.wikimedia.org/T101914#1505329 (10Dzahn) Which monitoring is it about? Icinga? Is it the "ElasticSearch health check... [02:08:55] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [02:11:06] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:16] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:03] !log twentyafterfour Finished scap: sync https://gerrit.wikimedia.org/r/#/c/229036/1 (duration: 25m 41s) [02:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:46] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [02:28:53] !log l10nupdate Synchronized php-1.26wmf16/cache/l10n: (no message) (duration: 09m 16s) [02:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:15] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [02:32:18] !log @tin LocalisationUpdate completed (1.26wmf16) at 2015-08-04 02:32:18+00:00 [02:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:52] twentyafterfour: thanks! [02:45:56] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:46] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:55] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [02:51:05] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [02:52:05] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [02:54:51] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive - https://phabricator.wikimedia.org/T107769#1505412 (10Dzahn) The mailman settings for the list still say to archive it, the post logs still look normal, the .mbox file has last been touched June 2 (this is when new mails sto... [02:57:16] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:00:36] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:03:26] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive - https://phabricator.wikimedia.org/T107769#1505442 (10Dzahn) Fixed. The issue were wrong file system permissions on the directory containing the .mbox file which is used to create archives. fix: chmod +x /var/lib/mailmana/a... [03:03:35] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive - https://phabricator.wikimedia.org/T107769#1505443 (10Dzahn) a:3Dzahn [03:03:51] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive - https://phabricator.wikimedia.org/T107769#1505444 (10Dzahn) 5Open>3Resolved [03:09:46] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:56] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [03:11:55] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [03:16:15] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [03:17:16] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:16] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:37] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:35] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [03:28:35] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [03:32:05] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:36:16] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [03:37:46] (03PS19) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [03:37:53] (03CR) 10jenkins-bot: [V: 04-1] labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [03:39:13] bd808: hah, the new hostname mentions by l10nupdate, when announced to twitter via SAL, link to https://twitter.com/TIN [03:39:16] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [03:40:04] heh [03:40:15] I have a patch in to fix that so they come out 'l10nupdate@tin' [03:42:37] (03CR) 10BryanDavis: "< greg-g> bd808: hah, the new hostname mentions by l10nupdate, when announced to twitter via SAL, link to https://twitter.com/TIN" [puppet] - 10https://gerrit.wikimedia.org/r/228299 (owner: 10BryanDavis) [03:43:33] :) [04:15:35] identi.ca/TIN [04:17:42] "No such 'user' with id 'tin'" ;) [04:18:44] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1505491 (10MZMcBride) >>! In T107043#1505060, @mpopov wrote: > Huge thanks to @robh and @krenair for their help in getting this resolved. What was the answer? [04:19:57] (03PS1) 10BryanDavis: Add logstash-filter-prune 0.1.5 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/229073 (https://phabricator.wikimedia.org/T99735) [04:20:28] brion: ogv.js is crazy cool. [04:20:32] (03PS2) 10BryanDavis: Add logstash-filter-prune 0.1.5 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/229073 (https://phabricator.wikimedia.org/T99735) [04:20:45] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1505495 (10Deskana) Judging by my logs of #wikimedia-operations, the wrong SSH key was being automatically loaded: 16:35:38 bearloga: robh btw, I figured out the mystery... [04:21:15] brion: it is smooth enough on safari that I had to triple-check I wasn't accidentally loading some plugin. [04:34:40] :D [04:44:57] (03PS12) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [04:48:06] 6operations, 10Wikimedia-Logstash: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121#1505496 (10bd808) 5Open>3declined I'm going to keep things simple and just ship a repo of unpacked plugins. Support for this was [[https://github.com/elastic/logstash/pull/3... [04:48:38] (03PS13) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [05:03:38] (03PS20) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [05:36:38] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think misc-web-lb is the right place too, but let's discuss it a bit with @bblack too." [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) (owner: 10JanZerebecki) [06:06:04] (03PS1) 10Legoktm: Disable Flow on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229076 (https://phabricator.wikimedia.org/T107846) [06:06:27] (03CR) 10Legoktm: [C: 032] Disable Flow on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229076 (https://phabricator.wikimedia.org/T107846) (owner: 10Legoktm) [06:06:34] (03Merged) 10jenkins-bot: Disable Flow on English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229076 (https://phabricator.wikimedia.org/T107846) (owner: 10Legoktm) [06:07:15] !log legoktm Synchronized wmf-config/InitialiseSettings.php: Disable Flow on English Wikiversity (duration: 00m 12s) [06:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:07:44] !log sync to mw1061 failed [06:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:07:52] rsync: failed to set times on "/srv/mediawiki/.": Read-only file system (30) [06:07:52] rsync: failed to set times on "/srv/mediawiki/wmf-config": Read-only file system (30) [06:07:52] rsync: mkstemp "/srv/mediawiki/wmf-config/.InitialiseSettings.php.VUVd7M" failed: Read-only file system (30) [06:08:12] _joe_: are there known issues with mw1061? ^ [06:08:42] <_joe_> not that I remember off the top of my head [06:09:01] hmm [06:09:05] <_joe_> legoktm: I'll take a look shortly, for now I'll depool it [06:09:16] !log legoktm Synchronized wmf-config/InitialiseSettings.php: Disable Flow on English Wikiversity (duration: 00m 12s) [06:09:20] <_joe_> mw1090 had the same problema AFAIR [06:09:25] yeah, still failing [06:09:28] ok [06:10:36] PROBLEM - puppet last run on cp3022 is CRITICAL puppet fail [06:13:01] (03PS1) 10Legoktm: Disable Flow on jawikiversity too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229077 (https://phabricator.wikimedia.org/T107846) [06:13:07] 6operations, 10ops-eqiad: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1505560 (10Joe) 3NEW [06:13:16] (03CR) 10Legoktm: [C: 032] Disable Flow on jawikiversity too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229077 (https://phabricator.wikimedia.org/T107846) (owner: 10Legoktm) [06:13:22] (03Merged) 10jenkins-bot: Disable Flow on jawikiversity too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229077 (https://phabricator.wikimedia.org/T107846) (owner: 10Legoktm) [06:14:02] !log legoktm Synchronized wmf-config/InitialiseSettings.php: Disable Flow on Japanese Wikiversity (duration: 00m 13s) [06:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:14:46] <_joe_> !log depooled mw1061 [06:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:19:10] _joe_: thanks [06:23:55] (03CR) 10Muehlenhoff: "We need https://gerrit.wikimedia.org/r/#/c/224577/ first" [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [06:27:26] PROBLEM - puppet last run on db2050 is CRITICAL puppet fail [06:30:17] PROBLEM - puppet last run on mw2052 is CRITICAL puppet fail [06:31:25] PROBLEM - puppet last run on mc2007 is CRITICAL Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on mw1086 is CRITICAL Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2021 is CRITICAL Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:33:15] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:36:53] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769#1505599 (10Aklapper) [06:37:46] RECOVERY - puppet last run on cp3022 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:39:41] 6operations, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769#1505601 (10Legoktm) So...is there a way to backfill the missing emails? [06:42:31] 6operations, 10Wikimedia-Mailing-lists, 7Pywikibot-General: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769#1505611 (10Legoktm) [06:52:35] RECOVERY - puppet last run on db2050 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:06] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:25] RECOVERY - puppet last run on mc2007 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw1086 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:36] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:06] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:37] RECOVERY - puppet last run on mw2052 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:38:02] !log @tin ResourceLoader cache refresh completed at Tue Aug 4 07:38:01 UTC 2015 (duration 38m 0s) [07:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:22] 6operations, 7Database: New s3 production cluster for mysql - https://phabricator.wikimedia.org/T106847#1505661 (10jcrespo) @Springle We need to spec a hardware replacement to be used for production hosts and send it to @robh [07:57:26] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet last ran 6 hours ago [08:01:19] 6operations: request VM for grafana - https://phabricator.wikimedia.org/T107832#1505668 (10faidon) It can just go to the graphite box, like Graphite itself/gdash/tessera etc., no? [08:01:42] (03PS1) 10Jcrespo: Increasing db1027 and db1015 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229082 [08:02:33] (03CR) 10Jcrespo: [C: 032] Increasing db1027 and db1015 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229082 (owner: 10Jcrespo) [08:07:17] !log jynus Synchronized wmf-config/db-eqiad.php: Increasing load for db1027 and db1015 (duration: 00m 12s) [08:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:18:17] :S Timestamps and database names are gone in logs on fluorine [08:20:37] hoo, all logs? mw logs? ...? [08:20:49] exception logs [08:22:18] jynus: db1065/66 (s1) look unhappy [08:22:35] Logs are full with slow queries presumably from them [08:22:53] springle: ^ [08:23:02] (03Draft1) 10Giuseppe Lavagetto: mediawiki: remove mw1061 from dsh group [puppet] - 10https://gerrit.wikimedia.org/r/229083 (https://phabricator.wikimedia.org/T107849) [08:23:03] 65? [08:23:32] yeah, both of them [08:23:37] guess they're API or so [08:23:41] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: remove mw1061 from dsh group [puppet] - 10https://gerrit.wikimedia.org/r/229083 (https://phabricator.wikimedia.org/T107849) (owner: 10Giuseppe Lavagetto) [08:23:56] yeah, they are [08:25:12] I see now [08:25:35] ApiQueryRevisions::run [08:26:26] T101502#1470443 [08:26:37] Yeah :/ Maybe setting the weight to 0 for them for now would make it less worse? [08:26:44] * bad [08:27:31] no [08:27:43] Oh, that bug :/ [08:28:00] that query with those specific parameters will fail always [08:28:21] Ok, that's unfortunate [08:28:31] I assumed it might be that those boxes are tight on ram [08:28:52] in that case making them only run API queries would have helped... but meh [08:31:42] but query is not failing, only slow [08:32:28] do you see it failing? [08:33:29] No, only being very slow [08:35:35] so yes, setting it to 0 may (only may) make it faster, but at the cost of non-api traffic [08:37:02] the API queries I see only take 3-4 seconds but they are non-trivial: IN + hundred of items [08:37:59] jynus: Look at the hhvm.log on fluorine it has slow queries [08:38:09] Do we still have a dbperformance log? [08:38:26] I am looking at tendril logging [08:39:17] right now the main issue is some timeouts [08:39:51] we have some hardware issues [08:40:18] that will try to overcome, too [08:41:43] Hardware issues as in degraded performance or as in not enough suitable hardware? [08:43:19] both [08:43:49] not as in degraded performance, as in server down [08:55:00] the main issue, which is SpecialWhatLinksHere::showIndirectLinks will be solved hopefully today [08:55:27] other queries will be solved as soon as we can test them properly [09:02:13] and feel free to help me with T101502#1471866 ! [09:04:44] (03PS1) 10ArielGlenn: dumps: move wikidatawiki, commonswiki to big wikis list [puppet] - 10https://gerrit.wikimedia.org/r/229087 [09:06:46] 6operations, 7Tracking: staged dumps tracking task - https://phabricator.wikimedia.org/T107757#1505780 (10Aklapper) [09:07:28] (03CR) 10ArielGlenn: [C: 032] dumps: move wikidatawiki, commonswiki to big wikis list [puppet] - 10https://gerrit.wikimedia.org/r/229087 (owner: 10ArielGlenn) [09:17:17] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1505790 (10ArielGlenn) [09:17:19] 6operations: move some wikis from small to big dumps config - https://phabricator.wikimedia.org/T107767#1505788 (10ArielGlenn) 5Open>3Resolved moved commonswiki and wikidatawiki for now, this should improve things a lot. we'll see on next run if there are a couple more that could benefit from being moved. h... [09:27:26] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1505843 (10fgiunchedi) @brion @hashar I've uploaded ffmpeg `7:2.7.2-1~wmf1` to `trusty-wikimedia`... [09:30:08] !log rolling schema change on image table to all wikis [09:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:40] jynus: Talking about schema changes... do you know what the state about updating wikidatawiki wb_items_per_site is? [09:32:23] springle is taking care of it, I didn't see applied, do not know the details [09:32:44] Ok, I'll poke him, then [09:33:07] can you remind me the ticket? [09:33:16] maybe I can check again [09:33:39] https://phabricator.wikimedia.org/T99459 / https://gerrit.wikimedia.org/r/228756 [09:40:42] hoo, it is not finished. I do not see it ongoing on the master, but it could be being applied on the slaves first and it is a large table. Do not have all details (too many methods to perform that) [09:45:11] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1505869 (10fgiunchedi) also `ffmpeg2theora` has been rebuilt with ffmpeg and uploaded to `trusty-w... [09:49:11] (03PS1) 10Faidon Liambotis: pybal: add Service, check_procs nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/229092 [09:51:09] hoo, as you are always very helpful with dbs, feel free to hangout on #wikimedia-databases It is how Sean and I communicate and the best place to ping us [09:51:23] 6operations: generate command lists for dump scheduler - https://phabricator.wikimedia.org/T107860#1505873 (10ArielGlenn) 3NEW a:3ArielGlenn [09:51:24] (03PS1) 10Faidon Liambotis: admin: remove gage from ops, ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/229093 [09:51:40] Nice... I will :) [09:51:42] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1505881 (10ArielGlenn) [09:51:43] 6operations: generate command lists for dump scheduler - https://phabricator.wikimedia.org/T107860#1505882 (10ArielGlenn) [09:53:01] (03CR) 10Faidon Liambotis: [C: 032] admin: remove gage from ops, ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/229093 (owner: 10Faidon Liambotis) [09:58:32] (03CR) 10Alexandros Kosiaris: [C: 031] pybal: add Service, check_procs nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/229092 (owner: 10Faidon Liambotis) [10:00:32] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1009 [puppet] - 10https://gerrit.wikimedia.org/r/229095 (https://phabricator.wikimedia.org/T102015) [10:01:22] (03PS3) 10Alexandros Kosiaris: Reorder bacula keypair key/certificate [puppet] - 10https://gerrit.wikimedia.org/r/219847 [10:01:58] akosiaris: it's broken [10:02:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1009 [puppet] - 10https://gerrit.wikimedia.org/r/229095 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi) [10:02:36] (03CR) 10Faidon Liambotis: [C: 04-1] Reorder bacula keypair key/certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/219847 (owner: 10Alexandros Kosiaris) [10:03:51] (03PS2) 10Faidon Liambotis: Disable connection tracking for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/224577 (owner: 10Muehlenhoff) [10:04:02] (03PS3) 10Faidon Liambotis: Disable connection tracking for poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/224577 (owner: 10Muehlenhoff) [10:04:16] PROBLEM - puppet last run on rdb1001 is CRITICAL Puppet has 1 failures [10:04:16] (03CR) 10Faidon Liambotis: [C: 032] Disable connection tracking for poolcounter [puppet] - 10https://gerrit.wikimedia.org/r/224577 (owner: 10Muehlenhoff) [10:05:26] (03PS4) 10Alexandros Kosiaris: Reorder bacula keypair key/certificate [puppet] - 10https://gerrit.wikimedia.org/r/219847 [10:06:12] (03CR) 10Faidon Liambotis: [C: 031] "Weird issue, but if you say so :)" [puppet] - 10https://gerrit.wikimedia.org/r/219847 (owner: 10Alexandros Kosiaris) [10:06:33] paravoid: yeah, i know [10:06:41] I was surprised as well :-( [10:07:50] (03PS5) 10Alexandros Kosiaris: Reorder bacula keypair key/certificate [puppet] - 10https://gerrit.wikimedia.org/r/219847 [10:07:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Reorder bacula keypair key/certificate [puppet] - 10https://gerrit.wikimedia.org/r/219847 (owner: 10Alexandros Kosiaris) [10:08:46] PROBLEM - puppet last run on strontium is CRITICAL Puppet has 1 failures [10:12:06] RECOVERY - Restbase root url on restbase1009 is OK: HTTP OK: HTTP/1.1 200 - 15145 bytes in 0.014 second response time [10:13:30] I am seeing some slowdown on commons database due to schema change. Please report if it is bearable or it is causing too much issues [10:13:42] commons is higly slow atm [10:15:41] Yay, s4 alert :P [10:15:50] yes, I am going to stop it [10:16:49] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1505913 (10fgiunchedi) @mczusatz looks like only NASA_-_Earth_from_Orbit_2013.webm is able to get fully transcoded, if you'd like to test it yourself as well ge... [10:17:10] things should be back to normal in some minutes [10:21:46] !log enabling puppet on tin [10:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:05] PROBLEM - puppet last run on tin is CRITICAL Puppet last ran 1 day ago [10:25:16] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:25:55] RECOVERY - torrus.wikimedia.org HTTP on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2166 bytes in 0.363 second response time [10:27:37] !log bootstrap cassandra on restbase1009 [10:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:36] RECOVERY - puppet last run on rdb1001 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:30:42] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1505949 (10fgiunchedi) >>! In T102015#1503798, @mobrovac wrote: >>>! In T102015#1503428, @Eevans wrote: >>>>! In T102015#1503250, @GWick... [10:31:37] (03PS10) 10Hoo man: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [10:31:58] PROBLEM - puppet last run on iridium is CRITICAL Puppet last ran 12 hours ago [10:34:15] RECOVERY - puppet last run on iridium is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:34:54] (03PS11) 10Hoo man: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [10:35:45] RECOVERY - puppet last run on strontium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:36:28] (03CR) 10ArielGlenn: [C: 032] Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [10:36:56] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [10:45:45] (03PS1) 10Hoo man: Run DCAT-AP after creating Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/229103 [10:49:05] PROBLEM - Host db1035 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:27] (03PS1) 10Jcrespo: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229105 [10:54:48] (03CR) 10Jcrespo: [C: 032] Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229105 (owner: 10Jcrespo) [10:54:54] (03Merged) 10jenkins-bot: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229105 (owner: 10Jcrespo) [10:56:57] (03PS2) 10Hoo man: Run DCAT-AP after creating Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/229103 [10:57:42] !log jynus Synchronized wmf-config/db-eqiad.php: Depool db1064 (duration: 00m 13s) [10:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:35] (03CR) 10ArielGlenn: [C: 032] Run DCAT-AP after creating Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/229103 (owner: 10Hoo man) [11:01:10] !log upgrading junos on asw-a-codfw [11:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:32] (03PS2) 10Giuseppe Lavagetto: imagescalers: convert the last two servers to HAT [puppet] - 10https://gerrit.wikimedia.org/r/227967 (https://phabricator.wikimedia.org/T84842) [11:05:17] <_joe_> going to do this ^^ right after lunch, the puppet change is just cosmetic [11:05:38] (03CR) 10Giuseppe Lavagetto: [C: 032] imagescalers: convert the last two servers to HAT [puppet] - 10https://gerrit.wikimedia.org/r/227967 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [11:09:58] 6operations: syslog-ng and rsyslog jousting on lithium - https://phabricator.wikimedia.org/T107611#1506005 (10fgiunchedi) it's audit time! ``` lithium:/var/log$ zgrep -c 'install rsyslog:' dpkg.log* dpkg.log:77 dpkg.log.1:848 dpkg.log.10.gz:5 dpkg.log.2.gz:1060 dpkg.log.3.gz:686 dpkg.log.4.gz:0 dpkg.log.5.gz:0... [11:24:39] the huge amount of selects on the table image on commons created metadata locking on db1064 and db1068; automatic failover worked ok, but there was a general slowdown for some minutes [11:26:19] will try to apply the change in rolling way to minimize issues, although that may take days to apply [11:37:49] jynus: may i ask what you do on the commons db (just curious! :-)) [11:37:50] ok...is stuff goingdown [11:38:18] Steinsplitter, I am being vague because it is a security bug [11:38:24] Commons is having some serious problems... :/ [11:38:27] ah, ok ;) [11:39:28] Are we having a DDoS attack?... I keep getting "too many connections" [11:39:59] I'm on it, trailing of previous issue [11:41:06] Josve,: it is a db issue, i guess wikiemdia has a good ddos protection. not like some small wikis ;) [11:41:13] should be ok now [11:41:47] <_joe_> !log reimaging mw1159 to HAT [11:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:41:55] HotCat, plus all gadgets, plus MediaViewer, plus ... is down... [11:42:09] * Josve05a do not like this one bit [11:42:30] <_joe_> jynus: I can assume this is all a db issue? [11:42:37] PROBLEM - Host baham is DOWN: CRITICAL - Network Unreachable (208.80.153.13) [11:42:37] PROBLEM - Host bast2001 is DOWN: CRITICAL - Network Unreachable (208.80.153.5) [11:42:39] yes, _joe_ [11:42:46] PROBLEM - Host mw2065 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:46] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:58] PROBLEM - Host mw2073 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:58] PROBLEM - Host mw2046 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:58] PROBLEM - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:58] PROBLEM - Host mw2036 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:58] PROBLEM - Host mw2037 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:05] PROBLEM - Host mw2013 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:06] PROBLEM - Host mw2053 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:06] PROBLEM - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:06] PROBLEM - Host mw2072 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:06] PROBLEM - Host mw2032 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:06] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:06] PROBLEM - Host mw2021 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:07] yay, Commons is back [11:43:08] PROBLEM - Host mw2019 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:12] the schema change failed on commons, I am fixing the pieces [11:43:15] PROBLEM - Host db2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:15] PROBLEM - Host db2011 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:15] PROBLEM - Host db2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:15] PROBLEM - Host db2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:15] PROBLEM - Host db2007 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:15] PROBLEM - Host db2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:16] PROBLEM - Host db2010 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:16] PROBLEM - Host mw2024 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:17] PROBLEM - Host mw2047 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:17] PROBLEM - Host mw2064 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:18] PROBLEM - Host mw2048 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:18] PROBLEM - Host mw2045 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:19] PROBLEM - Host mw2062 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:19] PROBLEM - Host mw2008 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:25] ^that is not me [11:43:35] PROBLEM - Host install2001 is DOWN: CRITICAL - Network Unreachable (208.80.153.4) [11:43:36] PROBLEM - Host mw2070 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:36] PROBLEM - Host mw2018 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:36] PROBLEM - Host mw2012 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:36] PROBLEM - Host cp2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:36] PROBLEM - Host mw2051 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:41] Anyone pulled the plug on codfw? :P [11:43:46] PROBLEM - Host mw2078 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:46] PROBLEM - Host mw2076 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:46] PROBLEM - Host mw2075 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:46] PROBLEM - Host mw2011 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:46] PROBLEM - Host mw2061 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:47] PROBLEM - Host mw2052 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:47] PROBLEM - Host mw2077 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:48] PROBLEM - Host mw2029 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:48] PROBLEM - Host mw2028 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:49] PROBLEM - Host mw2030 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:49] PROBLEM - Host db2012 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:50] <_joe_> yeah I was about to ask the same [11:43:50] PROBLEM - Host heze is DOWN: PING CRITICAL - Packet loss = 100% [11:43:50] PROBLEM - Host es2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:51] PROBLEM - Host mw2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:06] PROBLEM - Host mw2057 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:06] PROBLEM - Host mw2068 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:06] PROBLEM - Host mw2010 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:06] PROBLEM - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:06] PROBLEM - Host mw2040 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:07] PROBLEM - Host mw2009 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:07] PROBLEM - Host cp2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:08] PROBLEM - Host mw2042 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:08] PROBLEM - Host mw2054 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:15] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Network Unreachable (208.80.153.14) [11:44:16] PROBLEM - HHVM rendering on mw2209 is CRITICAL - Socket timeout after 10 seconds [11:44:17] PROBLEM - Host mw2039 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:26] PROBLEM - Host mw2067 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:26] PROBLEM - Host mw2041 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:26] PROBLEM - Host mw2066 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:26] PROBLEM - Host cp2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:26] PROBLEM - Host ms-be2013 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:27] PROBLEM - Host ms-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:27] PROBLEM - Host ms-be2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:27] PROBLEM - Host ms-be2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:27] PROBLEM - Host mc2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:28] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [11:44:28] PROBLEM - Host acamar is DOWN: CRITICAL - Network Unreachable (208.80.153.12) [11:44:35] PROBLEM - HHVM rendering on mw2128 is CRITICAL - Socket timeout after 10 seconds [11:44:36] PROBLEM - Host mw2025 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:36] PROBLEM - Host mw2050 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:46] PROBLEM - Host cp2005 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:46] PROBLEM - Host db2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:46] PROBLEM - Host mw2035 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:46] PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:46] PROBLEM - Host cp2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:52] !log schema update on Commons failed, expect some minor inestabilities until everything is fixed [11:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:44:58] PROBLEM - Host mc2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:58] PROBLEM - Host ms-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:58] PROBLEM - Host suhail is DOWN: PING CRITICAL - Packet loss = 100% [11:44:58] PROBLEM - Host mw2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:58] PROBLEM - Host mc2006 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:58] PROBLEM - Host mc2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:58] PROBLEM - Host mw2006 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:59] PROBLEM - Host rdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:59] PROBLEM - Host ms-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:00] PROBLEM - Host ms-fe2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:02] PROBLEM - Host db2009 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:06] PROBLEM - Host rdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:17] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.153.15) [11:45:26] PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:208:80:153:12 [11:45:26] PROBLEM - HHVM rendering on mw2147 is CRITICAL - Socket timeout after 10 seconds [11:45:39] <_joe_> killing icinga-wm [11:45:56] PROBLEM - Host 208.80.153.12 is DOWN: CRITICAL - Network Unreachable (208.80.153.12) [11:46:16] PROBLEM - HHVM rendering on mw2133 is CRITICAL - Socket timeout after 10 seconds [11:46:37] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:46:37] I would like to push an extension update... will this be fixed soonish or don't we care? [11:46:47] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [11:46:48] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:46:48] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:46:48] PROBLEM - IPsec on cp3012 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:46:48] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:46:48] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:46:48] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:46:49] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [11:46:49] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:07] PROBLEM - IPsec on cp3020 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [11:47:08] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:25] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:25] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:47:26] PROBLEM - IPsec on cp3019 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [11:47:26] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:47:35] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:36] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:36] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [11:47:36] PROBLEM - configured eth on lvs2006 is CRITICAL: eth1 reporting no carrier. [11:47:37] PROBLEM - HHVM rendering on mw2141 is CRITICAL - Socket timeout after 10 seconds [11:47:37] PROBLEM - configured eth on lvs2005 is CRITICAL: eth1 reporting no carrier. [11:47:37] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [11:47:38] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:38] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:47:38] PROBLEM - HHVM rendering on mw2127 is CRITICAL - Socket timeout after 10 seconds [11:47:38] PROBLEM - HHVM rendering on mw2148 is CRITICAL - Socket timeout after 10 seconds [11:47:45] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [11:47:45] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [11:47:45] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:47:45] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:45] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:47:46] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:47:46] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [11:47:47] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:47] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:48] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:47:48] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [11:47:49] PROBLEM - Host ripe-atlas-codfw is DOWN: CRITICAL - Network Unreachable (208.80.152.244) [11:47:56] PROBLEM - Router interfaces on cr2-codfw is CRITICAL host 208.80.153.193, interfaces up: 102, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae2BRet-0/0/0: down - asw-a-codfw:et-7/0/52 {#10706} [40Gbps Cu]BR [11:47:57] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [11:47:57] PROBLEM - IPsec on cp3021 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [11:48:05] PROBLEM - configured eth on lvs2004 is CRITICAL: eth1 reporting no carrier. [11:48:06] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:48:06] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:48:06] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:48:06] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:48:06] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [11:48:13] <_joe_> ok, killing it [11:48:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRet-0/0/0: down - asw-a-codfw:et-2/0/52 {#10702} [40Gbps Cu]BR [11:48:52] <_joe_> !log killed ircecho to prevent furter icinga spam [11:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:06] RECOVERY - Host ripe-atlas-codfw is UPING OK - Packet loss = 0%, RTA = 43.17 ms [11:54:06] PROBLEM - puppet last run on db2007 is CRITICAL puppet fail [11:54:15] PROBLEM - puppet last run on mw2064 is CRITICAL puppet fail [11:54:36] PROBLEM - puppet last run on mw2075 is CRITICAL puppet fail [11:54:36] PROBLEM - puppet last run on ms-be2004 is CRITICAL puppet fail [11:54:57] PROBLEM - puppet last run on mw2003 is CRITICAL puppet fail [11:54:57] PROBLEM - puppet last run on mw2067 is CRITICAL puppet fail [11:54:57] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [11:55:06] PROBLEM - puppet last run on mw2047 is CRITICAL puppet fail [11:55:07] PROBLEM - puppet last run on mw2010 is CRITICAL puppet fail [11:55:07] PROBLEM - puppet last run on mw2004 is CRITICAL puppet fail [11:55:07] PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail [11:55:16] PROBLEM - puppet last run on mw2079 is CRITICAL puppet fail [11:55:37] PROBLEM - puppet last run on mw2070 is CRITICAL puppet fail [11:55:38] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [11:55:38] PROBLEM - puppet last run on mw2019 is CRITICAL puppet fail [11:55:38] PROBLEM - puppet last run on mw2044 is CRITICAL Puppet has 18 failures [11:56:07] PROBLEM - puppet last run on mw2055 is CRITICAL puppet fail [11:56:07] PROBLEM - puppet last run on mw2015 is CRITICAL puppet fail [11:56:07] PROBLEM - puppet last run on mw2049 is CRITICAL puppet fail [11:56:07] PROBLEM - puppet last run on mw2039 is CRITICAL puppet fail [11:56:07] PROBLEM - puppet last run on db2009 is CRITICAL puppet fail [11:56:16] PROBLEM - puppet last run on install2001 is CRITICAL puppet fail [11:56:16] PROBLEM - puppet last run on mw2056 is CRITICAL puppet fail [11:58:47] * Josve05a high-fives icinga-wm [12:03:01] !log added pcre3_8.31-2ubuntu2.1+wm1 to trusty-wikimedi (reroll of security update with our JIT enablement patch) [12:03:03] PROBLEM - puppet last run on mw2009 is CRITICAL puppet fail [12:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:04:15] PROBLEM - puppet last run on mw2031 is CRITICAL puppet fail [12:04:55] PROBLEM - puppet last run on ms-be2003 is CRITICAL puppet fail [12:06:04] !log updated canary appservers mw1017/mw1018 to updated pcre3 + hhvm restart [12:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:07:46] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] Add config for Wikisource badges on Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) (owner: 10Aude) [12:13:33] RECOVERY - puppet last run on mw2009 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:13:33] RECOVERY - puppet last run on ms-be2003 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:13:34] RECOVERY - puppet last run on mw2044 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:14:45] Ok to deploy? [12:14:54] RECOVERY - puppet last run on mw2031 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:15:23] RECOVERY - puppet last run on mw2064 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:15:25] RECOVERY - puppet last run on db2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:15:33] PROBLEM - puppet last run on ms-be2002 is CRITICAL puppet fail [12:16:00] (03CR) 10BBlack: [C: 031] pybal: add Service, check_procs nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/229092 (owner: 10Faidon Liambotis) [12:16:04] PROBLEM - puppet last run on mw2006 is CRITICAL Puppet has 1 failures [12:16:04] RECOVERY - puppet last run on mw2010 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:16:34] PROBLEM - puppet last run on ms-be2001 is CRITICAL Puppet has 1 failures [12:16:35] RECOVERY - puppet last run on mw2075 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:16:35] RECOVERY - puppet last run on ms-be2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:36] RECOVERY - puppet last run on mw2019 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:17:04] RECOVERY - puppet last run on mw2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:54] RECOVERY - puppet last run on mw2067 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:18:14] RECOVERY - puppet last run on mw2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:18:14] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:18:15] RECOVERY - puppet last run on mw2070 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:18:33] RECOVERY - puppet last run on db2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:18:34] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:19:54] RECOVERY - puppet last run on mw2062 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:20:03] RECOVERY - puppet last run on install2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:05] RECOVERY - puppet last run on mw2049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:05] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:15] RECOVERY - puppet last run on mw2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:22] 6operations, 10Wikimedia-Mailing-lists, 7Pywikibot-General: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769#1506076 (10JohnLewis) Not really unless an op feels fine risking the addition of non-mailman emails to the mbox and then... [12:20:23] RECOVERY - puppet last run on mw2039 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:20:25] (03PS1) 10BBlack: add eventdonations.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/229113 [12:20:34] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [12:20:44] (03CR) 10BBlack: [C: 032 V: 032] add eventdonations.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/229113 (owner: 10BBlack) [12:20:57] (03PS2) 10BBlack: pybal: add Service, check_procs nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/229092 (owner: 10Faidon Liambotis) [12:21:25] RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:26:43] PROBLEM - puppet last run on mw2032 is CRITICAL puppet fail [12:27:59] (03PS1) 10BBlack: fix annoying but non-fatal excess whitespace in cert [puppet] - 10https://gerrit.wikimedia.org/r/229114 [12:28:04] PROBLEM - puppet last run on ms-be2013 is CRITICAL puppet fail [12:28:05] (03CR) 10BBlack: [C: 032] pybal: add Service, check_procs nrpe check [puppet] - 10https://gerrit.wikimedia.org/r/229092 (owner: 10Faidon Liambotis) [12:28:19] (03PS2) 10BBlack: fix annoying but non-fatal excess whitespace in cert [puppet] - 10https://gerrit.wikimedia.org/r/229114 [12:28:25] (03CR) 10BBlack: [C: 032 V: 032] fix annoying but non-fatal excess whitespace in cert [puppet] - 10https://gerrit.wikimedia.org/r/229114 (owner: 10BBlack) [12:29:55] bblack: want to do https://gerrit.wikimedia.org/r/#/c/228800/ + https://gerrit.wikimedia.org/r/#/c/228801/ while you're in a roll? :P [12:30:15] sure [12:30:36] (03PS2) 10BBlack: lvs: remove {bits,text,upload,mobile}svc lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/228800 (owner: 10Faidon Liambotis) [12:31:11] _joe_: You might know that... ok to deploy now or still codfw fall out? [12:31:30] <_joe_> hoo: it is :) [12:31:42] Nice [12:31:49] (for now) [12:32:51] (03CR) 10BBlack: [C: 032] lvs: remove {bits,text,upload,mobile}svc lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/228800 (owner: 10Faidon Liambotis) [12:33:07] (03PS2) 10BBlack: Remove {bits,text,upload,mobile}.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/228801 (owner: 10Faidon Liambotis) [12:33:59] (03CR) 10BBlack: [C: 032] Remove {bits,text,upload,mobile}.svc.$site.wmnet [dns] - 10https://gerrit.wikimedia.org/r/228801 (owner: 10Faidon Liambotis) [12:35:34] PROBLEM - nutcracker port on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:35:40] 6operations: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate - https://phabricator.wikimedia.org/T107760#1506091 (10ArielGlenn) this scheduler needs to have the following properties: very lightweight. no redis... [12:35:44] PROBLEM - HHVM rendering on mw2043 is CRITICAL - Socket timeout after 10 seconds [12:35:45] PROBLEM - HHVM rendering on mw2014 is CRITICAL - Socket timeout after 10 seconds [12:35:45] PROBLEM - HHVM rendering on mw2060 is CRITICAL - Socket timeout after 10 seconds [12:35:46] PROBLEM - HHVM rendering on mw2052 is CRITICAL - Socket timeout after 10 seconds [12:35:54] PROBLEM - HHVM rendering on mw2017 is CRITICAL - Socket timeout after 10 seconds [12:35:54] PROBLEM - HHVM rendering on mw2029 is CRITICAL - Socket timeout after 10 seconds [12:35:54] PROBLEM - HHVM rendering on mw2066 is CRITICAL - Socket timeout after 10 seconds [12:35:54] PROBLEM - HHVM rendering on mw2028 is CRITICAL - Socket timeout after 10 seconds [12:35:55] PROBLEM - HHVM rendering on mw2015 is CRITICAL - Socket timeout after 10 seconds [12:35:55] PROBLEM - HHVM rendering on mw2025 is CRITICAL - Socket timeout after 10 seconds [12:35:55] PROBLEM - HHVM rendering on mw2034 is CRITICAL - Socket timeout after 10 seconds [12:35:55] PROBLEM - HHVM rendering on mw2046 is CRITICAL - Socket timeout after 10 seconds [12:35:55] PROBLEM - HHVM rendering on mw2076 is CRITICAL - Socket timeout after 10 seconds [12:35:56] PROBLEM - nutcracker process on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:35:56] PROBLEM - HHVM rendering on mw2156 is CRITICAL - Socket timeout after 10 seconds [12:36:24] PROBLEM - puppet last run on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:36:24] PROBLEM - DPKG on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:36:44] PROBLEM - Disk space on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:36:45] PROBLEM - salt-minion processes on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:37:06] PROBLEM - HHVM processes on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:37:45] RECOVERY - HHVM rendering on mw2043 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.979 second response time [12:37:45] RECOVERY - HHVM rendering on mw2014 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.345 second response time [12:37:45] RECOVERY - HHVM rendering on mw2060 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.544 second response time [12:37:46] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [12:37:46] RECOVERY - HHVM rendering on mw2052 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.348 second response time [12:37:55] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.500 second response time [12:37:55] RECOVERY - HHVM rendering on mw2028 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.547 second response time [12:37:55] RECOVERY - HHVM rendering on mw2066 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 1.071 second response time [12:37:55] PROBLEM - RAID on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:37:55] RECOVERY - HHVM rendering on mw2029 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 1.817 second response time [12:37:56] RECOVERY - HHVM rendering on mw2015 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.303 second response time [12:37:56] RECOVERY - HHVM rendering on mw2034 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.300 second response time [12:37:56] RECOVERY - HHVM rendering on mw2025 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.302 second response time [12:37:56] RECOVERY - HHVM rendering on mw2046 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.552 second response time [12:37:57] PROBLEM - puppet last run on ms-fe2002 is CRITICAL Puppet has 1 failures [12:37:57] RECOVERY - HHVM rendering on mw2076 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.306 second response time [12:37:58] RECOVERY - HHVM rendering on mw2156 is OK: HTTP OK: HTTP/1.1 200 OK - 66625 bytes in 0.297 second response time [12:38:25] PROBLEM - configured eth on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:38:44] PROBLEM - dhclient process on mw1159 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [12:39:14] PROBLEM - puppet last run on baham is CRITICAL puppet fail [12:39:35] PROBLEM - puppet last run on db2005 is CRITICAL Puppet has 1 failures [12:39:35] PROBLEM - puppet last run on db2012 is CRITICAL Puppet has 1 failures [12:39:44] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 1 failures [12:40:41] (03PS2) 10Aude: Add config for Wikisource badges on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) [12:41:05] PROBLEM - puppet last run on es2006 is CRITICAL puppet fail [12:41:06] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:41:24] (03PS2) 10BBlack: decom bits service IPs [dns] - 10https://gerrit.wikimedia.org/r/228029 (https://phabricator.wikimedia.org/T95448) [12:41:26] PROBLEM - puppet last run on bast2001 is CRITICAL puppet fail [12:41:45] PROBLEM - puppet last run on mw2007 is CRITICAL Puppet has 1 failures [12:42:05] PROBLEM - puppet last run on mw2040 is CRITICAL Puppet has 2 failures [12:43:14] PROBLEM - puppet last run on mw2063 is CRITICAL Puppet has 1 failures [12:43:14] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 2 failures [12:43:15] PROBLEM - puppet last run on mw2078 is CRITICAL Puppet has 2 failures [12:43:25] PROBLEM - Disk space on ms-be2009 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdf1 is not accessible: Input/output error [12:43:26] RECOVERY - puppet last run on mw2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:43:30] (03PS2) 10BBlack: Remove cache::bits roles from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) [12:43:45] PROBLEM - RAID on ms-be2009 is CRITICAL 1 failed LD(s) (Offline) [12:43:55] RECOVERY - puppet last run on ms-be2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:45:07] * hoo rages [12:45:44] PROBLEM - puppet last run on mw2065 is CRITICAL puppet fail [12:46:27] People merging things and then not deploying them [12:46:34] PROBLEM - puppet last run on mw2041 is CRITICAL puppet fail [12:46:54] PROBLEM - puppet last run on mw2005 is CRITICAL Puppet has 1 failures [12:47:45] RECOVERY - HHVM processes on mw1159 is OK: PROCS OK: 6 processes with command name hhvm [12:48:15] RECOVERY - nutcracker port on mw1159 is OK: TCP OK - 0.000 second response time on port 11212 [12:48:25] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:48:26] RECOVERY - RAID on mw1159 is OK no RAID installed [12:48:36] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:48:45] RECOVERY - nutcracker process on mw1159 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [12:49:05] RECOVERY - configured eth on mw1159 is OK - interfaces up [12:49:06] RECOVERY - DPKG on mw1159 is OK: All packages OK [12:49:25] RECOVERY - dhclient process on mw1159 is OK: PROCS OK: 0 processes with command name dhclient [12:49:35] RECOVERY - salt-minion processes on mw1159 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:49:35] RECOVERY - Disk space on mw1159 is OK: DISK OK [12:50:01] I'm extremely unhappy about people messing stuff up on tin [12:50:09] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1506117 (10mobrovac) [12:50:13] (03PS3) 10BBlack: Remove cache::bits role from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) [12:50:13] that's only ever acceptable for security reasons [12:50:13] (03PS2) 10BBlack: Decom bits cluster varnish/lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/228034 (https://phabricator.wikimedia.org/T95448) [12:50:24] !log hoo Synchronized php-1.26wmf16/extensions/Wikidata/: Update Wikibase: Fixes for JSON dump creation (duration: 00m 39s) [12:50:24] s/reasons/patches [12:50:27] 6operations, 10ops-eqiad: db1059 raid degraded - https://phabricator.wikimedia.org/T107024#1506118 (10Cmjohnson) New disk has been ordered. Should be here tomorrow Congratulations: Work Order SR914778608 was successfully submitted. [12:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:46] hoo, wasn't me, right? [12:51:02] No, I think yesterday's evening SWAT messed those [12:51:15] RECOVERY - puppet last run on mw1159 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:51:18] ok, ok, because I did a couple of deployments this morning [12:52:48] The problem are messed submodule states [12:53:05] RECOVERY - puppet last run on ms-be2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:53:10] Whoever needs to update one of these extensions next will have to figure that [12:53:14] PROBLEM - puppet last run on ms-be2009 is CRITICAL Puppet has 1 failures [12:53:28] probably not going to happen, the branch will only be used two more days [12:53:45] RECOVERY - puppet last run on mw2032 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:55:17] !log Syncing to mw1160 failed (Host key verification failed.) [12:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:33] Probably reimaged... _joe_ ^ [12:55:47] <_joe_> hoo: yes, it is reimaging now [12:55:55] akosiaris, around? I will be online in half an hour, can we work on finishing up the kartotherian stuff? [12:55:56] PROBLEM - puppet last run on mw2053 is CRITICAL Puppet has 1 failures [12:56:35] PROBLEM - puppet last run on mw2046 is CRITICAL puppet fail [12:56:53] CCing jynus [12:58:04] PROBLEM - puppet last run on es2002 is CRITICAL puppet fail [12:58:04] PROBLEM - puppet last run on mc2004 is CRITICAL puppet fail [12:58:55] PROBLEM - puppet last run on mw2058 is CRITICAL Puppet has 1 failures [12:59:05] PROBLEM - puppet last run on heze is CRITICAL puppet fail [12:59:15] PROBLEM - puppet last run on mw2069 is CRITICAL puppet fail [12:59:55] PROBLEM - HHVM rendering on mw2023 is CRITICAL - Socket timeout after 10 seconds [12:59:56] PROBLEM - HHVM rendering on mw2061 is CRITICAL - Socket timeout after 10 seconds [12:59:56] PROBLEM - HHVM rendering on mw2047 is CRITICAL - Socket timeout after 10 seconds [12:59:56] PROBLEM - HHVM rendering on mw2055 is CRITICAL - Socket timeout after 10 seconds [12:59:56] PROBLEM - HHVM rendering on mw2069 is CRITICAL - Socket timeout after 10 seconds [13:00:04] PROBLEM - puppet last run on mw2016 is CRITICAL puppet fail [13:00:15] PROBLEM - HHVM rendering on mw2070 is CRITICAL - Socket timeout after 10 seconds [13:00:15] PROBLEM - HHVM rendering on mw2071 is CRITICAL - Socket timeout after 10 seconds [13:00:15] PROBLEM - HHVM rendering on mw2077 is CRITICAL - Socket timeout after 10 seconds [13:00:26] PROBLEM - puppet last run on mw2033 is CRITICAL Puppet has 1 failures [13:00:59] wth are those? [13:01:12] RECOVERY - HHVM rendering on mw2071 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.598 second response time [13:01:12] RECOVERY - HHVM rendering on mw2077 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.638 second response time [13:01:23] RECOVERY - HHVM rendering on mw2055 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.346 second response time [13:01:48] <_joe_> paravoid: looking [13:01:53] RECOVERY - Disk space on ms-be2009 is OK: DISK OK [13:02:12] RECOVERY - HHVM rendering on mw2061 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.601 second response time [13:02:21] RECOVERY - HHVM rendering on mw2070 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.339 second response time [13:03:12] RECOVERY - HHVM rendering on mw2069 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.596 second response time [13:03:22] RECOVERY - puppet last run on mw2033 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:03:42] RECOVERY - HHVM rendering on mw2047 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.340 second response time [13:03:43] PROBLEM - puppet last run on mw2057 is CRITICAL Puppet has 1 failures [13:03:52] RECOVERY - puppet last run on db2012 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:03:52] RECOVERY - HHVM rendering on mw2023 is OK: HTTP OK: HTTP/1.1 200 OK - 66620 bytes in 0.334 second response time [13:04:03] PROBLEM - puppet last run on mw2052 is CRITICAL Puppet has 1 failures [13:04:21] PROBLEM - puppet last run on mw2037 is CRITICAL puppet fail [13:04:22] RECOVERY - puppet last run on db2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:04:50] !log labstore1001 rebooting (possibly a couple of times) during tests and reinstallation [13:04:52] PROBLEM - pybal on lvs1003 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:04:52] PROBLEM - pybal on lvs3004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:04:52] RECOVERY - puppet last run on ms-fe2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:05:02] RECOVERY - puppet last run on baham is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:05:10] <_joe_> paravoid: ok on the servers directly there is no sign of the kind of error that would trigger those problems - all request from nagios resulted in a 200 OK [13:05:31] PROBLEM - puppet last run on suhail is CRITICAL puppet fail [13:05:32] PROBLEM - pybal on lvs1005 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:05:33] PROBLEM - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:05:51] no there isn't (pybal procs) [13:05:52] PROBLEM - pybal on lvs3003 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:06:15] it's probably seeing itself [13:06:21] PROBLEM - pybal on lvs1006 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:06:22] PROBLEM - puppet last run on mw2051 is CRITICAL puppet fail [13:06:23] PROBLEM - pybal on lvs3002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:06:32] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:06:40] maybe missing ^ ? [13:06:43] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:06:52] PROBLEM - puppet last run on mw2072 is CRITICAL Puppet has 1 failures [13:07:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRet-0/0/0: down - asw-a-codfw:et-2/0/52 {#10702} [40Gbps Cu]BR [13:07:03] RECOVERY - puppet last run on mw2040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:07:04] jynus, yes, missing ^^ [13:07:12] RECOVERY - puppet last run on mw2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:07:12] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:12] PROBLEM - pybal on lvs4004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:07:23] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:07:31] RECOVERY - puppet last run on mw2063 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:07:32] PROBLEM - puppet last run on mw2061 is CRITICAL Puppet has 1 failures [13:07:32] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:07:42] RECOVERY - puppet last run on mw2078 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:08:02] PROBLEM - pybal on lvs1001 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:08:02] PROBLEM - pybal on lvs1002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:08:02] PROBLEM - pybal on lvs4002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:08:02] * yurik_ now thinks jynus was not replying to his msg, and will come back when its quieter [13:08:13] RECOVERY - puppet last run on es2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:08:13] PROBLEM - pybal on lvs1004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:08:15] yurik_, not a good day today [13:08:17] ignore the pybal messages, it's just a bad check or something [13:08:21] PROBLEM - pybal on lvs4003 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/sbin/pybal [13:08:22] RECOVERY - puppet last run on bast2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:10:58] (03PS1) 10BBlack: filter pybal process check on root uid [puppet] - 10https://gerrit.wikimedia.org/r/229118 [13:11:21] (03CR) 10BBlack: [C: 032 V: 032] filter pybal process check on root uid [puppet] - 10https://gerrit.wikimedia.org/r/229118 (owner: 10BBlack) [13:11:45] bblack: thanks... [13:12:12] RECOVERY - puppet last run on mw2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:12:22] RECOVERY - puppet last run on mw2065 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:13:02] RECOVERY - puppet last run on mw2041 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:14:22] RECOVERY - pybal on lvs1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:14:32] RECOVERY - pybal on lvs3003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:15:02] RECOVERY - pybal on lvs3002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:15:22] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:15:32] RECOVERY - pybal on lvs1003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:15:32] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [13:15:51] RECOVERY - pybal on lvs4004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:16:21] RECOVERY - pybal on lvs1005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:16:52] RECOVERY - pybal on lvs1004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:17:02] RECOVERY - pybal on lvs1006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:17:43] RECOVERY - pybal on lvs3004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:18:22] RECOVERY - pybal on lvs4001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:18:42] RECOVERY - pybal on lvs1001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:18:42] RECOVERY - pybal on lvs4002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:19:02] RECOVERY - pybal on lvs4003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [13:21:15] !log rebooting asw-a-codfw, member 2 [13:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:32] RECOVERY - puppet last run on mw2046 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:22:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRet-0/0/0: down - asw-a-codfw:et-2/0/52 {#10702} [40Gbps Cu]BR [13:22:13] RECOVERY - puppet last run on mw2053 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [13:23:22] PROBLEM - puppet last run on mw2056 is CRITICAL Puppet has 1 failures [13:23:33] RECOVERY - puppet last run on mw2058 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:23:54] RECOVERY - puppet last run on es2002 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:23:54] RECOVERY - puppet last run on mc2004 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:24:12] PROBLEM - Host cp2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:42] PROBLEM - Host cp2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:53] RECOVERY - puppet last run on mw2057 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:25:02] PROBLEM - Host ms-fe2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:02] PROBLEM - Host mc2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:12] PROBLEM - Host cp2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:21] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:22] PROBLEM - Host lvs2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:22] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:22] PROBLEM - Host mc2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:22] RECOVERY - puppet last run on heze is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:22] RECOVERY - puppet last run on mw2069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:51] RECOVERY - Host cp2003 is UPING WARNING - Packet loss = 58%, RTA = 51.72 ms [13:26:52] RECOVERY - Host lvs2001 is UPING OK - Packet loss = 0%, RTA = 52.12 ms [13:26:52] RECOVERY - Host lvs2002 is UPING OK - Packet loss = 0%, RTA = 52.27 ms [13:26:52] RECOVERY - Host mc2002 is UPING OK - Packet loss = 0%, RTA = 53.51 ms [13:26:52] RECOVERY - Host mc2003 is UPING OK - Packet loss = 0%, RTA = 51.89 ms [13:26:52] RECOVERY - Host cp2002 is UPING OK - Packet loss = 0%, RTA = 52.40 ms [13:27:02] RECOVERY - Host lvs2003 is UPING OK - Packet loss = 0%, RTA = 51.78 ms [13:27:02] RECOVERY - Host cp2001 is UPING OK - Packet loss = 0%, RTA = 51.93 ms [13:27:12] RECOVERY - Host ms-fe2001 is UPING OK - Packet loss = 0%, RTA = 52.03 ms [13:27:32] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:41] RECOVERY - puppet last run on mw2052 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:29:12] RECOVERY - puppet last run on suhail is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:31:22] RECOVERY - puppet last run on mw2061 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:31:51] PROBLEM - puppet last run on cp2002 is CRITICAL puppet fail [13:32:21] RECOVERY - puppet last run on mw2037 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:32:21] RECOVERY - puppet last run on mw2051 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:32:51] RECOVERY - puppet last run on mw2072 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:33:02] PROBLEM - puppet last run on lvs2002 is CRITICAL puppet fail [13:35:48] (03PS2) 10BBlack: network::constants::all_networks(_lo)? via flatten() [puppet] - 10https://gerrit.wikimedia.org/r/228586 [13:35:50] (03PS2) 10BBlack: VCL: remove fqdn comment line [puppet] - 10https://gerrit.wikimedia.org/r/228584 [13:35:52] (03PS2) 10BBlack: varnish: get rid of some pre-systemd cruft [puppet] - 10https://gerrit.wikimedia.org/r/228591 [13:35:54] (03PS2) 10BBlack: vhtcpd: /etc/init/varnishhtcpd.conf is long-gone now [puppet] - 10https://gerrit.wikimedia.org/r/228590 [13:35:56] (03PS2) 10BBlack: VCL: define vcl_config "layer" for parsoidcache [puppet] - 10https://gerrit.wikimedia.org/r/228589 [13:35:58] (03PS2) 10BBlack: VCL: remove unused probes "swift", "options" [puppet] - 10https://gerrit.wikimedia.org/r/228588 [13:36:00] (03PS1) 10BBlack: restrict_access: move to common code for all backends [puppet] - 10https://gerrit.wikimedia.org/r/229121 [13:36:02] (03PS1) 10BBlack: VCL: use network::constants::all_networks_lo [puppet] - 10https://gerrit.wikimedia.org/r/229122 [13:36:44] (03Abandoned) 10BBlack: VCL: remove restrict_access from text/upload backends [puppet] - 10https://gerrit.wikimedia.org/r/228585 (owner: 10BBlack) [13:37:18] (03CR) 10jenkins-bot: [V: 04-1] restrict_access: move to common code for all backends [puppet] - 10https://gerrit.wikimedia.org/r/229121 (owner: 10BBlack) [13:37:25] (03Abandoned) 10BBlack: VCL: use network::constants::all_networks_lo for ssl_proxies [puppet] - 10https://gerrit.wikimedia.org/r/228587 (owner: 10BBlack) [13:37:27] (03CR) 10jenkins-bot: [V: 04-1] VCL: use network::constants::all_networks_lo [puppet] - 10https://gerrit.wikimedia.org/r/229122 (owner: 10BBlack) [13:41:22] (03PS2) 10BBlack: VCL: use network::constants::all_networks_lo [puppet] - 10https://gerrit.wikimedia.org/r/229122 [13:41:24] (03PS3) 10BBlack: network::constants::all_networks(_lo)? via flatten() [puppet] - 10https://gerrit.wikimedia.org/r/228586 [13:41:26] (03PS2) 10BBlack: restrict_access: move to common code for all backends [puppet] - 10https://gerrit.wikimedia.org/r/229121 [13:41:29] (03PS3) 10BBlack: varnish: get rid of some pre-systemd cruft [puppet] - 10https://gerrit.wikimedia.org/r/228591 [13:45:16] <_joe_> !log repooling mw1159,mw1160 [13:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:52] RECOVERY - Host labstore1001 is UPING OK - Packet loss = 0%, RTA = 1.61 ms [13:46:55] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 8 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1506211 (10Joe) 5Open>3Resolved [13:47:09] <_joe_> about damn time. [13:49:13] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:51:21] PROBLEM - dhclient process on labstore1001 is CRITICAL: Connection refused by host [13:51:31] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [13:52:01] (03PS1) 10Dzahn: elasticsearch: set fixed port numbers [puppet] - 10https://gerrit.wikimedia.org/r/229127 (https://phabricator.wikimedia.org/T107278) [13:52:11] PROBLEM - Disk space on labstore1001 is CRITICAL: Connection refused by host [13:52:11] PROBLEM - puppet last run on labstore1001 is CRITICAL: Connection refused by host [13:52:31] PROBLEM - DPKG on labstore1001 is CRITICAL: Connection refused by host [13:52:32] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: Connection refused by host [13:53:02] (03PS1) 10Lokal Profil: Look for i18n using absolute path [puppet] - 10https://gerrit.wikimedia.org/r/229128 (https://phabricator.wikimedia.org/T103087) [13:53:11] PROBLEM - configured eth on labstore1001 is CRITICAL: Connection refused by host [13:53:56] (03CR) 10DCausse: [C: 031] elasticsearch: set fixed port numbers [puppet] - 10https://gerrit.wikimedia.org/r/229127 (https://phabricator.wikimedia.org/T107278) (owner: 10Dzahn) [13:54:12] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [13:54:27] _joe_: I just came by to say: thank you! [13:54:49] (03CR) 10Lokal Profil: "I could include an i18n update here as well but it might be cleaner to have that in a separate patch" [puppet] - 10https://gerrit.wikimedia.org/r/229128 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [13:55:02] <_joe_> matanya: don't thank me :) It took too long, and we're having some 503s on thumbs around now [13:55:14] RECOVERY - puppet last run on cp2002 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:55:42] still, your hard and dedicated work is highly appreciated [13:57:15] hmm, thumbnailing issue ? [13:57:16] https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Parodie.svg/171px-Parodie.svg.png [13:57:40] give blank result, there are old thumbs that are still accessible it seems [13:57:45] https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Parodie.svg/170px-Parodie.svg.png [13:58:09] (03CR) 10Hoo man: "Please do that in a separate patch, yes. Also remember updating your github repo, we don't want to go splitbrain here." [puppet] - 10https://gerrit.wikimedia.org/r/229128 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [13:58:43] RECOVERY - puppet last run on lvs2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:58:44] <_joe_> thedj: yes looks like there is some issue with thumbnailing, but I was seeing another issue tbh [14:00:04] _joe_: this might be totally unrelated. just figured i'd point it out, since it's such a recent report on VP/T [14:00:15] (03CR) 10Lokal Profil: "Thanks. Github updated" [puppet] - 10https://gerrit.wikimedia.org/r/229128 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [14:00:16] "there is no such thing as coincidence" (until there is) [14:02:00] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/229127 (https://phabricator.wikimedia.org/T107278) (owner: 10Dzahn) [14:02:37] dcausse, mutante: let's merge https://gerrit.wikimedia.org/r/229127 before the upcoming ES cluster restart? [14:03:02] moritzm: yep, good idea [14:03:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRet-0/0/0: down - asw-a-codfw:et-2/0/52 {#10702} [40Gbps Cu]BR [14:03:17] (03CR) 10ArielGlenn: [C: 032] Look for i18n using absolute path [puppet] - 10https://gerrit.wikimedia.org/r/229128 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [14:03:25] <_joe_> thedj: I'm pretty sure it's unrelated to me re-pooling two imgescalers 15 minutes after I've seen some problems. Not sure about anything else though [14:03:35] (03CR) 10ArielGlenn: [V: 032] Look for i18n using absolute path [puppet] - 10https://gerrit.wikimedia.org/r/229128 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [14:03:51] <_joe_> hoo: did your deploy earlier touched thumb.php by any chance? [14:04:11] PROBLEM - puppet last run on mw2024 is CRITICAL Puppet has 1 failures [14:04:14] no, I only synced the Wikidata directory [14:04:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [14:04:51] ACKNOWLEDGEMENT - DPKG on labstore1001 is CRITICAL: Connection refused by host daniel_zahn in maintenance period [14:04:51] ACKNOWLEDGEMENT - Disk space on labstore1001 is CRITICAL: Connection refused by host daniel_zahn in maintenance period [14:04:51] ACKNOWLEDGEMENT - NTP on labstore1001 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn in maintenance period [14:04:51] ACKNOWLEDGEMENT - RAID on labstore1001 is CRITICAL: Connection refused by host daniel_zahn in maintenance period [14:04:51] ACKNOWLEDGEMENT - configured eth on labstore1001 is CRITICAL: Connection refused by host daniel_zahn in maintenance period [14:04:51] ACKNOWLEDGEMENT - dhclient process on labstore1001 is CRITICAL: Connection refused by host daniel_zahn in maintenance period [14:04:51] ACKNOWLEDGEMENT - puppet last run on labstore1001 is CRITICAL: Connection refused by host daniel_zahn in maintenance period [14:04:52] ACKNOWLEDGEMENT - salt-minion processes on labstore1001 is CRITICAL: Connection refused by host daniel_zahn in maintenance period [14:05:13] I did a pull in the core branch, but the only things dirty were extensions [14:07:22] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:32] RECOVERY - Host labstore1001 is UPING OK - Packet loss = 0%, RTA = 2.31 ms [14:14:39] 6operations, 10ops-ulsfo: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1506282 (10Dzahn) 3NEW [14:15:39] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1506289 (10BBlack) @ewilfong_WMF - cert/key info sent via email. Please respond there if there any technical issues with accessing the data... [14:16:49] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1506291 (10Dzahn) [14:18:22] 6operations: Track systems/roles for which intentionally no firewall rules are applied - https://phabricator.wikimedia.org/T104958#1506294 (10MoritzMuehlenhoff) No firewall rules are planned for the LVS (lvs*) and the caches (cp*). These are performance-critical systems with a limited set of open services, where... [14:23:25] !log upgrading junos on asw-a-codfw again [14:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:36] expect alert storm in ~1h from now :) [14:24:43] (03PS1) 10Dzahn: move grafana to graphite host [puppet] - 10https://gerrit.wikimedia.org/r/229132 [14:28:33] RECOVERY - puppet last run on mw2024 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:28:53] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [14:29:11] RECOVERY - RAID on labstore1001 is OK optimal, 12 logical, 12 physical [14:29:43] RECOVERY - Disk space on labstore1001 is OK: DISK OK [14:30:12] RECOVERY - DPKG on labstore1001 is OK: All packages OK [14:31:01] RECOVERY - configured eth on labstore1001 is OK - interfaces up [14:32:06] (03PS2) 10Dzahn: move grafana to graphite host [puppet] - 10https://gerrit.wikimedia.org/r/229132 (https://phabricator.wikimedia.org/T107832) [14:33:11] ori: ^ how about that [14:36:52] !log cr2-codfw upgrading SCBs [14:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:44] 6operations: request VM for grafana - https://phabricator.wikimedia.org/T107832#1506344 (10fgiunchedi) >>! In T107832#1505668, @faidon wrote: > It can just go to the graphite box, like Graphite itself/gdash/tessera etc., no? grafana v2 ships a server-side component as noted in https://phabricator.wikimedia.org/... [14:39:27] mutante: I'm more for a VM ^ [14:41:22] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [14:43:26] (03PS1) 10Mattflaschen: Disable Flow on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229133 [14:45:24] godog: ok, thanks, either works for me, i just want it gone from zirconium :) [14:45:48] (03PS2) 10Mattflaschen: Disable Flow on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229133 (https://phabricator.wikimedia.org/T107879) [14:46:38] (03CR) 10Alexandros Kosiaris: [C: 032] "Well, boxes that sit idle doing nothing productive consume energy and provide a target for attacks. In this case the box has a public IP m" [puppet] - 10https://gerrit.wikimedia.org/r/227997 (owner: 10Dzahn) [14:46:48] (03CR) 10Jonas Kress (WMDE): [C: 031] Add config for Wikisource badges on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229062 (https://phabricator.wikimedia.org/T97014) (owner: 10Aude) [14:46:52] PROBLEM - puppet last run on mw2005 is CRITICAL Puppet has 1 failures [14:48:11] PROBLEM - Apache HTTP on mw2028 is CRITICAL - Socket timeout after 10 seconds [14:48:12] PROBLEM - HHVM rendering on mw2017 is CRITICAL - Socket timeout after 10 seconds [14:48:12] PROBLEM - HHVM rendering on mw2072 is CRITICAL - Socket timeout after 10 seconds [14:48:12] PROBLEM - HHVM rendering on mw2207 is CRITICAL - Socket timeout after 10 seconds [14:48:12] PROBLEM - HHVM rendering on mw2078 is CRITICAL - Socket timeout after 10 seconds [14:48:12] PROBLEM - HHVM rendering on mw2053 is CRITICAL - Socket timeout after 10 seconds [14:48:12] PROBLEM - HHVM rendering on mw2012 is CRITICAL - Socket timeout after 10 seconds [14:48:12] PROBLEM - HHVM rendering on mw2214 is CRITICAL - Socket timeout after 10 seconds [14:48:22] PROBLEM - HHVM rendering on mw2029 is CRITICAL - Socket timeout after 10 seconds [14:48:59] (03PS1) 10ArielGlenn: dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 [14:49:17] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1506375 (10GWicke) @robh, can we agree on a timeline for this work? We don't want to annoy you or ourselves with unacknowledged icinga... [14:49:43] (03CR) 10jenkins-bot: [V: 04-1] dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 (owner: 10ArielGlenn) [14:50:13] PROBLEM - puppet last run on mw2067 is CRITICAL Puppet has 1 failures [14:50:13] RECOVERY - Apache HTTP on mw2028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.805 second response time [14:50:22] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 1.011 second response time [14:50:22] RECOVERY - HHVM rendering on mw2214 is OK: HTTP OK: HTTP/1.1 200 OK - 66759 bytes in 1.452 second response time [14:50:22] PROBLEM - dhclient process on mw2079 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:22] PROBLEM - configured eth on rdb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:22] PROBLEM - HHVM rendering on mw2034 is CRITICAL - Socket timeout after 10 seconds [14:50:22] PROBLEM - HHVM rendering on mw2025 is CRITICAL - Socket timeout after 10 seconds [14:50:22] PROBLEM - puppet last run on mw2018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:31] PROBLEM - HHVM rendering on mw2011 is CRITICAL - Socket timeout after 10 seconds [14:50:32] PROBLEM - swift-object-replicator on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:32] PROBLEM - swift-account-server on ms-be2013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:32] PROBLEM - DPKG on mw2011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:33] PROBLEM - puppet last run on mw2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:50:41] PROBLEM - HHVM rendering on mw2187 is CRITICAL - Socket timeout after 10 seconds [14:50:41] PROBLEM - HHVM rendering on mw2065 is CRITICAL - Socket timeout after 10 seconds [14:52:02] PROBLEM - puppet last run on mw2004 is CRITICAL Puppet has 1 failures [14:52:02] PROBLEM - puppet last run on mw2019 is CRITICAL Puppet has 1 failures [14:52:32] RECOVERY - dhclient process on mw2079 is OK: PROCS OK: 0 processes with command name dhclient [14:52:32] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 25 minutes ago with 0 failures [14:52:32] RECOVERY - configured eth on rdb2002 is OK - interfaces up [14:52:32] RECOVERY - HHVM rendering on mw2025 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.339 second response time [14:52:32] RECOVERY - HHVM rendering on mw2034 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.591 second response time [14:52:33] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.921 second response time [14:52:33] RECOVERY - HHVM rendering on mw2012 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.329 second response time [14:52:33] RECOVERY - HHVM rendering on mw2011 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.337 second response time [14:52:33] RECOVERY - HHVM rendering on mw2072 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.339 second response time [14:52:34] RECOVERY - HHVM rendering on mw2053 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.367 second response time [14:52:34] RECOVERY - HHVM rendering on mw2078 is OK: HTTP OK: HTTP/1.1 200 OK - 66757 bytes in 0.376 second response time [14:52:36] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1506382 (10RobH) I was not the blocker on this, but the entire ops team. I think the outcome from the meeting was you guys should hav... [14:52:41] RECOVERY - swift-account-server on ms-be2013 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:52:42] RECOVERY - swift-object-replicator on ms-be2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:52:42] RECOVERY - DPKG on mw2011 is OK: All packages OK [14:52:42] RECOVERY - puppet last run on mw2009 is OK Puppet is currently enabled, last run 8 minutes ago with 0 failures [14:52:42] RECOVERY - HHVM rendering on mw2029 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.334 second response time [14:52:43] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [14:52:51] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 66757 bytes in 0.283 second response time [14:52:51] RECOVERY - HHVM rendering on mw2065 is OK: HTTP OK: HTTP/1.1 200 OK - 66758 bytes in 0.339 second response time [14:54:02] PROBLEM - puppet last run on mw2027 is CRITICAL puppet fail [14:54:02] PROBLEM - puppet last run on mw2056 is CRITICAL Puppet has 1 failures [14:55:54] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1506383 (10GWicke) @robh: The clinic duty person can't be around 24 hours a day, which means that alerts related to work in SF will go... [14:56:12] PROBLEM - puppet last run on db2004 is CRITICAL Puppet has 1 failures [14:56:52] PROBLEM - puppet last run on db2001 is CRITICAL puppet fail [14:57:07] (03PS2) 10ArielGlenn: dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 [14:57:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRet-0/0/0: down - asw-a-codfw:et-2/0/52 {#10702} [40Gbps Cu]BR [14:57:23] PROBLEM - puppet last run on mw2053 is CRITICAL Puppet has 1 failures [14:57:31] dcausse: oh, hey, when is that restart scheduled? [14:57:52] (03CR) 10jenkins-bot: [V: 04-1] dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 (owner: 10ArielGlenn) [14:57:53] mutante: 5PM UTC [14:58:34] dcausse: so, should we just merge now? wondering if we need to restart a single one first or how to do it [14:58:52] PROBLEM - puppet last run on mw2046 is CRITICAL Puppet has 1 failures [14:58:57] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1506395 (10RobH) Are you guys having to do this when there is no op online (and when was this?) It'll take some work to split icing... [14:59:18] andrewbogott: ^ this is the icinga split [14:59:33] the basic request was denied, but im not on clinic duty so im letting you know about it [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150804T1500). Please do the needful. [15:00:04] James_F bd808: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:22] PROBLEM - puppet last run on mw2057 is CRITICAL Puppet has 1 failures [15:00:28] mutante: I guess it's safe to merge 5 min before the restart [15:00:52] PROBLEM - puppet last run on mw2020 is CRITICAL Puppet has 1 failures [15:01:01] dcausse: ok, so worst case it doesnt restart or something but we are in the scheduled window and reverting is easy [15:01:13] I can SWAT this morning— James_F|Away bd808 ping for morning SWAT [15:01:23] o/ [15:01:25] thcipriani, I added one too. [15:01:33] mutante: yep [15:02:13] (03CR) 10Sbisson: [C: 031] "I don't have +2 here :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229133 (https://phabricator.wikimedia.org/T107879) (owner: 10Mattflaschen) [15:02:17] matt_flaschen: kk—I see that now :) [15:02:39] Not sure why jouncebot didn't recognize it. [15:02:55] Heya. [15:03:02] asw-a-codfw is rebooting [15:03:09] do not be alarmed by the alert spam [15:03:13] James_F: hiya! [15:03:36] * greg-g waves [15:03:39] System going down in 30 seconds [15:03:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227329 (owner: 10Jforrester) [15:03:52] thcipriani: for gergo's patch you need to be careful to sync InitialiseSettings.php before logging.php or we will get a small error storm for undeclared variables [15:03:58] Woo. [15:04:01] this will reboot mw20* hosts [15:04:10] (03Merged) 10jenkins-bot: Enable VisualEditor for 10% of new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227329 (owner: 10Jforrester) [15:04:11] they would be soft errors but still annoying [15:04:18] well, not reboot them, make them unreachable for a moment [15:04:21] bd808: yup, saw that, I have notes from the last time I made that happen and freaked out :) [15:04:35] SWATers, beware [15:04:36] :) live and learn [15:04:46] thcipriani: should we wait while the codfw mw's are unreachable? [15:04:49] Is mediawiki-config +2 only given to deployers, or are they separate? [15:04:52] PROBLEM - Host mw2010 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:53] PROBLEM - Host mw2047 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:55] ah, I was wondering how a network failure would reboot mw [15:04:57] 6operations, 6Services, 7Icinga, 7Monitoring: create service/user groups in icinga - https://phabricator.wikimedia.org/T107884#1506402 (10RobH) 3NEW [15:05:03] PROBLEM - Host mw2063 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:03] PROBLEM - Host mw2040 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:03] PROBLEM - Host mw2056 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:03] PROBLEM - Host mw2013 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:03] PROBLEM - Host mw2067 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:03] PROBLEM - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:03] PROBLEM - Host mw2050 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:04] PROBLEM - Host ms-be2013 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:04] PROBLEM - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:12] PROBLEM - Host mw2029 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:13] PROBLEM - Host mw2015 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:13] PROBLEM - Host cp2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:13] PROBLEM - Host mw2023 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:13] PROBLEM - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:13] PROBLEM - Host mw2041 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:13] PROBLEM - Host mw2073 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:14] PROBLEM - Host mw2077 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:14] PROBLEM - Host mw2068 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:15] PROBLEM - Host mw2042 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:15] PROBLEM - Host mw2032 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:21] Urm [15:05:21] PROBLEM - Host mw2037 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:21] PROBLEM - Host mw2051 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:21] PROBLEM - Host mw2019 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:21] PROBLEM - Host mw2058 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:21] PROBLEM - Host mw2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:21] PROBLEM - Host mw2035 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:21] PROBLEM - Host mw2046 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:22] PROBLEM - Host mw2021 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:25] JohnFLewis: it's ok [15:05:28] IT'S ALL FINE, IGNORE THOSE [15:05:35] :) [15:05:37] :) [15:05:37] Okay :) [15:05:42] PROBLEM - Host mw2030 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:42] PROBLEM - Host mw2066 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:42] PROBLEM - Host mw2057 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:43] I notice there codfw anyway [15:05:43] PROBLEM - Host mw2061 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:51] PROBLEM - Host acamar is DOWN: CRITICAL - Network Unreachable (208.80.153.12) [15:05:52] PROBLEM - Host mw2062 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:52] PROBLEM - Host mw2072 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:52] PROBLEM - Host mw2054 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:52] PROBLEM - Host mw2045 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:52] PROBLEM - Host mw2070 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:52] PROBLEM - Host mw2044 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:52] PROBLEM - Host mw2055 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:59] well, swat should pause during it, shouldn't it? [15:06:02] greg-g: mayhaps, these patches should all be fairly quick, so we can wait for the restart [15:06:03] PROBLEM - Host mw2078 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:03] PROBLEM - Host mw2034 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:03] PROBLEM - Host mw2060 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:03] PROBLEM - Host mw2075 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:03] PROBLEM - Host mw2053 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:03] PROBLEM - Host mw2012 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:03] PROBLEM - Host mw2039 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:04] PROBLEM - Host mw2048 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:04] PROBLEM - Host mw2017 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:05] PROBLEM - Host mc2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:05] PROBLEM - Host cp2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:06] PROBLEM - Host ms-be2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:06] PROBLEM - Host mw2016 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:11] PROBLEM - Host mw2025 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:11] PROBLEM - Host mw2065 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:12] PROBLEM - Host mw2069 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:12] PROBLEM - Host mw2074 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:12] PROBLEM - Host mw2026 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:21] 10Ops-Access-Requests, 6operations, 6Services, 7Icinga, 7Monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#1439066 (10RobH) Sorry if it seems like I was arguing against services ever getting this, that isnt the case. I'm just asking how big... [15:06:34] PROBLEM - Host mw2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:41] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:42] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:51] PROBLEM - Host install2001 is DOWN: CRITICAL - Network Unreachable (208.80.153.4) [15:06:52] PROBLEM - Host labcontrol2001 is DOWN: CRITICAL - Network Unreachable (208.80.153.14) [15:06:52] PROBLEM - Host mw2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:52] matt_flaschen: it looks like +2 is for the wmf-deploy group, ops and gerrit admins -- https://gerrit.wikimedia.org/r/#/admin/projects/operations/mediawiki-config,access [15:06:53] PROBLEM - Host cp2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:53] PROBLEM - Host mw2038 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:59] thcipriani: I kinda wish paravoid would have waited 45 minutes or done it earlier, but... ;) as long as scap won't care and we just sync to them when they come back online later, we can go ahead (I'm not sure about scap) [15:07:02] PROBLEM - HHVM rendering on mw2209 is CRITICAL - Socket timeout after 10 seconds [15:07:02] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.153.15) [15:07:03] PROBLEM - Host suhail is DOWN: PING CRITICAL - Packet loss = 100% [15:07:03] PROBLEM - Host db2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:03] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:03] PROBLEM - Host lvs2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:03] PROBLEM - Host rdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:03] PROBLEM - Host mc2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:12] PROBLEM - Host mw2036 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:21] PROBLEM - Host mw2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:22] PROBLEM - Host db2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:22] PROBLEM - Host db2010 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:22] PROBLEM - Host mw2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:22] PROBLEM - Host db2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:33] greg-g: I did it earlier, the new version I installed was broken, so I had to re-upgrade it... [15:07:55] greg-g: it also takes 40-60min to complete, it's hard to schedule [15:08:03] (see SAL) [15:08:13] for what it's worth, scap will soft error and then the train will catch everything up later [15:08:14] * greg-g just woke up [15:08:28] thcipriani: per above, go ahead [15:08:36] PROBLEM - configured eth on lvs2004 is CRITICAL: eth1 reporting no carrier. [15:08:36] sorry, didn't mean to waste time there [15:08:41] PROBLEM - Host ripe-atlas-codfw is DOWN: CRITICAL - Network Unreachable (208.80.152.244) [15:08:42] PROBLEM - configured eth on lvs2006 is CRITICAL: eth1 reporting no carrier. [15:08:49] * greg-g continues to drink first cup of coffee [15:08:52] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:08:52] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:08:52] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:08:52] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:08:56] no problem, these are all in the mediawiki-installation group so I wasn't sure [15:09:01] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:02] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:02] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:09:02] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:07] thcipriani: yah, that was my thinking [15:09:12] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 67885 bytes in 9.094 second response time [15:09:23] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 66784 bytes in 3.901 second response time [15:09:23] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:23] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:09:23] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:09:23] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:23] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:23] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:24] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:09:24] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:25] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:09:25] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:09:26] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:09:33] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:09:33] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:33] PROBLEM - IPsec on cp3012 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:35] You'll end up with an N of M failed soft error message [15:09:41] PROBLEM - Router interfaces on cr2-codfw is CRITICAL host 208.80.153.193, interfaces up: 102, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae2BRet-0/0/0: down - asw-a-codfw:et-7/0/52 {#10706} [40Gbps Cu]BR [15:09:52] PROBLEM - configured eth on lvs2005 is CRITICAL: eth1 reporting no carrier. [15:09:53] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:09:53] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:09:53] PROBLEM - IPsec on cp3022 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:10:02] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:02] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:10:02] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:02] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:03] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:03] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:12] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:13] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:13] PROBLEM - IPsec on cp3020 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:10:21] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:21] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:22] PROBLEM - IPsec on cp3019 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:10:22] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:29] * paravoid cries [15:10:32] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:10:32] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:32] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:32] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:32] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:10:33] PROBLEM - IPsec on cp3021 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2006_v4, cp2006_v6 [15:10:33] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:34] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:34] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2002_v4, cp2002_v6, cp2005_v4, cp2005_v6 [15:10:35] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp2003_v4, cp2003_v6 [15:10:35] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:36] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 28 connecting: cp2001_v4, cp2001_v6, cp2004_v4, cp2004_v6 [15:10:37] Isn't this fun? [15:10:44] hey at least the ipsec alerting works! \o/ [15:10:51] (I killed the bot) [15:11:15] how do we do that on eqiad, BTW? [15:11:25] do what? [15:11:42] upgrade the network [15:12:08] we generally don't [15:12:19] I'm hoping we can put this off until we can do a switchover to codfw [15:12:22] "carefully" I suppose is the answer [15:12:28] for at least production, it's still going to suck for e.g. Labs... [15:13:00] paravoid: not if we can fail labs over to codfw too ;) [15:13:01] for production there are different redundant systems too, so losing a switch stack has impact but is not catastrophic [15:13:27] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for 10% of new accounts on enwiki [[gerrit:227329]] (duration: 03m 13s) [15:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:45] ^ James_F sync'd ish (minus codfw) check please :) [15:13:53] James_F: ooo shiny [15:14:05] bd808: Shiny? [15:14:13] "neat" [15:14:16] VE rollout [15:14:28] Oh, bumping to 10%? Yeah. [15:14:31] Slowly does it. [15:14:35] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1506433 (10mpopov) The issue was with the SSH config & key stuff. Here's the stanza that fixed it: ``` Host !bast1001.wikimedia.org *.wikimedia.org *.wmnet User thcipriani: Looks sane. [15:14:43] James_F: kk, thanks [15:14:48] bd808: you're up [15:15:01] default connect timeout is a little brutal for scap, evidently [15:15:32] thcipriani: so what we are looking for is a lack of logging errors. other than that not much to check [15:15:38] every time I see high enwiki db errors I panic [15:15:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227630 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [15:15:56] jynus: so like all the time? [15:15:59] then for the 5th time I realize that they are only on codfw [15:16:13] (today) [15:16:19] (03Merged) 10jenkins-bot: Add configuration for authmetrics logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227630 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [15:16:30] (on the specific switch) [15:18:26] jynus: Hopefully 'soon' codfw will be something worth worrying about. [15:18:42] and better [15:19:00] we will have to worry about those two at the same time [15:19:01] 10Ops-Access-Requests, 6operations: Requesting access to operations/mediawiki-config for Sbisson - https://phabricator.wikimedia.org/T107886#1506448 (10SBisson) 3NEW [15:19:31] because active-active rules [15:19:37] ^ hmm, is there some rule what having +2 means you should also have shell though [15:19:45] s/what/that [15:20:02] mutante: probably.... [15:20:26] mutante: I think it's more of a rule of thumb than something we strictly enforce in gerrit [15:20:59] 10Ops-Access-Requests, 6operations: Requesting access to operations/mediawiki-config for Sbisson - https://phabricator.wikimedia.org/T107886#1506459 (10Dzahn) I _think_ there is some rule that having +2 on the config repo means one should also have shell to be able to deploy and babysit changes. [15:21:15] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Add configuration for authmetrics logging (part I) [[gerrit:227630]] (duration: 03m 11s) [15:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:45] mutante: ah I see what you are point at now, yeah without shell that should be declined [15:21:47] bd808: right, but for mediawiki-config i assume we dont want people to +2 without deploying? [15:21:51] it would only cause problems [15:22:17] *nod* [15:22:29] mw-config merge without sync make RoanKattouw very very crabby :) [15:22:39] and for good reason [15:22:39] (03PS1) 10Lokal Profil: Localisation updates from translatewiki.net [puppet] - 10https://gerrit.wikimedia.org/r/229136 [15:23:46] yea:) also handing out +2 on gerrit has usually not been an access request, though maybe it should be [15:24:50] !log thcipriani Synchronized wmf-config: SWAT: Add configuration for authmetrics logging (part II) [[gerrit:227630]] (duration: 02m 41s) [15:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:03] ^ bd808 taht ought to do it [15:25:14] no spikes in fatalmonitor as of yet [15:25:44] thcipriani: yeah I'm not seeing anything scary in beta cluster logstash either [15:26:42] okie doke. matt_flaschen you're up. [15:27:08] Alright [15:27:11] 10Ops-Access-Requests, 6operations: Requesting access to operations/mediawiki-config for Sbisson - https://phabricator.wikimedia.org/T107886#1506478 (10bd808) @SBisson +2 in mediawiki-config is really only needed for cluster deployers. We really don't like it when changes are merged there without being synced... [15:27:37] (03PS3) 10Muehlenhoff: Add ferm rules for Hive server/metastore [puppet] - 10https://gerrit.wikimedia.org/r/228791 (https://phabricator.wikimedia.org/T83597) [15:28:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229133 (https://phabricator.wikimedia.org/T107879) (owner: 10Mattflaschen) [15:28:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for Hive server/metastore [puppet] - 10https://gerrit.wikimedia.org/r/228791 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [15:28:12] !log restarting db1064 for regular maintenance and upgrade given that it was depooled in the first place for a schema change [15:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:26] (03Merged) 10jenkins-bot: Disable Flow on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229133 (https://phabricator.wikimedia.org/T107879) (owner: 10Mattflaschen) [15:31:21] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 67885 bytes in 9.069 second response time [15:32:04] PROBLEM - Host cr2-codfw is DOWN: CRITICAL - Network Unreachable (208.80.153.193) [15:32:06] 10Ops-Access-Requests, 6operations: Requesting access to operations/mediawiki-config for Sbisson - https://phabricator.wikimedia.org/T107886#1506505 (10SBisson) I understand. I want to learn how to participate in SWAT. Do I need shell access for that or do I just have to be around to test my changes? [15:32:22] RECOVERY - HHVM rendering on mw2136 is OK: HTTP OK: HTTP/1.1 200 OK - 66784 bytes in 3.430 second response time [15:33:57] 10Ops-Access-Requests, 6operations: Requesting access to operations/mediawiki-config for Sbisson - https://phabricator.wikimedia.org/T107886#1506506 (10bd808) >>! In T107886#1506505, @SBisson wrote: > I understand. I want to learn how to participate in SWAT. Do I need shell access for that or do I just have to... [15:34:00] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable Flow on ptwikibooks [[gerrit:229133]] (duration: 03m 40s) [15:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:07] ^ matt_flaschen check please [15:35:11] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [15:35:11] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [15:35:11] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [15:35:33] 10Ops-Access-Requests, 6operations: Requesting access to operations/mediawiki-config for Sbisson - https://phabricator.wikimedia.org/T107886#1506509 (10Krenair) Suggest resolved declined unless this is for actual production deployment access (it seems to not be). wmf-deployment access in gerrit should not be g... [15:35:51] PROBLEM - HHVM rendering on mw2209 is CRITICAL - Socket timeout after 10 seconds [15:36:12] PROBLEM - HHVM rendering on mw2200 is CRITICAL - Socket timeout after 10 seconds [15:36:23] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [15:37:13] 10Ops-Access-Requests, 6operations: Requesting access to operations/mediawiki-config for Sbisson - https://phabricator.wikimedia.org/T107886#1506511 (10Andrew) 5Open>3Invalid a:3Andrew @SBisson I'm going to close this bug for now. If you decide that you want deployer rights (or shell rights) then go ah... [15:38:27] apergos, did you see https://phabricator.wikimedia.org/T107510 ? [15:38:29] thcipriani, fix confirmed. [15:38:31] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 66784 bytes in 9.117 second response time [15:38:37] matt_flaschen: thanks! [15:40:11] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [15:40:11] PROBLEM - check_puppetrun on payments2001 is CRITICAL Puppet has 1 failures [15:40:11] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [15:41:51] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1506524 (10akosiaris) A couple of notes: * The SCA cluster at this point has really minimal usage in pretty much all aspects (CPU, Memory, Disk Sp... [15:42:22] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 67616 bytes in 9.212 second response time [15:43:21] RECOVERY - puppet last run on db2046 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:43:32] PROBLEM - HHVM rendering on mw2131 is CRITICAL - Socket timeout after 10 seconds [15:45:12] PROBLEM - check_puppetrun on payments2003 is CRITICAL Puppet has 1 failures [15:45:12] RECOVERY - check_puppetrun on payments2001 is OK Puppet is currently enabled, last run 115 seconds ago with 0 failures [15:45:12] PROBLEM - check_puppetrun on payments2002 is CRITICAL Puppet has 1 failures [15:45:42] RECOVERY - HHVM rendering on mw2131 is OK: HTTP OK: HTTP/1.1 200 OK - 66784 bytes in 5.293 second response time [15:46:22] RECOVERY - puppet last run on mw2099 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [15:46:43] RECOVERY - puppet last run on mw2170 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:46:51] PROBLEM - HHVM rendering on mw2128 is CRITICAL - Socket timeout after 10 seconds [15:46:52] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:47:22] RECOVERY - puppet last run on mw2183 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:48:03] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:48:22] PROBLEM - HHVM rendering on mw2120 is CRITICAL - Socket timeout after 10 seconds [15:49:01] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 67616 bytes in 9.155 second response time [15:49:11] RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:42] RECOVERY - IPsec on cp3022 is OK: Strongswan OK - 16 ESP OK [15:49:42] RECOVERY - Router interfaces on cr2-codfw is OK host 208.80.153.193, interfaces up: 112, down: 0, dormant: 0, excluded: 0, unused: 0 [15:49:43] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 42 ESP OK [15:49:51] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 16 ESP OK [15:49:51] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 32 ESP OK [15:49:51] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 42 ESP OK [15:49:51] RECOVERY - Host mw2052 is UPING OK - Packet loss = 0%, RTA = 52.03 ms [15:49:51] RECOVERY - Host es2002 is UPING OK - Packet loss = 0%, RTA = 51.81 ms [15:49:52] RECOVERY - Host mw2014 is UPING OK - Packet loss = 0%, RTA = 51.93 ms [15:49:52] RECOVERY - Host ms-be2001 is UPING OK - Packet loss = 0%, RTA = 51.72 ms [15:51:06] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 42 ESP OK [15:51:07] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 32 ESP OK [15:51:07] RECOVERY - Host labs-ns1.wikimedia.org is UPING OK - Packet loss = 0%, RTA = 51.98 ms [15:51:08] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 67876 bytes in 0.543 second response time [15:51:13] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:22] RECOVERY - Host 2620:0:860:1:208:80:153:12 is UPING OK - Packet loss = 0%, RTA = 54.27 ms [15:51:24] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 16 ESP OK [15:51:24] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 16 ESP OK [15:51:24] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 32 ESP OK [15:51:24] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 32 ESP OK [15:51:31] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 32 ESP OK [15:51:31] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 32 ESP OK [15:51:31] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 42 ESP OK [15:51:31] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 32 ESP OK [15:51:31] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 16 ESP OK [15:51:32] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 16 ESP OK [15:51:32] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 42 ESP OK [15:51:33] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:33] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:41] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 42 ESP OK [15:51:41] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 32 ESP OK [15:51:41] RECOVERY - IPsec on cp3012 is OK: Strongswan OK - 32 ESP OK [15:51:42] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:42] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:42] RECOVERY - puppet last run on mw2084 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:42] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:43] RECOVERY - puppet last run on db2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:52] RECOVERY - puppet last run on mw2176 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:52] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 42 ESP OK [15:51:52] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 32 ESP OK [15:52:01] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:01] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:14] PROBLEM - puppet last run on mw2076 is CRITICAL puppet fail [15:54:22] PROBLEM - puppet last run on db2009 is CRITICAL puppet fail [15:54:22] PROBLEM - puppet last run on db2005 is CRITICAL puppet fail [15:54:22] PROBLEM - puppet last run on ms-fe2001 is CRITICAL puppet fail [15:54:22] PROBLEM - puppet last run on mw2036 is CRITICAL puppet fail [15:54:23] PROBLEM - puppet last run on dbstore2001 is CRITICAL puppet fail [15:54:23] PROBLEM - puppet last run on mw2045 is CRITICAL puppet fail [15:54:23] PROBLEM - puppet last run on mw2041 is CRITICAL puppet fail [15:54:31] PROBLEM - puppet last run on cp2006 is CRITICAL puppet fail [15:54:32] PROBLEM - puppet last run on db2010 is CRITICAL puppet fail [15:54:32] PROBLEM - puppet last run on mc2005 is CRITICAL puppet fail [15:54:32] PROBLEM - puppet last run on rdb2002 is CRITICAL puppet fail [15:54:33] PROBLEM - puppet last run on mw2031 is CRITICAL puppet fail [15:54:41] PROBLEM - puppet last run on es2002 is CRITICAL Puppet has 1 failures [15:54:41] PROBLEM - puppet last run on ms-be2003 is CRITICAL puppet fail [15:54:41] PROBLEM - puppet last run on ms-be2002 is CRITICAL puppet fail [15:54:42] PROBLEM - puppet last run on mw2002 is CRITICAL puppet fail [15:54:42] PROBLEM - puppet last run on mw2072 is CRITICAL puppet fail [15:54:42] PROBLEM - puppet last run on mw2043 is CRITICAL puppet fail [15:54:42] PROBLEM - puppet last run on mw2018 is CRITICAL puppet fail [15:54:43] PROBLEM - puppet last run on mw2001 is CRITICAL puppet fail [15:54:43] PROBLEM - puppet last run on mw2014 is CRITICAL puppet fail [15:54:44] PROBLEM - puppet last run on lvs2002 is CRITICAL puppet fail [15:54:51] PROBLEM - puppet last run on mw2009 is CRITICAL puppet fail [15:54:52] PROBLEM - puppet last run on ms-fe2002 is CRITICAL puppet fail [15:54:52] PROBLEM - puppet last run on install2001 is CRITICAL puppet fail [15:55:00] 6operations, 7network: Upgrade switch fabrics in cr2-codfw - https://phabricator.wikimedia.org/T84775#1506569 (10faidon) This is now done. We followed the process outlined by Juniper, [[ http://www.juniper.net/techpubs/en_US/release-independent/junos/topics/task/installation/scb-mxseries-mx480-upgrading-opera... [15:55:02] PROBLEM - puppet last run on mw2040 is CRITICAL puppet fail [15:55:02] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [15:55:03] PROBLEM - puppet last run on mw2069 is CRITICAL puppet fail [15:55:05] 6operations, 7network: Upgrade switch fabrics in cr2-codfw - https://phabricator.wikimedia.org/T84775#1506572 (10faidon) 5Open>3Resolved a:3faidon [15:55:11] PROBLEM - puppet last run on mw2029 is CRITICAL puppet fail [15:55:12] PROBLEM - puppet last run on mw2007 is CRITICAL puppet fail [15:55:12] PROBLEM - puppet last run on es2001 is CRITICAL puppet fail [15:55:12] PROBLEM - puppet last run on mw2022 is CRITICAL puppet fail [15:55:12] PROBLEM - puppet last run on mc2002 is CRITICAL puppet fail [15:55:12] PROBLEM - puppet last run on ms-be2004 is CRITICAL puppet fail [15:55:12] PROBLEM - puppet last run on es2007 is CRITICAL puppet fail [15:55:13] PROBLEM - puppet last run on mw2047 is CRITICAL puppet fail [15:55:13] PROBLEM - puppet last run on mw2077 is CRITICAL puppet fail [15:55:14] PROBLEM - puppet last run on baham is CRITICAL puppet fail [15:55:14] PROBLEM - puppet last run on mw2071 is CRITICAL puppet fail [15:55:15] PROBLEM - puppet last run on mw2013 is CRITICAL puppet fail [15:55:15] PROBLEM - puppet last run on mw2063 is CRITICAL puppet fail [15:55:16] PROBLEM - puppet last run on mw2003 is CRITICAL puppet fail [15:55:31] PROBLEM - puppet last run on mw2066 is CRITICAL puppet fail [15:55:31] PROBLEM - puppet last run on mw2073 is CRITICAL puppet fail [15:55:31] PROBLEM - puppet last run on mw2042 is CRITICAL puppet fail [15:55:31] PROBLEM - puppet last run on mw2015 is CRITICAL puppet fail [15:55:31] PROBLEM - puppet last run on mw2025 is CRITICAL puppet fail [15:55:32] PROBLEM - puppet last run on mw2059 is CRITICAL puppet fail [15:55:32] PROBLEM - puppet last run on suhail is CRITICAL puppet fail [15:55:41] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1506574 (10akosiaris) a:3akosiaris [15:55:46] 6operations, 7network: Upgrade switch fabrics in cr2-codfw - https://phabricator.wikimedia.org/T84775#931162 (10faidon) [15:55:51] RECOVERY - puppet last run on mw2057 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:51] PROBLEM - puppet last run on mw2021 is CRITICAL puppet fail [15:55:51] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [15:55:51] PROBLEM - puppet last run on mw2044 is CRITICAL puppet fail [15:55:51] PROBLEM - puppet last run on mw2049 is CRITICAL puppet fail [15:55:52] PROBLEM - puppet last run on mw2024 is CRITICAL puppet fail [15:55:52] PROBLEM - puppet last run on cp2001 is CRITICAL puppet fail [15:55:54] 6operations, 10ops-codfw, 7network: Upgrade switch fabrics in cr2-codfw - https://phabricator.wikimedia.org/T84775#931162 (10faidon) [15:56:03] PROBLEM - puppet last run on db2012 is CRITICAL puppet fail [15:56:11] PROBLEM - puppet last run on ms-be2001 is CRITICAL puppet fail [15:56:12] PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail [15:56:12] PROBLEM - puppet last run on mw2060 is CRITICAL puppet fail [15:56:12] PROBLEM - puppet last run on mw2017 is CRITICAL puppet fail [15:56:12] PROBLEM - puppet last run on mc2003 is CRITICAL puppet fail [15:56:12] PROBLEM - puppet last run on mw2065 is CRITICAL puppet fail [15:56:12] PROBLEM - puppet last run on mw2070 is CRITICAL puppet fail [15:56:13] PROBLEM - puppet last run on cp2005 is CRITICAL puppet fail [15:56:13] PROBLEM - puppet last run on cp2004 is CRITICAL puppet fail [15:56:14] PROBLEM - puppet last run on mw2055 is CRITICAL puppet fail [15:56:14] PROBLEM - puppet last run on es2006 is CRITICAL puppet fail [15:56:21] PROBLEM - puppet last run on mw2074 is CRITICAL puppet fail [15:56:21] PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail [15:56:21] PROBLEM - puppet last run on mw2038 is CRITICAL puppet fail [15:56:22] PROBLEM - puppet last run on mw2052 is CRITICAL puppet fail [15:56:22] PROBLEM - puppet last run on mw2034 is CRITICAL puppet fail [15:56:22] PROBLEM - puppet last run on mw2012 is CRITICAL puppet fail [15:56:22] PROBLEM - puppet last run on bast2001 is CRITICAL puppet fail [15:56:23] PROBLEM - puppet last run on mw2028 is CRITICAL puppet fail [15:56:23] PROBLEM - puppet last run on mw2006 is CRITICAL puppet fail [15:56:24] PROBLEM - puppet last run on rdb2001 is CRITICAL puppet fail [15:56:24] PROBLEM - puppet last run on mw2035 is CRITICAL puppet fail [15:56:25] PROBLEM - puppet last run on mw2010 is CRITICAL puppet fail [15:56:25] PROBLEM - puppet last run on mw2037 is CRITICAL puppet fail [15:56:43] RECOVERY - puppet last run on mc2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:52] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:02] RECOVERY - puppet last run on lvs2002 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:57:42] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:57:43] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:02] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:03] RECOVERY - puppet last run on mw2024 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:58:11] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:32] RECOVERY - puppet last run on mw2052 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:58:52] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:12] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:44] RECOVERY - puppet last run on mw2077 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:52] RECOVERY - puppet last run on mw2061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:02] RECOVERY - puppet last run on suhail is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:02] RECOVERY - puppet last run on mw2035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:02] RECOVERY - puppet last run on mw2037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:02] RECOVERY - puppet last run on mc2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:12] RECOVERY - puppet last run on cp2006 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:01:31] RECOVERY - puppet last run on mw2072 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:01:55] (03PS1) 10Muehlenhoff: Add ferm rules for Hadoop jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/229145 (https://phabricator.wikimedia.org/T83597) [16:02:02] RECOVERY - puppet last run on es2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:12] RECOVERY - puppet last run on mw2054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:15] 6operations: syslog-ng and rsyslog jousting on lithium - https://phabricator.wikimedia.org/T107611#1506595 (10ori) a:3yuvipanda >>! In T107611#1506005, @fgiunchedi wrote: > so possibly introduced with {rOPUPfb088d95300e81b9c4c03e9492116f193ba035c0} Yep. [16:02:22] YuviPanda: ^ [16:03:01] RECOVERY - puppet last run on cp2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:02] RECOVERY - puppet last run on mw2038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:03] RECOVERY - puppet last run on mw2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:10] 6operations: syslog-ng and rsyslog jousting on lithium - https://phabricator.wikimedia.org/T107611#1506599 (10ori) [16:03:12] RECOVERY - puppet last run on mw2051 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:03:26] (03Abandoned) 10Dzahn: remove multatuli from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/227997 (owner: 10Dzahn) [16:04:12] RECOVERY - puppet last run on mw2071 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:04:28] (03CR) 10BryanDavis: [C: 032 V: 032] Add logstash-filter-prune 0.1.5 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/229073 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [16:04:53] RECOVERY - puppet last run on db2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:02] RECOVERY - puppet last run on mc2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:02] RECOVERY - puppet last run on cp2005 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:05:12] RECOVERY - puppet last run on mw2074 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:12] RECOVERY - puppet last run on mw2034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:18] (03PS1) 10Muehlenhoff: All running services are now ferm-enabled, so turn enable base::firewall on analytics1027. [puppet] - 10https://gerrit.wikimedia.org/r/229147 (https://phabricator.wikimedia.org/T83597) [16:05:21] RECOVERY - puppet last run on mw2028 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:31] RECOVERY - puppet last run on db2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:32] RECOVERY - puppet last run on dbstore2001 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:52] RECOVERY - puppet last run on mw2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:53] RECOVERY - puppet last run on ms-fe2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:06:22] RECOVERY - puppet last run on mw2007 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:06:22] RECOVERY - puppet last run on es2001 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:06:22] RECOVERY - puppet last run on baham is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:06:23] RECOVERY - puppet last run on mw2063 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:07:12] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:07:12] RECOVERY - puppet last run on mw2060 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:07:13] RECOVERY - puppet last run on es2006 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:07:13] (03CR) 10Dzahn: "ack, added bblack" [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) (owner: 10JanZerebecki) [16:07:30] (03CR) 10Ottomata: Add ferm rules for Hadoop jmxtrans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229145 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [16:07:31] PROBLEM - HHVM rendering on mw2009 is CRITICAL: Connection timed out [16:07:31] PROBLEM - HHVM rendering on mw2079 is CRITICAL: Connection timed out [16:07:31] PROBLEM - HHVM rendering on mw2023 is CRITICAL: Connection timed out [16:07:31] PROBLEM - HHVM rendering on mw2068 is CRITICAL: Connection timed out [16:07:42] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:07:43] RECOVERY - puppet last run on db2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:08:02] RECOVERY - puppet last run on mw2001 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:08:22] RECOVERY - puppet last run on mw2040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:08:23] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:08:30] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1506627 (10EWilfong_WMF) Thanks, @bblack. The cert is now in place. [16:08:31] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:08:52] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:08:53] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1506628 (10EWilfong_WMF) 5Open>3Resolved [16:09:18] 6operations: syslog-ng and rsyslog jousting on lithium - https://phabricator.wikimedia.org/T107611#1506630 (10ori) [16:09:22] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:09:31] RECOVERY - HHVM rendering on mw2009 is OK: HTTP OK: HTTP/1.1 200 OK - 66791 bytes in 0.303 second response time [16:09:31] RECOVERY - HHVM rendering on mw2023 is OK: HTTP OK: HTTP/1.1 200 OK - 66791 bytes in 0.323 second response time [16:09:31] RECOVERY - HHVM rendering on mw2079 is OK: HTTP OK: HTTP/1.1 200 OK - 66791 bytes in 0.542 second response time [16:09:32] RECOVERY - HHVM rendering on mw2068 is OK: HTTP OK: HTTP/1.1 200 OK - 66791 bytes in 0.301 second response time [16:09:32] RECOVERY - puppet last run on bast2001 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:09:32] RECOVERY - puppet last run on rdb2001 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:10:13] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [16:10:41] RECOVERY - puppet last run on mw2029 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:10:52] RECOVERY - puppet last run on mw2078 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:10:54] RECOVERY - puppet last run on mw2025 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:11:21] RECOVERY - puppet last run on mw2005 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:11:24] (03PS2) 10Muehlenhoff: Add ferm rules for Hadoop jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/229145 (https://phabricator.wikimedia.org/T83597) [16:11:32] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:42] RECOVERY - puppet last run on mw2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:45] 6operations: syslog-ng and rsyslog jousting on lithium - https://phabricator.wikimedia.org/T107611#1506640 (10fgiunchedi) a:5yuvipanda>3fgiunchedi after looking closer it seems we're using `syslog-ng` only for collecting logs, I've converted the existing config to `rsyslog`, we might as well remove `syslog-n... [16:12:03] RECOVERY - puppet last run on mw2041 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:12:04] RECOVERY - puppet last run on rdb2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:34] bblack: is misc-web the right place for query.wikidata.org ? [16:12:43] RECOVERY - puppet last run on mc2002 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:12:43] per https://gerrit.wikimedia.org/r/#/c/228411/ [16:12:58] mutante: I really don't know. I know almost nothing about that project. [16:13:02] RECOVERY - puppet last run on mw2042 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:13:42] RECOVERY - puppet last run on mw2065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:13:48] jzerebecki: could you give us the exec summary ?:) [16:14:09] https://www.mediawiki.org/wiki/Wikidata_query_service#Use_cases [16:14:22] RECOVERY - puppet last run on ms-be2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:22] RECOVERY - puppet last run on ms-be2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:23] (03CR) 10Ori.livneh: [C: 031] "The Precise hosts still use TCP to connect to nutcracker. They are actively being migrated, but it will be some weeks before they're compl" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [16:14:32] RECOVERY - puppet last run on mw2009 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:14:47] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1506647 (10Ottomata) Oof, had some problems yesterday :( Incident documentation here: https://wikitech.wikimedia.or... [16:14:53] (03CR) 10Ori.livneh: "Oh, and what about redis on 6380?" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [16:15:27] "in order for these projects to scale to production, as well as to provide a benefit to other Wikidata-related projects, we need to build a service that allows for simple and more complex queries of Wikidata items/properties." [16:15:31] RECOVERY - puppet last run on mw2044 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:16:12] RECOVERY - puppet last run on db2009 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:16:31] RECOVERY - puppet last run on mw2031 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:02] RECOVERY - puppet last run on ms-be2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:13] RECOVERY - puppet last run on mw2075 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:23] RECOVERY - puppet last run on mw2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:37] (03PS21) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [16:18:03] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:18:12] RECOVERY - puppet last run on mw2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:18:12] RECOVERY - puppet last run on mw2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:18:41] RECOVERY - puppet last run on mw2067 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:18:43] (03CR) 10Ori.livneh: [C: 04-1] "Since this patch was initially created, we have started using nutcracker to proxy Redis connections, too. So nutcracker needs to dial out " [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [16:18:52] RECOVERY - puppet last run on install2001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:18:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRet-0/0/0: down - asw-a-codfw:et-2/0/52 {#10702} [40Gbps Cu]BR [16:19:01] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:19:12] RECOVERY - puppet last run on mw2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:19:13] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:19:22] RECOVERY - puppet last run on db2007 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:19:42] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:19:46] (03PS2) 10Rush: elasticsearch: set fixed port numbers [puppet] - 10https://gerrit.wikimedia.org/r/229127 (https://phabricator.wikimedia.org/T107278) (owner: 10Dzahn) [16:20:02] RECOVERY - puppet last run on mw2070 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:20:12] RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:20:14] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:20:42] RECOVERY - puppet last run on db2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:21:12] RECOVERY - puppet last run on mw2062 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:21:29] (03CR) 10Rush: [C: 032] elasticsearch: set fixed port numbers [puppet] - 10https://gerrit.wikimedia.org/r/229127 (https://phabricator.wikimedia.org/T107278) (owner: 10Dzahn) [16:21:32] RECOVERY - puppet last run on mw2039 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:21:32] (03PS1) 10Jcrespo: Repool db1064 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229149 [16:21:52] RECOVERY - puppet last run on mw2049 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:22:11] (03CR) 10Jcrespo: [C: 032] Repool db1064 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229149 (owner: 10Jcrespo) [16:22:17] (03CR) 10Muehlenhoff: "The Redis server only listens on localhost, so it doesn't need to be allowed in the incoming rules. If that is a high traffic service we m" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [16:22:29] (03CR) 10Ottomata: [C: 031] Add ferm rules for Hadoop jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/229145 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [16:23:01] joal: moritzm is looking at doing https://gerrit.wikimedia.org/r/#/c/229147/ soon [16:23:06] just want to make sure you know when this is happening. [16:23:19] (03PS3) 10Muehlenhoff: Add ferm rules for Hadoop jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/229145 (https://phabricator.wikimedia.org/T83597) [16:23:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for Hadoop jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/229145 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [16:23:45] !log jynus Synchronized wmf-config/db-eqiad.php: Repool db1064 with low traffic after maintenance (duration: 00m 12s) [16:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:58] no apaches failed [16:24:14] (03PS1) 10Smalyshev: Fix rules.log error when starting Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/229150 [16:24:32] RECOVERY - puppet last run on mw2020 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:24:41] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1506700 (10mobrovac) >>! In T107287#1506524, @akosiaris wrote: > * The SCA cluster at this point has really minimal usage in pretty much all aspect... [16:25:01] (03CR) 10jenkins-bot: [V: 04-1] Fix rules.log error when starting Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/229150 (owner: 10Smalyshev) [16:25:01] RECOVERY - puppet last run on es2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:05] 7Blocked-on-Operations, 7Puppet, 6operations, 10Beta-Cluster, and 3 others: Setup a dedicated mediawiki host in Beta Cluster that we can use for security scanning - https://phabricator.wikimedia.org/T72181#1506701 (10dduvall) [16:25:32] RECOVERY - puppet last run on mw2069 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:25:52] RECOVERY - puppet last run on mw2033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:01] (03PS16) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) [16:27:02] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:27:56] (03PS2) 10Smalyshev: Fix rules.log error when starting Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/229150 [16:28:33] (03PS1) 10Dzahn: enable firewalling on tin [puppet] - 10https://gerrit.wikimedia.org/r/229151 [16:28:37] (03CR) 10jenkins-bot: [V: 04-1] Fix rules.log error when starting Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/229150 (owner: 10Smalyshev) [16:28:53] (03CR) 10Dzahn: [C: 04-2] enable firewalling on tin [puppet] - 10https://gerrit.wikimedia.org/r/229151 (owner: 10Dzahn) [16:28:59] (03PS2) 10Dzahn: enable firewalling on tin [puppet] - 10https://gerrit.wikimedia.org/r/229151 [16:29:29] (03PS3) 10Smalyshev: Fix rules.log error when starting Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/229150 [16:29:34] (03CR) 10Dzahn: "once merged, please do https://gerrit.wikimedia.org/r/#/c/223458/" [puppet] - 10https://gerrit.wikimedia.org/r/229151 (owner: 10Dzahn) [16:31:06] (03CR) 10Dzahn: "looks like we might need additional rules for memcached first" [puppet] - 10https://gerrit.wikimedia.org/r/227417 (owner: 10Muehlenhoff) [16:32:37] (03CR) 10DCausse: [C: 032 V: 032] "Time to deploy" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/226049 (https://phabricator.wikimedia.org/T106165) (owner: 10DCausse) [16:36:42] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:38:51] RECOVERY - puppet last run on mw2011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:39:02] RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:40:23] RECOVERY - puppet last run on mw2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:45:32] RECOVERY - puppet last run on mw2064 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:46:23] RECOVERY - puppet last run on mw2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:51:03] (03CR) 10Chad: Phabricator: Fetch all gerrit references in Git (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [16:53:16] (03CR) 10Muehlenhoff: "The rule for memcached is already defined in modules/memcached/manifests/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/227417 (owner: 10Muehlenhoff) [16:53:52] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 6Services: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1506768 (10GWicke) Jessie has an apertium package: https://packages.debian.org/search?keywords=apertium Is this sufficiently up to date? [16:54:16] (03PS2) 10Andrew Bogott: Remove hashar and dan as roots on labnodepool: [puppet] - 10https://gerrit.wikimedia.org/r/228890 (https://phabricator.wikimedia.org/T95303) [16:55:54] (03PS3) 10Andrew Bogott: Remove hashar and dan as roots on labnodepool: [puppet] - 10https://gerrit.wikimedia.org/r/228890 (https://phabricator.wikimedia.org/T95303) [16:56:10] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 6Services: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1506774 (10KartikMistry) It is up-to-date now, but several language pairs and language packages need to upload and backport. I'm doing it right now... [16:56:53] anyone deploying? [16:57:01] no [16:57:54] I'll do a small maint. job fix for Wikidata [16:58:04] (03PS1) 10Jcrespo: Increase db1064 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229157 [16:58:58] jynus, can you post a summary of our meeting at https://phabricator.wikimedia.org/T107610 ? [16:59:31] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1506785 (10mobrovac) 3NEW [16:59:49] (03CR) 10Andrew Bogott: "Questions:" [puppet] - 10https://gerrit.wikimedia.org/r/228890 (https://phabricator.wikimedia.org/T95303) (owner: 10Andrew Bogott) [17:00:11] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [17:00:56] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. All services are covered." [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [17:01:02] matt_flaschen, will do [17:01:07] Thanks [17:04:05] ottomata: just saw you message [17:04:11] sorry for delay :S [17:04:16] gI am monitoring the things [17:07:19] joal: let's rather enable this tomorrow when Andrew gets online, would that work for you? [17:09:03] (03PS2) 10coren: nrpe: Merge check_systemd_unit_lastrun into _state [puppet] - 10https://gerrit.wikimedia.org/r/228329 [17:11:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL host 208.80.153.192, interfaces up: 106, down: 6, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRae1.32767: down - BRet-0/0/0: down - asw-a-codfw:et-2/0/52 {#10702} [40Gbps Cu]BRae1.2001: down - Subnet public1-a-codfwBRae1.2201: down - Subnet sandbox1-a-codfwBRae1.2017: down - Subnet private1-a-codfwBR [17:11:10] !log freezing elasticsearch indexes for 1.7.1 [17:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:47] 6operations, 10vm-requests, 7Pybal: codfw: 3 VM %request for PyBal - https://phabricator.wikimedia.org/T107901#1506798 (10ori) 3NEW [17:12:33] 6operations, 10Wikimedia-Mailing-lists, 7Pywikibot-General: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769#1506806 (10Legoktm) And how risky is doing that? Can we make a backup beforehand in case things go wrong? Do we also kn... [17:14:34] moritzm: joal is around if you wanna go ahead [17:14:58] !log hoo Synchronized php-1.26wmf16/extensions/Wikidata/: Fix maintenance/dumpJson.php fatal (duration: 00m 21s) [17:15:01] (03PS1) 10Filippo Giunchedi: rsyslog: add rsyslog::receiver to deprecate syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) [17:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:11] moritzm, ottomata : ready now if you wish, or tomorrow as you prefer [17:15:51] (03CR) 10jenkins-bot: [V: 04-1] rsyslog: add rsyslog::receiver to deprecate syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) (owner: 10Filippo Giunchedi) [17:17:05] !log Started dumpwikidatajson.sh on snapshot1003 to create a correct Wikidata json dump [17:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:17:21] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [17:18:39] !log es1.7.1: upgrade elastic1001 [17:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:42] PROBLEM - puppet last run on ruthenium is CRITICAL puppet fail [17:20:26] joal: let's rather do it tomorrow, I'm out for dinner soon [17:20:35] moritzm: fine by me :) [17:20:53] moritzm: I'll be off sooner than today though [17:22:08] (03PS2) 10Filippo Giunchedi: rsyslog: add rsyslog::receiver to deprecate syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) [17:22:36] joal: ok! shouldn't take too long anyway and we can start once Andrew is around [17:22:37] (03PS3) 10Ori.livneh: Set up a listing page for /api/ in all projects [puppet] - 10https://gerrit.wikimedia.org/r/228426 (https://phabricator.wikimedia.org/T107086) (owner: 10GWicke) [17:22:48] (03CR) 10Ori.livneh: [C: 032 V: 032] Set up a listing page for /api/ in all projects [puppet] - 10https://gerrit.wikimedia.org/r/228426 (https://phabricator.wikimedia.org/T107086) (owner: 10GWicke) [17:22:52] moritzm: what time-zone are you ? [17:23:02] if you get diner, I get you are close to mine :) [17:24:23] moritzm: --^ [17:26:08] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1506876 (10jcrespo) I finally understood what is the requirement- you need to separate away Flow data from regular pa... [17:26:19] joal: CEST :-) [17:26:34] moritzm: same ;) [17:26:39] Anytime tomorrow ;) [17:26:45] moritzm: --^ [17:26:50] ok, we'll ping you! [17:26:53] great [17:26:57] Thanks moritzm 1 [17:27:03] talk to you tomorrow [17:27:11] ottomata: in case --^ [17:27:37] (03CR) 10Filippo Giunchedi: [C: 031] Send $LOGUSER with dologmsg messages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/228299 (owner: 10BryanDavis) [17:33:44] (03CR) 10Jcrespo: [C: 032] Increase db1064 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229157 (owner: 10Jcrespo) [17:35:50] !log jynus Synchronized wmf-config/db-eqiad.php: Increase db1064 traffic (duration: 00m 13s) [17:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:14] (03CR) 10BryanDavis: Send $LOGUSER with dologmsg messages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228299 (owner: 10BryanDavis) [17:39:10] (03PS2) 10BryanDavis: Send $LOGUSER with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228299 [17:45:32] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:46:29] (03PS14) 10BryanDavis: Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [17:46:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ":thumbsup:" [puppet] - 10https://gerrit.wikimedia.org/r/228299 (owner: 10BryanDavis) [17:46:48] (03PS3) 10Filippo Giunchedi: Send $LOGUSER with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228299 (owner: 10BryanDavis) [17:46:55] (03CR) 10Filippo Giunchedi: [V: 032] Send $LOGUSER with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228299 (owner: 10BryanDavis) [17:48:52] 6operations, 10Wikimedia-Mailing-lists, 7Pywikibot-General: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769#1506976 (10Dzahn) No, if we touch the mbox file in any way it will break all the links to archives, the will get renumbe... [17:51:47] (03CR) 1020after4: [C: 032] Cleanup stale docroot/bits/static-1.26wmf* content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229037 (owner: 10BryanDavis) [17:52:19] (03Merged) 10jenkins-bot: Cleanup stale docroot/bits/static-1.26wmf* content [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229037 (owner: 10BryanDavis) [17:52:54] (03CR) 10Dduvall: [C: 031] "I'm 90% sure this should be sufficient for nodepool management." [puppet] - 10https://gerrit.wikimedia.org/r/228890 (https://phabricator.wikimedia.org/T95303) (owner: 10Andrew Bogott) [17:53:58] (03PS4) 10Andrew Bogott: Remove hashar and dan as roots on labnodepool: [puppet] - 10https://gerrit.wikimedia.org/r/228890 (https://phabricator.wikimedia.org/T95303) [17:54:29] (03CR) 1020after4: [C: 032] Update multiversion/updateBranchPointers whitespace and docs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229038 (owner: 10BryanDavis) [17:54:36] (03Merged) 10jenkins-bot: Update multiversion/updateBranchPointers whitespace and docs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229038 (owner: 10BryanDavis) [17:55:03] (03CR) 10Andrew Bogott: [C: 032] Remove hashar and dan as roots on labnodepool: [puppet] - 10https://gerrit.wikimedia.org/r/228890 (https://phabricator.wikimedia.org/T95303) (owner: 10Andrew Bogott) [17:55:39] Are there any other patches in need of deployment? I'm about to deploy the new branch [17:55:49] I think there is a couple of small, but easy to fix issues on the latests deployments [17:56:28] (03PS22) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [17:56:36] /rpc/RunJobs.php is trying to run NULL->getNamespace() [17:56:45] (03PS23) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [17:57:06] cirrusSearchLinksUpdatePrioritized in particular [17:59:22] PROBLEM - Host labnodepool1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150804T1800). Please do the needful. [18:00:25] !log re-imaging labnodepool1001 [18:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:17] reported as T107913, not sure which is the right project [18:03:32] RECOVERY - Host labnodepool1001 is UPING OK - Packet loss = 0%, RTA = 2.08 ms [18:03:35] (I do not think it is directly caused by Cirrus) [18:05:38] 6operations, 10Wikimedia-Mailing-lists, 7Pywikibot-General: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769#1507050 (10Dzahn) I don't think the Gmane issue is related, afaik Gmane is just a regular user who is subscribed to the... [18:06:57] (03CR) 1020after4: Phabricator: Setup git config for all repositories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [18:07:08] (03PS4) 1020after4: Phabricator: Setup git config for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [18:08:30] (03CR) 10JanZerebecki: "The short summary about what this does is:" [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) (owner: 10JanZerebecki) [18:09:06] jynus: discovery there should be the right one [18:11:11] (03CR) 10BryanDavis: "Deployed via cherry-pick on deployment-logstash2 (beta cluster)." [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [18:11:13] chasemp, thanks, still unsure it is the root cause, but if it is not they will bounce back! :-) [18:12:51] mutante, bblack: answered on https://gerrit.wikimedia.org/r/#/c/228411/ [18:13:46] (03CR) 10BryanDavis: "Deployed via cherry-pick on beta cluster and in stashbot Labs project." [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [18:14:21] I would like to do a couple of extra fixes more, but I do not want to do deployments so late for me if I am not going to be able to be around [18:14:46] see you guys [18:15:31] jzerebecki: great, thanks [18:16:18] (03Abandoned) 10EBernhardson: Prevent caching of search requests partitipating in AB test [puppet] - 10https://gerrit.wikimedia.org/r/228404 (https://phabricator.wikimedia.org/T106888) (owner: 10EBernhardson) [18:16:21] akosiaris: yt? [18:19:49] (03PS1) 1020after4: symlinks for 1.26wmf17, delete 1.26wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229179 [18:20:54] (03CR) 1020after4: [C: 032] symlinks for 1.26wmf17, delete 1.26wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229179 (owner: 1020after4) [18:21:19] (03Merged) 10jenkins-bot: symlinks for 1.26wmf17, delete 1.26wmf9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229179 (owner: 1020after4) [18:22:30] !log twentyafterfour Started scap: rebuild localization cache, sync 1.26wmf17 [18:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:27:20] 6operations, 10Wikimedia-Logstash: Import logstash 1.5.3 into apt.wm.o - https://phabricator.wikimedia.org/T107916#1507147 (10bd808) 3NEW [18:40:24] (03CR) 10Dzahn: "that explains, i just checked the role class where we usually put the ferm rules" [puppet] - 10https://gerrit.wikimedia.org/r/227417 (owner: 10Muehlenhoff) [18:42:32] !log es1.7.1: upgrade elastic1002 [18:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:57] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1507297 (10BBlack) Bringing this conversation back here from the comments in https://gerrit.wikimedia.org/r/#/c/228411/ > The short summary abou... [18:46:59] (03CR) 10BBlack: "hash_ignore_busy also doesn't prevent caching anyways..." [puppet] - 10https://gerrit.wikimedia.org/r/228404 (https://phabricator.wikimedia.org/T106888) (owner: 10EBernhardson) [18:49:43] ACKNOWLEDGEMENT - RAID on ms-be2009 is CRITICAL 1 failed LD(s) (Offline) daniel_zahn T107877 [18:51:09] !log twentyafterfour Finished scap: rebuild localization cache, sync 1.26wmf17 (duration: 28m 39s) [18:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:22] (03CR) 10Dzahn: [C: 031] "already active on mc2002 and lgtm over there" [puppet] - 10https://gerrit.wikimedia.org/r/227417 (owner: 10Muehlenhoff) [18:52:26] (03PS2) 10Dzahn: Enable ferm on mc1009 [puppet] - 10https://gerrit.wikimedia.org/r/227417 (owner: 10Muehlenhoff) [18:57:23] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 30.77% of data above the critical threshold [500.0] [19:01:36] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1507403 (10csteipp) @Stas, is wikidata.org required for some reason? Or was that just ok with them? Running on wikimedia.org would have a number... [19:02:39] 119 Warning: Failed connecting to redis server at 10.64.0.201: Connection timed out [19:03:14] jzerebecki: https://phabricator.wikimedia.org/T107602#1507403 [19:06:52] (03PS1) 1020after4: group0 wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229189 [19:07:06] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1507444 (10Smalyshev) @csteipp Well, it's //Wikidata// Query Service which serves wikidata content... So having domain at wikimedia and not wikid... [19:08:09] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229189 (owner: 1020after4) [19:08:15] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229189 (owner: 1020after4) [19:08:37] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.26wmf17 [19:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:02] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:10:57] twentyafterfour: looks like 1.26wmf17 has wikidata on our wmf/1.26wmf9 branch again :( [19:11:57] is https://github.com/wikimedia/mediawiki-tools-release/blob/5f26da7d2ce533640d6271a515c599b46878b1a8/make-wmf-branch/config.json the wrong place? [19:12:13] or forgot to git pull? [19:12:18] maybe [19:12:25] !log Applied Icba6d7a87 on mw1017 for a couple of webpagetest runs [19:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:00] aude: hmm, weird [19:18:57] twentyafterfour: yeah :( [19:19:59] (03CR) 10Dzahn: [C: 032] Enable ferm on mc1009 [puppet] - 10https://gerrit.wikimedia.org/r/227417 (owner: 10Muehlenhoff) [19:21:35] enabling firewall on one redis server.. only one to be on the safe side [19:21:48] we have done it before in codfw but those are not really in prod [19:24:17] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1507554 (10Lydia_Pintscher) I'd very much prefer query.wikidata.org for the query service. wikidata-query.wikimedia.org is rather ugly and not me... [19:25:27] aude: I'm fixing it, I have no idea how that happened [19:25:43] I specifically checked that, too... [19:25:58] * twentyafterfour really doesn't trust make-wmf-branch [19:29:15] (03PS1) 10Ottomata: Updates and fixes for 0.8.2.1-2 release [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 [19:29:34] (03PS2) 10Ottomata: Updates and fixes for 0.8.2.1-2 release [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) [19:31:14] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1507585 (10JanZerebecki) The intent is for the service to allow CORS, but I'm not sure about the implications. Anyway that that means it is not a... [19:31:53] (03PS3) 10Ottomata: Updates and fixes for 0.8.2.1-2 release [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) [19:38:10] wtf gerrit: https://gerrit.wikimedia.org/r/#/c/229190/ [19:38:14] twentyafterfour: thanks [19:38:33] gerrit doesn't seem to want to merge this one... it's stuck in "merge pending" [19:38:46] twentyafterfour: depends on abandoned changeset [19:38:55] mutante: but it wasn't based on that one at all [19:39:03] that's the wtf part [19:39:05] has to be rebased? [19:39:14] didn't give me the option to rebase [19:39:18] tried the "submit" button now? [19:39:20] I guess I'll start over [19:39:29] mutante: I tried it several times [19:39:30] manual rebase? [19:39:46] interesting, the second time jenkins result looks slightly different [19:39:47] aude: yeah, I guess that's what it's gonna take [19:39:50] interactive [19:39:52] it doesnt actually verify it [19:40:39] it says the tests succeeded, but it does not say "Verified +2" [19:40:42] unlike the first time [19:41:46] mutante: because nothing removed the first V+2 it can not add it a second time [19:42:07] jzerebecki: makes sense, thx [19:42:24] (03PS1) 10Smalyshev: drop cookies when proxying request [puppet] - 10https://gerrit.wikimedia.org/r/229194 [19:46:08] (03PS1) 10Aude: Exclude Flow topic boards from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) [19:46:24] (03PS2) 10Aude: Exclude Flow topic boards from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) [19:46:27] ok manual rebase worked [19:46:38] !log es1.7.1: upgrade elastic1003 [19:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:11] (03CR) 10Legoktm: [C: 031] Exclude Flow topic boards from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [19:58:34] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1507676 (10JanZerebecki) >>! In T107602#1507297, @BBlack wrote: > The part about failover is orthogonal to the decision about misc-web. Our stan... [19:58:53] (03CR) 10Legoktm: "Is there a way for extensions to do this directly?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [20:05:51] (03CR) 10Aude: "there is no hook or mechanism (yet?) for our settings but could be useful for things like this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [20:07:32] (03PS3) 10Aude: Exclude Flow topic boards and Draft NS from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) [20:08:10] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1507688 (10coren) @scfc Right now, we have one clearly used package from it but we've used backports (in general, not necessarily from a repo) before in production and in Labs so I think the que... [20:15:01] (03PS1) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [20:18:56] I'm gonna sync again to deploy the right wikidata branch [20:21:44] twentyafterfour: ok [20:21:56] twentyafterfour: by sync, do you mean scap? [20:22:02] scap yes [20:22:05] k [20:22:09] PROBLEM - puppet last run on cp3030 is CRITICAL puppet fail [20:22:57] !log twentyafterfour Started scap: fixup wikidata submodule version [20:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:48] (03CR) 10Dzahn: "see this: Change-Id: I47ea2913e088ce7 , it adds the logstash hosts to hiera, then we can take the list from there" [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [20:31:26] (03PS2) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [20:32:24] (03CR) 10BryanDavis: "hieradata/labs/deployment-prep/host/deployment-logstash2.yaml should prbably be updated too just so beta cluster doesn't break when you de" [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [20:34:41] (03CR) 10Dzahn: Add ferm rules for Logstash/Elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [20:35:08] (03PS3) 10Dzahn: logstash: add cluster hostnames to hiera [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) [20:36:55] (03CR) 10Dzahn: "why would it influence beta cluster when i add a completely new variable that didnt exist before?" [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [20:37:51] (03CR) 10Dzahn: "ah, you mean later when applying the actual firewall rules?" [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [20:38:51] (03CR) 10BryanDavis: "> ah, you mean later when applying the actual firewall rules?" [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [20:39:15] (03CR) 10Dzahn: "unless the labs instances include base::firewall they should not be influenced" [puppet] - 10https://gerrit.wikimedia.org/r/229203 (https://phabricator.wikimedia.org/T104964) (owner: 10Dzahn) [20:41:12] mutante: so base::firewall isn't going to be added to role::logstash? [20:41:20] just directly on the hosts? [20:41:36] bd808: correct [20:41:41] ok cool [20:41:51] we always follow this pattern: [20:42:06] ferm rules to open ports -> role class [20:42:19] base::firewall that actually enables everything -> nodes only [20:42:30] if a host just gets the role it will be noop [20:42:34] and no iptables rules at all [20:43:11] s/always/now/ -- I remember the first ferm stuff locking us out of lots of things in beta cluster but that was early last year [20:43:41] * bd808 cursed matanya for that for a while [20:43:54] then it was a mistake to put the base class into the role [20:44:11] but maybe one day beta should have it too [20:44:13] ? [20:44:50] beta cluster should be prod like in my opinion for all things possible [20:45:08] but that's just one opinion among many [20:45:53] i agree and always think that when i see classes that have "labs" or "beta" in their name [20:45:58] * YuviPanda agrees with bd808 before wandering off again [20:46:24] !log twentyafterfour Finished scap: fixup wikidata submodule version (duration: 23m 26s) [20:46:25] but then you get told "but mutante, it's different from prod anyways" [20:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:41] tehre are a few weird things in beta cluster that we would never want in prod but it should be a firly small number of things [20:47:09] hiera has made it more reasonable to refactor things to be configurable than the old $::realm swtiching [20:47:37] but not everything has been cleaned up, and hiera comes with its own warts [20:47:44] i find it confusing that in labs hieradata can be in more than 1 place [20:47:49] either puppet OR that wiki page [20:48:28] well without the wiki page most projects wouldn't be able to customize much [20:48:49] remember there are a lot of Labs projects besides deployment-prep [20:49:16] getting a root +2 stuff in ops/puppet is no fun [20:49:29] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:32] and in this case, we have: role/common/logstash.yaml vs. deployment-prep/hosts/deployment-lgostash2.yaml [20:49:39] and running a local puppetmaster with cherry-picks is not a great long term solution [20:49:41] maybe we can also make that more similar [20:50:15] please, like kill off the confusing role hierarchy [20:50:19] before when there was a $realm switch it was messy but you could see the data in one place [20:50:43] now it's in 2 places you have to find first [20:51:02] (03PS1) 10Madhuvishy: Enable CORS in ORES [puppet] - 10https://gerrit.wikimedia.org/r/229212 [20:51:06] labs projects just have project/common.yaml and project/hosts/*.yaml [20:51:16] which mostly sort of works [20:52:10] (03CR) 10Yuvipanda: [C: 04-1] "Commit message is usually of the format:" [puppet] - 10https://gerrit.wikimedia.org/r/229212 (owner: 10Madhuvishy) [20:52:15] madhuvishy: ^ nitpick [20:52:20] but otherwise woohoo [20:52:34] it doesn't really work to apply one thing to a whole cluster just by their hostnames [20:52:41] YuviPanda: aah yes i forgot that format thingy [20:52:44] FINE [20:52:55] we had some discussion about that in the suggestion by Ori to change the hierarchy [20:53:48] sure. it leads to crazy stuff like the regexes in site.pp [20:54:44] (03PS2) 10Madhuvishy: ores: Enable CORS for ORES [puppet] - 10https://gerrit.wikimedia.org/r/229212 [20:55:13] (03CR) 10Yuvipanda: [C: 032 V: 032] "woo" [puppet] - 10https://gerrit.wikimedia.org/r/229212 (owner: 10Madhuvishy) [20:55:19] what would work (but hiera doesn't support it) would be per-host yaml files that could include other common files. That may be annoying for really big clusters though like the app servers [20:55:55] madhuvishy: can you ssh to ores-lb-02 and run puppet and verify this works? [20:56:02] in this example, could we do role/common/logstash.yaml and role/common/labs/logstash.yaml ? [20:56:12] it seems way more intuitive [20:56:17] and role/common/labs already exists [20:56:30] but just has "nfs" in it and that's it [20:56:31] then it would apply to the 2 other logstash farms I run in labs [20:56:47] labs !== beta cluster [20:57:03] hmm, i guess then role/common/beta/foo [20:57:55] twentyafterfour: still doesn't seem right :/ [20:59:09] mutante: role/labs/deployment-prep/logstash.yaml would follow the same conventions as we have for project/host level config [20:59:43] that's role/// basically [21:00:07] although someday labs may be in multiple DCs too [21:00:27] and.. have a beta labs [21:00:28] ;) [21:00:45] treating a Labs project the same as a DC might actually work better [21:00:45] i just want something where you can predict the location of it after looking at the prod version and vice versa [21:00:46] YuviPanda: Yup it ran fine. [21:01:23] host-based and changing instance names make it hard to find [21:01:55] madhuvishy: \o/ awesome [21:02:58] mutante: it gets even more confusing when you figure out that the Hiera: namespace on wikitech overrides whatever is in ops/puppet [21:03:24] even host level is clobbered by that [21:03:58] bd808: yea, i know, so what keeps people from using the puppet repo? just the "we'd have to wait for ops" part again? [21:03:59] this is the proposal, fwiw: https://phabricator.wikimedia.org/T106404 [21:04:22] i think the hierarchy can straddle two repositories [21:04:25] mutante: yeah pretty much that. [21:04:55] there is no "ops/puppet SWAT window" twice a day yeht ;) [21:04:59] *yet [21:05:17] hmm, that doesn't sound like such a bad idea actually [21:05:21] yea, see the comment from Faidon there at the bottom [21:05:41] bd808: what YuviPanda said, maybe we should [21:05:50] but then the wiki page has to go if it works :) [21:06:10] * YuviPanda disagrees about the wiki page having to go [21:06:24] well, the point is to have one central location [21:06:25] I wasn't sure if paravoid was skeptical about my proposal or about the hostname scheme [21:06:51] tbh I wasn't sure either [21:06:54] the point is to have an intuitive and clear decision-procedure for resolving variables [21:06:57] in my opinion [21:07:00] otherwise the $realm check method seems almost cleaner [21:07:09] at least you saw both things in one place [21:07:13] mutante Ithink there are a bunch of issues around it, and I think a lot of the problems are about the role backend and how that interacts with others than wikitech itself [21:07:14] mutante: it can't go for everyone, maybe for beta? [21:07:21] it can be a union of multiple sources, as long as those are arranged in a logical and hierarchical way with respect to one another [21:07:32] +1 to ori [21:07:41] greg-g: what is "beta" [21:07:52] beta is the hindi word for 'son' [21:07:59] pretty sure it's a fish guys [21:08:06] so maybe greg-g is proposing that his son can't edit wikitech? [21:08:08] it's a band! :) [21:08:10] and the second letter of the greek alphabet [21:08:39] bd808: :) [21:08:49] ori: I think the answer is both tho "In general, this hostname-based model does not fit our reality and our role hiearchy very well, I think. (I wonder how it works for the people that use it…)" [21:08:54] that seems like a general no [21:09:16] that's a judgement, not an argument [21:09:35] greg-g https://wikitech.wikimedia.org/wiki/Beta_beta_beta [21:09:45] (assuming that edit saves....) [21:09:52] well sure but in this case I assume they are meant to be one and the same [21:09:56] which I guess it isn't going to. boo [21:10:02] or so presented to me [21:10:49] if beta was always changed first and production would be changed second, wouldn't beta be alpha and prod be beta ? [21:11:43] 6operations, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107940#1507947 (10CCogdill_WMF) [21:11:46] ... it'd be beta = beta and prod = prod [21:11:48] i mean, in theory don't we want each and every change to be applied on beta first [21:11:59] yes, which would make it beta [21:12:01] :) [21:12:09] (03PS1) 10GWicke: Move api listing rewrite rules to main project domains [puppet] - 10https://gerrit.wikimedia.org/r/229219 [21:12:48] it's a semantics argument I guess but in general test it before production [21:12:57] beta is such a horrible word for a deployment environment. [21:13:16] integration -> staging -> prod [21:13:18] yea, i just mean the part where the word for "second" is used to describe what should be first [21:13:39] alpha is dev VMs and labs projects [21:13:47] bd808: actually yeah agreed [21:13:49] YuviPanda: ayeyyyy, it isn't that it is hard to do, but kind of annoying [21:14:08] extra complexity for not much gain, tell me why i was convinced that cron is better than just letting users rsync whenever t hey want? [21:14:12] ottomata: ya but it simplifies things for researchers no - put files in here, they'll show up the extra day. [21:14:13] ottomata: ls! [21:14:16] so they can delete? [21:14:18] previous experiences with a staging though were closer to testwiki and were based on a cookie devs had [21:14:25] can't they ls in labs? [21:14:37] ottomata: they can't ssh to labstore1003 and the nfs mounts for the dumps server is readonly [21:14:42] only roots can ssh to labstore [21:14:55] but they need this stuff accessible in labs somewhere, rigth/ [21:15:11] yes, so it'll show up in /public/dumps mount [21:15:13] but that'll be readonly [21:15:26] that sounds confusing [21:15:31] hmm? [21:15:32] dumps is for dumps.wm.org, no? [21:15:40] so.... [21:15:40] (03CR) 10Reedy: [C: 04-1] "Don't do it like that. You're undoing good work to remove duplication" [puppet] - 10https://gerrit.wikimedia.org/r/229219 (owner: 10GWicke) [21:15:48] because a lot of tools on labs [21:15:52] read the XML dumps and other dumps [21:16:03] we have an NFS server that contains XML dumps [21:16:08] so people can directly read and manipulate it [21:16:14] instead of having to download it themselves [21:16:18] so that's the 'dumps NFS server [21:16:18] ' [21:16:28] NFS mount on dataset1001 [21:16:28] which is made available readonly to any labs instance that wants it [21:16:29] right? [21:16:34] I think it's an rsync [21:16:37] not sure [21:16:41] dataset1001 does have an NFS mount [21:16:46] that's how people access that stuff on stat boxes [21:16:55] i think what you have going now is [21:17:03] but it needs to get to labstore1003 somehow.. [21:17:10] dataset1001 has both, NFS and rsyncd [21:17:18] rsync from dataset1001 -> labstore1003, NFS mount expoted from labstore1003 to labs instnaces [21:17:19] yes? [21:17:30] rsyncd clients that are allowed to rsync from it are specifically listed [21:17:33] (now in hiera) [21:17:49] ottomata: yes, looks like it [21:18:04] ottomata: and so I just want stat* -> labstore1003 rsync. exact same process :) [21:18:16] but rsynced to the same /public/dumps dir? [21:18:36] ottomata: that's just the common name of the mount in labs instances [21:18:42] let me find out what it's actually path'd in labstore1003 [21:18:59] won't it get confusing that there are extra things in /public/dumps that are not on dataset1001 where most of the things in /public/dumps are rsynced from? [21:19:26] ottomata: it already has: [21:19:27] incr lost+found pagecounts-raw public [21:19:36] and no I don't think it'll be confusing [21:19:39] pagecounts-raw? [21:19:40] just put it in /srv/dumps/research [21:19:52] eeeef [21:20:00] the rsync doesn't do rsync --delete? [21:20:04] are you sure? (it shoudl!) [21:20:14] no idea - I've never touched the labstore1003 box :D [21:20:19] it's mostly apergos and coren [21:20:24] I only vaguely know how it works :D [21:20:25] otherwise when people remove things from dataset1001 it won't be removed from labstore1003 [21:20:34] i think it is a bad idea to try to mix rsync archives [21:20:48] we can put it in /srv/research too if you want [21:20:48] multiple sources gets confusing, especially if you want to mirror [21:20:51] make it a different NFS mount [21:20:53] that's trivial [21:21:07] i'm ok with that, or you can make the NFS mount a higher level [21:21:10] YuviPanda: The NFS side of 1003 is ridiculously simple - it's just a flat single nfs export ro to labs that's the destination of the dumps' rsyncs. [21:21:24] and rsync dumps to a subdir in the nfs export on 1003 [21:21:42] maybe make /public the export [21:21:46] instead of /public/dumps? [21:21:54] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107940#1507976 (10Krenair) [21:22:06] ottomata: that'll be a lot of work on the instances and require a remount :D [21:22:20] is that better or worse than more NFS mounts? [21:22:24] ottomata: let's just call it /srv/research and have it be a different mount? and then projects can mount it if they want [21:22:26] ottomata: definitely worse [21:22:31] adding an extra mount is trivial [21:22:46] Also /public contains things not coming from 1003 too - it's really just /public/dumps that's imported [21:22:48] or, YuviPanda, people could just rsync stuff into dataset1001 :) [21:22:50] assuming the box is puppetized [21:22:53] and have it in dumps.wm.org :) [21:22:57] then it would get rsynced again [21:23:02] ottomata: so you want them to go stats -> datset -> labstore? :) [21:23:04] oh, hm [21:23:06] maybe [21:23:10] ee, YuviPanda, we might want to ask ellery and others [21:23:18] cause, i can imagine that waiting for the rsync to happen would be really annoying [21:23:21] often you want to do: [21:23:23] generate data set [21:23:25] cool! [21:23:25] It is [21:23:29] load it up to show someone [21:23:30] in labs [21:23:36] haha [21:23:36] hmm [21:23:37] I want to be able to kick rsync and make it go for datasets.wikimedia [21:23:56] ottomata: so maybe we should go your way and have the rsync module be there and have it just be kickable by researchers and their scripts? [21:24:08] halfak: we are mostly talking about a way to get data directly from stat1002 to labs [21:24:30] do you want to rsync directly from dataset1001? [21:24:50] mutante: i think rsync directly from stat1002(+) to labstore1003 [21:25:02] what happens between dataset1001 and stat1002 though? [21:25:08] ottomata: so if we 1. setup a new mountpoint on labstore1003 2. made it an rsync module so scripts can rsync into it 3. documented how researchers can push data into it, is that all? [21:25:10] nuthin, that was an idea [21:25:17] yup [21:25:21] that would be my preference [21:25:23] ottomata: so another thing I was worried about is that labstore1003 can become the only copy of something accidentally [21:25:28] ottomata: which would be bad [21:25:29] then we don't have to worrry about mirroring or multiple sources [21:25:36] naw, its labs! [21:25:39] why wouldn't you just pull it from dataset? [21:25:40] but I guess that's ok [21:25:41] if they miss data then too bad [21:25:47] mutante: because it doesn't exist there [21:25:50] that would be an extra step [21:25:54] a trout, a trout for anyone who says 'just labs' :P [21:26:14] heheh [21:26:28] but yeah, fair enough [21:26:29] ottomata: that's what i was wondering, so stat1002 creates the data from raw data it gets from dataset? [21:26:34] no [21:26:39] from hadoop, or something else [21:26:41] ottomata: let's go with your idea! [21:26:46] ah, ok [21:27:42] hm,i guess can I make a class in role::labs::nfs::dumps? [21:27:46] .pp [21:27:46] * [21:27:49] oh [21:27:51] it is labsnfs.pp [21:27:52] perfect [21:27:53] yes [21:27:58] :D [21:27:58] ok [21:28:21] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107940#1508007 (10Krenair) So we're going to let this service send email for anything @wikimedia.org? [21:29:24] ergh. /etc/exports is rendered by ::nfs::dumps [21:29:38] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107940#1508016 (10Krenair) a:5CCogdill_WMF>3None [21:30:08] ottomata: labstore1003 is a mess :( [21:30:10] wait aminhte [21:30:12] ? [21:30:41] what's the point of the NFS mount on dataset1001 YuviPanda? [21:30:51] couldn't dataset1001 just rsync over to labstore? [21:30:59] i was wrong before, it looks like it does [21:31:10] ottomata: no idea - I didn't even know dataset1001 had an NFS mount? [21:31:18] export /srv/dumps, mount that on dataset1001. cron on dataset1001 does LOCAL rsync [21:31:27] oh... [21:31:27] I see [21:31:28] to the mounted NFS /srv/dumps from labstore [21:31:31] coren ^ do you know why/ [21:31:34] WHYYYYY? [21:31:35] hehe [21:31:35] apergos ^ [21:32:35] YuviPanda: Once upon a time, the syncing script did filesystem copies. I know it uses rsync as the primary mechanism now, but I wouldn't swear that it doesn't look at the fs at all. [21:32:46] I think only Alexandros can tell you for sure. [21:33:37] haha, chase the pointer :D [21:33:55] ottomata: I guess remote rsync seems like the thing to do? [21:34:36] yeah [21:34:41] (03PS2) 10GWicke: Move api listing rewrite rules to main project domains [puppet] - 10https://gerrit.wikimedia.org/r/229219 [21:34:55] but it means i have to refactor some puppet stuff because I can't stand putting this export in a template called dumps.erb [21:36:43] ottomata: hahaha :D [21:36:45] <3 [21:37:02] coren I also just realized that there's a bug about the toolserver puppetization you did not actually being applied anywhere? [21:37:10] and it's not a module but just stuff in root [21:37:19] (03PS3) 10GWicke: Move api listing rewrite rules to main project domains [puppet] - 10https://gerrit.wikimedia.org/r/229219 [21:37:39] YuviPanda: I was unable to parse that statement. [21:39:31] let me find bug [21:39:37] you should see your phab pings more often :) [21:40:07] coren https://phabricator.wikimedia.org/T104537 [21:47:22] (03PS1) 10Ottomata: Set up writeable rsync module and NFS export of /srv/statistics to allow sharing of public data from stat boxes to labs [puppet] - 10https://gerrit.wikimedia.org/r/229262 (https://phabricator.wikimedia.org/T107576) [21:48:51] ottomata: what's analytics1027? [21:49:56] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1508121 (10BBlack) Do we actually need an internal service endpoint like `wdqs.svc.eqiad.wmnet` for this... [21:49:59] its a host we use for regular data copies from hadoop to datasets.wikmedia.org (which is hosted on stat1001) [21:50:16] (03CR) 10Yuvipanda: [C: 04-1] Set up writeable rsync module and NFS export of /srv/statistics to allow sharing of public data from stat boxes to labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229262 (https://phabricator.wikimedia.org/T107576) (owner: 10Ottomata) [21:50:20] but more produciotn like, whereas stat boxes lots of people can log into [21:50:20] ottomata: one bug I think [21:50:23] right [21:50:33] OH [21:50:36] i didn't realize it was a param [21:50:40] dunno how i missed that, thanks [21:50:42] douhh [21:50:43] k [21:50:47] yeah i was wondering where that came from [21:51:20] (03PS2) 10Ottomata: Set up writeable rsync module and NFS export of /srv/statistics to allow sharing of public data from stat boxes to labs [puppet] - 10https://gerrit.wikimedia.org/r/229262 (https://phabricator.wikimedia.org/T107576) [21:51:33] ottomata: mind if I edit the commit message? [21:52:43] Waitwaitwait. What network are those stats boxes on? [21:53:20] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1508128 (10JanZerebecki) Currently there is no functionality that needs an internal service endpoint. We... [21:55:57] Coren: analytics network, but they will be pushing [21:56:04] oh, right duh [21:56:12] yeah gotta make a hole in the ACL, woo! [21:56:19] YuviPanda: please, amend away [21:57:19] (03PS3) 10Yuvipanda: labs: Setup /srv/statistics for rsync from stats hosts [puppet] - 10https://gerrit.wikimedia.org/r/229262 (https://phabricator.wikimedia.org/T107576) (owner: 10Ottomata) [21:57:23] ottomata: there we go [21:57:35] ottomata: needs a few follow up patches to make it available in labs instances, want me to do them now? [22:00:23] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 3Discovery-Wikidata-Query-Service-Sprint: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1508141 (10Smalyshev) Yes, we don't actually have any needs that having query.wikidata.org going to the... [22:01:40] YuviPanda: sure, we will ahve to get mark or someone to poke a hole in analytics VLAN ACL too [22:01:42] i ahve to run though [22:01:49] thanks! tty tomorrow [22:01:59] ottomata: \o/ cool [22:02:02] I'll have patches up [22:02:24] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107940#1508157 (10csteipp) If I'm reading that right, that adds about 4,500 ip's approved to send emails as us, which makes me nervous. Can we narrow that down? [22:05:52] syncing wmf17 one more time [22:07:07] !log twentyafterfour Synchronized php-1.26wmf17: forgot submodule update (duration: 01m 39s) [22:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:55] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107940#1508175 (10EWilfong_WMF) We cannot. An alternative is to just add the DKIM record and leave the SPF record as is. This would give email sent through the system a valid DKIM record and an SPF va... [22:10:46] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1508189 (10csteipp) >>! In T107602#1507585, @JanZerebecki wrote: > The intent is for the service to allow CORS, but I'm not sure about the implic... [22:19:59] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1508205 (10GWicke) Will the query service return raw HTML or SVG content? If it's only returning other content types like JSON, then CORS might n... [22:22:02] coren did you see the ticket about toolserver? [22:23:26] (03PS1) 10Yuvipanda: labs: Allow projects to opt into a 'statistics' NFS mount [puppet] - 10https://gerrit.wikimedia.org/r/229265 (https://phabricator.wikimedia.org/T107576) [22:25:16] (03PS3) 10ArielGlenn: dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 [22:25:58] (03CR) 10jenkins-bot: [V: 04-1] dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 (owner: 10ArielGlenn) [22:27:24] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1508217 (10Dzahn) [22:32:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [22:32:50] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1508269 (10Dzahn) 5Open>3Resolved a:3Dzahn [22:33:12] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1488561 (10Dzahn) a:5Dzahn>3BBlack [22:34:31] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1508279 (10Dzahn) Would it worth considering to not log locally but send it to the central syslog server instead? [22:34:53] 6operations, 10vm-requests: request VM for grafana - https://phabricator.wikimedia.org/T107832#1508280 (10Dzahn) [22:35:41] (03PS4) 10ArielGlenn: dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 [22:36:23] (03CR) 10jenkins-bot: [V: 04-1] dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 (owner: 10ArielGlenn) [22:40:04] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1508291 (10faidon) We absolutely cannot do this for neither DKIM nor SPF. A potential alternative would be to send emails from `@benefactore... [22:40:28] 6operations, 10Traffic: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#1508292 (10BBlack) 3NEW [22:42:15] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1508309 (10Smalyshev) > Will the query service return raw HTML or SVG content? Check out: https://wiki.blazegraph.com/wiki/index.php/REST_API#QU... [22:45:04] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1508326 (10Smalyshev) > But if our wikis accept CORS requests from the service's domain, then an xss in this service can lead to significant issu... [22:45:51] (03PS5) 10ArielGlenn: dumps: generate conf files for dump stage scheduler [puppet] - 10https://gerrit.wikimedia.org/r/229134 [22:57:45] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1508360 (10csteipp) >>! In T107602#1508326, @Smalyshev wrote: > Aren't our tokens HTTP only? Our session cookies are, but anti-csrf tokens are a... [22:58:59] 6operations, 10Traffic: Stop using LVS from varnishes - https://phabricator.wikimedia.org/T107956#1508370 (10GWicke) One potential issue worth testing for is whether Varnish can actually deal with frequent config reloads. For deploys, we'd probably need the ability to reload multiple times per second, or at le... [22:59:24] (03PS1) 10Mattflaschen: Disable Flow on betawikiversity due to lots of psuedo-namespace Topic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229272 [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150804T2300). Please do the needful. [23:00:19] (03PS2) 10Mattflaschen: Disable Flow on betawikiversity due to lots of psuedo-namespace Topic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229272 (https://phabricator.wikimedia.org/T107904) [23:00:26] SWAT is empty but maybe matt_flaschen wants to deploy his patch :) [23:02:04] RoanKattouw, I'll volunteer to do SWAT today. ;) [23:02:13] But seriously, if someone wants me to do a late-breaking one, just ping me. [23:05:50] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 504 (expecting: 200) [23:05:56] (03CR) 10Mattflaschen: [C: 032] Disable Flow on betawikiversity due to lots of psuedo-namespace Topic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229272 (https://phabricator.wikimedia.org/T107904) (owner: 10Mattflaschen) [23:06:22] (03Merged) 10jenkins-bot: Disable Flow on betawikiversity due to lots of psuedo-namespace Topic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229272 (https://phabricator.wikimedia.org/T107904) (owner: 10Mattflaschen) [23:08:09] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 504 (expecting: 200) [23:08:09] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev of by title from MW returned the unexpected status 504 (expecting: 200) [23:08:16] !log mattflaschen Synchronized wmf-config/InitialiseSettings.php: Disable Flow on betawikiversity (duration: 00m 13s) [23:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:09] disable flow? whyyyyyy? :P [23:10:06] due to lots of psuedo-namespace Topic [23:11:07] Yep [23:13:28] anyway, if holy Zuul permits, I might have a couple patches soon - I can deploy them myself [23:45:48] matt_flaschen, are you done? any objections if I deploy? [23:46:06] Yeah, go ahead. [23:48:18] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [23:55:20] (03CR) 10BryanDavis: [C: 04-1] "self -1 just so nobody gets excited and merges this for me before I have a window to apply it on the prod cluster." [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [23:57:15] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1508633 (10mobrovac) >>! In T107900#1508279, @Dzahn wrote: > Would it worth considering to not log locally but send it to the central syslog server instead? We already send them to logstash, with a different log level,... [23:57:58] !log maxsem Synchronized php-1.26wmf17/extensions/WikimediaEvents/: SWAT (duration: 00m 12s) [23:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:14] !log maxsem Synchronized php-1.26wmf16/extensions/WikimediaEvents/: SWAT (duration: 00m 12s) [23:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master