[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T0000). Please do the needful. [00:02:04] Nothing to deploy. [00:02:17] (03CR) 10Jforrester: "I always saw these as placeholders for if the wikis beset by this code wanted to make their lives even worse by customising it further." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338129 (owner: 10Reedy) [00:12:03] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:18:46] * Dereckson is going to deploy an interwiki map update [00:18:53] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [00:22:01] (03PS1) 10Dereckson: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340667 [00:23:11] !log mira - disarming keyholder, changed password of analytics deploy key - rearming to test changes for T154943 [00:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:17] T154943: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943 [00:23:58] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340667 (owner: 10Dereckson) [00:25:13] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [00:25:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:23] (03Merged) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340667 (owner: 10Dereckson) [00:26:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [00:26:49] (03CR) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340667 (owner: 10Dereckson) [00:27:01] 340667 on mwdebug1002 [00:27:13] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [00:33:39] 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3066309 (10Krinkle) [00:37:20] !log dereckson@tin Synchronized wmf-config/interwiki.php: Update interwiki map (ref T159103) (duration: 00m 41s) [00:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:27] T159103: Update DRAE interwiki link - https://phabricator.wikimedia.org/T159103 [00:38:23] jouncebot: next [00:38:23] In 0 hour(s) and 21 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T0100) [00:38:25] jouncebot: now [00:38:25] For the next 0 hour(s) and 21 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T0000) [00:41:14] !log mira - disarm/rearm keyholder after changing passphrases of all other deployment keys (T154943) [00:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:23] T154943: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943 [00:44:41] !log tin - disarm/rearm keyholder after changing passphrases of all deployment keys to new passphrase (T154943) [00:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:35] !log tin/mira - you will notice in the output of keyholder status you will not see the pathes in the "comment" column anymore. this is due to newer versions of openssh-client and caused our problem last time i attempted this. thanks to thcipriani's fix https://gerrit.wikimedia.org/r/#/c/312947/ we don't rely on this anymore and all is good, keyholder stays armed even after re-encrypting the [00:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:41] keys (T154943) [00:48:42] T154943: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943 [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T0100). Please do the needful. [01:13:07] 06Operations, 13Patch-For-Review: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943#3066450 (10Dzahn) - dis-armed and re-armed keyholder on both mira and tin, confirming that you are asked **just once for the new passphrase** and all keys are loaded - you will notice... [01:30:19] 06Operations, 13Patch-For-Review: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943#3066497 (10Dzahn) - reverted the cumin key back to like it was before as requested by @volans who pointed out this should keep having a separate passphrase [01:47:53] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1802.520017 Seconds [01:48:54] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 29.218287 Seconds [02:11:43] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [02:25:28] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 09m 30s) [02:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:27] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 14m 52s) [02:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 2 03:04:16 UTC 2017 (duration 5m 49s) [03:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:22] 06Operations, 10Monitoring, 06Release-Engineering-Team, 07Tracking, 07Wikimedia-Incident: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481685 (10Tbayer) The alert feature introduced in the recent Grafana update (T152473) could be of interest for this... [03:43:53] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:06:43] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [04:09:43] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [04:11:53] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [04:27:13] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:56:14] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [05:01:42] 06Operations, 10MediaWiki-Maintenance-scripts, 06Performance-Team, 10Thumbor: ensure thumbor container access is preserved by mw filebackend setzoneaccess - https://phabricator.wikimedia.org/T144479#3066703 (10Krinkle) [05:03:33] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:07:03] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:22:53] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:32:33] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [05:33:13] PROBLEM - MegaRAID on db1056 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [05:33:14] ACKNOWLEDGEMENT - MegaRAID on db1056 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T159410 [05:33:17] 06Operations, 10ops-eqiad: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T159410#3066737 (10ops-monitoring-bot) [05:35:03] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:36:43] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:39:43] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:40:51] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3066744 (10RobH) I'll get started on setting up these systems during my workday tomorrow (Thursday). [05:50:53] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [05:56:00] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3066753 (10Ladsgroup) Thanks for the explanation. If we want to go other ways the most feasible one to me is to duplicate precaching for ores. One in eqiad... [06:13:03] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:17:14] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3066770 (10Joe) Again regarding precaching (which is surely duplicable): do we *really* need it? The cache hit ratio is so small that I'm not sure precach... [06:18:53] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:33:53] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3066771 (10Ladsgroup) Precaching got broken several times (when we were in labs) the result was a huge increase in average response time. I think response... [06:34:03] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:41:03] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:03:03] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:09:51] !log Resume pt-table-checksum on idwiki (s2) - T154485 [07:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:57] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [07:11:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "minor nits." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/340539 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [07:13:17] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "@Tim: what's wrong in having bot nginx and apache on a server?" [puppet] - 10https://gerrit.wikimedia.org/r/340462 (https://phabricator.wikimedia.org/T154105) (owner: 10Tim Landscheidt) [07:16:53] (03CR) 10Giuseppe Lavagetto: [C: 031] relforge: use experimental apt repo to have access to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/340500 (https://phabricator.wikimedia.org/T159168) (owner: 10Gehel) [07:20:16] !log Deploy alter table enwiki.revision db2016 (codfw master) - T132416 [07:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:21] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:20:44] (03CR) 10Giuseppe Lavagetto: tlsproxy::localssl: add ability to have an access.log (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) (owner: 10Giuseppe Lavagetto) [07:21:44] (03PS2) 10Giuseppe Lavagetto: tlsproxy::localssl: add ability to have an access.log [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) [07:37:38] 06Operations, 07Puppet, 06Labs, 10Traffic, 07Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411#3066814 (10Joe) [07:45:02] 06Operations, 07Puppet, 06Labs, 10Traffic, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412#3066827 (10Joe) [07:45:38] (03Abandoned) 10Giuseppe Lavagetto: hieradata: stop repeating data for clusters [puppet] - 10https://gerrit.wikimedia.org/r/312205 (owner: 10Giuseppe Lavagetto) [07:52:13] (03CR) 10Giuseppe Lavagetto: [C: 031] tlsproxy: add prometheus support [puppet] - 10https://gerrit.wikimedia.org/r/339465 (owner: 10Ema) [07:54:11] (03CR) 10Giuseppe Lavagetto: [C: 031] "Please see comment though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [07:54:34] (03PS2) 10Giuseppe Lavagetto: confd: add ability to define a global and per-template prefix [puppet] - 10https://gerrit.wikimedia.org/r/340537 [08:03:40] (03PS3) 10Giuseppe Lavagetto: confd: add ability to define a global and per-template prefix [puppet] - 10https://gerrit.wikimedia.org/r/340537 [08:04:43] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:03] PROBLEM - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:05:13] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused [08:05:43] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [08:06:57] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#2852630 (10Joe) @Jgreen any idea when it will happen? (all FR to jessie, I mean). [08:07:13] PROBLEM - cassandra-b service on restbase2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [08:07:23] PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:08:33] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:09:49] (03CR) 10Ema: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) (owner: 10Giuseppe Lavagetto) [08:10:18] I'll take a look at what's up on restbase2003 [08:10:39] <_joe_> thanks [08:11:13] RECOVERY - cassandra-b service on restbase2003 is OK: OK - cassandra-b is active [08:11:23] RECOVERY - Check systemd state on restbase2003 is OK: OK - running: The system is fully operational [08:13:03] RECOVERY - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-b valid until 2017-09-12 15:35:15 +0000 (expires in 194 days) [08:13:13] RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.135 port 9042 [08:14:50] (03PS4) 10Giuseppe Lavagetto: confd: add ability to define a global and per-template prefix [puppet] - 10https://gerrit.wikimedia.org/r/340537 [08:21:04] (03CR) 10Giuseppe Lavagetto: [C: 032] confd: add ability to define a global and per-template prefix [puppet] - 10https://gerrit.wikimedia.org/r/340537 (owner: 10Giuseppe Lavagetto) [08:24:32] <_joe_> this ^^ fails to apply due to one of the dreaded dependency cycles [08:24:35] <_joe_> will solve it [08:27:11] !log Start pt-table-checksum on itwiki (s2)  - T154485 [08:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:17] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [08:28:03] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:13] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:36:33] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:38:19] <_joe_> that ^^ is expected [08:38:24] (03PS1) 10Giuseppe Lavagetto: confd::file: fix circular dependency [puppet] - 10https://gerrit.wikimedia.org/r/340693 [08:39:04] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] confd::file: fix circular dependency [puppet] - 10https://gerrit.wikimedia.org/r/340693 (owner: 10Giuseppe Lavagetto) [08:41:50] FYI restbase2003 died because of JVM OOM [08:44:03] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:44:44] 06Operations, 10ops-codfw: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3066883 (10fgiunchedi) [08:45:52] !log running alter table on db2035 T147747 [08:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:26] (03PS1) 10Filippo Giunchedi: Decomission ms-fe200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/340694 (https://phabricator.wikimedia.org/T159413) [08:49:35] (03CR) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [08:51:27] !log running alter table on db2039 T147747 [08:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:10] (03PS1) 10Smalyshev: Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 [09:00:12] 06Operations, 07discovery-system: confctl SubjectAltNameWarning after python-urllib3 upgrade - https://phabricator.wikimedia.org/T156232#3066925 (10ema) Disabling the warning seems like a viable workaround to me. @Joe? [09:00:13] (03CR) 10jerkins-bot: [V: 04-1] Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 (owner: 10Smalyshev) [09:00:33] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:01:45] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3066927 (10MoritzMuehlenhoff) HHVM 3.18 now requires webscalesqlclient also for the standard mysql extension, so I've changed the tarball generation to no longer prune webscalesqlclient. DENABLE_ASYNC_MYSQL is now br... [09:06:06] (03PS2) 10Filippo Giunchedi: Decomission ms-fe200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/340694 (https://phabricator.wikimedia.org/T159413) [09:08:39] (03PS3) 10Giuseppe Lavagetto: tlsproxy::localssl: add ability to have an access.log [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) [09:08:58] !log installing apache2 security updates on mw1262-mw1265 [09:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:38] (03CR) 10Filippo Giunchedi: [C: 032] Decomission ms-fe200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/340694 (https://phabricator.wikimedia.org/T159413) (owner: 10Filippo Giunchedi) [09:09:51] <_joe_> you merge-sniped me! [09:09:52] <_joe_> :P [09:10:27] (03PS4) 10Giuseppe Lavagetto: tlsproxy::localssl: add ability to have an access.log [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) [09:10:53] _joe_: tsk, yours didn't have the jenkins rubberstamp yet :P [09:19:19] (03CR) 10Giuseppe Lavagetto: [C: 032] tlsproxy::localssl: add ability to have an access.log [puppet] - 10https://gerrit.wikimedia.org/r/328495 (https://phabricator.wikimedia.org/T153797) (owner: 10Giuseppe Lavagetto) [09:22:34] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3066980 (10Marostegui) [09:22:38] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, and 2 others: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3066977 (10Marostegui) 05Open>03Resolved a:03Marostegui I am closing this as nothing has been reported so far. If something arises, feel free to reo... [09:23:08] 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3066982 (10Beetlebeard) Yes, the comments were helpful. I still haven't had time to look for our CNAME entry, though. [09:24:46] !log Upgrading composer to 1.1.0 on CI instances [09:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:03] PROBLEM - Swift HTTPS on ms-fe2002 is CRITICAL: connect to address 10.192.0.24 and port 80: Connection refused [09:25:03] PROBLEM - Swift HTTPS on ms-fe2003 is CRITICAL: connect to address 10.192.16.25 and port 80: Connection refused [09:25:03] PROBLEM - Swift HTTPS on ms-fe2001 is CRITICAL: connect to address 10.192.0.23 and port 80: Connection refused [09:25:03] PROBLEM - Swift HTTP frontend on ms-fe2001 is CRITICAL: connect to address 10.192.0.23 and port 80: Connection refused [09:25:03] PROBLEM - Swift HTTP frontend on ms-fe2004 is CRITICAL: connect to address 10.192.16.26 and port 80: Connection refused [09:25:04] PROBLEM - Swift HTTP frontend on ms-fe2002 is CRITICAL: connect to address 10.192.0.24 and port 80: Connection refused [09:25:04] PROBLEM - Swift HTTPS on ms-fe2004 is CRITICAL: connect to address 10.192.16.26 and port 80: Connection refused [09:25:05] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: connect to address 10.192.0.23 and port 80: Connection refused [09:25:05] PROBLEM - Swift HTTP backend on ms-fe2004 is CRITICAL: connect to address 10.192.16.26 and port 80: Connection refused [09:25:33] PROBLEM - Swift HTTP backend on ms-fe2002 is CRITICAL: connect to address 10.192.0.24 and port 80: Connection refused [09:25:33] PROBLEM - Swift HTTP frontend on ms-fe2003 is CRITICAL: connect to address 10.192.16.25 and port 80: Connection refused [09:25:34] PROBLEM - Swift HTTP backend on ms-fe2003 is CRITICAL: connect to address 10.192.16.25 and port 80: Connection refused [09:25:34] that's me ^ [09:26:26] buuuuu [09:26:31] :D [09:26:45] !log installing glibc updates from jessie point release [09:26:47] I know right? [09:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:33] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: rack/setup ms-fe200[5-8] - https://phabricator.wikimedia.org/T152612#3066991 (10fgiunchedi) 05Open>03Resolved Machines in service, see T159413 for decom [09:35:39] (03PS5) 10Ema: tlsproxy: add prometheus support [puppet] - 10https://gerrit.wikimedia.org/r/339465 [09:35:58] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3067019 (10fgiunchedi) [09:36:06] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:36:35] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3066883 (10fgiunchedi) a:05fgiunchedi>03Papaul All four machines ms-fe200[1-4] are running `spare::system` role and can be decomissioned. [09:37:36] (03CR) 10Ema: [C: 032] tlsproxy: add prometheus support [puppet] - 10https://gerrit.wikimedia.org/r/339465 (owner: 10Ema) [09:40:32] (03PS2) 10Giuseppe Lavagetto: profile::discovery::client: create confd-generate files for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340538 (https://phabricator.wikimedia.org/T149617) [09:42:06] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:43:27] (03PS3) 10Giuseppe Lavagetto: profile::discovery::client: create confd-generate files for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340538 (https://phabricator.wikimedia.org/T149617) [09:46:14] (03PS4) 10Giuseppe Lavagetto: profile::discovery::client: create confd-generate files for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340538 (https://phabricator.wikimedia.org/T149617) [09:51:17] (03PS2) 10Phuedx: Hygiene: Remove Page Previews experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339478 [09:54:21] (03PS5) 10Giuseppe Lavagetto: profile::discovery::client: create confd-generate files for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340538 (https://phabricator.wikimedia.org/T149617) [09:55:38] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::discovery::client: create confd-generate files for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340538 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto) [09:55:40] !log increased PHP memory_limit on bohrium for Piwik (T154558) [09:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:47] T154558: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558 [09:58:02] (03PS1) 10Phuedx: Make Page Previews use RESTBase on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340697 (https://phabricator.wikimedia.org/T158221) [09:59:13] (03CR) 10jerkins-bot: [V: 04-1] Make Page Previews use RESTBase on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340697 (https://phabricator.wikimedia.org/T158221) (owner: 10Phuedx) [10:01:41] (03CR) 10Phuedx: [C: 04-2] "Per T158221#3067125." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340697 (https://phabricator.wikimedia.org/T158221) (owner: 10Phuedx) [10:03:14] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:04:23] 06Operations, 10Traffic: Upgrade text and upload cache clusters to varnish 4.1.5 - https://phabricator.wikimedia.org/T159424#3067155 (10ema) [10:04:32] 06Operations, 10Traffic: Upgrade text and upload cache clusters to varnish 4.1.5 - https://phabricator.wikimedia.org/T159424#3067168 (10ema) p:05Triage>03Normal [10:04:41] !log installing tiff security updates on trusty hosts (jessie already fixed) [10:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:14] RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:16:16] (03PS2) 10Jcrespo: Update pager partitioning to the latest version [software] - 10https://gerrit.wikimedia.org/r/338943 (https://phabricator.wikimedia.org/T147747) [10:17:15] (03CR) 10Phuedx: [C: 04-2] "Recheck." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340697 (https://phabricator.wikimedia.org/T158221) (owner: 10Phuedx) [10:18:37] (03CR) 10jerkins-bot: [V: 04-1] Make Page Previews use RESTBase on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340697 (https://phabricator.wikimedia.org/T158221) (owner: 10Phuedx) [10:21:31] (03PS1) 10Jcrespo: mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340702 (https://phabricator.wikimedia.org/T159319) [10:22:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340702 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [10:22:58] (03CR) 10Marostegui: [C: 031] "Look the same as db1026 current status which should be the one to pursue as per: https://phabricator.wikimedia.org/T132416#3057955" [software] - 10https://gerrit.wikimedia.org/r/338943 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [10:23:00] !log installing bind updates (we're using client-side libs/tools) [10:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:52] 06Operations, 10Traffic: Allow setting varnish connection timeouts in puppet - https://phabricator.wikimedia.org/T159429#3067249 (10ema) [10:27:14] 06Operations, 10Traffic: Allow setting varnish connection timeouts in puppet - https://phabricator.wikimedia.org/T159429#3067277 (10ema) p:05Triage>03Normal [10:34:21] (03PS1) 10Phuedx: Re-enable Page Previews instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) [10:35:25] (03CR) 10jerkins-bot: [V: 04-1] Re-enable Page Previews instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) (owner: 10Phuedx) [10:37:46] anyone know what's up with the composer job? [10:41:41] marostegui: jynus: if you have database related change in mediawiki-config do force merge them [10:42:02] build broke due to an update of the php dependency manager "composer" :/ [10:42:14] hashar: I don't have anything myself now, but thanks for the heads up :) [10:46:59] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163815 (owner: 10Hashar) [10:47:05] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163815 [10:48:22] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163815 (owner: 10Hashar) [10:51:32] !log CI composer based builds are sometime broken since composer got upgraded to 1.1.0 . See https://phabricator.wikimedia.org/T159431 [10:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:31] hashar, sorry, what? [10:53:45] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3067340 (10MoritzMuehlenhoff) [10:53:51] ah, CI-related, you mean? [10:53:52] jynus: jenkins fail on operations/mediawiki-config and I noticed a few database related changes [10:53:55] so if they are needed, you would have to force merge [10:53:56] ok [10:54:03] I lacked context [10:54:09] "something broke" [10:54:10] yeah sorry :( [10:54:28] I do the same [10:54:58] when I talk server, I mean db, etc. I assume you pass "some time" with jenkins [10:56:37] (03CR) 10Jcrespo: [C: 032] Update pager partitioning to the latest version [software] - 10https://gerrit.wikimedia.org/r/338943 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [10:57:23] (03Merged) 10jenkins-bot: Update pager partitioning to the latest version [software] - 10https://gerrit.wikimedia.org/r/338943 (https://phabricator.wikimedia.org/T147747) (owner: 10Jcrespo) [10:57:38] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340702 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [10:58:37] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340702 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [11:07:51] !log kartik@tin Started deploy [cxserver/deploy@5101090]: (no justification provided) [11:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:12] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3067393 (10MoritzMuehlenhoff) It seems to only consider backports ATM, when I run "apt-cache show munin" (as an example of a package which is present in backports), I only get the version from backports, not the j... [11:10:16] !log kartik@tin Finished deploy [cxserver/deploy@5101090]: (no justification provided) (duration: 02m 24s) [11:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:48] (03CR) 10jenkins-bot: mariadb: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340702 (https://phabricator.wikimedia.org/T159319) (owner: 10Jcrespo) [11:18:04] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 341.85 seconds [11:19:10] ^checking [11:24:04] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 19.91 seconds [11:26:36] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163815 (owner: 10Hashar) [11:30:51] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163815 (owner: 10Hashar) [11:31:07] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340697 (https://phabricator.wikimedia.org/T158221) (owner: 10Phuedx) [11:31:12] (03CR) 10Hashar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) (owner: 10Phuedx) [11:31:40] ta [11:33:12] phuedx: jynus: marostegui: for operations/mediawiki-config , the composer jenkins job is fixed [11:33:22] require('composer.json'); failed [11:33:28] have to use the absolute path for whatever reason [11:37:00] 06Operations, 15User-Elukey: swift-account-stats: Max retries exceeded with url: /auth/v1.0 - https://phabricator.wikimedia.org/T159437#3067471 (10ema) [11:37:45] 06Operations, 10media-storage: swift-account-stats: Max retries exceeded with url: /auth/v1.0 - https://phabricator.wikimedia.org/T159437#3067490 (10ema) [11:46:57] 06Operations, 10MediaWiki-General-or-Unknown: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3067511 (10ema) [11:49:58] (03PS1) 10Elukey: Add experimental JVM Heap usage alarm to Zookeeper prod instances [puppet] - 10https://gerrit.wikimedia.org/r/340719 (https://phabricator.wikimedia.org/T157968) [11:50:32] !sal [11:50:32] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [11:50:54] (03CR) 10jerkins-bot: [V: 04-1] Add experimental JVM Heap usage alarm to Zookeeper prod instances [puppet] - 10https://gerrit.wikimedia.org/r/340719 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [11:51:16] ah snap the comma [11:51:17] sigh [11:51:47] (03PS2) 10Elukey: Add experimental JVM Heap usage alarm to Zookeeper prod instances [puppet] - 10https://gerrit.wikimedia.org/r/340719 (https://phabricator.wikimedia.org/T157968) [11:58:40] !log CI composer based builds are now ok. Only operations/mediawiki-config was impacted as far as I can tell. [11:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:13] 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3067604 (10fgiunchedi) [12:00:15] 06Operations, 10media-storage: swift-account-stats: Max retries exceeded with url: /auth/v1.0 - https://phabricator.wikimedia.org/T159437#3067601 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I'm assuming this happened on ms-fe2001 today? It was during decomission and a legitimate error, I'll tentative... [12:07:45] (03PS1) 10Filippo Giunchedi: Provision ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/340721 (https://phabricator.wikimedia.org/T155095) [12:11:17] (03PS2) 10Filippo Giunchedi: Provision ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/340721 (https://phabricator.wikimedia.org/T155095) [12:12:51] 06Operations, 10Monitoring, 06Release-Engineering-Team, 07Tracking, 07Wikimedia-Incident: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#3067616 (10hashar) [12:15:40] (03CR) 10Filippo Giunchedi: [C: 032] Provision ms-fe100[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/340721 (https://phabricator.wikimedia.org/T155095) (owner: 10Filippo Giunchedi) [12:19:45] (03PS3) 10Elukey: Add experimental JVM Heap usage alarm to Zookeeper prod instances [puppet] - 10https://gerrit.wikimedia.org/r/340719 (https://phabricator.wikimedia.org/T157968) [12:21:25] (03CR) 10Elukey: [C: 032] Add experimental JVM Heap usage alarm to Zookeeper prod instances [puppet] - 10https://gerrit.wikimedia.org/r/340719 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [12:24:25] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: use conftool_prefix [puppet] - 10https://gerrit.wikimedia.org/r/340724 [12:24:27] (03PS1) 10Giuseppe Lavagetto: pybal: remove pool, unused anywhere [puppet] - 10https://gerrit.wikimedia.org/r/340725 [12:24:30] (03PS1) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/340726 [12:24:32] (03PS1) 10Giuseppe Lavagetto: pybal::configuration: explicitly set the conftool prefix [puppet] - 10https://gerrit.wikimedia.org/r/340727 [12:30:20] (03PS2) 10Giuseppe Lavagetto: profile::conftool::client: use conftool_prefix [puppet] - 10https://gerrit.wikimedia.org/r/340724 [12:36:57] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 for maintenance (duration: 00m 42s) [12:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:27] !log running ANALYZE table on revision at db1051 (depooled) T159319 [12:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:02] (03PS3) 10Giuseppe Lavagetto: profile::conftool::client: use conftool_prefix [puppet] - 10https://gerrit.wikimedia.org/r/340724 [12:48:04] 06Operations, 10ops-eqiad, 10DBA: Investigate db1047 replication lag - https://phabricator.wikimedia.org/T159266#3067703 (10jcrespo) We need to change the battery of the second controller (number 1) and disable auto-learning there (it was only disabled on number 0). For the first part we need @Cmjohnson, ne... [12:48:56] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3067709 (10jcrespo) [12:49:09] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3062225 (10jcrespo) a:05jcrespo>03None [12:49:20] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3062225 (10jcrespo) p:05Triage>03Normal [12:51:52] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::conftool::client: use conftool_prefix [puppet] - 10https://gerrit.wikimedia.org/r/340724 (owner: 10Giuseppe Lavagetto) [12:52:01] !log installing shadow security updates on jessie hosts [12:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:45] (03PS2) 10Giuseppe Lavagetto: pybal: remove pool, unused anywhere [puppet] - 10https://gerrit.wikimedia.org/r/340725 [12:55:52] (03Draft1) 10Paladox: Gerrit: Fix not so it reports the correct user merging the change [puppet] - 10https://gerrit.wikimedia.org/r/340735 [12:55:57] (03PS2) 10Paladox: Gerrit: Fix not so it reports the correct user merging the change [puppet] - 10https://gerrit.wikimedia.org/r/340735 (https://phabricator.wikimedia.org/T159441) [12:56:34] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up ms-fe100[5-7] - https://phabricator.wikimedia.org/T155095#3067723 (10fgiunchedi) [12:56:36] 06Operations, 10ops-eqiad: rack/setup/install/track new ms-fe1005-1008 - https://phabricator.wikimedia.org/T154250#3067725 (10fgiunchedi) [12:57:17] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: remove pool, unused anywhere [puppet] - 10https://gerrit.wikimedia.org/r/340725 (owner: 10Giuseppe Lavagetto) [12:57:47] (03PS3) 10Paladox: Gerrit: Fix not so it reports the correct user merging the change [puppet] - 10https://gerrit.wikimedia.org/r/340735 (https://phabricator.wikimedia.org/T159441) [12:59:56] (03PS4) 10Paladox: Gerrit: Fix bot so it reports the correct user merging the change [puppet] - 10https://gerrit.wikimedia.org/r/340735 (https://phabricator.wikimedia.org/T159441) [13:00:13] (03PS2) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/340726 [13:03:24] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1005.eqiad.wmnet [13:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:04] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:06:44] PROBLEM - DPKG on cp4012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:06:54] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [13:08:22] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3067736 (10Marostegui) I have disabled the auto-learn mode for that controller - I have not set it to "2" (warn via an event) because we are not really using it: ```... [13:09:54] PROBLEM - DPKG on bast3001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:10:39] (03PS3) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/340726 [13:10:54] RECOVERY - DPKG on bast3001 is OK: All packages OK [13:45:38] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340742 (https://phabricator.wikimedia.org/T128546) [13:46:39] !log removed obsolete kernels on eventlog1001 [13:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:51] !log removed obsolete kernels on ocg1002 [13:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:44] RECOVERY - DPKG on cp4012 is OK: All packages OK [13:53:14] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:39] phuedx: want to deploy your changes during eu swat? [13:54:32] (03PS1) 10Elukey: Fix the Zookeeper JVM Heap usage alarm [puppet] - 10https://gerrit.wikimedia.org/r/340743 (https://phabricator.wikimedia.org/T157968) [13:54:41] zeljkof: i need to deploy the first one to the beta cluster to test it [13:54:59] is that taken care of by jenkins automatically? [13:55:15] not sure, hashar should know [13:55:43] phuedx: CR+2 [13:55:53] cool [13:55:54] phuedx: CI will merge it in Gerrit and that will update beta cluster [13:56:00] via a job that runs on post merge [13:56:01] (03CR) 10Ema: [C: 031] "LGTM, one minor comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340727 (owner: 10Giuseppe Lavagetto) [13:56:07] but that also mean the change will land on prod [13:56:17] the first one //should// be a nop [13:56:23] (03CR) 10Elukey: [C: 032] Fix the Zookeeper JVM Heap usage alarm [puppet] - 10https://gerrit.wikimedia.org/r/340743 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [13:56:25] jouncebot: next [13:56:25] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T1400) [13:56:52] phuedx: you can start :} [13:57:02] (03PS3) 10Phuedx: Hygiene: Remove Page Previews experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339478 [13:57:11] o/ [13:57:18] o/ [13:57:23] I'm around [13:57:32] phuedx: you are deploying your changes? [13:57:48] zeljkof: don't mind [13:57:57] phuedx: great [13:58:09] can you also deploy the third change? :) [13:58:14] there is one from jan_drewniak [13:59:00] * phuedx waits for the build [13:59:01] phuedx: afaik you will want to sync CommonSettings.php first and after that InitialiseSettings.php [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T1400). Please do the needful. [14:00:04] phuedx and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:12] hashar: makes sense [14:00:25] commonsettings refers to variables that are removed by initialisesettings [14:00:29] don't want warning spam [14:01:09] o/ [14:01:32] (03CR) 10Phuedx: [C: 032] Hygiene: Remove Page Previews experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339478 (owner: 10Phuedx) [14:01:40] (03CR) 10Phuedx: [C: 032] "SWAT!!1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339478 (owner: 10Phuedx) [14:03:02] (03Merged) 10jenkins-bot: Hygiene: Remove Page Previews experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339478 (owner: 10Phuedx) [14:03:20] (03CR) 10jenkins-bot: Hygiene: Remove Page Previews experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339478 (owner: 10Phuedx) [14:03:31] * phuedx twiddles thumbs while that gets synced to the beta cluster [14:04:16] jan_drewniak: there's a merge conflict with the second change in the queue (mine) want to do yours after ^ [14:04:18] ? [14:04:36] Sure [14:05:15] ok [14:05:28] the beta-mediawiki-config-update job has run [14:05:30] https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/6838/ [14:05:54] hashar, zeljkof: is there a beta-specific fatalmonitor? [14:06:11] http://logstash-beta.wmflabs.org [14:06:20] duh [14:06:21] 'course ;) [14:06:22] but you would need to exercise whatever code paths you are interested in [14:06:27] should've guessed that [14:06:33] hashar: basically: does it load? [14:06:34] ;) [14:06:40] ;- [14:06:42] ;} [14:08:03] okie poke, beta cluster hasn't fallen over and page previews are still functional/i can opt in/out of 'em [14:09:54] time for scap [14:09:55] :D [14:10:00] syncing commonsettings.php [14:10:56] !log phuedx@tin Synchronized wmf-config/CommonSettings.php: Remove Page Previews experiment config (duration: 01m 06s) [14:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:27] syncing the other file [14:12:43] !log phuedx@tin Synchronized wmf-config/InitialiseSettings.php: Remove Page Previews experiment config (duration: 00m 40s) [14:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:48] jan_drewniak: time for yours [14:14:00] neat [14:14:07] there is a huge spam of notices whenever we sync :] [14:14:55] (03CR) 10Phuedx: [C: 032] "SWAT!!1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340742 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:15:33] Notice: Array to string conversion in /srv/mediawiki/php-1.29.0-wmf.13/includes/TemplateParser.php(131)(xxxxxxxxxxxxx) : eval()'d code on line 40 [14:16:09] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340742 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:16:36] phuedx: So I've never actually deployed the portals myself... usually somebody from ops does it for me [14:16:56] hashar: i noticed that one [14:17:03] unrelated to your deploy imho [14:17:42] hashar: +1 [14:17:42] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340742 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:18:09] I blame PHP every day for notices/warnings etc [14:19:06] jan_drewniak: the change isn't on tin yet [14:19:23] there's a note about running the sync-portals script [14:19:28] except i don't see it ;) [14:19:39] hashar: ^ know anything about that [14:19:47] yeah [14:19:55] the doc is something like portals/README iirc [14:20:06] there is a shell script there that does sync and then purge some URL [14:20:07] phuedux: I should probably know how to deploy the portals myself too though, updating a repo & running s script doesn't seem that hard. The script should be at root of the repo [14:20:14] !log elukey@tin Started deploy [analytics/refinery@c3dd129]: (no justification provided) [14:20:17] ah gotcha [14:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:50] hashar: so no sync-dir for wmf-config [14:20:52] just run the script? [14:21:09] if you trust it yes :} [14:21:17] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:21:22] cat /srv/mediawiki-staging/portals/sync-portals [14:21:37] jouncebot now [14:21:37] For the next 0 hour(s) and 38 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T1400) [14:21:44] the script itself has several issues but I am too lazy to fix them :} [14:22:02] spoken like a true hashar [14:22:32] !log elukey@tin Finished deploy [analytics/refinery@c3dd129]: (no justification provided) (duration: 02m 18s) [14:22:34] jan_drewniak: the deploy itself is not too complicated given you have all the appropriate rights [14:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:40] it is really all about smashing a few commands [14:22:42] be careful [14:22:46] and monitor logs / metrics [14:23:31] ok [14:23:34] change on tin [14:23:45] hashar: i'm going to run the sync-portals script [14:23:47] you can scap pull it on mwdebug1001 [14:23:53] or yeah just sync-portals [14:26:05] ok pulling on mwdebug1001 [14:26:11] !log running alter table on db2040 T147747 [14:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:08] jan_drewniak: anything? [14:28:31] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:28:58] (03PS2) 10Phuedx: Re-enable Page Previews instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) [14:29:41] hashar: derp, i forgot to submodule update the portals directory ;) [14:29:52] ;} [14:30:11] gj [14:30:27] jan_drewniak: the change should be on mwdebug1001 [14:30:37] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3067834 (10Jgreen) We have six Precise boxes left to replace by the end of March, most are waiting on procurement tasks. Once these are done we'll have all metric-worthy services on Jessie. T... [14:30:42] phuedx: yup, mwdebug1001 looks good\ [14:30:43] Zppix: ta ;) [14:31:24] okie poke running the script [14:32:11] !log phuedx@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 40s) [14:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:53] !log phuedx@tin Synchronized portals: (no justification provided) (duration: 00m 41s) [14:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:22] jan_drewniak: ^ [14:35:04] phuedx: nuts... can you purge this url? https://www.wikipedia.org/portal/wikipedia.org/assets/js/gt-ie9-c84bf66d33.js [14:35:50] and this one: https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-d1cc91a7f4.js [14:36:34] arent you regenerating those ids ? [14:36:47] phuedx: to purge basically do: [14:36:56] echo URL HERE | mwscript purgeList.php [14:37:05] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1006.eqiad.wmnet [14:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:16] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1007.eqiad.wmnet [14:37:18] that sends cache invalidation requests to the various varnish caches [14:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:24] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1008.eqiad.wmnet [14:37:24] hashar: ta [14:37:24] all done the right way / magically [14:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:33] i grabbed the script from the sync-portals script [14:37:38] jan_drewniak: any more? [14:37:50] phuedx: phew, it looks all good now [14:37:59] 👍 [14:38:17] continuing on w/ 340706 [14:39:12] (03PS3) 10Phuedx: Re-enable Page Previews instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) [14:39:26] ^ spotted a typo ;) [14:41:29] (03CR) 10Hashar: [C: 032] Introduce linters using rake [puppet/nginx] - 10https://gerrit.wikimedia.org/r/338386 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:41:37] (03CR) 10Phuedx: [C: 032] "SWAT!!1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) (owner: 10Phuedx) [14:42:55] (03Merged) 10jenkins-bot: Re-enable Page Previews instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) (owner: 10Phuedx) [14:43:09] (03CR) 10jenkins-bot: Re-enable Page Previews instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340706 (https://phabricator.wikimedia.org/T149947) (owner: 10Phuedx) [14:43:22] (03CR) 10Hashar: [C: 032] Introduce linters using rake [puppet/nginx] - 10https://gerrit.wikimedia.org/r/338386 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:43:40] (03Merged) 10jenkins-bot: Introduce linters using rake [puppet/nginx] - 10https://gerrit.wikimedia.org/r/338386 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:44:04] (03CR) 10Hashar: [C: 032] Introduce linters using rake [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/338387 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:44:15] change is on mwdebug1001 [14:44:17] verifying [14:44:28] (03Merged) 10jenkins-bot: Introduce linters using rake [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/338387 (https://phabricator.wikimedia.org/T154894) (owner: 10Hashar) [14:44:42] cool [14:44:46] looks good [14:46:00] syncing [14:47:02] !log phuedx@tin Synchronized wmf-config/InitialiseSettings.php: T157700: Re-enable Page Previews instrumentation (duration: 00m 40s) [14:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:07] T157700: Allow page previews to be rolled out to 90% of logged-out users - https://phabricator.wikimedia.org/T157700 [14:48:02] phuedx: that is for HoverCards isn't it ? [14:48:10] yeah [14:48:16] hovercards/popups/page previews [14:48:17] thanks god [14:48:24] ? [14:48:28] that was one of the most annoying feature we had [14:48:31] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3067854 (10elukey) [14:48:37] forced me to login to the wiki to get that feature turned on :} [14:49:09] lol [14:50:24] phuedx: any reason we don't just turn it on for all users ? [14:50:33] hashar: we are doing [14:50:41] /slowly// [14:50:58] ;) [14:51:15] hashar: some wikis have asked us that same question ;) [14:51:19] hrrrrm [14:51:27] not seeing the change taking effect on enwiki [14:51:39] I can't remember for how long I had popups enabled probably more than a couple years [14:52:33] ok now i see it [14:52:35] cool [14:52:39] \O/ [14:53:01] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3067862 (10MoritzMuehlenhoff) mc2008-mc2016 are not properly removed from puppet, they still show up in servermon e.g. [14:53:07] jouncebot: now [14:53:07] For the next 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T1400) [14:53:14] 6 minutes to go [14:53:22] i'll be quicker next time ;) [14:56:31] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:02:47] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3067871 (10fgiunchedi) To get started with oresrdb2001 in codfw please see/review https://gerrit.wikimedia.org/r/#/c/34048 [15:03:15] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up ms-fe100[5-8] - https://phabricator.wikimedia.org/T155095#3067872 (10fgiunchedi) [15:04:00] alright -- out [15:04:12] going to pick up my kids from school [15:07:31] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:11:21] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:16:02] (03PS7) 10Hashar: contint: slave role for Saucelabs jobs [puppet] - 10https://gerrit.wikimedia.org/r/338770 [15:16:43] (03CR) 10Hashar: [C: 031] "Rebased. Already cherry picked on CI puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/338770 (owner: 10Hashar) [15:19:32] (03Abandoned) 10Hashar: logstash: parse runJobs messages [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:25:32] (03PS6) 10Muehlenhoff: Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) [15:28:38] 06Operations, 10Monitoring, 06Release-Engineering-Team, 07Tracking, 07Wikimedia-Incident: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#3067970 (10Peter) I've been trying out alerts for a while, let me write down a summary the coming days. [15:33:54] (03CR) 10Hashar: [C: 04-1] "Pending https://gerrit.wikimedia.org/r/#/c/332475/" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar) [15:36:31] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:37:23] (03CR) 10Volans: [C: 031] "LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) (owner: 10Muehlenhoff) [15:40:21] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:40:27] (03PS7) 10Muehlenhoff: Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) [15:41:01] PROBLEM - Host mc2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:14] !log installing libfcgi-perl security updates [15:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:41] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 616477 [15:47:09] cp1074 strikes again, looking ^ [15:49:54] !log uploaded 6.8.9.9-5+deb8u7+wmf1 to apt.wikimedia.org (CMYK sharpen bugfix rebased on latest Debian update) [15:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:21] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:11] PROBLEM - Host mc2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:23] <_joe_> elukey: any idea why mc2009 is still in icinga? [15:57:11] _joe_: apparently something went wrong in decom, see my earlier comment at https://phabricator.wikimedia.org/T157675#3067862 [16:00:07] _joe_ nope.. [16:00:56] !log restarting db1001 for kernel and mariadb upgrade [16:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:14] ^dbproxy1001 and dbproxy1006 will complain for a bit [16:01:21] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:02:41] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 420 [16:03:00] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#3068047 (10hashar) 05Open>03Resolved Bulk of the integration is done and there are spec being added to the repo. I wrote a guide on https://w... [16:03:10] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#3068049 (10hashar) 05Resolved>03Open [16:05:11] PROBLEM - haproxy failover on dbproxy1001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:05:21] PROBLEM - haproxy failover on dbproxy1006 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:05:23] here they are [16:06:12] (03PS5) 10ArielGlenn: Move default config into a file [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [16:06:14] (03PS1) 10ArielGlenn: use single config object for all conf setting lookups [dumps] - 10https://gerrit.wikimedia.org/r/340759 [16:08:59] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3068053 (10Papaul) @fgiunchedi are all the previous steps done? (removed from puppet. icinga...) ? I can just start from removing systems from DNS and begin the disk wi... [16:10:11] RECOVERY - haproxy failover on dbproxy1001 is OK: OK check_failover servers up 2 down 0 [16:10:21] RECOVERY - haproxy failover on dbproxy1006 is OK: OK check_failover servers up 2 down 0 [16:13:41] moritzm: I tried to execute puppet cert clean mc2009.etc.. and it successfully removed it, weird.. maybe puppet node clean didn't work? [16:16:01] PROBLEM - Host mc2010 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:04] <_joe_> elukey: puppet node clean [16:16:12] <_joe_> in a meeting sorry [16:16:33] 06Operations, 10DBA, 13Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#3068092 (10jcrespo) m1 slave db1001 has been restarted and TLS enabled. [16:20:07] yeah I tried that too but I don't see changes in einstenium's puppet run [16:21:58] mutante: hello! Let me know when you are online, mc200* decom might need some more work [16:24:49] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3068098 (10RobH) [16:25:20] (03PS4) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/340726 [16:26:07] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3066883 (10RobH) I've updated the base task description with the checklist for decommissioning (that is listed off the [[ https://wikitech.wikimedia.org/wiki/Server_Lifecy... [16:27:24] elukey: check puppetdb data with curl ;) [16:27:38] !log puppet disabled on authdns production boxes, for hacky testing of discovery-related commits [16:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:51] PROBLEM - Host mc2011 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:06] (03PS16) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [16:29:01] elukey: it doesn't result deactivated [16:29:10] curl https://nitrogen.eqiad.wmnet/v3/nodes/mc2009.codfw.wmnet [16:29:41] PROBLEM - Host mc2012 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:46] volans: I am still ignorant about that part, thanks :) [16:37:59] (03PS1) 10Urbanecm: New throttle rule for WMUK [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340765 (https://phabricator.wikimedia.org/T159454) [16:40:36] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3068139 (10fgiunchedi) [16:41:19] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3066883 (10fgiunchedi) @papaul I've effectively decomissioned the systems from production and they are still in puppet but running the spare role in puppet, I've marked as... [16:50:05] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3068147 (10Dzahn) I have never intended to use backports on planet2001 nor any idea why it would be different from others. It is supposed to be like planet1001 and uses identical roles. [16:50:42] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068148 (10Dzahn) @Papaul @elukey can you confirm mc2008-mc2016 are physically shutdown? [16:50:45] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3068149 (10chasemp) [16:51:21] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:52:02] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3068162 (10Dzahn) I can just reinstall this machine easily, but that doesn't mean i know why it had this state. [16:52:30] !log puppet re-enabled on authdns production boxes [16:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:08] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3068171 (10Dzahn) rm /etc/apt/sources.list.d/debian-backports.list vi /etc/apt/sources/wikimedia.list apt-get upgrade.. upgraded apache [16:55:21] PROBLEM - Host mc2019 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:38] what?? [16:55:43] papaul: you therE? [16:55:47] mc2019 is a prod host [16:56:01] PROBLEM - Host es2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:01] PROBLEM - Host ms-be2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:10] 06Operations: Inconsistent package status on planet2001 - https://phabricator.wikimedia.org/T159432#3068176 (10Dzahn) a:03Dzahn [16:56:11] PROBLEM - Host es2014 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:25] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068178 (10Papaul) @Dzahn mc2008- mc2012 are physically shutdown since disk wipe is in progress on those systems and not mc2013-mc2016 [16:56:25] this is weird [16:56:33] elukey: yes [16:56:41] es down? [16:56:51] PROBLEM - Host ripe-atlas-codfw is DOWN: PING CRITICAL - Packet loss = 100% [16:57:16] papaul: sorry I wanted to ask if you shut down mc2019 but there seems to be an issue in codfw [16:57:21] RECOVERY - Host ms-be2001 is UP: PING OK - Packet loss = 0%, RTA = 36.55 ms [16:57:29] marostegui, ^ [16:57:33] RECOVERY - Host es2001 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [16:57:33] RECOVERY - Host es2014 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [16:57:33] RECOVERY - Host mc2019 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [16:57:41] is it network only? [16:57:41] did a switch reboot? [16:57:53] maybe we could check the rack [16:58:28] yes they are on the same rack afaics [16:58:34] yep seems so [16:58:40] elukey: mc2019 is on on my side [16:59:12] it was network [16:59:20] I thought it was power [16:59:22] papaul: I think there was an issue with rack A1, probably a switch [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T1700). Please do the needful. [17:00:05] Pchelolo: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:10] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068182 (10Dzahn) Can we shutdown mc2013-mc2016 as well? [17:01:22] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068183 (10Dzahn) I ran the exact same commands from above again, it removed 2008,2009 and 2012 from servermon. [17:02:01] RECOVERY - Host ripe-atlas-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [17:03:28] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068200 (10Papaul) @Dzahn mc2013-mc2016 are down [17:03:57] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068201 (10Dzahn) Same for salt-keys. They were back and had to repeat the same command because servers were not shutdown yet. (2008 thru 2016) [17:04:01] PROBLEM - Host mc2016 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:01] PROBLEM - Host mc2014 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:21] PROBLEM - Host mc2015 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:31] PROBLEM - Host mc2013 is DOWN: PING CRITICAL - Packet loss = 100% [17:04:33] elukey: those are decomm? [17:04:40] yes [17:04:47] why are they alarming? [17:04:57] I am connected to es2014 [17:05:03] but it is alterting soft to me [17:06:02] the procedure says: "Disable ALL service level checks in icinga for host. [17:06:10] source: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission [17:06:20] volans: maybe during the decom process some steps were not done [17:07:19] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068208 (10Dzahn) >>! In T157675#3068200, @Papaul wrote: > @Dzahn mc2013-mc2016 are down Thanks. I ran the commands a third time. Running pupp... [17:09:26] to answer the earlier question, yes looks like one switch rebooted [17:09:31] System booted: 2017-03-02 16:55:13 UTC (00:13:06 ago) [17:09:44] is the root cause known yet? [17:10:04] (03PS1) 10Giuseppe Lavagetto: conftool: change of schema for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340771 [17:10:04] godog: thanks for checking, not that I'm aware off [17:13:05] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068219 (10Papaul) @robh if you want to disable switch ports please see below for switch port informationn Rack A2 mc2001 xe-2/0/0 mc2002 xe-... [17:13:11] papaul: maybe some physical work around asw-a1 and power was disconnected ? [17:13:30] godog: no work there [17:14:03] godog: work on row B and C on mc servers [17:14:07] win 7 [17:14:26] godog: I think that Faidon is investigating it [17:15:03] paravoid: ^ [17:15:17] elukey: ack, thanks! [17:15:42] godog: out of curiosity, where did you find the "system booted etc.." ? [17:16:00] I mean, what command did you use on asw-a :) [17:16:10] elukey: 'show system uptime' will list all members [17:16:18] ahhhhhh [17:16:49] fpc1 [17:16:51] ok got it [17:17:14] will add it to https://wikitech.wikimedia.org/wiki/Network_cheat_sheet [17:17:17] thanks :) [17:17:31] ema --^ [17:18:16] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:35] elukey: np! [17:19:03] Pchelolo: around? [17:19:22] I'm here godog [17:19:22] I was looking at your restbase patch now [17:20:13] elukey: the mc servers that are supposed to be decom'ed are gone from Icinga now. not all of them were physically shut down yet. but now they are. [17:20:14] what do you think about it? [17:20:26] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:20:31] mutante: was it a matter of re-running the node clean? [17:20:34] and it's all unrelated to issues with other hosts [17:21:15] elukey: "node clean" && "node deactivate" on puppetmaster1001, then puppet run on einsteinium. and salt-key -d on neodymium [17:21:42] <_joe_> mutante: but that should be done by dcops now [17:21:47] if servers are still running and this is done, then it will first look like they are gone, but then they come back later [17:21:49] <_joe_> as part of their decom routine [17:21:59] <_joe_> or did I understand it wrong? [17:22:14] don't know, just helping [17:22:28] (03PS2) 10Giuseppe Lavagetto: conftool: change of schema for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340771 [17:22:39] Pchelolo: LGTM so far, checking more [17:22:56] so in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_ANY_Opsen there is a specific mention of "Disable ALL service level checks in icinga for host." that I missed for mc2001->mc2016 [17:22:59] <_joe_> bblack: ok to go? ^^ [17:23:17] I thought this step was part of the puppet clean up etc.. [17:23:31] so my bad [17:23:34] for the alarms [17:23:51] papaul: missed one step before handing over to you, sorry! [17:24:27] godog kk. I know there were concerns about doing it before, but I've put a lot of work into resolving potential problems like if the disk fills up of something happens, so we're very sure now the service would be ok [17:24:33] elukey: no problem no need to be sorry [17:25:06] PROBLEM - puppet last run on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:25:26] PROBLEM - SSH on bast3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:56] also down, network issues again? [17:25:58] mmmh doesn't look good [17:26:02] I cannot ssh [17:26:06] PROBLEM - configured eth on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:06] PROBLEM - Check systemd state on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:06] PROBLEM - Check whether ferm is active by checking the default input chain on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:06] PROBLEM - DPKG on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:06] PROBLEM - Disk space on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:06] PROBLEM - Check size of conntrack table on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:07] PROBLEM - dhclient process on bast3001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:26:08] PROBLEM - salt-minion processes on bast3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:10] (03CR) 10BBlack: [C: 031] conftool: change of schema for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340771 (owner: 10Giuseppe Lavagetto) [17:26:15] _joe_: I think so [17:26:24] bast3001 died? [17:26:30] mutante: ^ [17:26:33] <_joe_> bblack: of course the schema change will propagate slowly once I've merged [17:26:33] what a timing, lol [17:26:50] actually I can establish an ssh connection ,take forever to give me the prompt [17:26:55] but we don't come unprepared for that :) [17:27:09] <_joe_> volans: so it's overloaded [17:27:09] bblack: bast3002 to the rescue [17:27:26] _joe_: probably, I can confirm once I'll be in... [17:27:26] volans, is it the machine, then, not the network? [17:27:27] <_joe_> mutante: ahah that right? [17:27:30] i was gonna switch over the prometheus role about now [17:27:39] sorry, I did the same question [17:27:45] most other things are already there [17:27:55] (03PS2) 10Urbanecm: Add new rules for WMUK [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340765 (https://phabricator.wikimedia.org/T159454) [17:28:08] _joe_: yea, i installed bast3002 to replace it because it was about to die [17:28:36] what's sucking up all the cpu on 3001? [17:28:56] RECOVERY - Check size of conntrack table on bast3001 is OK: OK: nf_conntrack is 0 % full [17:28:56] RECOVERY - Disk space on bast3001 is OK: DISK OK [17:28:56] RECOVERY - Check systemd state on bast3001 is OK: OK - running: The system is fully operational [17:28:56] RECOVERY - salt-minion processes on bast3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:28:56] RECOVERY - dhclient process on bast3001 is OK: PROCS OK: 0 processes with command name dhclient [17:28:56] RECOVERY - configured eth on bast3001 is OK: OK - interfaces up [17:29:03] grof grafana also IO [17:29:06] RECOVERY - DPKG on bast3001 is OK: All packages OK [17:29:06] RECOVERY - Check whether ferm is active by checking the default input chain on bast3001 is OK: OK ferm input default policy is set [17:29:08] 1722 prometh+ 20 0 9699052 7.216g 0 S 33.6 92.7 41090:34 prometheus [17:29:11] 17:29:05 up 56 days, 6:14, 4 users, load average: 35.35, 27.35, 13.19 [17:29:15] i'll do the things i was going to do anyways to get rid of bast3001 asap [17:29:16] (03CR) 10jerkins-bot: [V: 04-1] Add new rules for WMUK [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340765 (https://phabricator.wikimedia.org/T159454) (owner: 10Urbanecm) [17:29:16] RECOVERY - SSH on bast3001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [17:29:19] prometheus is heh [17:29:24] yep [17:29:35] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: change of schema for discovery [puppet] - 10https://gerrit.wikimedia.org/r/340771 (owner: 10Giuseppe Lavagetto) [17:29:38] it (prom) also has 7.2GB resident mem on an 8GB box :) [17:29:45] (03PS3) 10Dzahn: switch prometheus.eqiad to bast3002 [dns] - 10https://gerrit.wikimedia.org/r/340272 (https://phabricator.wikimedia.org/T156506) [17:29:46] ^ [17:30:00] Mar 2 17:29:00 bast3001 prometheus@ops[1722]: time="2017-03-02T17:28:59Z" level=info msg="Done checkpointing in-memory metrics and chunks in 5m4.639712687s." [17:30:01] .eqiad to 3002? [17:30:04] i would have also merged that without 3001 breaking just now [17:30:06] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [17:30:20] (03PS3) 10Urbanecm: Add new rules for WMUK [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340765 (https://phabricator.wikimedia.org/T159454) [17:30:22] oh just commitmsg [17:30:29] it is also ok to restart prometheus on bast3001 if it misbehaving [17:30:30] it's acually promethus.esams in the zonefile [17:30:31] bblack: no, oops, esams, yep [17:30:58] (03PS4) 10Dzahn: switch prometheus.esams to bast3002 [dns] - 10https://gerrit.wikimedia.org/r/340272 (https://phabricator.wikimedia.org/T156506) [17:31:22] godog: looks like was the checkpointing [17:31:23] the timing was really fun because we had JUST finished a super long rysnc to get the data across [17:32:08] (03CR) 10Dzahn: [C: 032] switch prometheus.esams to bast3002 [dns] - 10https://gerrit.wikimedia.org/r/340272 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [17:32:29] volans: could be yeah [17:32:37] 06Operations, 10netops: asw-a1-codfw spontaneous reboot - https://phabricator.wikimedia.org/T159464#3068307 (10faidon) [17:32:58] elukey, godog, jynus etc. ^ [17:33:04] brb [17:33:07] thanks! [17:33:12] paravoid: thanks! [17:34:15] Pchelolo: do you have a sample log file and/or a sense of how big it can get? [17:34:36] patch LGTM but I'd rather merge it on the next puppetswat on tues rather than thurs [17:35:10] godog: we can look at logstash - normally we get ~10 logs per sec over all the machines [17:35:50] ok with me merging on tue [17:35:56] PROBLEM - HP RAID on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [17:36:18] godog: I will edit the wikitech page to move it [17:36:29] (03PS10) 10MarcoAurelio: Rename 'technician' to 'interface-editor' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) [17:36:52] Pchelolo: awesome, TYVM [17:38:19] switched. i see https://grafana.wikimedia.org/dashboard/db/prometheus-global-overview?var-datasource=codfw%20prometheus%2Fglobal&var-site=esams and that [17:38:32] (03CR) 10MarcoAurelio: Rename 'technician' to 'interface-editor' on trwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) (owner: 10MarcoAurelio) [17:38:39] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad [17:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:13] !log bblack@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=appservers-rw,name=eqiad [17:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:23] !log bblack@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=appservers-ro,name=codfw [17:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:36] RECOVERY - HP RAID on dbstore2001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [17:46:16] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:49:58] (03PS8) 10Muehlenhoff: Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) [17:50:26] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:51:00] !log disabling puppet on authdns prod machines for hacky discovery testing [17:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:46] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:47] (03PS17) 10BBlack: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [17:54:30] (03PS1) 10RobH: archiva.w.o ssl check changing to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/340778 [17:57:08] papaul: I've downtimed graphite2001, good to replace all 4x ssd [17:57:28] (03CR) 10Muehlenhoff: [C: 032] Script for offboarding a user from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/340346 (https://phabricator.wikimedia.org/T142825) (owner: 10Muehlenhoff) [17:57:59] (03CR) 10Dzahn: [C: 031] archiva.w.o ssl check changing to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/340778 (owner: 10RobH) [17:58:00] godog: have a question on graphite2001 maybe i am missing something [17:58:06] (03CR) 10RobH: [C: 032] archiva.w.o ssl check changing to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/340778 (owner: 10RobH) [17:58:12] godog: graphite2001 is it HP or Dell? [17:58:15] (03PS2) 10RobH: archiva.w.o ssl check changing to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/340778 [17:58:33] godog: because on site it is a Dell server [17:58:42] godog: and i have HP SSD's [17:59:08] papaul: yeah it is dell indeed [17:59:16] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:59:21] robh: ^ [17:59:51] they are aftermarket ssds from dasher [17:59:53] (03PS18) 10BBlack: DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [17:59:55] so doesnt matter =] [18:00:04] they should still be intel s3610 SSDs [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T1800). Please do the needful. [18:00:13] if they detect as something else when used, let me know [18:00:26] robh: graphite2001 has 3.5" disk and not 2.5" [18:00:34] and the HP dists are 2.5" [18:00:50] graphite2001 has SSDS in 2.5 to 3.5" brackets no? [18:01:05] (03CR) 10jerkins-bot: [V: 04-1] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [18:01:07] my understanding is it has ssds, and ssds dont fit in 3.5" bays without adapters. [18:01:08] robh:i have to open the server to check [18:01:16] i think you'll find it does ;] [18:01:32] its the older style cabled bays for that reason [18:01:42] robh: let me check [18:01:54] godog: is the server off? [18:02:07] (03PS19) 10BBlack: DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) [18:03:03] godog: can i poweroff graphite2001 ? [18:03:11] papaul: yes you can [18:03:17] godog: ok thanks [18:04:01] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068446 (10RobH) [18:06:32] (03CR) 10BBlack: [C: 032] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [18:07:24] robh: your right there are SSD DC S3500 series [18:09:13] (03PS1) 10Chad: Group2 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340783 [18:09:25] (03CR) 10Chad: [C: 04-2] "l8r gator" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340783 (owner: 10Chad) [18:09:49] (03PS1) 10BBlack: dns discovery: add the data to authdns::testns too [puppet] - 10https://gerrit.wikimedia.org/r/340784 [18:10:13] (03CR) 10BBlack: [V: 032 C: 032] dns discovery: add the data to authdns::testns too [puppet] - 10https://gerrit.wikimedia.org/r/340784 (owner: 10BBlack) [18:12:10] RoanKattouw: Do you need me to do that deploy? I'm already on tin so thought I'd offer :) [18:12:17] RainbowSprinkles: Yes please [18:13:06] * RainbowSprinkles waits for the merge [18:13:07] (03PS1) 10BBlack: dns discovery: quote active_active keys in templates [puppet] - 10https://gerrit.wikimedia.org/r/340785 [18:13:34] (03CR) 10BBlack: [V: 032 C: 032] dns discovery: quote active_active keys in templates [puppet] - 10https://gerrit.wikimedia.org/r/340785 (owner: 10BBlack) [18:15:24] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068506 (10RobH) >>! In T157675#3068219, @Papaul wrote: > @robh if you want to disable switch ports please see below for switch port informatio... [18:15:34] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3068520 (10RobH) [18:17:23] (03PS3) 10RobH: archiva.w.o ssl check changing to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/340778 [18:18:24] (03PS1) 10BBlack: dns discovery: watch_keys is an array [puppet] - 10https://gerrit.wikimedia.org/r/340786 [18:18:38] (03CR) 10BBlack: [V: 032 C: 032] dns discovery: watch_keys is an array [puppet] - 10https://gerrit.wikimedia.org/r/340786 (owner: 10BBlack) [18:18:40] (03CR) 10RobH: [V: 032 C: 032] archiva.w.o ssl check changing to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/340778 (owner: 10RobH) [18:19:19] .... im in rebase race condition helll [18:19:29] with me? :) [18:19:45] yessss [18:19:46] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:19:57] now it wont let me rebase without changing the revision... [18:19:58] I'm in "it's so fun to test for syntax errors via production merges" hell :) [18:19:58] RainbowSprinkles: It just merged [18:20:01] wtf [18:20:46] (03PS4) 10RobH: archiva.w.o ssl check changing to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/340778 [18:21:59] !log demon@tin Synchronized php-1.29.0-wmf.14/includes/changes/EnhancedChangesList.php: T159466 (duration: 00m 40s) [18:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:06] T159466: Change in Watchlist behaviour causing breakages in gadgets - https://phabricator.wikimedia.org/T159466 [18:22:36] RoanKattouw: ^ you're live [18:22:44] Thanks [18:22:46] yw [18:25:13] godog: SSD in place and reimage in progress [18:26:37] papaul: sweet, thanks! [18:26:55] godog: no problem [18:27:16] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:28:04] godog: can you please give me the task info so i can update my spare sheet with the old SSD [18:28:50] papaul: that'd be https://phabricator.wikimedia.org/T157153 [18:29:14] godog: thanks [18:30:18] papaul: the old ssds will have to likely be destroyed, but set them aside (for grpahte2001) [18:30:30] since they are near failure, we may be able to hook them up to another system [18:30:33] and wipe though. [18:30:39] (since htey have not actually failed) [18:31:56] papaul: but most definitely dont just put the old SSDs on the shelf for spare. They are going to either get wiped and sent back for manufacter's warranty repair, or we'll destory them. [18:31:59] wow, typos. [18:32:36] robh: so no need to put them in the spare sheet? [18:32:50] correct, but you should make a sub-task so we dont forget them [18:32:58] robh: ok [18:33:02] a sub-task to attach to a spare system and attempt SSD wipe (different than normal wipe) [18:33:08] or that we'll destroy or something. [18:33:42] since the disk degauser doesnt work on ssds, it has to be more destructive. ideally we can still attach to a spare pool system, wipe the SSDs, and then get destroy. [18:33:46] safer than just destruction. [18:34:01] teh cabal is here [18:34:05] !ops find them [18:34:07] and gas them [18:34:26] we must destroy AntiSpamMeta and the Eleet Cabal of Farts! [18:34:33] NotASpy: [18:34:46] !ops NotASpy is eleet cabal of farts member [18:37:41] (03PS1) 10BBlack: dns discovery: separate map [puppet] - 10https://gerrit.wikimedia.org/r/340788 [18:38:32] (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: adapt to new schema format [puppet] - 10https://gerrit.wikimedia.org/r/340789 [18:39:51] (03PS2) 10Giuseppe Lavagetto: profile::discovery::client: adapt to new schema format [puppet] - 10https://gerrit.wikimedia.org/r/340789 [18:39:53] (03CR) 10BBlack: [C: 032] dns discovery: separate map [puppet] - 10https://gerrit.wikimedia.org/r/340788 (owner: 10BBlack) [18:40:04] <_joe_> argh merge-sniped again! [18:40:24] it is like whack-a-mole [18:41:04] <- mole [18:41:11] (03PS3) 10Giuseppe Lavagetto: profile::discovery::client: adapt to new schema format [puppet] - 10https://gerrit.wikimedia.org/r/340789 [18:41:19] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::discovery::client: adapt to new schema format [puppet] - 10https://gerrit.wikimedia.org/r/340789 (owner: 10Giuseppe Lavagetto) [18:44:11] dammit I thought I had a gif [18:44:32] 06Operations: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401#3068626 (10MoritzMuehlenhoff) These are fully rolled out: libpng cairo python-crypto gnutls28 glibc [18:47:51] papaul: I'll restart the reimage, but first change it to jessie [18:48:39] (03PS2) 10Smalyshev: Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 [18:48:44] (03PS1) 10Filippo Giunchedi: install_server: graphite2001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/340796 [18:49:02] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] install_server: graphite2001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/340796 (owner: 10Filippo Giunchedi) [18:50:32] (03PS5) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/340726 [18:54:46] (03CR) 10Giuseppe Lavagetto: [C: 032] role::puppetmaster::frontend: include profile::configmaster [puppet] - 10https://gerrit.wikimedia.org/r/340726 (owner: 10Giuseppe Lavagetto) [18:55:01] (03PS2) 10Addshore: Add beta hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340559 (https://phabricator.wikimedia.org/T158628) [18:55:06] godog: ok [18:55:09] Hello Wikimedia Operations - I would like to start contributing. This task seems easy enough: https://phabricator.wikimedia.org/T159438. Mind if I assign it to myself? [18:56:00] I can do SWAT (as 2 of the changes are mine) [18:56:47] evanstucker: You may want to simply comment on the task that you're preparing a patchset and then paste the gerrit link into the task (or add Bug:task# in your commit message) [18:57:18] if you assign it to yourself, most will siply then ignore it while its assigned, which may not be ideal if its your first time trying to commit something to our repo =] [18:57:49] Also, if your patchset has Bug:T159438 as the bottom line, it'll comment about it automatically in the task [18:57:50] T159438: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438 [18:57:57] robh: sounds good. Yeah, I need to go pull your Puppet repo and figure out how to patch things. [18:58:47] most of the details for our repos are on https://wikitech.wikimedia.org/wiki/Help:Git [18:59:01] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:59:09] but feel free to ask any questions (usually folks in here are about and will answer) [18:59:38] Also this rooms topic has the ops clinic person [18:59:43] I'm not sure it's a puppet issue at all, foreachwiki is just a simple wrapper for running maintenance scripts [19:00:03] Ideally, we need to improve error reporting so we can see why T159438 causes failures to begin with [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T1900). Please do the needful. [19:00:05] Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:15] *waves* I can do swat [19:00:16] (it's probably an issue in MWScript or Echo) [19:01:17] Present [19:01:23] (03CR) 10Addshore: [C: 032] Add new rules for WMUK [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340765 (https://phabricator.wikimedia.org/T159454) (owner: 10Urbanecm) [19:02:02] o/ [19:03:01] (03Merged) 10jenkins-bot: Add new rules for WMUK [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340765 (https://phabricator.wikimedia.org/T159454) (owner: 10Urbanecm) [19:03:20] (03CR) 10jenkins-bot: Add new rules for WMUK [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340765 (https://phabricator.wikimedia.org/T159454) (owner: 10Urbanecm) [19:03:24] I'm guessing that foreachwikiindblist is just outputting that message to STDERR instead of STDOUT. I just need to figure out when that output was added and either move it to STDOUT or otherwise get rid of it. [19:03:40] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:05:08] syncing [19:05:47] !log addshore@tin Synchronized wmf-config/throttle.php: [[gerrit:340765|Add new rules for WMUK]] T159454 T159461 (duration: 00m 43s) [19:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:54] T159454: Lift account registration cap for event (5 March 2017) - https://phabricator.wikimedia.org/T159454 [19:05:54] T159461: Lift account registration cap for event (8 March 2017) - https://phabricator.wikimedia.org/T159461 [19:06:39] addshore, was deployed? [19:06:46] Urbanecm: yup [19:06:53] 06Operations, 10MediaWiki-General-or-Unknown: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3067511 (10Evanstucker) I'm working on a patchset for this... This is my first time contributing, so I need to figure out where this script is located and how to patch it. Bear with me... [19:07:10] Thank you [19:08:15] RainbowSprinkles: can I just get you to quickly check https://gerrit.wikimedia.org/r/#/c/340559/2 ? [19:08:50] * RainbowSprinkles looks [19:08:52] I ran addWiki on deployment-prep / beta yesterday, and afaik this is the next step [19:09:13] Looks fine if addWiki already ran [19:09:20] PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:25] RainbowSprinkles: epic! [19:09:28] [[wikitech:Add a new wiki]] is your guide (though it's for production, so YMMV) [19:09:36] (03CR) 10Addshore: [C: 032] Add beta hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340559 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [19:10:39] (03Merged) 10jenkins-bot: Add beta hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340559 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [19:10:59] (03CR) 10jenkins-bot: Add beta hewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340559 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [19:11:33] 06Operations, 10MediaWiki-General-or-Unknown: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3067511 (10demon) The two things you'll be interested in are the [[ /source/mediawiki-config/browse/master/multiversion/MWScript.php | MWScript ]] wrapper and the [[ /diffusion/ECHO/br... [19:13:32] (03PS1) 10Giuseppe Lavagetto: pybal::web::service: fixup [puppet] - 10https://gerrit.wikimedia.org/r/340802 [19:13:56] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] pybal::web::service: fixup [puppet] - 10https://gerrit.wikimedia.org/r/340802 (owner: 10Giuseppe Lavagetto) [19:14:55] RainbowSprinkles: thanks! [19:15:08] yw [19:15:29] !log addshore@tin Synchronized wikiversions-labs.json: [[gerrit:340559|Add beta hewiktionary]] T158628 1/2 NOOP (duration: 00m 42s) [19:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:34] T158628: Create beta hewiktionary for testing InterwikiSorting & Cognate - https://phabricator.wikimedia.org/T158628 [19:16:47] (03Draft1) 10Paladox: Gerrit: Fix bot so that it checks against *-name and *-username [puppet] - 10https://gerrit.wikimedia.org/r/340801 [19:16:49] !log addshore@tin Synchronized dblists/all-labs.dblist: [[gerrit:340559|Add beta hewiktionary]] T158628 2/2 NOOP (duration: 00m 39s) [19:16:50] (03PS2) 10Paladox: Gerrit: Fix bot so that it checks against *-name and *-username [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T159075) [19:16:50] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[ldap-utils],Package[python-mwclient],Package[tzdata] [19:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:00] (03PS3) 10Paladox: Gerrit: Fix bot so that it checks against *-name and *-username [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T159075) [19:17:00] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:17:50] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [19:18:21] (03PS1) 10BBlack: dns discovery: use data.pooled in template [puppet] - 10https://gerrit.wikimedia.org/r/340804 [19:18:45] (03CR) 10BBlack: [V: 032 C: 032] dns discovery: use data.pooled in template [puppet] - 10https://gerrit.wikimedia.org/r/340804 (owner: 10BBlack) [19:18:50] PROBLEM - confd service on puppetmaster2001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [19:19:10] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [19:19:38] (03CR) 10Paladox: "Not really sure if this will fix it, but it will use usernames instead of names." [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T159075) (owner: 10Paladox) [19:20:00] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [19:20:11] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [19:21:08] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3068704 (10Luke081515) [19:21:40] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [19:22:00] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [19:23:01] <_joe_> what's up with tzdata? [19:23:24] (03PS1) 10Giuseppe Lavagetto: confd::file: fix template [puppet] - 10https://gerrit.wikimedia.org/r/340805 [19:23:50] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] confd::file: fix template [puppet] - 10https://gerrit.wikimedia.org/r/340805 (owner: 10Giuseppe Lavagetto) [19:29:51] RECOVERY - confd service on puppetmaster2001 is OK: OK - confd is active [19:30:33] 06Operations, 10MediaWiki-General-or-Unknown: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3068724 (10Evanstucker) Another clue lies here: https://gerrit.wikimedia.org/r/#/c/322777/1/includes/exception/MWExceptionRenderer.php I'll have to investigate this later this evening.... [19:31:10] (03PS1) 10Giuseppe Lavagetto: profile::configmaster: remove redundant prefix definition [puppet] - 10https://gerrit.wikimedia.org/r/340807 [19:31:11] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:33:40] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::configmaster: remove redundant prefix definition [puppet] - 10https://gerrit.wikimedia.org/r/340807 (owner: 10Giuseppe Lavagetto) [19:37:21] RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:38:38] 06Operations, 10MediaWiki-General-or-Unknown: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3068748 (10demon) Yeah that's what's producing the "set $wgShowDBErrorBacktrace" message. We'll likely want to change that in production for maintenance scripts. Better debugging :) [19:38:41] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:39:26] (03PS1) 10Giuseppe Lavagetto: pybal::web: fix dc directories [puppet] - 10https://gerrit.wikimedia.org/r/340808 [19:40:02] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] pybal::web: fix dc directories [puppet] - 10https://gerrit.wikimedia.org/r/340808 (owner: 10Giuseppe Lavagetto) [19:40:04] robh: any idea what the timeline is for https://phabricator.wikimedia.org/T158795 yet? (trying to determine how to plan) [19:40:34] (03PS1) 10Chad: Show backtraces on DB errors in cli context (maint scripts) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340809 [19:40:48] (03PS2) 10Chad: Show backtraces on DB errors in cli context (maint scripts) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340809 (https://phabricator.wikimedia.org/T159438) [19:41:13] RoanKattouw: Thoughts ^? [19:42:02] urandom: i'm not sure what the holdup was with the vendor but since you pinged i pulled bcak up my email tot hem and replied back asking whats up [19:42:13] !log maxsem@tin Started deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ [19:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:21] i know wy though, they forgot in the shuffle of the 20 or so open quotes they have to redo as of march 1st. [19:42:36] !log maxsem@tin Finished deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ (duration: 00m 23s) [19:42:39] pricing increases industry wide on memory and solid state drives =P [19:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:52] !log maxsem@tin Started deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ [19:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:56] !log maxsem@tin Finished deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ (duration: 00m 03s) [19:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:08] !log maxsem@tin Started deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ [19:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:12] !log maxsem@tin Finished deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ (duration: 00m 03s) [19:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:30] !log maxsem@tin Started deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ [19:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:36] !log maxsem@tin Finished deploy [tilerator/deploy@edb97c5]: https://gerrit.wikimedia.org/r/#/c/340607/ (duration: 00m 05s) [19:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:12] the above is me having troubles with scap :O [19:49:53] !log maxsem@tin Started deploy [tilerator/deploy@0fe5a1d]: Reverting to previous version [19:49:56] !log maxsem@tin Finished deploy [tilerator/deploy@0fe5a1d]: Reverting to previous version [19:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:39] RainbowSprinkles: Commit msg sounds great, looking at the patch now [19:51:50] (03CR) 10Catrope: [C: 031] Show backtraces on DB errors in cli context (maint scripts) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340809 (https://phabricator.wikimedia.org/T159438) (owner: 10Chad) [19:51:57] OK well that was simple :) [19:53:39] Hey is there a way to view recent changes by size? Like only changes over x size? [19:53:54] !log maxsem@tin Started deploy [tilerator/deploy@edb97c5]: Trying https://gerrit.wikimedia.org/r/#/c/340607/ once again [19:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:58] !log maxsem@tin Finished deploy [tilerator/deploy@edb97c5]: Trying https://gerrit.wikimedia.org/r/#/c/340607/ once again (duration: 00m 04s) [19:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:03] RoanKattouw: Awesome, thx. Trying to make it actually possible to debug maintenance scripts when they fail ;-) [19:55:11] That would be nice [19:55:19] (03PS3) 10Chad: Show backtraces on DB errors in cli context (maint scripts) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340809 (https://phabricator.wikimedia.org/T159438) [19:55:24] (03CR) 10Chad: [C: 032] Show backtraces on DB errors in cli context (maint scripts) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340809 (https://phabricator.wikimedia.org/T159438) (owner: 10Chad) [19:56:30] (03Merged) 10jenkins-bot: Show backtraces on DB errors in cli context (maint scripts) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340809 (https://phabricator.wikimedia.org/T159438) (owner: 10Chad) [19:56:39] (03CR) 10jenkins-bot: Show backtraces on DB errors in cli context (maint scripts) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340809 (https://phabricator.wikimedia.org/T159438) (owner: 10Chad) [19:57:01] !log bblack@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=appservers-rw [19:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:14] !log bblack@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=appservers-rw,named=eqiad [19:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:18] !log demon@tin Synchronized wmf-config/CommonSettings.php: Stacktraces are useful when cli scripts fail (duration: 00m 56s) [19:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:36] 06Operations, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3068818 (10demon) So I bumped the error reporting here, next time it fails we should get better info :D [20:00:05] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T2000). Please do the needful. [20:00:13] choo choo [20:02:54] (03CR) 10Chad: [C: 032] Group2 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340783 (owner: 10Chad) [20:04:42] (03Merged) 10jenkins-bot: Group2 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340783 (owner: 10Chad) [20:05:00] (03CR) 10jenkins-bot: Group2 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340783 (owner: 10Chad) [20:05:22] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.14 [20:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:41] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:06:13] (03CR) 10Dzahn: [C: 04-2] install: remove bast3001 from puppet and smokeping [puppet] - 10https://gerrit.wikimedia.org/r/340169 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:06:37] 06Operations, 06Services, 10Traffic, 07Performance: Look into solutions for replaying traffic to testing environment(s) - https://phabricator.wikimedia.org/T129682#2112642 (10Eevans) [20:07:38] robh: thanks! [20:21:31] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [20:22:36] (03PS2) 10Dzahn: smokeping: replace bast3001 with bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340169 (https://phabricator.wikimedia.org/T156506) [20:23:12] (03CR) 10Dzahn: [C: 032] smokeping: replace bast3001 with bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340169 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:24:14] 06Operations, 06Commons, 10MediaWiki-extensions-GWToolset, 06Multimedia, 07Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#3068879 (10Multichill) 05Open>03Resolved @harej : Last activity over a year ago, no blockers so I'm closin... [20:30:33] (03PS1) 10Dzahn: prometheus: remove bast3001 as esams server, keep bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340811 (https://phabricator.wikimedia.org/T156506) [20:30:51] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:31:25] (03PS2) 10Dzahn: prometheus: remove bast3001 as esams node, keep bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340811 (https://phabricator.wikimedia.org/T156506) [20:32:34] (03CR) 10Dzahn: [C: 032] "host prometheus.svc.esams.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/340811 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:32:51] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:39:53] (03PS1) 10Dzahn: bast3001: remove puppet roles, add role::spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/340812 (https://phabricator.wikimedia.org/T156506) [20:42:01] (03PS1) 10Dzahn: bast3001: remove from network/constants.pp [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) [20:44:35] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068928 (10Dzahn) [20:44:55] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068928 (10Dzahn) This checklist is able to be copied and pasted into phabricator hardware request tasks for reclaiming systems to spare or decom. [] - all system services confirmed offline from production use... [20:46:01] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:46:23] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068940 (10Dzahn) Common name: bast3001 Object type: Server Visible label: bast3001 Asset tag is missing. Explicit tags: Amsterdam https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=... [20:47:01] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:47:41] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:48:01] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:48:11] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:48:22] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068948 (10Dzahn) [20:48:34] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068928 (10Dzahn) [20:48:51] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[python-mwclient],Package[tzdata] [20:48:54] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068928 (10Dzahn) a:03Dzahn [20:49:01] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:49:01] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:49:11] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:31] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:50:07] (03PS2) 10Dzahn: bast3001: remove puppet roles, add role::spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/340812 (https://phabricator.wikimedia.org/T156506) [20:54:01] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [20:54:11] (03PS3) 10Dzahn: bast3001: remove puppet roles, add role::spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/340812 (https://phabricator.wikimedia.org/T156506) [20:54:36] (03CR) 10Dzahn: [C: 032] bast3001: remove puppet roles, add role::spare for decom [puppet] - 10https://gerrit.wikimedia.org/r/340812 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [20:55:01] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4063710 keys, up 122 days 12 hours - replication_delay is 0 [20:55:02] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:56:01] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4063509 keys, up 122 days 12 hours - replication_delay is 0 [20:58:51] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:01:51] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:02:47] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3068974 (10hashar) Nodepool emits a metric similar to the fullstack one. That is from the time an instance is created internally to nodepool until the time it has been added as a Jenkins slave.... [21:07:51] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:09:14] (03PS1) 10Dzahn: bast3001: fix spare class role name [puppet] - 10https://gerrit.wikimedia.org/r/340827 [21:09:31] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:10:30] (03CR) 10Dzahn: [C: 032] bast3001: fix spare class role name [puppet] - 10https://gerrit.wikimedia.org/r/340827 (owner: 10Dzahn) [21:12:21] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3069031 (10hashar) There are some labvirt* that shows a bump in CPU guest / Load average. labvirt1001 specially is concerning (load raising to 45+). Graphs over 30 days: [[ https://grafana.wiki... [21:13:49] (03PS2) 10Dzahn: bast3001: fix spare class role name [puppet] - 10https://gerrit.wikimedia.org/r/340827 [21:16:41] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [21:18:56] 06Operations, 10hardware-requests, 13Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3069075 (10Dzahn) [21:19:31] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:30:21] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:35:51] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:36:15] (03PS2) 10Dzahn: bast3001: remove from network/constants.pp [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) [21:36:18] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3069092 (10chasemp) Update: I believe we have worked out a plan to move forward with labstore1006/7 in FY 16/17 [21:36:29] (03CR) 10jerkins-bot: [V: 04-1] bast3001: remove from network/constants.pp [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:37:14] 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3069095 (10Dzahn) [21:37:16] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3069096 (10Dzahn) [21:37:18] 06Operations, 10ops-esams, 10hardware-requests, 13Patch-For-Review: reclaim hooft to spares - https://phabricator.wikimedia.org/T131560#3069097 (10Dzahn) [21:37:20] 06Operations, 10hardware-requests, 13Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3069094 (10Dzahn) [21:37:57] 06Operations, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3069098 (10RobH) [21:38:07] (03CR) 10Dzahn: "the non-ops user keys have already been removed from bast3001 but ops users can still use it .. until this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:38:24] (03PS3) 10Dzahn: bast3001: remove from network/constants.pp [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) [21:39:35] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3069116 (10Dzahn) p:05Normal>03High This has been replaced by bast3002 (T156506) and there is now a decom task at T159480. After decom is finished this can be closed too. [21:39:46] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3069120 (10Dzahn) p:05High>03Low [21:41:18] 06Operations, 10hardware-requests, 13Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068928 (10Dzahn) [21:46:10] 06Operations, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3069131 (10RobH) [21:46:51] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:47:20] 06Operations, 10ops-codfw: apply hostname labels and update racktables for scb2005 (WMF6466) and scb2006 (WMF6468) - https://phabricator.wikimedia.org/T159487#3069132 (10RobH) [21:47:31] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [21:48:31] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 06Services (watching), 15User-mobrovac, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#3069147 (10RobH) [21:48:34] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3069146 (10RobH) 05Open>03Resolved [21:49:16] (03CR) 10Krinkle: mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340539 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [21:50:14] (03PS1) 10Dzahn: bastion: rsync home dir data bast3001->bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340833 (https://phabricator.wikimedia.org/T156506) [21:52:19] (03CR) 10Dzahn: [C: 032] bastion: rsync home dir data bast3001->bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/340833 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:55:03] (03PS1) 10RobH: setting scb200[56] dns entries [dns] - 10https://gerrit.wikimedia.org/r/340835 [21:56:07] (03CR) 10RobH: [C: 032] setting scb200[56] dns entries [dns] - 10https://gerrit.wikimedia.org/r/340835 (owner: 10RobH) [21:57:16] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3069162 (10RobH) [21:57:49] (03PS1) 10Dzahn: bast3002: remove bastionhost::migration role [puppet] - 10https://gerrit.wikimedia.org/r/340842 [21:58:57] (03PS2) 10Dzahn: bast3002: remove bastionhost::migration role [puppet] - 10https://gerrit.wikimedia.org/r/340842 (https://phabricator.wikimedia.org/T156506) [21:59:06] (03CR) 10Dzahn: [C: 032] bast3002: remove bastionhost::migration role [puppet] - 10https://gerrit.wikimedia.org/r/340842 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [22:00:04] matt_flaschen and RoanKattouw: Respected human, time to deploy Flow enable and uninstall (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170302T2200). Please do the needful. [22:00:04] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:00:20] 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3069166 (10Dzahn) replaced by bast3002 for all practical purposes (prometheus and ganglia roles moved too) copied home dir data, mailed ops list about it, edited wikitech pages, pasted new fin... [22:00:34] PROBLEM - Confd template for /var/lib/gdnsd/discovery-appservers-ro.state on baham is CRITICAL: File not found: /var/lib/gdnsd/discovery-appservers-ro.state [22:00:34] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:00:54] PROBLEM - Confd template for /var/lib/gdnsd/discovery-appservers-rw.state on baham is CRITICAL: File not found: /var/lib/gdnsd/discovery-appservers-rw.state [22:01:14] PROBLEM - confd service on baham is CRITICAL: CRITICAL - Expecting active but unit confd is activating [22:02:02] Here, about to start convertNamespaceFromWikitext.php [22:03:24] Fatal error: Class 'MaintenanceDebugLogger' not found in /srv/mediawiki/php-1.29.0-wmf.14/extensions/Flow/maintenance/convertNamespaceFromWikitext.php on line 69 [22:03:25] Hmm [22:04:47] (03PS1) 10RobH: scb2005 & scb2006 install params [puppet] - 10https://gerrit.wikimedia.org/r/340856 [22:05:12] (03CR) 10RobH: [C: 032] scb2005 & scb2006 install params [puppet] - 10https://gerrit.wikimedia.org/r/340856 (owner: 10RobH) [22:07:41] Regression from 9d087957aebaa28900033211936497df29f9e99a [22:07:49] The extension.json conversion [22:09:37] !log bast3002 - stop rsyncd, remove rsyncd config snippets (T156506) [22:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:45] T156506: Replace bast3001 - https://phabricator.wikimedia.org/T156506 [22:09:59] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3069204 (10Dzahn) [22:10:01] 06Operations, 10hardware-requests, 13Patch-For-Review: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3069203 (10Dzahn) 05Open>03Resolved [22:10:49] And then when I try to fix it, https://phabricator.wikimedia.org/P5012 [22:13:20] Ha [22:13:23] Did we remove FormatJson? [22:13:30] Or did we just composerify it? [22:14:43] Nope, includes/json/FormatJson.php [22:15:39] hah, so why does it not work then [22:16:01] matt_flaschen: I can do the metawiki stuff in the meantime, right? That shouldn't get in your way on cawiki or require any non-core maintenance scripts [22:16:38] RoanKattouw, gen-autoload.php is ironically not loading the autoloader. [22:16:44] Or wasn't, I fixed it. [22:16:48] lol nice one [22:16:52] Quis est autoload Autoloa? [22:17:13] RoanKattouw, let me do this deploy since it affects extension.json. [22:17:19] OK go ahead [22:17:26] You can merge then I'll cherry-pick and deploy. One sec. [22:17:43] (03PS4) 10Catrope: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [22:18:06] OK, cool, tell me what to do when [22:19:07] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3069238 (10RobH) [22:19:14] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [22:20:50] RoanKattouw, https://gerrit.wikimedia.org/r/340873 . I let it put in the annoying Symphony test autoloader stuff since I'm tired of fixing that every time I commit this (same with old autoload.php) [22:21:02] They're just unnecessary, shouldn't cause issues. [22:25:20] matt_flaschen: It fails Jenkins in autoload-static somehow? https://integration.wikimedia.org/ci/job/composer-php55-trusty/15021/console [22:25:38] ah [22:25:42] James_F: https://integration.wikimedia.org/ci/job/composer-php55-trusty/15021/console [22:25:52] RoanKattouw could you ignore that file in the phplint setting? [22:26:05] Aha OK [22:26:06] that will fail on php 5.5 or lower [22:26:10] RoanKattouw, yeah, I was about to ask you. Don't know why. [22:26:13] thanks [22:26:26] matt_flaschen it's because the static file is for php 5.6 [22:26:31] it will fail on php 5.5 [22:27:05] 06Operations, 13Patch-For-Review: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943#3069265 (10Dzahn) 05Open>03Resolved sent mail about this to ops list. i consider this resolved now. [22:27:15] 06Operations: make deployment SSH keys use the same passphrase - https://phabricator.wikimedia.org/T154943#3069267 (10Dzahn) [22:27:20] paladox, okay, thanks. [22:27:31] Your welcome :) [22:28:12] RoanKattouw, do you think that is because of my extension.json change, or we just need an exclude like paladox said? [22:28:22] I just amended with James_F's guidance, hoping that fixe sit [22:28:23] It doesn't seem like would depend on extension.json, just does a directory search AFAICT. [22:28:28] We need to exclude it from the lint [22:28:34] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:28:50] (03PS7) 10Dzahn: Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [22:28:50] i will create the patch now [22:28:52] :) [22:28:55] paladox, they already did it. [22:28:55] paladox: Already done [22:29:00] Technically we support 5.5.9, but oh well. [22:29:01] I'm just ignoring all of vendor/ now [22:29:01] Oh :) [22:29:06] Yeh :) [22:29:07] Previously it was only ignoring bits of vendor [22:29:28] In theory we shouldn't be using libraries that don't support 5.5.9 (if that is the issue). In theory. [22:29:32] (03CR) 10Dzahn: [C: 031] "confirmed with puppet-compiler.wmflabs.org/5619/ and puppet-compiler.wmflabs.org/5620/" [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [22:29:53] RoanKattouw, thanks for the fix. Awaiting Jenkins' judgement. [22:30:21] Thanks James_F [22:30:34] Glad I made a window for this. [22:30:47] Always happy to help. [22:31:10] matt_flaschen: We support 5.5.9 for MW-core but not linting of MW's /vendor upstream modules. [22:31:23] matt_flaschen: Which is a bit of a difference. :-) [22:31:38] James_F, how does that work Real World, though. If a library we use fails on 5.5.9, MW (or an extension) fails on 5.5.9. [22:32:12] We did try to get upstream composer to allow us to disable the autoload_static.php file but they said no [22:32:29] 06Operations, 10ops-esams, 10hardware-requests, 13Patch-For-Review: redeploy hooft as bast3002 - https://phabricator.wikimedia.org/T131560#3069273 (10RobH) [22:32:54] paladox, is it not allowed because "__DIR__ . '/..' . '/symfony/polyfill-mbstring/bootstrap.php'" is not a constant. [22:32:57] ? [22:33:06] Not sure. [22:33:06] PHP is weirdly strict about that. [22:33:16] It's because it is using php 5.6 syntax [22:33:31] that is not in php 5.5 [22:33:32] other users have reported it to composer. [22:33:59] paladox, yeah: http://php.net/manual/en/language.oop5.constants.php . Thanks for the heads-up. [22:34:04] I've ran into that before. [22:34:07] Your welcome [22:34:11] RoanKattouw probally want to do --exclude vendor/bin --exclude vendor/jakub-onderka --exclude vendor/composer/ [22:34:37] otherwise it will fail with [22:34:38] Fatal error: File not found: /home/jenkins/workspace/mediawiki-extensions-hhvm-jessie/src/extensions/Flow/vendor/pimple/pimple/src/Pimple/Container.php in /home/jenkins/workspace/mediawiki-extensions-hhvm-jessie/src/includes/AutoLoader.php on line 81 [22:34:38] paladox: James told me to just do --exclude vendor [22:34:46] huh, why? [22:34:49] Looks like they finally support combining multiple constants. [22:34:58] Not sure probaly flow requires that file [22:35:08] in it's AutoLoader.php file [22:35:33] hah you're right, it does fail [22:36:04] !log killed stuck updates on maps-test2001 [22:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:29] Yep [22:37:43] I don't think that's related to the --exclude thing there though [22:37:58] RoanKattouw, should we put it back to paladox's 3? [22:38:02] We can try [22:38:03] I don't think Jenkins ran on that. [22:38:08] Yep [22:38:09] If that passes CI, I will be mystified, but let's try it [22:38:15] I agree. [22:38:18] Will try, though. [22:39:25] matt_flaschen: Mind if I do the metawiki changes in the meantime? [22:39:48] RoanKattouw, no, go ahead. I'll wait to deploy until you're done. [22:40:12] (03CR) 10Catrope: [C: 032] Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [22:41:08] CI still fails on the Flow patch [22:41:21] (03Merged) 10jenkins-bot: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [22:41:30] (03CR) 10jenkins-bot: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [22:41:41] Strangely that file does exist for me locally [22:41:44] RoanKattouw, yeah, looking. [22:42:28] (03CR) 10Dzahn: [C: 032] Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [22:42:57] paladox, you're right about the tabs. It should match convertExtensionToRegistration. I copied that line, will commit later. [22:43:16] Your welcome :) [22:44:12] RoanKattouw, someone recently updated our Pimple. I probably need to run that locally and re-do the autoloader. [22:45:11] Aha [22:45:17] Oh right, Reedy did that [22:46:23] !log catrope@tin Synchronized dblists/: T63729: disable Flow on metawiki (duration: 00m 58s) [22:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:29] T63729: Remove Flow from Meta-Wiki - https://phabricator.wikimedia.org/T63729 [22:49:12] Ugh, deleteBatch isn't working because the content model is no longer registered [22:49:24] I forgot how I handled this on enwiki last time [22:49:29] (03PS1) 10BBlack: confd: define srv_dns at all sites [puppet] - 10https://gerrit.wikimedia.org/r/340890 [22:49:39] I'll try reenabling Flow locally on tin and disabling the deletion prevention or something [22:50:28] Oh, the deletion succeeded, it's just post-deletion stuff that didn't [22:50:57] Title::purgeSquid fails because it indirectly calls Title::getContentLanguage() which calls ContentHandler [22:51:06] I'll just set the CH to wikitext in the DB :) [22:51:31] (03PS3) 10Krinkle: mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5 [puppet] - 10https://gerrit.wikimedia.org/r/340539 (https://phabricator.wikimedia.org/T156922) [22:52:07] OK that worked [22:52:30] And https://meta.wikimedia.org/wiki/Special:Contributions/Flow_talk_page_manager is clean now, yay [22:53:27] (03CR) 10BBlack: [C: 032] confd: define srv_dns at all sites [puppet] - 10https://gerrit.wikimedia.org/r/340890 (owner: 10BBlack) [22:54:23] RoanKattouw, except vendor/pimple/pimple/src/Pimple/Container.php is still there. [22:54:28] I even wiped vendor and regenerated it. [22:54:34] ha [22:55:10] !log Stopped statsd-mw-js-deprecate service on hafnium per https://gerrit.wikimedia.org/r/338929 [22:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:39] RoanKattouw, I guess this doesn't run composer. I'll just remove the junk after all. [22:56:15] RECOVERY - confd service on baham is OK: OK - confd is active [22:56:34] RECOVERY - Confd template for /var/lib/gdnsd/discovery-appservers-ro.state on baham is OK: No errors detected [22:59:49] RoanKattouw, https://gerrit.wikimedia.org/r/#/c/340873/ [22:59:53] Waiting for Jenkins now [23:02:38] greg-g, expanded our Flow window to end at 4 Pacific (when SWAT starts) due to unexpected problems. [23:02:56] (Autoloader regression, rabbit hole) [23:05:29] !log bblack@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=appservers-rw [23:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:37] !log bblack@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=appservers-rw,name=eqiad [23:05:37] matt_flaschen: kk [23:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:54] RECOVERY - Confd template for /var/lib/gdnsd/discovery-appservers-rw.state on baham is OK: No errors detected [23:07:15] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [23:07:48] !log all authdns servers puppet re-enabled [23:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:04] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 06Discovery-Search (Current work): collect usual GC metrics for Blazegraph JVMs - https://phabricator.wikimedia.org/T159248#3069367 (10debt) [23:10:14] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [23:11:57] chasemp ^^ [23:12:09] thanks [23:12:18] your welcome :) [23:12:54] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:15:44] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:15:52] 06Operations, 10ops-esams, 10hardware-requests, 13Patch-For-Review: redeploy hooft as bast3002 - https://phabricator.wikimedia.org/T131560#3069449 (10Dzahn) oh yea, this one has been done in T156506, it's kind of a duplicate or invalid now. [23:16:15] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3069452 (10chasemp) leaked 3 more now leaving to debug tomorrow > PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova... [23:16:18] 06Operations, 13Patch-For-Review: Reimage hooft with jessie and rename to bast3001 - https://phabricator.wikimedia.org/T123712#3069454 (10Dzahn) [23:16:20] 06Operations, 10ops-esams, 10hardware-requests, 13Patch-For-Review: redeploy hooft as bast3002 - https://phabricator.wikimedia.org/T131560#3069453 (10Dzahn) 05Open>03Invalid [23:17:47] (03CR) 10Chad: Scap clean: abort if a branch is still in use (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [23:18:04] (03PS1) 10Papaul: DHCP: Add DHCP entries for ms-be2028-msbe2039 [puppet] - 10https://gerrit.wikimedia.org/r/340896 [23:19:33] (03PS2) 10Chad: Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 [23:20:20] (03CR) 10Chad: Scap clean: abort if a branch is still in use (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [23:27:03] RoanKattouw, okay, Jenkins finally passes: https://gerrit.wikimedia.org/r/#/c/340873/ [23:27:25] After that's reviewed, I can cherry-pick, deploy, then do the cawiki. [23:29:07] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: collect usual GC metrics for Blazegraph JVMs - https://phabricator.wikimedia.org/T159248#3069495 (10Smalyshev) [23:29:28] (03PS4) 10Krinkle: mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5 [puppet] - 10https://gerrit.wikimedia.org/r/340539 (https://phabricator.wikimedia.org/T156922) [23:31:48] matt_flaschen: +2ed [23:32:21] Thanks RoanKattouw [23:32:24] (03PS5) 10Krinkle: mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5 [puppet] - 10https://gerrit.wikimedia.org/r/340539 (https://phabricator.wikimedia.org/T156922) [23:34:55] (03CR) 10Krinkle: "No, it overwrite most of my changes from PS13. https://gerrit.wikimedia.org/r/#/c/337158/13..16/modules/webperf/files/navtiming.py and htt" [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [23:37:17] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [23:40:47] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:43:47] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [23:44:51] (03Draft1) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 [23:44:54] (03PS2) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) [23:54:32] (03PS1) 10EBernhardson: Revert "Test disable super_detect_noop script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340902 [23:58:08] (03CR) 10Chad: "This can go whenever, just needs a plugin reload not full restart" [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T159075) (owner: 10Paladox) [23:58:20] (03CR) 10Chad: [C: 031] "This can go whenever just needs a plugin reload not a full restart" [puppet] - 10https://gerrit.wikimedia.org/r/340735 (https://phabricator.wikimedia.org/T159441) (owner: 10Paladox) [23:58:24] (03CR) 10Chad: [C: 031] Gerrit: Fix bot so that it checks against *-name and *-username [puppet] - 10https://gerrit.wikimedia.org/r/340801 (https://phabricator.wikimedia.org/T159075) (owner: 10Paladox) [23:59:27] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed