[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171115T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:10:41] (03PS3) 10EddieGP: Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) (owner: 10Zoranzoki21) [00:10:43] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] [00:11:44] (03CR) 10EddieGP: "The request was to have an exemption until 2017-11-21. Thus we should set the exemption to end at 2017-11-22T00:00 UTC, or it will expire " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) (owner: 10Zoranzoki21) [00:12:02] (03PS4) 10EddieGP: Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) (owner: 10Zoranzoki21) [00:12:08] Can I get an ACK on that icinga-wm warning about zuul/gearman [00:12:20] Known, large batch of changes pushed [00:12:40] Hmm, I'd like to add https://gerrit.wikimedia.org/r/#/c/391220/ (even though it's not my patch) to SWAT if possible. It's a throttle override that will be needed tomorrow according to the ticket comments and that was incorrectly placed in the puppet swat window earlier this day (where it was ignored, as it isn't, well, puppet). [00:13:20] T180441 is the ticket [00:13:20] T180441: Temporary lift of IP cap from 2017-11-14 to 2017-11-21 - https://phabricator.wikimedia.org/T180441 [00:26:47] (03PS5) 10MaxSem: Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) (owner: 10Zoranzoki21) [00:27:10] Seems nobody is around as there wasn't anything to swat in the first place, not sure whom to ping about ^ [00:27:17] (03CR) 10MaxSem: [C: 032] Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) (owner: 10Zoranzoki21) [00:27:20] lol [00:27:28] eddiegp: I'll deploy it, meanwhile please put it in the right section for record keeping, please [00:27:44] MaxSem: Thanks, I'll do that [00:30:12] (03Merged) 10jenkins-bot: Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) (owner: 10Zoranzoki21) [00:32:32] Oh yeah, I should have seen Max already started with that patch if I wouldn't have silenced wikibugs, lol :D [00:33:17] !log maxsem@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/391220/ (duration: 00m 49s) [00:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:25] eddiegp: done [00:34:31] MaxSem: Thanks! [00:34:39] :) [00:34:43] (03CR) 10jenkins-bot: Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) (owner: 10Zoranzoki21) [00:53:38] !log going to restart zuul as it's got backed up [00:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:01] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [02:29:30] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.7) (duration: 05m 57s) [02:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:52] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:04] (03PS1) 10Chad: search.wikimedia.org: Add robots.txt, tell them to stay away [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391470 [03:05:06] (03CR) 10Chad: [C: 032] search.wikimedia.org: Add robots.txt, tell them to stay away [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391470 (owner: 10Chad) [03:09:36] (03Merged) 10jenkins-bot: search.wikimedia.org: Add robots.txt, tell them to stay away [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391470 (owner: 10Chad) [03:11:52] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:12:59] (03CR) 10jenkins-bot: search.wikimedia.org: Add robots.txt, tell them to stay away [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391470 (owner: 10Chad) [03:19:12] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:22:12] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 7.129 second response time on 10.192.16.162 port 9042 [03:25:01] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 851.90 seconds [03:50:29] (03PS1) 10Tim Starling: Work around HHVM bug by using XMLWriter::writeAttribute() [dumps/dcat] - 10https://gerrit.wikimedia.org/r/391489 (https://phabricator.wikimedia.org/T117534) [03:53:51] (03CR) 10Tim Starling: "Tested locally by setting up a fake dump directory with a single (zero size) dump file in it. I downloaded the config from snapshot1007. I" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/391489 (https://phabricator.wikimedia.org/T117534) (owner: 10Tim Starling) [03:57:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 149.24 seconds [04:06:14] !log demon@tin Synchronized docroot/search.wikimedia.org/robots.txt: go away robots / kill some 404s (duration: 00m 50s) [04:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:01] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 1 down 0 [04:16:01] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [05:04:28] (03PS1) 10Phedenskog: Skip upper limits for values in navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/391496 (https://phabricator.wikimedia.org/T104902) [05:27:42] !log subbu@terbium running linter-reparse.py script to initialize baseline linter categories for all wikis (12 hours in so far .. expected to run for ~2 weeks). hits parsoid eqiad cluster. [05:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:08] !log subbu@terbium (followup to the linter-reparse.py script log entry) it is safe to kill -9 the script on terbium anytime if there are any problems because of it [05:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:42] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:38:42] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 7.340 second response time on 10.192.16.164 port 9042 [06:11:02] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:01] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 3.068 second response time on 10.192.16.162 port 9042 [06:25:01] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:25:52] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [06:38:32] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:22] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 1.068 second response time on 10.192.16.162 port 9042 [06:52:01] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3761763 (10Joe) >>! In T180023#3744258, @Gehel wrote: > I would argue that including the source of the software as a submodule should be optional. The specific use case I have in m... [06:59:12] PROBLEM - configured eth on restbase2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:59:31] PROBLEM - Disk space on restbase2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [07:00:11] PROBLEM - MD RAID on restbase2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [07:01:01] PROBLEM - SSH on restbase2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:21] PROBLEM - configured eth on restbase2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [07:02:11] RECOVERY - configured eth on restbase2004 is OK: OK - interfaces up [07:02:31] RECOVERY - Disk space on restbase2004 is OK: DISK OK [07:02:55] RECOVERY - SSH on restbase2004 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [07:03:11] PROBLEM - MD RAID on restbase2004 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 4, Spare: 0 [07:03:12] ACKNOWLEDGEMENT - MD RAID on restbase2004 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 4, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T180562 [07:03:15] 10Operations, 10ops-codfw: Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3761766 (10ops-monitoring-bot) [07:05:25] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3761771 (10Joe) >>! In T180023#3744420, @Volans wrote: >> Which deployment method to choose > > I would mention also cases in which the upstream package or dependencies release qu... [07:07:22] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [07:08:21] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:15:26] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106 and db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391506 [07:15:30] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1106 and db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391506 [07:16:24] 10Operations, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3761787 (10Krinkle) >>! In T180407#3758143, @Nemo_bis wrote: > I assume the same would apply to the "UseDC" and "UseCDNCache" cookies? And maybe also the "cpPosTime"? `CP` cookie is a long-l... [07:19:42] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:20:31] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3761789 (10Krinkle) [07:20:32] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 [07:20:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Well yeah, but we also have needs for transparency and we avoid having anything not critical private but rather everything public via defa" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391214 (owner: 10Ottomata) [07:21:11] 10Operations, 10Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#3761791 (10Krinkle) [07:22:04] 10Operations, 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3761793 (10Gehel) >>! In T180023#3761763, @Joe wrote: > Well if we don't include a submodule with the sources, I would expect us to clone a repository at a certain revision at the... [07:28:24] 10Operations, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3761794 (10Nemo_bis) Ok, makes sense. Explains why I didn't notice them before checking in the developer console. :-) I don't know whether creating those cookies causes issues (e.g. because o... [07:31:04] (03CR) 10Alexandros Kosiaris: [C: 031] Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [07:32:15] (03PS1) 10Marostegui: tools.my.cnf: Replication filter: s51290__dpl_p [puppet] - 10https://gerrit.wikimedia.org/r/391508 (https://phabricator.wikimedia.org/T180560) [07:33:17] (03CR) 10Marostegui: [C: 032] tools.my.cnf: Replication filter: s51290__dpl_p [puppet] - 10https://gerrit.wikimedia.org/r/391508 (https://phabricator.wikimedia.org/T180560) (owner: 10Marostegui) [07:34:15] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1106 and db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391506 (owner: 10Marostegui) [07:34:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 for the dashboard file being changed in this change (should be in its own)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [07:35:48] (03CR) 10Alexandros Kosiaris: [C: 031] Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [07:36:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106 and db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391506 (owner: 10Marostegui) [07:36:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106 and db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391506 (owner: 10Marostegui) [07:36:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "And I just saw https://gerrit.wikimedia.org/r/#/c/391238/4 which addresses my concern" [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [07:36:47] (03CR) 10Alexandros Kosiaris: [C: 031] Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [07:37:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1106 and db1100 - T174569 (duration: 00m 49s) [07:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:39] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [07:38:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1071 and db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391509 (https://phabricator.wikimedia.org/T174569) [07:39:54] (03PS2) 10Marostegui: db-eqiad.php: Depool db1071 and db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391509 (https://phabricator.wikimedia.org/T174569) [07:40:10] (03CR) 10Krinkle: [C: 031] Skip upper limits for values in navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/391496 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [07:40:58] (03CR) 10Krinkle: [C: 031] Skip upper limits for values in navtiming2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391496 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [07:41:02] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:41:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1071 and db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391509 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:42:01] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 3.055 second response time on 10.192.16.162 port 9042 [07:43:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1071 and db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391509 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:44:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099 and db1071 - T174569 (duration: 00m 49s) [07:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:44] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [07:44:47] (03PS1) 10KartikMistry: Add apertium-crh-tur packages [puppet] - 10https://gerrit.wikimedia.org/r/391511 (https://phabricator.wikimedia.org/T178139) [07:45:22] (03CR) 10jerkins-bot: [V: 04-1] Add apertium-crh-tur packages [puppet] - 10https://gerrit.wikimedia.org/r/391511 (https://phabricator.wikimedia.org/T178139) (owner: 10KartikMistry) [07:46:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1071 and db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391509 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:47:19] (03PS1) 10Marostegui: tools.my.cnf: Replication filter for i_psub [puppet] - 10https://gerrit.wikimedia.org/r/391513 (https://phabricator.wikimedia.org/T180560) [07:48:24] (03CR) 10Marostegui: [C: 032] tools.my.cnf: Replication filter for i_psub [puppet] - 10https://gerrit.wikimedia.org/r/391513 (https://phabricator.wikimedia.org/T180560) (owner: 10Marostegui) [07:49:15] !log Deploy schema change on db1071 and db1099 (s5) - T174569 [07:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:52] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [07:50:42] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:53:13] (03PS2) 10KartikMistry: Add new Apertium pairs: crh-tur and cat-srd [puppet] - 10https://gerrit.wikimedia.org/r/391511 (https://phabricator.wikimedia.org/T178139) [07:53:55] (03CR) 10jerkins-bot: [V: 04-1] Add new Apertium pairs: crh-tur and cat-srd [puppet] - 10https://gerrit.wikimedia.org/r/391511 (https://phabricator.wikimedia.org/T178139) (owner: 10KartikMistry) [07:54:56] (03PS3) 10KartikMistry: Add new Apertium pairs: crh-tur and cat-srd [puppet] - 10https://gerrit.wikimedia.org/r/391511 (https://phabricator.wikimedia.org/T178139) [07:55:04] (03CR) 10Alexandros Kosiaris: [C: 031] standard: map an host to its production cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [07:55:36] (03CR) 10Alexandros Kosiaris: [C: 032] Add new Apertium pairs: crh-tur and cat-srd [puppet] - 10https://gerrit.wikimedia.org/r/391511 (https://phabricator.wikimedia.org/T178139) (owner: 10KartikMistry) [07:55:58] Thanks akosiaris [07:56:21] yw [07:56:42] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [07:58:11] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [07:59:01] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:59:01] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [08:01:56] akosiaris: when should new package available? [08:02:28] it is. thanks! [08:02:38] 10Operations, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3756812 (10BBlack) Does RL make use of the CP cookie information to use different module-loading strategies for H/1 vs H/2? I remember that being the intent in creating it, but I'm not sure... [08:02:45] (03PS1) 10Marostegui: tools.my.cnf: Replication filter for i_redirects [puppet] - 10https://gerrit.wikimedia.org/r/391514 (https://phabricator.wikimedia.org/T180560) [08:02:47] kart_: tops 30 mins across the fleet [08:03:01] (03PS2) 10Marostegui: tools.my.cnf: Replication filter for i_redirects [puppet] - 10https://gerrit.wikimedia.org/r/391514 (https://phabricator.wikimedia.org/T180560) [08:04:01] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [08:04:01] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [08:04:07] (03CR) 10Marostegui: [C: 032] tools.my.cnf: Replication filter for i_redirects [puppet] - 10https://gerrit.wikimedia.org/r/391514 (https://phabricator.wikimedia.org/T180560) (owner: 10Marostegui) [08:06:25] akosiaris: also, will it restart apertium-apy service? we need it to restart service too. I forgot the last time added new package :) [08:07:51] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2018-08-17 16:11:39 +0000 (expires in 275 days) [08:08:11] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [08:11:45] kart_: yes it will [08:11:51] Okay. Cool. [08:12:21] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [08:13:03] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:23:29] akosiaris, kart_: BTW, when I upgraded kernels on scb* I noted a few apertium packages with a pending update, maybe upgrade those as well while you're at it [08:29:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391515 (https://phabricator.wikimedia.org/T178359) [08:29:36] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3761858 (10mobrovac) [08:31:35] moritzm: yeah indeed, good idea [08:31:40] (03PS1) 10Marostegui: tools.my.cnf.erb: Ignore filter for u_pagelinks [puppet] - 10https://gerrit.wikimedia.org/r/391516 (https://phabricator.wikimedia.org/T180560) [08:31:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391515 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:31:58] (03CR) 10Marostegui: [C: 032] tools.my.cnf.erb: Ignore filter for u_pagelinks [puppet] - 10https://gerrit.wikimedia.org/r/391516 (https://phabricator.wikimedia.org/T180560) (owner: 10Marostegui) [08:32:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391515 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:32:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391515 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:33:06] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3761866 (10fgiunchedi) A note: as per {T179422} restbase2004 isn't in service [08:34:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101 - going to convert it to mult-instance T178359 (duration: 00m 48s) [08:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:12] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:45:26] moritzm: ah. Sorry just noticed. [08:45:39] moritzm: can you send me list? [08:46:48] kart_: I think akosiaris already upgraded them, on scb1004 all apertium packages are now up-to-date [08:46:56] OK. Nice. [08:47:15] Yes. I don't see them in apt list --upgradable [08:47:21] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3761888 (10Joe) @dduvall the reason this is happening right now is that stretch doesn't have a package for npm and I didn't g... [08:48:38] (03PS1) 10Marostegui: tools.my.cnf.erb: Ignore memory tables for s51290 [puppet] - 10https://gerrit.wikimedia.org/r/391517 (https://phabricator.wikimedia.org/T180560) [08:49:55] (03CR) 10Marostegui: [C: 032] tools.my.cnf.erb: Ignore memory tables for s51290 [puppet] - 10https://gerrit.wikimedia.org/r/391517 (https://phabricator.wikimedia.org/T180560) (owner: 10Marostegui) [08:51:04] !log reboot thorium (hosting all analytics websites) for kernel updates [08:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:30] this --^ means that there will be a little downtime for analytics services in a bit [08:52:12] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:58:03] kart_: moritzm: Yeah already done. I am still restarting apertium-apy on a rolling restart pattern, should be done soon [09:00:16] (03PS1) 10Mobrovac: RESTBase: Remove seeds that are moving to Cass3 [puppet] - 10https://gerrit.wikimedia.org/r/391518 (https://phabricator.wikimedia.org/T179422) [09:02:17] 10Operations, 10Cassandra: cassandra unresponsive on restbase2001-c - https://phabricator.wikimedia.org/T180568#3761927 (10fgiunchedi) [09:02:51] !log rebooting labsdb1010 for kernel upgrade [09:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:23] 10Operations, 10Cassandra: cassandra unresponsive on restbase2001-c - https://phabricator.wikimedia.org/T180568#3761941 (10mobrovac) [09:03:46] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3761943 (10mobrovac) [09:08:24] !log stop eventlogging on eventlog1001, eventlogging replication on db1108/db1047/dbstore1002 as preparation steps to migrate the log db from db1046 to db1107 [09:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:02] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [09:10:41] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [09:15:19] !log Stop mysql on db1046 to transfer its content to db1107 - T177405 [09:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:26] T177405: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405 [09:15:30] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3761976 (10MoritzMuehlenhoff) Current npm releases are not packaged in Debian since the list of dependencies exploded. Effort... [09:17:41] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [09:17:51] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [09:19:33] !log restart cassandra on restbase2001-c - T180568 [09:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:40] T180568: cassandra unresponsive on restbase2001-c - https://phabricator.wikimedia.org/T180568 [09:20:52] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [09:21:51] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.164 and port 9042: Connection refused [09:22:12] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:42] (03PS1) 10Elukey: site.pp: add role to db1107 (new eventlogging master db) [puppet] - 10https://gerrit.wikimedia.org/r/391519 (https://phabricator.wikimedia.org/T177405) [09:24:24] (03CR) 10jerkins-bot: [V: 04-1] site.pp: add role to db1107 (new eventlogging master db) [puppet] - 10https://gerrit.wikimedia.org/r/391519 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:24:28] (03PS2) 10Elukey: site.pp: add role to db1107 (new eventlogging master db) [puppet] - 10https://gerrit.wikimedia.org/r/391519 (https://phabricator.wikimedia.org/T177405) [09:27:01] RECOVERY - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-c valid until 2018-08-17 16:11:42 +0000 (expires in 275 days) [09:27:52] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 [09:30:15] !log updating openssl on database hosts [09:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:36] !log restart cassandra on restbase2002-b - T180568 [09:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:42] T180568: cassandra unresponsive on restbase2001-c - https://phabricator.wikimedia.org/T180568 [09:40:56] (03PS1) 10Muehlenhoff: Add role::mariadb::core_multiinstance to Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/391521 [09:42:57] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Remove seeds that are moving to Cass3 [puppet] - 10https://gerrit.wikimedia.org/r/391518 (https://phabricator.wikimedia.org/T179422) (owner: 10Mobrovac) [09:43:38] (03CR) 10Marostegui: [C: 031] Add role::mariadb::core_multiinstance to Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/391521 (owner: 10Muehlenhoff) [09:44:35] (03PS2) 10Muehlenhoff: Add role::mariadb::core_multiinstance to Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/391521 [09:47:36] (03CR) 10Muehlenhoff: [C: 032] Add role::mariadb::core_multiinstance to Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/391521 (owner: 10Muehlenhoff) [09:48:08] (03PS1) 10Elukey: cumin: update videoscaler canary after the decom of mw1168 [puppet] - 10https://gerrit.wikimedia.org/r/391522 (https://phabricator.wikimedia.org/T177387) [09:48:34] (03CR) 10Muehlenhoff: [C: 031] cumin: update videoscaler canary after the decom of mw1168 [puppet] - 10https://gerrit.wikimedia.org/r/391522 (https://phabricator.wikimedia.org/T177387) (owner: 10Elukey) [09:51:59] !log ppchelko@tin Started deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x T179786 [09:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:06] T179786: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786 [09:52:28] !log mobrovac@tin Started restart [restbase/deploy@c76a665]: Pick up the new seeds definition - T179422 [09:52:32] !log ppchelko@tin Started deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x T179786 [09:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:36] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:15] !log reboot restbase2004 - T180562 [09:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:21] T180562: Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562 [09:55:14] PROBLEM - Host restbase2004 is DOWN: PING CRITICAL - Packet loss = 100% [09:55:16] !log ppchelko@tin Finished deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x T179786 (duration: 02m 44s) [09:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:44] !log ppchelko@tin Started deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x attempt 2 T179786 [09:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:21] RECOVERY - Host restbase2004 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [09:57:31] RECOVERY - MD RAID on restbase2004 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0 [09:57:36] (03PS3) 10Volans: Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) [09:57:38] (03PS4) 10Volans: Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) [09:57:40] (03PS4) 10Volans: Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [09:57:43] (03PS5) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [09:57:44] (03PS1) 10Volans: Grafana: add graph to Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391525 (https://phabricator.wikimedia.org/T170353) [09:58:05] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [09:58:28] !log ppchelko@tin (no justification provided) [09:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:16] !log ppchelko@tin Finished deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x attempt 2 T179786 (duration: 03m 32s) [09:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:24] T179786: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786 [09:59:36] !log ppchelko@tin Started deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x attempt 3, force T179786 [09:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:53] (03PS2) 10Elukey: cumin: update videoscaler canary after the decom of mw1168 [puppet] - 10https://gerrit.wikimedia.org/r/391522 (https://phabricator.wikimedia.org/T177387) [10:01:00] volans: --^ [10:01:41] elukey: ok, until we have a canary role :-P [10:01:53] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/391522 (https://phabricator.wikimedia.org/T177387) (owner: 10Elukey) [10:02:18] (03CR) 10Elukey: [C: 032] site.pp: add role to db1107 (new eventlogging master db) [puppet] - 10https://gerrit.wikimedia.org/r/391519 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [10:02:27] arrrggg nope [10:02:34] (03CR) 10Elukey: site.pp: add role to db1107 (new eventlogging master db) [puppet] - 10https://gerrit.wikimedia.org/r/391519 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [10:02:44] too many cr opened [10:02:48] (03CR) 10Elukey: [C: 032] cumin: update videoscaler canary after the decom of mw1168 [puppet] - 10https://gerrit.wikimedia.org/r/391522 (https://phabricator.wikimedia.org/T177387) (owner: 10Elukey) [10:03:37] (03PS1) 10Marostegui: install_server: Reimage db1101 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/391527 (https://phabricator.wikimedia.org/T178359) [10:03:51] (03PS2) 10Marostegui: install_server: Reimage db1101 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/391527 (https://phabricator.wikimedia.org/T178359) [10:04:13] (03CR) 10Marostegui: [C: 04-2] "Wait till tomorrow to make sure s2 performs fine without this host" [puppet] - 10https://gerrit.wikimedia.org/r/391527 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:04:21] !log ppchelko@tin Finished deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x attempt 3, force T179786 (duration: 04m 44s) [10:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:27] T179786: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786 [10:04:36] !log ppchelko@tin Started deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x attempt 4 T179786 [10:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:42] (03PS1) 10Jcrespo: dbproxy: Add proxy 4 and 9 to reimage as stretch for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/391528 (https://phabricator.wikimedia.org/T156844) [10:05:04] (03CR) 10Marostegui: [C: 031] dbproxy: Add proxy 4 and 9 to reimage as stretch for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/391528 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [10:05:37] (03PS2) 10Jcrespo: dbproxy: Add proxy 4 and 9 to reimage as stretch for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/391528 (https://phabricator.wikimedia.org/T156844) [10:07:49] (03PS1) 10Muehlenhoff: Update db cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/391530 [10:08:01] (03CR) 10Jcrespo: [C: 032] dbproxy: Add proxy 4 and 9 to reimage as stretch for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/391528 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [10:09:02] PROBLEM - Restbase root url on restbase2006 is CRITICAL: connect to address 10.192.48.38 and port 7231: Connection refused [10:09:27] !log ppchelko@tin Finished deploy [trending-edits/deploy@a0e1fe3]: Update node-rdkafka to v1.x attempt 4 T179786 (duration: 04m 51s) [10:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:34] T179786: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786 [10:09:52] ignore the rb problems ^ [10:11:10] 10Operations, 10Trending-Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban), and 4 others: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786#3762168 (10Pchelolo) 05Open>03Resolved a:03Pchelolo Deployed. Resolving. [10:13:40] (03PS3) 10Elukey: site.pp: add role to db1107 (new eventlogging master db) [puppet] - 10https://gerrit.wikimedia.org/r/391519 (https://phabricator.wikimedia.org/T177405) [10:13:49] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987633 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1009.eqiad... [10:14:43] (03PS2) 10Phedenskog: webperf: Skip upper limits for values in navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/391496 (https://phabricator.wikimedia.org/T104902) [10:21:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391533 [10:21:58] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3762232 (10fgiunchedi) restbase2004 wasn't accessible over ssh and I saw the usual "disks busted" messages on console before rebooting ``` [11567494.419280] sd 0:1:0:0: rejecting I/O t... [10:24:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391533 (owner: 10Marostegui) [10:27:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391533 (owner: 10Marostegui) [10:28:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391533 (owner: 10Marostegui) [10:29:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106, optimizing wikidatawiki.wb_terms (duration: 00m 49s) [10:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:20] (03CR) 10Alexandros Kosiaris: [C: 031] Grafana: add graph to Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391525 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:32:36] (03CR) 10Alexandros Kosiaris: [C: 031] Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:36:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Generally LGTM, a few smaller issues to fix though." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [10:39:12] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 0 down 0 [10:42:58] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3762292 (10fgiunchedi) I was able to recover the kernel logs on the remote syslog servers, {P6314} [10:45:10] (03CR) 10Filippo Giunchedi: [C: 031] Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:46:46] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3762299 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1009.eqiad.wmnet'] ``` and were **ALL** successful. [10:48:33] (03CR) 10Filippo Giunchedi: [C: 031] "I painted the naming bikeshed, LGTM tho" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:49:20] (03CR) 10Filippo Giunchedi: [C: 031] Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:51:13] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:52:18] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [10:56:39] (03PS1) 10Jcrespo: haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) [10:57:14] (03CR) 10jerkins-bot: [V: 04-1] haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [10:57:51] (03PS2) 10Jcrespo: haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) [10:58:22] (03CR) 10jerkins-bot: [V: 04-1] haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [10:58:33] !log rebooting job runners in eqiad for update to 4.9.51 (and to pick up openssl update) [10:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:36] (03PS3) 10Jcrespo: haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) [11:01:32] (03PS4) 10Jcrespo: haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) [11:02:30] (03CR) 10Elukey: [C: 032] site.pp: add role to db1107 (new eventlogging master db) [puppet] - 10https://gerrit.wikimedia.org/r/391519 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:02:33] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3762340 (10fgiunchedi) Awfully similar to {T144826} [11:08:27] (03PS5) 10Jcrespo: haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) [11:10:15] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus consumer/client-side-events-log consumer/all-events-log processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side-08 processor/client-side-07 processor/client-side-06 processor/client-side-05 processor/client-side-04 proce [11:10:16] processor/client-side-02 processor/client-side-01 processor/client-side-00 forwarder/legacy-zmq [11:10:36] PROBLEM - eventlogging_sync processes on dbstore1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [11:10:36] I guess downtime expired [11:10:43] uff [11:10:45] re-adding it sorry [11:10:46] PROBLEM - eventlogging_sync processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [11:14:07] still me, all good --^ [11:21:23] (03PS1) 10Elukey: profile::mariadb::misc::eventlogging::database: fix hearthbeat settings [puppet] - 10https://gerrit.wikimedia.org/r/391537 (https://phabricator.wikimedia.org/T177405) [11:23:02] (03CR) 10Jcrespo: [C: 031] profile::mariadb::misc::eventlogging::database: fix hearthbeat settings [puppet] - 10https://gerrit.wikimedia.org/r/391537 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:23:27] (03CR) 10Elukey: [C: 032] profile::mariadb::misc::eventlogging::database: fix hearthbeat settings [puppet] - 10https://gerrit.wikimedia.org/r/391537 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:26:49] (03PS1) 10Ema: 4.1.8-1wm2: fix VSV00002 [debs/varnish4] (debian-wmf-4.1) - 10https://gerrit.wikimedia.org/r/391538 [11:26:55] (03PS6) 10Jcrespo: haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) [11:27:29] (03CR) 10jerkins-bot: [V: 04-1] haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [11:30:30] (03PS1) 10Elukey: hieradata::regex: allow notifications for db1107, disable them for db1046 [puppet] - 10https://gerrit.wikimedia.org/r/391540 (https://phabricator.wikimedia.org/T156844) [11:32:35] !log Stop MySQL on db1046 [11:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:01] (03CR) 10Elukey: [C: 032] hieradata::regex: allow notifications for db1107, disable them for db1046 [puppet] - 10https://gerrit.wikimedia.org/r/391540 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [11:33:02] if everthing works well, both dbproxy1004 and dbproxy1009 should complain [11:33:53] (03PS1) 10Ema: 5.1.3-1wm3: fix VSV00002 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/391541 [11:34:34] PROBLEM - mysqld processes on db1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [11:34:47] page [11:34:53] elukey: wasn't db1046 downtimed? [11:35:12] I guess you're decomming it [11:35:18] it works [11:35:21] it was, but at this point I might have not re-added downtime [11:35:38] i am reading the buffer, it was downtimed for 2h only [11:36:06] yep I re-added two more hours earlier on, probably not to db1046 [11:36:12] I am stupid :) [11:36:22] let's disable notifications then [11:36:26] I am going to deploy 391536 despite the violation [11:36:31] I already did :-) [11:36:38] marostegui: I just merged the disabled notifications [11:36:45] cool [11:36:47] not fast enough, apparently [11:37:04] elukey: let me run puppet on einstenimium [11:37:31] (03CR) 10Jcrespo: [V: 032 C: 032] haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [11:37:36] (03PS7) 10Jcrespo: haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) [11:38:02] we then point the proxy to the new hosts [11:38:07] and change the dns [11:38:11] (03CR) 10jerkins-bot: [V: 04-1] haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [11:38:36] (03CR) 10Jcrespo: [V: 032 C: 032] haproxy: Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391536 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [11:39:39] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [debs/varnish4] (debian-wmf-4.1) - 10https://gerrit.wikimedia.org/r/391538 (owner: 10Ema) [11:40:27] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/391541 (owner: 10Ema) [11:41:51] (03CR) 10Ema: [C: 032] 4.1.8-1wm2: fix VSV00002 [debs/varnish4] (debian-wmf-4.1) - 10https://gerrit.wikimedia.org/r/391538 (owner: 10Ema) [11:42:07] (03CR) 10Ema: [C: 032] 5.1.3-1wm3: fix VSV00002 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/391541 (owner: 10Ema) [11:43:52] (03PS1) 10Jcrespo: haproxy: Followup to Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391542 (https://phabricator.wikimedia.org/T156844) [11:44:16] (03CR) 10Jcrespo: [V: 032 C: 032] haproxy: Followup to Update configuration template to haproxy 1.7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/391542 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [11:45:56] !log restart haproxy on dbproxy1009 [11:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:10] (03PS1) 10Jcrespo: eventlogging: Point m4-master to db1107 with db1108 as backup [puppet] - 10https://gerrit.wikimedia.org/r/391543 (https://phabricator.wikimedia.org/T156844) [11:50:36] !log varnish 5.1.3-1wm3 uploaded to apt.w.o (experimental) [11:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:40] (03PS4) 10Volans: Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) [11:55:42] (03PS2) 10Volans: Grafana: add graph to Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391525 (https://phabricator.wikimedia.org/T170353) [11:55:44] (03PS5) 10Volans: Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) [11:55:46] (03PS5) 10Volans: Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [11:55:48] (03PS6) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [11:57:29] (03PS1) 10Jcrespo: eventlogging: Repoint m4-master CNAME to dbproxy1009 [dns] - 10https://gerrit.wikimedia.org/r/391545 (https://phabricator.wikimedia.org/T156844) [11:58:08] !log varnish 4.1.8-1wm2 uploaded to apt.w.o (main) [11:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:02] (03CR) 10Filippo Giunchedi: "I tried testing this with pcc but without much success, since cumin targets are only defined when running in production afaict: https://pu" [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [11:59:39] (03CR) 10Filippo Giunchedi: standard: map an host to its production cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [12:00:05] PROBLEM - Check size of conntrack table on mw1303 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:00:05] PROBLEM - Check size of conntrack table on mw1302 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:00:14] PROBLEM - Check size of conntrack table on mw1301 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [12:00:14] PROBLEM - Check size of conntrack table on mw1304 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:01:01] !log upgrade varnish to 5.1.3-1wm3 on cp3007 (cache_misc) [12:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:25] PROBLEM - Check size of conntrack table on mw1299 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:03:15] (03CR) 10Volans: "LGTM, one nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [12:03:25] PROBLEM - Check size of conntrack table on mw1299 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:04:05] PROBLEM - Check size of conntrack table on mw1302 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [12:04:14] PROBLEM - Check size of conntrack table on mw1301 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:04:23] I'm checking one of those appservers to see what's up [12:05:05] PROBLEM - Check size of conntrack table on mw1300 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:05:39] lots of tw sockets looks like, on jobrunners [12:06:11] (03CR) 10Elukey: [C: 031] eventlogging: Repoint m4-master CNAME to dbproxy1009 [dns] - 10https://gerrit.wikimedia.org/r/391545 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [12:06:14] PROBLEM - Check size of conntrack table on mw1304 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:07:05] PROBLEM - Check size of conntrack table on mw1303 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:07:18] (03CR) 10Elukey: [C: 031] eventlogging: Point m4-master to db1107 with db1108 as backup [puppet] - 10https://gerrit.wikimedia.org/r/391543 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [12:07:38] godog: were they rebooted recently? [12:07:49] I think that it is a conntrack setting issue, it happened in the past [12:08:09] should we disable it temporarilly? [12:08:19] ah yeah, good catch elukey, I missed moritzm update [12:08:20] or is it not-breaking [12:08:36] yeah net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 [12:08:50] going to fix it [12:09:14] PROBLEM - Check size of conntrack table on mw1304 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [12:09:23] !log executed sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 on all jobrunners [12:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:13] ah now I remember, the sysctl/firewall race at boot [12:10:21] exactly [12:11:13] RECOVERY - Check size of conntrack table on mw1303 is OK: OK: nf_conntrack is 65 % full [12:11:13] RECOVERY - Check size of conntrack table on mw1300 is OK: OK: nf_conntrack is 64 % full [12:11:14] RECOVERY - Check size of conntrack table on mw1302 is OK: OK: nf_conntrack is 62 % full [12:11:14] RECOVERY - Check size of conntrack table on mw1301 is OK: OK: nf_conntrack is 65 % full [12:11:14] RECOVERY - Check size of conntrack table on mw1304 is OK: OK: nf_conntrack is 65 % full [12:11:22] \o/ [12:11:32] RECOVERY - Check size of conntrack table on mw1299 is OK: OK: nf_conntrack is 64 % full [12:11:34] nice [12:13:00] (03CR) 10Marostegui: [C: 031] eventlogging: Point m4-master to db1107 with db1108 as backup [puppet] - 10https://gerrit.wikimedia.org/r/391543 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [12:13:11] (03CR) 10Marostegui: [C: 031] eventlogging: Repoint m4-master CNAME to dbproxy1009 [dns] - 10https://gerrit.wikimedia.org/r/391545 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [12:13:49] elukey, godog: which race condition? perhaps this? https://bugs.debian.org/864341 [12:14:45] arturo: https://phabricator.wikimedia.org/T136094 iirc [12:15:13] arturo: yeah that's the one [12:15:29] linked from the task elukey posted [12:15:51] glad we're not the only ones :) [12:15:58] :-) [12:16:28] BTW that bug report triggered me joining the WMF, since I started talking to moritzm [12:17:05] indeed :-) [12:17:11] wonderful! [12:17:17] hahaha nice! [12:19:10] (03CR) 10Jcrespo: [C: 032] eventlogging: Repoint m4-master CNAME to dbproxy1009 [dns] - 10https://gerrit.wikimedia.org/r/391545 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [12:19:19] !log cache_misc: upgrade varnish to 5.1.3-1wm3 [12:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:31] (03CR) 10Jcrespo: [C: 032] eventlogging: Point m4-master to db1107 with db1108 as backup [puppet] - 10https://gerrit.wikimedia.org/r/391543 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [12:24:19] !log deploying dns change for m4-master [12:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:34] \o/ [12:28:11] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3762638 (10phuedx) > In T178189#3740029, @akosiaris wrote: >> In T178189#3740805, @phuedx wrote: >> One thing that we (Readers Web)... [12:30:52] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3762650 (10phuedx) [12:38:58] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#3762670 (10fgiunchedi) [12:40:02] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [12:41:45] !log re-enable eventlogging after maintenance [12:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:10] (03CR) 10Elukey: "We usually don't ensure the specific file (since it gets created if not there by the stdout/err redirection) and the /var/log/wikidata dir" [puppet] - 10https://gerrit.wikimedia.org/r/386662 (owner: 10Hoo man) [13:02:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Add Jinja2 expression statement [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/390174 (owner: 10Thcipriani) [13:05:25] (03PS3) 10Marostegui: install_server: Reimage db1101 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/391527 (https://phabricator.wikimedia.org/T178359) [13:07:24] !log upgrade varnish to 4.1.8-1wm2 on cp3030 (cache_text) [13:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:15] !log rebooting video scalers in eqiad for update to 4.9.51 (and to pick up openssl update) [13:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:32] RECOVERY - eventlogging_sync processes on db1047 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [13:30:52] RECOVERY - eventlogging_sync processes on dbstore1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [13:31:37] !log ppchelko@tin Started restart [changeprop/deploy@065a06e]: Restart to rebalance all rules T179684 [13:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:43] T179684: ChangeProp workers die if they can't connect to redis - https://phabricator.wikimedia.org/T179684 [13:34:56] !log cache_text: upgrade varnish to 4.1.8-1wm2 [13:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:46] (03PS1) 10Marostegui: mariadb: Enable notifications in a few hosts [puppet] - 10https://gerrit.wikimedia.org/r/391549 (https://phabricator.wikimedia.org/T178359) [13:49:42] (03CR) 10Marostegui: [C: 032] mariadb: Enable notifications in a few hosts [puppet] - 10https://gerrit.wikimedia.org/r/391549 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:52:30] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3762974 (10Peter) The automatic annotations from WebPageTest we could purge them like every two weeks if it's an issue, we very very rarely need to go back longer in... [13:52:39] (03PS1) 10Marostegui: mariadb: Enable notifications on codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/391550 [13:53:27] (03CR) 10Marostegui: [C: 032] mariadb: Enable notifications on codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/391550 (owner: 10Marostegui) [13:55:58] 10Operations: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#3762982 (10Marostegui) I have reviewed and added the ones for the DBs that could already be enabled back. As soon as puppet starts running they should be picked up. Thanks for the report! [13:56:57] (03PS1) 10Jcrespo: labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance [puppet] - 10https://gerrit.wikimedia.org/r/391551 (https://phabricator.wikimedia.org/T179244) [13:58:39] (03PS2) 10Jcrespo: labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance [puppet] - 10https://gerrit.wikimedia.org/r/391551 (https://phabricator.wikimedia.org/T179244) [13:59:27] (03CR) 10Marostegui: [C: 031] labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance [puppet] - 10https://gerrit.wikimedia.org/r/391551 (https://phabricator.wikimedia.org/T179244) (owner: 10Jcrespo) [14:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171115T1400). [14:00:05] kart_, Pchelolo, Amir1, and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:43] (03CR) 10Jcrespo: [C: 032] labsdb: point analytics replica to labsdb1010 for labsdb1009 maintenance [puppet] - 10https://gerrit.wikimedia.org/r/391551 (https://phabricator.wikimedia.org/T179244) (owner: 10Jcrespo) [14:00:45] I can SWAT today [14:00:48] !log installing openssl updates on kafka and hadoop clusters [14:00:53] o/ [14:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:09] I can deploy mine once you're done :) [14:01:46] o/ [14:01:49] kart_, Pchelolo, Amir1, dcausse: do you want to deploy your own change? if not, does any change need a long time to test? [14:01:59] here [14:02:00] o/ [14:02:07] Amir1: ok, will ping you when done, want to go first? [14:02:21] zeljkof: my changes are simple. go ahead. [14:02:22] I'm here but mine is not testable at all, just bumps a request timeout a little bit [14:02:28] zeljkof: I can deploy mine, and I can't test it properly on mwdebug (it affects some code that is cached) [14:02:35] nah, let's get yours out of the door [14:02:46] zeljkof: you can just sync mine everywhere at once, no way to test it [14:03:17] Pchelolo: sounds like a good thing to start with :) [14:04:17] !log reload haproxy on dbproxy1010 [14:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] ok, the plan: will deploy Pchelolo's commit, then kart_'s then ping Amir1 and dcausse to take over deploying their changes? sounds good? [14:04:49] zeljkof: fine by me [14:05:01] sounds good! [14:05:45] yes [14:06:16] (03PS1) 10Muehlenhoff: Fix kakfa cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/391552 [14:06:28] * zeljkof is deploying 391535 [14:06:44] (03PS2) 10Muehlenhoff: Fix kakfa cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/391552 [14:07:57] I see queries flowin towards labsdb1010, will wait a bit and then stop labsdb1009 [14:08:22] (03CR) 10Muehlenhoff: [C: 032] Fix kakfa cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/391552 (owner: 10Muehlenhoff) [14:09:13] jynus: you mean stop the db service? [14:09:49] (or, shutdown the machine) [14:10:02] arturo: yes: https://phabricator.wikimedia.org/T179244 [14:10:23] !log upgrade varnish to 4.1.8-1wm2 on cp4024 (cache_upload, depooled) [14:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:12] ok marostegui [14:11:55] !log upgrade hpsa firmware to 6.06 on restbase2004 - T180562 T141756 [14:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:03] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [14:12:03] T180562: Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562 [14:12:11] !log zfilipin@tin Synchronized php-1.31.0-wmf.8/extensions/EventBus/EventBus.php: SWAT: [[gerrit:391535|Increase request timeout to match kafka produce timeout (T180017)]] (duration: 00m 50s) [14:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:17] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [14:12:36] Pchelolo: deployed ^ please check and thanks for deploying with #releng ;) [14:13:25] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3120: Connection refused [14:13:26] all looks good zeljkof thank you [14:13:34] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3124: Connection refused [14:13:38] kart_: your commits are next, anything special about any of them? long time to test? can not be tested at mwdebug? [14:13:45] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3128: Connection refused [14:13:48] looking ^ [14:14:05] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3127: Connection refused [14:14:14] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3122: Connection refused [14:14:14] zeljkof: no. easier. [14:14:15] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3125: Connection refused [14:14:15] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 3121: Connection refused [14:14:15] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4024 is CRITICAL: connect to address 10.128.0.124 and port 80: Connection refused [14:14:23] (sorry for the spam) [14:14:28] on wtp2017 - ferm can't start properly - reason is given as "DNS query for 'prometheus2003.codfw.wmnet' failed: query timed out" but the host/dig can lookup the name just fine.. ?! [14:14:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) (owner: 10KartikMistry) [14:15:15] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.157 second response time [14:15:15] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 461 bytes in 0.157 second response time [14:15:24] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 460 bytes in 0.157 second response time [14:15:24] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 460 bytes in 0.157 second response time [14:15:24] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 460 bytes in 0.157 second response time [14:15:34] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 460 bytes in 0.157 second response time [14:15:34] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 460 bytes in 0.157 second response time [14:15:45] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4024 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.157 second response time [14:16:54] RECOVERY - Check whether ferm is active by checking the default input chain on wtp2017 is OK: OK ferm input default policy is set [14:16:56] mutante: reproducibly? [14:17:02] (03Merged) 10jenkins-bot: Beta: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) (owner: 10KartikMistry) [14:17:03] godog: no :p [14:17:04] RECOVERY - Check systemd state on wtp2017 is OK: OK - running: The system is fully operational [14:17:08] godog: i could just start it ^ [14:17:12] (03CR) 10jenkins-bot: Beta: Explicitly set cookieDomain for ContentTranslationSiteTemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320200 (https://phabricator.wikimedia.org/T149879) (owner: 10KartikMistry) [14:17:13] heh, damn @resolve [14:17:47] !log wtp2017 - systemctl start ferm (ferm wasnt running due to failed DNS lookup for prometheus2003 sometime in the past) [14:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:17] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#3763049 (10ArielGlenn) Just a FYI, the snapshot hosts 1005,6,7 have: Firmware Version: 3.56 They've been fine so far. [14:18:56] !log repool cp4024 T174891 [14:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:02] T174891: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891 [14:19:30] hm, scap pull is taking forever at mwdebug1002 cc thcipriani|afk [14:19:38] zeljkof: oops! [14:19:47] oh, ok done, but it tool a long time [14:19:49] kart_: 320200 is at mwdebug1002, let me know if it's ok to deploy [14:20:03] testing [14:20:04] kart_: oops? [14:20:07] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3763065 (10fgiunchedi) a:03Papaul Rebooting the machine after the firmware upgrade didn't seem to do the trick, I can't power it back up via ilo. @papaul could you take a look? thanks... [14:20:15] zeljkof: about scap pull. no worry. [14:20:21] not something I like to hear during swat ;) [14:20:28] ah, yes [14:20:44] godog: bad idea to mix DNS resolution with iptables rules. Better use external variables or something [14:21:16] git branch -a [14:21:28] heh [14:21:37] zeljkof: nothing broken. go ahead. [14:21:50] kart_: deploying [14:22:41] !log zfilipin@tin Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:320200|Beta: Explicitly set cookieDomain for ContentTranslationSiteTemplates (T149879)]] (duration: 00m 49s) [14:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:49] T149879: Fix ContentTranslation Labs instance (cx2.wmflabs.org) - https://phabricator.wikimedia.org/T149879 [14:23:05] kart_: deployed, please check, I am reviewing the next patch [14:23:18] arturo: indeed, one of the ideas iirc was to use puppet's ipresolve() and let puppet do the resolution either at ferm time or asynchrnously at some other time [14:23:19] (03PS4) 10Zfilipin: Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 (owner: 10KartikMistry) [14:24:26] kart_: trailing whitespace in the commit message?! ;) fixing https://gerrit.wikimedia.org/r/#/c/378833/4//COMMIT_MSG [14:24:50] (03PS5) 10Zfilipin: Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 (owner: 10KartikMistry) [14:25:10] ah [14:25:14] godog: I guess that would be an improvement, yes [14:25:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 (owner: 10KartikMistry) [14:25:29] godog: looking at for example modules/contint/manifests/firewall/labs.pp, why not simply using the IP address? the FQDN is fixed anyway, I mean, if the FQDN change resolution you will know from inside the own repository, isn't it? or a variable [14:25:48] zeljkof: earlier patch is good. [14:25:49] kart_: happens :) fixed [14:26:46] (03Merged) 10jenkins-bot: Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 (owner: 10KartikMistry) [14:26:56] (03CR) 10jenkins-bot: Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 (owner: 10KartikMistry) [14:27:35] !log cache_upload: upgrade varnish to 4.1.8-1wm2 [14:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:42] arturo: not sure if I understood but dns is a separate repo [14:27:50] arturo: found the task! T148986 has a bit more context [14:27:51] T148986: Firewall sets not being loaded post-reboot due to a @resolve race on jessie - https://phabricator.wikimedia.org/T148986 [14:28:00] Amir1: ping me when you start! :) [14:28:06] sure [14:28:38] godog: out for lunch, we could talk about this in the future :-) [14:28:38] kart_: 378833 is at mwdebug1002, let me know if it's ok to deploy [14:28:57] arturo: for sure! enjoy [14:29:02] Amir1, dcausse, addshore: I will deploy my final commit in a few minutes, stand by :) [14:29:36] zeljkof: looks good. [14:29:46] kart_: deploying [14:30:51] !log zfilipin@tin Synchronized wmf-config/: SWAT: [[gerrit:378833|Remove wgContentTranslationEnableSuggestions]] (duration: 00m 51s) [14:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:15] kart_: deployed, please check and thanks for deploying with #releng ;) [14:31:24] Amir1, dcausse, addshore: I am done, take over :) [14:31:36] with pleasure [14:31:45] (03CR) 10Ladsgroup: [C: 032] Comply wikidata with new ores thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391197 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:32:02] zeljkof: thanks. looks good. [14:33:10] (03PS6) 10Filippo Giunchedi: wmflib: switch away from Ganglia::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) [14:33:12] (03PS5) 10Filippo Giunchedi: standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) [14:35:43] (03PS1) 10Muehlenhoff: Update Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/391557 [14:36:14] (03PS2) 10Ladsgroup: Comply wikidata with new ores thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391197 (https://phabricator.wikimedia.org/T180450) [14:37:18] (03CR) 10Ladsgroup: [C: 032] Comply wikidata with new ores thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391197 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:37:20] (03CR) 10Muehlenhoff: [C: 032] Update Cumin alias for druid [puppet] - 10https://gerrit.wikimedia.org/r/391557 (owner: 10Muehlenhoff) [14:37:38] (03PS7) 10Filippo Giunchedi: wmflib: switch away from Ganglia::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) [14:37:56] !log restbase creating Cassandra 3 revision tables on restbase1009 - T179421 [14:38:01] (03PS1) 10Jcrespo: Link to grafana rather than to ganglia on tendril [software/tendril] - 10https://gerrit.wikimedia.org/r/391558 (https://phabricator.wikimedia.org/T177225) [14:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:02] T179421: Migrate revisions and restrictions from legacy to new storage - https://phabricator.wikimedia.org/T179421 [14:38:25] (03CR) 10Volans: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [14:39:31] (03Merged) 10jenkins-bot: Comply wikidata with new ores thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391197 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:39:41] (03CR) 10jenkins-bot: Comply wikidata with new ores thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391197 (https://phabricator.wikimedia.org/T180450) (owner: 10Ladsgroup) [14:39:50] (03CR) 10Filippo Giunchedi: [C: 032] wmflib: switch away from Ganglia::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [14:41:07] I forgot to put justification for ORES change [14:41:09] sorry [14:41:49] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 49s) [14:41:49] just !log it next to it :) [14:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:29] !log deployed ORES change for wikidata (gerrit:391197, phab:T180450) [14:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:35] T180450: ORES thresholds for Wikidata is too strict - https://phabricator.wikimedia.org/T180450 [14:43:16] addshore: now it's time to get your patch moved [14:43:19] yup! [14:43:29] should do a full scap for it [14:44:16] hmm, Amir1 im seeing a few exceptions, i think perhaps coming form your ores change? [14:44:42] [{exception_id}] {exception_url} RuntimeException from line 277 of /srv/mediawiki/php-1.31.0-wmf.7/extensions/ORES/includes/Stats.php: Unable to parse threshold: {"levelName":"likelybad","levelConfig":"maximum recall @ precision >= 0.08","bound":"min"," [14:45:21] let me check [14:45:43] They might have dropped off now [14:46:05] 14:41:30 to 14:44:00 [14:46:51] I will monitor it [14:47:01] it's not something I can get done easily [14:47:10] ack! [14:47:16] if it continues I get another patch going on [14:47:19] Amir1: For back compat, we’re using old-style threshold expressions everywhere. I think this might have been the first production use of the new syntax [14:47:32] It should have worked in theory, though. [14:47:39] awight: it's the old style, I just changed the numbers [14:47:59] OK weird so it’s getting munged to new style internally, as it should. [14:48:55] Right, I see your change, should be harmless. [14:49:13] looks like the exceptions dropped off and didnt come back [14:50:52] Amir1: FWIW, The code in extractBoundValue fails with that error if the received threshold data doesn’t contain the threshold we’re looking for… [14:50:59] glad we don’t have to debug further. [14:52:11] PROBLEM - Disk space on prometheus2003 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [14:52:11] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [14:52:30] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [14:53:00] PROBLEM - Disk space on prometheus2004 is CRITICAL: DISK CRITICAL - free space: /boot 0 MB (0% inode=98%) [14:54:11] Amir1: Looks like the branch bump is merged [14:55:42] addshore: yeah [14:55:57] so what's the command, scap sync-file? [14:56:37] Nope, it will have to be a full scap, so scap sync 'Log message here' [14:56:44] kk [14:56:50] once you have fetched, rebased, submodules updated etc [14:56:54] *submodule [14:57:12] !log ladsgroup@tin Started scap: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch (T180539) [14:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:18] T180539: wmf.8 still on old Wikidata branch - https://phabricator.wikimedia.org/T180539 [14:57:29] shoooot [14:57:32] hmm, have you fetched and rebased and updated the submodule? [14:57:34] I forgot to fetch and rebase [14:57:41] hehe, just cancel the scap :) [14:57:43] !log ladsgroup@tin scap aborted: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch (T180539) (duration: 00m 31s) [14:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:06] looks good now (files are there) [14:59:30] nope, the rebase failed [15:00:06] The wikidata extension is a mess in tin [15:01:08] What did the rebase fail with? [15:01:27] addshore@tin:/srv/mediawiki-staging/php-1.31.0-wmf.8$ git status [15:01:30] modified: extensions/Wikidata (modified content) [15:01:31] oooh [15:01:48] I aborted the rebase [15:01:54] https://www.irccloud.com/pastebin/YtfhLH54/ [15:02:15] (03PS1) 10Filippo Giunchedi: prometheus: sort nodes by certname in get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/391563 (https://phabricator.wikimedia.org/T179395) [15:02:21] volans: ^ [15:02:32] addshore: should we reset it and bump it to HEAD? [15:02:42] untestable before merge, sadly [15:02:55] Amir1: give me a sec [15:03:02] k [15:03:19] Amir1: rebase worked fine for me [15:03:50] godog: according to modules/ssh/templates/known_hosts.erb we're using with '$field $sortdirection': 'certname asc' [15:03:56] aah, the submodule update failed though [15:03:58] not sure what happens if you leave it [15:04:03] https://www.irccloud.com/pastebin/jw0b01Hq/ [15:04:28] that's what I'm saying [15:05:00] yeh, lets just reset it to what is currently at the head of master on the branch [15:05:14] godog: ah no seems that asc is default [15:05:31] !log rebooting kubestagetcd* for update to 4.9.51 [15:05:36] volans: ah! I'll specify it anyways, I was reading the query_resources docs [15:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:55] oka [15:06:21] Amir1: I'll let you do it :) [15:06:27] (03PS2) 10Filippo Giunchedi: prometheus: sort nodes by certname in get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/391563 (https://phabricator.wikimedia.org/T179395) [15:06:28] to avoid any messy crossover [15:06:30] on it [15:06:46] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/391563 (https://phabricator.wikimedia.org/T179395) (owner: 10Filippo Giunchedi) [15:07:48] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: sort nodes by certname in get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/391563 (https://phabricator.wikimedia.org/T179395) (owner: 10Filippo Giunchedi) [15:07:52] is someone working on /boot prometheus? [15:08:10] addshore: the patch for elastic was pulled there in a nasty way [15:08:21] the patch for elastic? [15:08:46] ^maybe moritz if it is a due to a kernel update? [15:09:02] if not I can have a look [15:09:42] yeah I think he was upgrading the kernel there [15:10:36] I will wait as it should not be creating issues, but he may want to check if installation was successful, etc [15:11:05] !log ladsgroup@tin Started scap: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch, second try (T180539) [15:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:12] T180539: wmf.8 still on old Wikidata branch - https://phabricator.wikimedia.org/T180539 [15:11:36] jynus: I installed new kernel there, but the reboot is postponed until tomorrow due to godog running some tests [15:11:36] Amir1: great! lets see how this goes! [15:11:51] jynus: what's up, shortage in /boot now? [15:11:57] yeah, I mean the space shortage [15:12:10] I can try to delete older unused kernels [15:12:20] oh, I missed these, I'll clean up old kernels [15:12:22] but wanted a heads up in case it could have made install fail or something [15:12:31] (03CR) 10Mobrovac: [C: 031] "LGTM for RB and Kafka" [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [15:12:36] 10Operations, 10ops-codfw, 10Services (watching): Degraded RAID on restbase2004 - https://phabricator.wikimedia.org/T180562#3763161 (10fgiunchedi) a:05Papaul>03fgiunchedi Actually the server came back up after a while, @papaul was checking and power was back on. [15:12:39] and it should not create ongoing issues [15:13:04] or maybe it was ok until reboot tomorrow [15:13:15] !log removing unused kernels from prometheus* [15:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:31] RECOVERY - Disk space on prometheus1003 is OK: DISK OK [15:13:39] cool [15:13:50] the /boot is ridiculously small on those machines, just 88 M [15:13:57] o really? [15:14:14] hmm Amir1 i found an issue while testing it on mwdebug1002 [15:14:14] probably not as small as the 60 GB for our apt repo! [15:14:18] :-) [15:14:21] RECOVERY - Disk space on prometheus1004 is OK: DISK OK [15:14:34] fatal error: Argument 1 passed to Wikibase\HistoryEntityAction::__construct() must be an instance of Article, WikiPage given in /srv/mediawiki/php-1.31.0-wmf.8/extensions/Wikidata/extensions/Wikibase/repo/includes/Content/ItemHandler.php on line 118 << Amir1 [15:14:39] might want to cancel that scap again [15:14:49] addshore: it's halfway done [15:14:53] !log ladsgroup@tin scap aborted: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch, second try (T180539) (duration: 03m 48s) [15:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:01] Amir1: https://test.wikidata.org/w/index.php?title=Q99687&action=history [15:15:06] (on mwdebug1002) [15:15:10] RECOVERY - Disk space on prometheus2004 is OK: DISK OK [15:15:21] RECOVERY - Disk space on prometheus2003 is OK: DISK OK [15:15:58] addshore: it doesn't load up [15:16:06] we are way after the SWAT time [15:16:13] Amir1: indeed [15:16:17] I leave it to you [15:16:18] moritzm: sorry to be so annoying [15:16:26] Amir1: okay! [15:16:54] I appreciate all your hard work even if I do not say it enough [15:18:00] Amir1: had your scap started syncing to anywhere yet? [15:18:15] addshore: yes [15:18:42] okay, right, I'll revert the change and re scap i guess then! [15:19:13] Amir1: can you +2 https://gerrit.wikimedia.org/r/#/c/391565/? [15:19:45] done [15:19:51] going for lunch [15:20:10] 10Operations, 10Discovery, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3763177 (10Mholloway) [15:23:11] jynus: no, thanks for pointing out the alert, I missed in when it scrolled by [15:27:44] 10Operations, 10Cassandra: cassandra unresponsive on restbase2001-c - https://phabricator.wikimedia.org/T180568#3763213 (10Eevans) p:05Triage>03High [15:29:19] !log installing perl security updates on trusty (Debian hosts fixed two months ago) [15:29:23] 10Operations, 10Traffic, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3763215 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi We're back! And as a nice side effect no longer relyin... [15:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:50] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4027 is CRITICAL: connect to address 10.128.0.127 and port 3128: Connection refused [15:34:54] Amir1: hit the rebase problem with the revert too [15:35:50] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4027 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.157 second response time [15:38:00] 10Operations, 10Cassandra, 10User-Eevans: Abberant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3763260 (10Eevans) [15:40:05] !log addshore@tin Started scap: Revert partial scap: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch, second try (T180539) [15:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:13] T180539: wmf.8 still on old Wikidata branch - https://phabricator.wikimedia.org/T180539 [15:43:05] !log shutting down labsdb1009 for maintenance T179244 [15:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:12] T179244: labsdb1009 crashed - OOM - https://phabricator.wikimedia.org/T179244 [15:48:41] (03PS1) 10Giuseppe Lavagetto: Updated docker-pkg to the latest version in master [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/391568 [15:48:55] (03Abandoned) 10Filippo Giunchedi: standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [15:50:02] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Updated docker-pkg to the latest version in master [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/391568 (owner: 10Giuseppe Lavagetto) [15:50:12] 10Operations, 10Cassandra, 10User-Eevans: Abberant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3761927 (10Eevans) Elevated load and GC collection time starting when 2002-b began bootstrapping (after the bootstrap of 2002-a), but extending well beyond the complet... [15:50:50] (03PS2) 10Ema: Lower depool threshold for Apache to 0.8 (80%) [puppet] - 10https://gerrit.wikimedia.org/r/389964 (https://phabricator.wikimedia.org/T178799) (owner: 10Muehlenhoff) [15:50:56] (03CR) 10Ema: [V: 032 C: 032] Lower depool threshold for Apache to 0.8 (80%) [puppet] - 10https://gerrit.wikimedia.org/r/389964 (https://phabricator.wikimedia.org/T178799) (owner: 10Muehlenhoff) [15:52:00] 10Operations, 10Cassandra, 10User-Eevans: Abberant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3763302 (10Eevans) [15:53:11] 10Operations, 10Cassandra, 10User-Eevans: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3763304 (10ema) [15:53:44] godog: i think https://gerrit.wikimedia.org/r/#/c/391011/ is an example where we should ignore the jenkins-bot vote, until the whole mx role is converted to profiles .. and we should just do it [15:54:23] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3763309 (10mobrovac) [15:54:51] (03CR) 10Anomie: [C: 031] Work around HHVM bug by using XMLWriter::writeAttribute() [dumps/dcat] - 10https://gerrit.wikimedia.org/r/391489 (https://phabricator.wikimedia.org/T117534) (owner: 10Tim Starling) [15:55:26] !log restart pybal on lvs[12]00[36] for config change https://gerrit.wikimedia.org/r/#/c/389964/ [15:55:28] mutante: yeah I was thinking of sticking a lint ignore to make it explicit, didn't get to it yet today tho [15:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:36] (03PS1) 10Thcipriani: Fix missing postfix heading for Step 2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/391569 [15:56:13] !log oblivian@tin Started deploy [docker-pkg/deploy@9b319d2]: Adding do extension to jinja2, needed for contint images [15:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:31] !log oblivian@tin Finished deploy [docker-pkg/deploy@9b319d2]: Adding do extension to jinja2, needed for contint images (duration: 00m 18s) [15:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:37] mutante: actually I'll do it now [15:57:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix missing postfix heading for Step 2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/391569 (owner: 10Thcipriani) [15:57:59] (03PS5) 10Filippo Giunchedi: role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) [15:58:06] (03PS16) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [15:59:57] godog: cool :) [16:00:56] (03PS6) 10Filippo Giunchedi: role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) [16:03:54] !log addshore@tin Finished scap: Revert partial scap: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch, second try (T180539) (duration: 23m 48s) [16:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:03] T180539: wmf.8 still on old Wikidata branch - https://phabricator.wikimedia.org/T180539 [16:04:10] well, thats final complete... [16:05:09] *finally [16:10:59] (03CR) 10Filippo Giunchedi: "A few nits but overall LGTM!" (034 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [16:11:45] (03CR) 10Filippo Giunchedi: [C: 032] role: add ferm rule for mtail on mx [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [16:12:39] :) nice! was just compiling it too [16:13:44] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/8788/ noop :)" [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [16:15:36] (03CR) 10Dzahn: "applied on mx2001" [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [16:15:56] godog: done on mx2001, works [16:16:08] mutante: sweet, thanks! I did mx1001 [16:16:58] thanks, one more step towards ganglia removal [16:19:45] eddiegp: they merged https://review.openstack.org/#/c/517831/ :) [16:23:29] (03Draft1) 10Addshore: Setup home dir for addshore [puppet] - 10https://gerrit.wikimedia.org/r/391559 [16:23:37] (03PS2) 10Addshore: Setup home dir for addshore [puppet] - 10https://gerrit.wikimedia.org/r/391559 [16:24:39] 10Operations, 10Discovery, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3763392 (10Mholloway) It looks like the new redirect behavior was introduced in January (https://gerrit.wikimedia... [16:26:06] (03PS17) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [16:26:21] (03PS3) 10Dzahn: admins: Setup home dir for addshore [puppet] - 10https://gerrit.wikimedia.org/r/391559 (owner: 10Addshore) [16:26:30] (03CR) 10Dzahn: [C: 032] admins: Setup home dir for addshore [puppet] - 10https://gerrit.wikimedia.org/r/391559 (owner: 10Addshore) [16:28:48] (03CR) 10Rush: "I am going to remove to get this off my cluttered dashboard, let me know if I can be helpful though :)" [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [16:29:14] (03Abandoned) 10Rush: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030 (owner: 10Tim Landscheidt) [16:30:20] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/home/addshore/.gitconfig],File[/home/addshore/bin/stat-web-grep] [16:31:17] mutante: --^ (not sure if a temp glitch or not) [16:31:46] probably puppet file race [16:31:51] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/addshore/.gitconfig],File[/home/addshore/bin/stat-web-grep],File[/home/addshore/bin/stat-sql-store] [16:31:52] elukey: i think yes, checking [16:31:55] (create new file + reference in same commit) [16:32:26] puppet's lack of transactionality on puppet tree updates is :P [16:32:57] :D [16:33:19] yes, so i double checked on mw2181 [16:33:26] it just creates the file on next run and is ok [16:36:50] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:38:56] woo! [16:42:00] (03PS3) 10Rush: Tools: Fix test for enabled PHP module mcrypt [puppet] - 10https://gerrit.wikimedia.org/r/340059 (https://phabricator.wikimedia.org/T159022) (owner: 10Tim Landscheidt) [16:43:09] (03CR) 10Ottomata: "> Also having certs in ops/puppet can be beneficial as well in debugging as well (simply put, how often do you find -name "foo"/git grep f" [puppet] - 10https://gerrit.wikimedia.org/r/391214 (owner: 10Ottomata) [16:45:48] (03CR) 10Rush: [C: 032] Tools: Fix test for enabled PHP module mcrypt [puppet] - 10https://gerrit.wikimedia.org/r/340059 (https://phabricator.wikimedia.org/T159022) (owner: 10Tim Landscheidt) [16:50:15] (03PS13) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [16:51:13] (03CR) 10Dzahn: "so...which ferm rules are we missing exactly?" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:55:36] (03CR) 10Dzahn: "did you mean this one?" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:56:26] (03PS7) 10Zoranzoki21: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [16:56:35] (03CR) 10jerkins-bot: [V: 04-1] Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [16:57:40] Hi. Please rebase patch https://gerrit.wikimedia.org/r/#/c/390188/ because jenkins can not do tests [16:57:54] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3763485 (10mobrovac) [16:59:07] (03PS1) 10Arturo Borrero Gonzalez: maintain-views: implement connection timeouts for views creation [puppet] - 10https://gerrit.wikimedia.org/r/391586 (https://phabricator.wikimedia.org/T180564) [16:59:22] (03CR) 10Dzahn: "confirmed on logstash1001 there are iptables rules including:" [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [16:59:39] (03CR) 10jerkins-bot: [V: 04-1] maintain-views: implement connection timeouts for views creation [puppet] - 10https://gerrit.wikimedia.org/r/391586 (https://phabricator.wikimedia.org/T180564) (owner: 10Arturo Borrero Gonzalez) [17:00:20] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:00:39] (03PS2) 10Arturo Borrero Gonzalez: maintain-views: implement connection timeouts for views creation [puppet] - 10https://gerrit.wikimedia.org/r/391586 (https://phabricator.wikimedia.org/T180564) [17:04:44] Please rebase patch https://gerrit.wikimedia.org/r/#/c/390188/ because jenkins can not do tests [17:04:49] (03PS8) 10Rush: apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [17:12:09] (03PS8) 10Dzahn: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [17:12:24] Zoranzoki21: done [17:12:30] Thank you [17:13:17] (03CR) 10Volans: "I'm not very familiar with the Prometheus or Druid API, I've left some comment inline" (0310 comments) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [17:13:20] yw:) [17:15:05] (03CR) 10Dzahn: [C: 031] "rebased. that IP is Uni Leipzig. proxy.uni-leipzig.de." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [17:16:29] 10Operations, 10Wikimedia-General-or-Unknown, 10Tor, 10WorkType-NewFunctionality: Run our own Tor client for Tor block - https://phabricator.wikimedia.org/T32716#3763559 (10Aklapper) p:05Normal>03Low [17:16:46] godog: just got reminded that i cant compile puppet changes on certain bastions, due to "Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse template prometheus/cluster_config.erb: [17:17:00] do we know more about that one yet [17:17:57] or actually that might not be the reason since i also see it in the production catalog [17:19:10] (03CR) 10Zoranzoki21: [C: 031] Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [17:21:34] (03CR) 10Dzahn: "see how it compiles fine and without differences on bast1001/2001 (http://puppet-compiler.wmflabs.org/8791/) but also fails to compile on" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [17:22:43] (03CR) 10Zoranzoki21: [C: 031] Remove www.*.org symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391355 (owner: 10Chad) [17:22:45] (03PS5) 10Rush: openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) [17:23:23] (03CR) 10jerkins-bot: [V: 04-1] openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:24:05] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391353 (owner: 10Chad) [17:27:53] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391251 (https://phabricator.wikimedia.org/T177060) (owner: 10Addshore) [17:32:29] (03CR) 10Jcrespo: [C: 031] "Ok for the query parts. Cannot say about the script in general- if it will fail gracefully, etc." [puppet] - 10https://gerrit.wikimedia.org/r/391586 (https://phabricator.wikimedia.org/T180564) (owner: 10Arturo Borrero Gonzalez) [17:39:11] (03PS1) 10Chad: group1 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391591 [17:39:13] (03CR) 10Chad: [C: 04-2] group1 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391591 (owner: 10Chad) [17:40:43] mutante: heh it has to do with puppetdb and pcc I think, but haven't had the time to investigate, I noticed the same [17:41:07] godog: ok, *nod*. at least not the only one, yep [17:41:26] i think my change to convert bastions to profile would probably work, just hard to proof on all 4 [17:41:36] it shows fine on 1001/2001 [17:42:37] (03PS1) 10Dzahn: planet: use profile::base::firewall in role [puppet] - 10https://gerrit.wikimedia.org/r/391595 [17:44:25] (03CR) 10Dzahn: [C: 032] "wmf-style: total violations delta -1" [puppet] - 10https://gerrit.wikimedia.org/r/391595 (owner: 10Dzahn) [17:44:40] jynus: thanks for profile::base::firewall :) [17:45:36] that is not supposed to be a proper solution [17:45:47] either things have to be properly migrated [17:45:54] or a different schema has to be used [17:46:24] https://gerrit.wikimedia.org/r/#/c/391595/1/modules/role/manifests/planet_server.pp is ok though [17:47:08] ok! well, it removes the style violations that keep me from converting more things to profile. that in itself is cool:) [17:47:10] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3763641 (10dduvall) Thanks for the update @joe and @MoritzMuehlenhoff! What do you think about maintaining separate tags bas... [17:48:09] (03CR) 10Jcrespo: "By the addition of profile::base::firewall, this could actually work, (maintaining both the profile and the module for a while) although i" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [17:52:17] (03PS3) 10Chad: search.wikimedia.org: Stop supporting non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390358 [17:52:45] (03CR) 10Chad: [C: 032] "Per IRC, dropping WIP and will merge this. We can't find any evidence of this script ever getting any non-WP traffic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390358 (owner: 10Chad) [17:56:09] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3763672 (10elukey) [17:56:12] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3763670 (10elukey) 05Open>03Resolved All the work has been completed, closing the task! [17:56:24] (03Merged) 10jenkins-bot: search.wikimedia.org: Stop supporting non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390358 (owner: 10Chad) [17:56:34] (03CR) 10jenkins-bot: search.wikimedia.org: Stop supporting non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390358 (owner: 10Chad) [17:57:55] !log demon@tin Synchronized docroot/search.wikimedia.org/index.php: removing support for non-wikipedias (duration: 00m 49s) [17:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:21] (03PS1) 10Dzahn: misc PHP apps: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391607 [17:59:07] (03PS2) 10Dzahn: misc PHP apps: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391607 [18:00:30] (03CR) 10Dzahn: [C: 032] "wmf-style: total violations delta -2" [puppet] - 10https://gerrit.wikimedia.org/r/391607 (owner: 10Dzahn) [18:02:04] (03CR) 10Dzahn: "no-op on krypton.eqiad.wmnet - also going to convert this to profile next" [puppet] - 10https://gerrit.wikimedia.org/r/391607 (owner: 10Dzahn) [18:13:42] (03CR) 10Phuedx: [C: 031] "This LGTM. This'll enable the button for all users on all wikis (knowing that there's browser-specific limiting in the target codebase)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391041 (https://phabricator.wikimedia.org/T179914) (owner: 10Jdlrobson) [18:16:10] (03PS1) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [18:16:47] (03CR) 10jerkins-bot: [V: 04-1] misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [18:17:11] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: [subtask] How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3763739 (10bmansurov) a:05bmansurov>03None [18:18:22] (03PS6) 10Rush: openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) [18:18:52] (03CR) 10jerkins-bot: [V: 04-1] openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:19:20] (03CR) 10Rush: [V: 032 C: 032] openstack: move main hiera deployment config to common [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:19:40] (03CR) 10Rush: [V: 032 C: 032] "taking style violations to not refactor db code inline w/ this param change" [puppet] - 10https://gerrit.wikimedia.org/r/387625 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:20:21] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: How should we get Chromium for use in puppeteer? - https://phabricator.wikimedia.org/T178570#3763759 (10bmansurov) [18:20:44] !log rebooting auth [18:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:51] !log rebooting auth* servers for update to 4.9.51 [18:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:07] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 2 others: [subtask] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3763761 (10bmansurov) a:05bmansurov>03None [18:21:21] PROBLEM - puppet last run on dbproxy1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:07] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 3 others: [spike] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3763762 (10phuedx) [18:22:17] 10Operations, 10Gerrit, 10Readers-Web-Backlog, 10Patch-For-Review, and 3 others: [spike] Temporarily allow pushing large objects - https://phabricator.wikimedia.org/T178189#3683850 (10phuedx) a:03phuedx [18:24:03] (03PS1) 10Rush: horizon: restrict to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/391611 [18:24:23] (03CR) 10Muehlenhoff: [C: 031] horizon: restrict to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/391611 (owner: 10Rush) [18:24:29] (03PS2) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [18:24:53] (03CR) 10Rush: [C: 032] horizon: restrict to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/391611 (owner: 10Rush) [18:26:21] RECOVERY - puppet last run on dbproxy1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:26:56] (03Abandoned) 10Rush: role::horizon: Restrict to $CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/366858 (owner: 10Muehlenhoff) [18:28:35] (03PS2) 10Rush: dynamicproxy: Add vhost to access.log [puppet] - 10https://gerrit.wikimedia.org/r/389409 (https://phabricator.wikimedia.org/T178963) (owner: 10BryanDavis) [18:31:36] (03PS3) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [18:33:26] (03CR) 10Rush: [C: 032] dynamicproxy: Add vhost to access.log [puppet] - 10https://gerrit.wikimedia.org/r/389409 (https://phabricator.wikimedia.org/T178963) (owner: 10BryanDavis) [18:35:29] (03PS4) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [18:36:43] (03PS1) 10Jdlrobson: Disable EventLogging for popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391615 (https://phabricator.wikimedia.org/T178500) [18:38:22] !log removed 2FA from User:Einsbor after verification for votewiki, stewardwiki and SUL [18:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:19] (03PS1) 10Dzahn: misc static sites: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391616 [18:40:21] (03PS11) 10Rush: ssh-key-ldap-lookup: Deny user auth if /etc/block-ldap-key-lookup exists [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [18:41:57] (03PS4) 10Rush: Tools: Undo obsolete /var/mail customization [puppet] - 10https://gerrit.wikimedia.org/r/326306 (owner: 10Tim Landscheidt) [18:42:11] (03PS1) 10Madhuvishy: firstboot: Prevent non-root users from logging in during instance set up [puppet] - 10https://gerrit.wikimedia.org/r/391619 (https://phabricator.wikimedia.org/T171508) [18:42:35] (03CR) 10Rush: [C: 032] Tools: Undo obsolete /var/mail customization [puppet] - 10https://gerrit.wikimedia.org/r/326306 (owner: 10Tim Landscheidt) [18:43:10] (03PS2) 10Madhuvishy: firstboot: Prevent non-root users from logging in during instance set up [puppet] - 10https://gerrit.wikimedia.org/r/391619 (https://phabricator.wikimedia.org/T171508) [18:47:45] (03CR) 10Zoranzoki21: [C: 031] Disable EventLogging for popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391615 (https://phabricator.wikimedia.org/T178500) (owner: 10Jdlrobson) [18:48:03] (03PS9) 10Zoranzoki21: Enable the ArticlePlaceholder for Northern Sami (sewiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [18:48:33] (03PS10) 10Zoranzoki21: Enable the ArticlePlaceholder for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [18:49:25] (03CR) 10Rush: [V: 032 C: 032] "looks like security and legal both say this is good to go so I'm going to merge it. This won't be active until a run of maintain-views an" [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [18:49:33] (03PS11) 10Rush: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [18:56:32] (03CR) 10Phuedx: [C: 031] "This'll disable the EventLogging instrumentation for all wikis but leave the rest of the experimental setup in tact." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391615 (https://phabricator.wikimedia.org/T178500) (owner: 10Jdlrobson) [18:58:41] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3763858 (10Joe) @MoritzMuehlenhoff did you ever took a look at the deb packages that are officially distributed by node? The... [18:58:55] (03PS1) 10Chad: Adding lfs plugin @ 2.13.9 [software/gerrit] - 10https://gerrit.wikimedia.org/r/391625 [18:59:48] (03CR) 10Chad: "Plugin already built and uploaded to Archiva: https://archiva.wikimedia.org/#artifact~releases/com.googlesource.gerrit.plugins/lfs/2.13.9" [software/gerrit] - 10https://gerrit.wikimedia.org/r/391625 (owner: 10Chad) [19:00:06] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171115T1900). [19:00:06] Jdlrobson and RoanKattouw: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:27] I'm here [19:01:22] And I can SWAT, too [19:01:31] \o [19:01:55] !log Restarting Cassandra instances on restbase2001.codfw.wmnet (T180568) [19:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:02] T180568: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568 [19:03:07] RoanKattouw: first patch needs no testing since it wont be enabled until the feature flag is turned on [19:03:12] (03PS3) 10Catrope: Enable structured change filters by default on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382333 (https://phabricator.wikimedia.org/T177444) [19:03:17] jdlrobson: The "limit to Chrome" one? [19:03:20] correct [19:04:09] (03PS5) 10Rush: Configure fixed lock manager ports for labstore NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/357562 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [19:06:11] Meanwhile my laptop's acting up, please stand by while I reboot [19:06:19] RoanKattouw: can you ping me when you have finished swat? :) [19:07:22] Will do [19:08:14] (03CR) 10Catrope: [C: 032] Enable structured change filters by default on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382333 (https://phabricator.wikimedia.org/T177444) (owner: 10Catrope) [19:10:43] (03Merged) 10jenkins-bot: Enable structured change filters by default on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382333 (https://phabricator.wikimedia.org/T177444) (owner: 10Catrope) [19:10:54] (03CR) 10jenkins-bot: Enable structured change filters by default on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/382333 (https://phabricator.wikimedia.org/T177444) (owner: 10Catrope) [19:11:30] (03PS3) 10Madhuvishy: firstboot: Prevent non-root users from logging in during instance set up [puppet] - 10https://gerrit.wikimedia.org/r/391619 (https://phabricator.wikimedia.org/T171508) [19:11:39] (03CR) 10Ayounsi: [WIP] Puppetize Netbox (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) (owner: 10Ayounsi) [19:12:38] (03PS16) 10Ayounsi: [WIP] Puppetize Netbox [puppet] - 10https://gerrit.wikimedia.org/r/387880 (https://phabricator.wikimedia.org/T170144) [19:13:19] (03CR) 10Thcipriani: [C: 031] "This should be ok to land now that scap 3.7.3-1 has been released (yesterday), all deployments should default to trying to use the key tha" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [19:14:13] !log catrope@tin Synchronized wmf-config/flaggedrevs.php: Enable RCFilters on all remaining wikis (T177445) (duration: 00m 49s) [19:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:20] T177445: Graduate New Filters on Recent Changes out of beta on all wikis without "Hide reviewed edits" filter (shown on some FlaggedRevs wikis) - https://phabricator.wikimedia.org/T177445 [19:14:52] (03CR) 10Rush: [C: 031] firstboot: Prevent non-root users from logging in during instance set up [puppet] - 10https://gerrit.wikimedia.org/r/391619 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [19:15:19] (03CR) 10Madhuvishy: [C: 032] firstboot: Prevent non-root users from logging in during instance set up [puppet] - 10https://gerrit.wikimedia.org/r/391619 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [19:16:13] !log catrope@tin Synchronized php-1.31.0-wmf.7/skins/MinervaNeue/resources/skins.minerva.scripts/init.js: Limit download button to Google Chrome (T179529, T179914) (duration: 00m 49s) [19:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:20] T179914: Deploy print to PDF button for Chrome on Android - https://phabricator.wikimedia.org/T179914 [19:16:20] T179529: [Spike] Can we detect browsers where the window.print function simply doesn't work? - https://phabricator.wikimedia.org/T179529 [19:17:42] (03CR) 10Rush: openstack2: no Icinga paging (SMS) if on labtest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [19:19:49] !log catrope@tin Synchronized php-1.31.0-wmf.8/resources/src/mediawiki.rcfilters/mw.rcfilters.UriProcessor.js: T180577 (duration: 00m 49s) [19:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:56] T180577: [Regression] Recent Changes on MediaWiki.org doesn't display more than 50 past edits - https://phabricator.wikimedia.org/T180577 [19:20:51] 10Operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor throughput towards some destinations - https://phabricator.wikimedia.org/T120425#1853689 (10ayounsi) @Nemo_bis I see that the last comment is from more than a year ago, is that issue still happening? [19:21:39] (03PS2) 10Catrope: Enable the download icon on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391041 (https://phabricator.wikimedia.org/T179914) (owner: 10Jdlrobson) [19:21:44] (03CR) 10Rush: [C: 032] Configure fixed lock manager ports for labstore NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/357562 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [19:21:49] (03PS6) 10Rush: Configure fixed lock manager ports for labstore NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/357562 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [19:24:21] (03CR) 10Catrope: [C: 032] Enable the download icon on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391041 (https://phabricator.wikimedia.org/T179914) (owner: 10Jdlrobson) [19:25:44] (03Merged) 10jenkins-bot: Enable the download icon on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391041 (https://phabricator.wikimedia.org/T179914) (owner: 10Jdlrobson) [19:26:25] (03CR) 10jenkins-bot: Enable the download icon on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391041 (https://phabricator.wikimedia.org/T179914) (owner: 10Jdlrobson) [19:30:32] 10Operations, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#3764041 (10demon) [19:30:43] !log labstore1004:~# service nfs-kernel-server restart [19:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:52] jdlrobson: Your download icon thing is on mwdebug1002, please test [19:32:17] (03CR) 10Paladox: [C: 031] Adding lfs plugin @ 2.13.9 [software/gerrit] - 10https://gerrit.wikimedia.org/r/391625 (owner: 10Chad) [19:33:49] Jayprakash12345 [19:33:51] [config] 390182 Enable the SandboxLink extension in the Mirandese Wikipedia (T180052) [19:33:51] [config] 390183 Add BP and WP as aliases to project namespace at mwlwiki (T180052) [19:33:51] T180052: Creation and update of namespaces in Mirandese Wikipedia (mwlwiki) - https://phabricator.wikimedia.org/T180052 [19:34:14] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Minerva download icon on all wikis (T179914) (duration: 00m 48s) [19:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:20] T179914: Deploy print to PDF button for Chrome on Android - https://phabricator.wikimedia.org/T179914 [19:34:26] (03PS2) 10Catrope: Disable EventLogging for popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391615 (https://phabricator.wikimedia.org/T178500) (owner: 10Jdlrobson) [19:34:32] (03CR) 10Catrope: [C: 032] Disable EventLogging for popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391615 (https://phabricator.wikimedia.org/T178500) (owner: 10Jdlrobson) [19:35:49] (03Merged) 10jenkins-bot: Disable EventLogging for popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391615 (https://phabricator.wikimedia.org/T178500) (owner: 10Jdlrobson) [19:36:24] (03CR) 10jenkins-bot: Disable EventLogging for popups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391615 (https://phabricator.wikimedia.org/T178500) (owner: 10Jdlrobson) [19:36:34] Please look at https://wikitech.wikimedia.org/wiki/Deployments#Week_of_November_13th again. [19:37:33] (03PS9) 10Rush: Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:37:52] !log upgrade of elasticsearch codfw completed, cluster still recovering - T178411 [19:37:56] jdlrobson: "Disable EventLogging for popups" is on mwdebug1002, please test [19:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:01] T178411: Upgrade cirrus elasticsearch clusters to 5.5.x - https://phabricator.wikimedia.org/T178411 [19:38:08] RoanKattouw: on it [19:38:13] Jayprakash12345: Aha thanks, I'll pick those up [19:38:15] (03CR) 10jerkins-bot: [V: 04-1] Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:38:20] (03PS10) 10Rush: Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:38:35] (03PS4) 10Catrope: Enable the SandboxLink extension in the Mirandese Wikipedia (Third Req) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390182 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:38:39] (03CR) 10Catrope: [C: 032] Enable the SandboxLink extension in the Mirandese Wikipedia (Third Req) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390182 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:38:51] 10Operations, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#3764068 (10demon) @thcipriani Noticed that it appears it's been [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=792075 | added ]] to [[ https://packages.debian.org/source/buster/gi... [19:38:55] (03CR) 10jerkins-bot: [V: 04-1] Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:39:10] LGTM RoanKattouw [19:39:59] (03Merged) 10jenkins-bot: Enable the SandboxLink extension in the Mirandese Wikipedia (Third Req) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390182 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:40:01] (03CR) 10jenkins-bot: Enable the SandboxLink extension in the Mirandese Wikipedia (Third Req) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390182 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:40:20] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Disable EventLogging for Popups (T178500) (duration: 00m 49s) [19:40:20] (03PS11) 10Rush: Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:25] T178500: Stop Page Previews enwiki and dewiki A/B test (again) - https://phabricator.wikimedia.org/T178500 [19:40:59] (03CR) 10Chad: [V: 032 C: 032] Adding lfs plugin @ 2.13.9 [software/gerrit] - 10https://gerrit.wikimedia.org/r/391625 (owner: 10Chad) [19:41:21] !log demon@tin Started deploy [gerrit/gerrit@bce982f]: adding lfs plugin @ 2.13.9 [19:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:30] !log demon@tin Finished deploy [gerrit/gerrit@bce982f]: adding lfs plugin @ 2.13.9 (duration: 00m 08s) [19:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:36] (03PS12) 10Rush: Add role::labs::libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:41:40] :) [19:42:11] (03CR) 10Jdrewniak: "I don't know much about the Apache config here, but I know assets for www.wikipedia.org are served from URLs like:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391355 (owner: 10Chad) [19:42:43] (03PS13) 10Rush: wmcs tenant: add libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:42:51] (03PS14) 10Rush: wmcs tenant: add libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:44:10] PROBLEM - Long running screen/tmux on iron is CRITICAL: CRIT: Long running SCREEN process. (PID: 24047, 1729221s 1728000s). [19:44:54] (03PS15) 10Legoktm: wmcs tenant: add libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) [19:45:01] Jayprakash12345: SandboxLink on mwlwiki is ready for testing on mwdebug1002, please test and confirm that it works [19:45:10] (03CR) 10Rush: [C: 031] "no issue on my end, we have no toolforge presence in labtest though so it's a moot point atm. But that doesn't mean it's not a good idea " [puppet] - 10https://gerrit.wikimedia.org/r/384893 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [19:45:11] ok [19:46:03] thanks for your help RoanKattouw :) [19:46:56] RoanKattouw : Everthing is fine in X-debug. Please run stashbot. [19:47:35] (03CR) 10Rush: [C: 032] wmcs tenant: add libraryupgrader puppet configuration [puppet] - 10https://gerrit.wikimedia.org/r/372213 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:48:05] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable SandboxLink on mwlwiki (T180052) (duration: 00m 48s) [19:48:08] (03PS1) 10Herron: puppet: point codfw mediawiki::appservers at puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391627 (https://phabricator.wikimedia.org/T177254) [19:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:12] T180052: Creation and update of namespaces in Mirandese Wikipedia (mwlwiki) - https://phabricator.wikimedia.org/T180052 [19:48:37] (03PS4) 10Catrope: Add BP and WP as aliases to project namespace at mwlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390183 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:48:46] mutante: left a comment on https://gerrit.wikimedia.org/r/#/c/384892/ [19:48:47] (03CR) 10Catrope: [C: 032] Add BP and WP as aliases to project namespace at mwlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390183 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:50:08] (03Merged) 10jenkins-bot: Add BP and WP as aliases to project namespace at mwlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390183 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:50:17] (03CR) 10jenkins-bot: Add BP and WP as aliases to project namespace at mwlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390183 (https://phabricator.wikimedia.org/T180052) (owner: 10Jayprakash12345) [19:50:38] Jayprakash12345: The namespace aliases patch is ready for testing nwo [19:51:13] RoanKattouw: I'm going to hit +2 on my 2 things now (if thats okay) I think your on your last patch? [19:51:19] Yes I am [19:51:20] Go ahead [19:51:39] Awesome, have done :) [19:51:56] RoanKattouw : Everthing is fine in X-debug. Please run stashbot. [19:53:16] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add BP and WP aliases for project namespace on mwlwiki (T180052) (duration: 00m 48s) [19:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:23] T180052: Creation and update of namespaces in Mirandese Wikipedia (mwlwiki) - https://phabricator.wikimedia.org/T180052 [19:53:26] addshore: That's me done ---^^ [19:53:39] Awesome! thanks! [19:53:48] Now to do my bit before the train! [19:54:06] RoanKattouw: Thanks [19:56:36] (03PS8) 10Rush: Add initial profile for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [19:57:37] (03PS1) 10Gehel: wdqs: move ldf traffic to wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/391630 (https://phabricator.wikimedia.org/T176593) [19:58:09] (03CR) 10Smalyshev: [C: 031] wdqs: move ldf traffic to wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/391630 (https://phabricator.wikimedia.org/T176593) (owner: 10Gehel) [19:58:26] (03CR) 10Gehel: [C: 032] wdqs: move ldf traffic to wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/391630 (https://phabricator.wikimedia.org/T176593) (owner: 10Gehel) [20:00:04] no_justification: Time to snap out of that daydream and deploy MediaWiki train. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171115T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:34] =o [20:00:40] * addshore is still going fyi no_justification :) [20:00:50] You're good [20:03:41] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:03:51] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.164 and port 9042: Connection refused [20:04:00] * addshore waits for https://gerrit.wikimedia.org/r/#/c/391631 [20:09:29] (03CR) 10Rush: "Totally should be a profile but all of this storage code needs refactoring and for now I don't want to separate this bit from the rest of " [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [20:09:42] (03PS1) 10Ottomata: Try async kafka producer to fix http timeouts [puppet] - 10https://gerrit.wikimedia.org/r/391634 (https://phabricator.wikimedia.org/T180017) [20:09:50] RECOVERY - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-c valid until 2018-08-17 16:11:42 +0000 (expires in 274 days) [20:10:37] (03Abandoned) 10Rush: Add ferm service for rpc.statd on labstore [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [20:11:00] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 [20:11:28] !log otto@tin Started deploy [eventlogging/eventbus@872cfb3]: deploying kafka-futures change to kafka1001 only, will apply async with https://gerrit.wikimedia.org/r/#/c/391634/ [20:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:49] !log otto@tin Finished deploy [eventlogging/eventbus@872cfb3]: deploying kafka-futures change to kafka1001 only, will apply async with https://gerrit.wikimedia.org/r/#/c/391634/ (duration: 00m 20s) [20:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:28] (03PS2) 10Ottomata: Try async kafka producer to fix http timeouts [puppet] - 10https://gerrit.wikimedia.org/r/391634 (https://phabricator.wikimedia.org/T180017) [20:12:54] (03PS12) 10Rush: labstore: initial ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [20:13:32] (03CR) 10jerkins-bot: [V: 04-1] labstore: initial ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [20:13:40] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Ferm rules for labstore NFS hosts - https://phabricator.wikimedia.org/T165136#3764481 (10chasemp) cc'd from the patch Totally should be a profile but all of this storage code needs refactoring and for now I don't want to separate this bit from the rest of the l... [20:14:53] Pulled it onto mwdebug1002 for a bit of testing [20:15:58] (03PS3) 10Ottomata: Try async kafka producer to fix http timeouts [puppet] - 10https://gerrit.wikimedia.org/r/391634 (https://phabricator.wikimedia.org/T180017) [20:16:01] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/8796/kafka1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/391634 (https://phabricator.wikimedia.org/T180017) (owner: 10Ottomata) [20:16:06] (03CR) 10Ottomata: [V: 032 C: 032] Try async kafka producer to fix http timeouts [puppet] - 10https://gerrit.wikimedia.org/r/391634 (https://phabricator.wikimedia.org/T180017) (owner: 10Ottomata) [20:16:17] Looking good [20:17:00] !log addshore@tin Started scap: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch (again) T180539 [20:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:06] T180539: wmf.8 still on old Wikidata branch - https://phabricator.wikimedia.org/T180539 [20:17:10] no_justification: all looking good, so syncing! :) [20:17:51] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] [20:19:07] (03PS1) 10Chad: Use lfs plugin for lfs support [puppet] - 10https://gerrit.wikimedia.org/r/391635 [20:20:51] RECOVERY - High lag on wdqs2002 is OK: OK: Less than 30.00% above the threshold [600.0] [20:21:19] (03Draft1) 10Paladox: gerrit: Set lfs.plugin in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/391636 [20:21:21] (03Draft2) 10Paladox: gerrit: Set lfs.plugin in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/391636 [20:21:33] lol abandoning as chad just did it :) [20:21:44] (03Abandoned) 10Paladox: gerrit: Set lfs.plugin in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/391636 (owner: 10Paladox) [20:21:46] (03CR) 10Paladox: [C: 031] Use lfs plugin for lfs support [puppet] - 10https://gerrit.wikimedia.org/r/391635 (owner: 10Chad) [20:26:54] (03PS3) 10Rush: maintain-views: implement connection timeouts for views creation [puppet] - 10https://gerrit.wikimedia.org/r/391586 (https://phabricator.wikimedia.org/T180564) (owner: 10Arturo Borrero Gonzalez) [20:28:00] (03CR) 10Rush: [V: 032 C: 032] "I think this is better than it was :)" [puppet] - 10https://gerrit.wikimedia.org/r/391586 (https://phabricator.wikimedia.org/T180564) (owner: 10Arturo Borrero Gonzalez) [20:32:00] (03PS2) 10Dzahn: gerrit: Use lfs plugin for lfs support [puppet] - 10https://gerrit.wikimedia.org/r/391635 (owner: 10Chad) [20:33:23] (03CR) 10Dzahn: [C: 032] gerrit: Use lfs plugin for lfs support [puppet] - 10https://gerrit.wikimedia.org/r/391635 (owner: 10Chad) [20:33:42] (03PS3) 10Dzahn: gerrit: Use lfs plugin for lfs support [puppet] - 10https://gerrit.wikimedia.org/r/391635 (owner: 10Chad) [20:34:01] no_justification that will require a gerrit restart i think ^^ [20:34:02] :) [20:34:11] Indeed [20:34:57] (03PS4) 10Dzahn: gerrit: Use lfs plugin for lfs support [puppet] - 10https://gerrit.wikimedia.org/r/391635 (owner: 10Chad) [20:35:29] ok [20:37:01] applied on gerrit2001 [20:37:37] thanks :) [20:38:35] restarting didnt break it there, so confirmed. we can wait with cobalt i guess or do it now [20:38:49] !log addshore@tin Finished scap: Update extensions/Wikidata to new wmf/1.31.0-wmf.8 branch (again) T180539 (duration: 21m 48s) [20:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:55] T180539: wmf.8 still on old Wikidata branch - https://phabricator.wikimedia.org/T180539 [20:38:58] no_justification: nothing broken! :) [20:39:19] mutante: jfdi [20:39:20] :) [20:40:42] !log restarting gerrit to enable 'large file support'-plugin gerrit:391635 [20:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:29] back [20:42:02] :) [20:42:15] random warning of the day [20:42:16] WARNING: Multiple Servlet injectors detected. This is a warning indicating that you have more than one Guice... expected. [20:43:34] that can be ignored :) [20:43:36] 20:43:02.799842 trace git-lfs: HTTP: {"message":"User anonymous is not authorized to perform upload operation"} [20:43:36] 20:43:02.799893 trace git-lfs: api: http response indicates "basic" authentication. Resubmitting... [20:43:36] 20:43:02.841173 trace git-lfs: creds: git credential cache ("https", "gerrit.wikimedia.org", "r/test/gerrit-ping/info/lfs") [20:43:36] 20:43:02.841190 trace git-lfs: Filled credentials for https://paladox@gerrit.wikimedia.org/r/test/gerrit-ping/info/lfs [20:43:43] no_justification i get ^^ [20:43:46] jouncebot: ressurect [20:43:47] also brb [20:44:19] Aww. /me pokes [20:44:30] :) it worked. yay [20:44:32] thanks [20:45:13] Worked? but I don't see it come back yet.. [20:45:25] There you are, jouncebot. [20:45:55] Niharika: the command worked, it made it come back :) [20:46:13] Yeah. :) [20:47:47] 10Operations, 10Traffic, 10netops: Number of nlwiki (biography) articles getting consistently ~70 hits per day for the past months - https://phabricator.wikimedia.org/T180621#3764660 (10MarcoAurelio) [20:48:29] 10Operations, 10Traffic: Number of nlwiki (biography) articles getting consistently ~70 hits per day for the past months - https://phabricator.wikimedia.org/T180621#3763869 (10MarcoAurelio) [20:50:13] (03CR) 10Dzahn: openstack2: no Icinga paging (SMS) if on labtest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) (owner: 10Dzahn) [20:50:41] chasemp: thank you. yes, i added the default = true [20:50:49] well, uploading now [20:51:11] hieradata/regex.yaml: needs merge [20:51:28] what the hack, ..ok amending [20:52:35] (03PS3) 10Dzahn: openstack2: no Icinga paging (SMS) if on labtest [puppet] - 10https://gerrit.wikimedia.org/r/384892 (https://phabricator.wikimedia.org/T178008) [20:52:38] (03PS1) 10Herron: puppet: point codfw mediawiki canary appservers at puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/391646 (https://phabricator.wikimedia.org/T177254) [20:52:52] !log Restarting restbase2002-a.codfw.wmnet (T180568) [20:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:58] T180568: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568 [20:57:05] (03PS1) 10Ottomata: Update kafka-jumbo to kafka 0.11.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/391649 (https://phabricator.wikimedia.org/T152015) [20:57:22] (03PS2) 10Ottomata: Update kafka-jumbo to kafka 0.11.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/391649 (https://phabricator.wikimedia.org/T152015) [20:58:12] (03CR) 10Ottomata: [C: 032] Update kafka-jumbo to kafka 0.11.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/391649 (https://phabricator.wikimedia.org/T152015) (owner: 10Ottomata) [20:58:13] 10Operations, 10Wikimedia-Mailing-lists: cleanup mailman archives - introduce apache rewrites - https://phabricator.wikimedia.org/T109609#3764692 (10RobH) 05Open>03Resolved a:03RobH I no longer thing doing this across the board is a good idea. Sometimes they are intentionally linked to the old archives,... [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171115T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:01:02] (03PS2) 10Dzahn: misc static sites: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391616 [21:01:12] Nothing for ORES. [21:01:14] !log restarting kafka-jumbo brokers to update to Kafka 0.11.0.1 [21:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:24] (03PS3) 10Dzahn: misc static sites: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391616 [21:01:57] no mobileapps deployment today. [21:02:15] apergos: hey! if you're around, labstore1003 mounts a /dumps share on dataset1001. Afaik, dataset1001 rsyncs dumps over to labstore1003, so we're wondering why the nfs mount - i can't find any references on puppet, and thought i'd check with you [21:02:51] madhuvishy: I'm here but can I look at this and get back to you tomorrow? [21:02:52] (03CR) 10Dzahn: [C: 032] misc static sites: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391616 (owner: 10Dzahn) [21:03:15] (11 pm and I'm in a different headspace now) [21:03:30] if there's a changeset or a ticket I can surely add what I know [21:03:41] ya np, whenever you can :) [21:04:31] (03CR) 10Dzahn: "no-op on bromine.eqiad.wmnet - the only affected node" [puppet] - 10https://gerrit.wikimedia.org/r/391616 (owner: 10Dzahn) [21:05:13] * addshore taps out [21:07:02] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3764750 (10Jdrewniak) [21:10:05] (03PS1) 10Dzahn: RT,releases: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391653 [21:11:28] (03CR) 10Dzahn: [C: 032] RT,releases: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391653 (owner: 10Dzahn) [21:12:58] (03CR) 10Dzahn: "wmf-style: total violations delta -2" [puppet] - 10https://gerrit.wikimedia.org/r/391653 (owner: 10Dzahn) [21:17:41] back [21:18:20] 10Operations: reinstall RT server with private IP and stretch - https://phabricator.wikimedia.org/T180641#3764808 (10Dzahn) [21:19:22] 10Operations: reinstall RT server with private IP and stretch - https://phabricator.wikimedia.org/T180641#3764821 (10Dzahn) p:05Triage>03Low [21:19:35] 10Operations: reinstall RT server with private IP and stretch - https://phabricator.wikimedia.org/T180641#3764808 (10Dzahn) a:03Dzahn [21:21:25] !log Disabling puppet across cloud VPS through cumin on labpuppetmaster1001 [21:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:50] mutante: I see that ununpentium needs to go away, since the element has a real name now? :D [21:22:31] (03PS12) 10Madhuvishy: ssh-key-ldap-lookup: Deny user auth if /etc/block-ldap-key-lookup exists [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) [21:23:32] Sagan: does it? hehe, the joke was that " Ununpentium has no practical uses yet. It’s so unstable that it doesn’t stay around long enough to make anything out of it. " :) [21:23:36] (03CR) 10Chad: [C: 032] group1 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391591 (owner: 10Chad) [21:23:41] https://www.newyorker.com/tech/elements/ununpentium-the-newest-element [21:24:30] mutante: ah, ok. the newest now is Oganesson now, which is a IRC nick I use as well :D [21:24:31] moscovium? [21:24:40] I think most of them have no real use [21:24:45] https://en.wikipedia.org/wiki/Moscovium [21:24:55] (03Merged) 10jenkins-bot: group1 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391591 (owner: 10Chad) [21:25:03] In 1979 IUPAC recommended that the placeholder systematic element name ununpentium (with the corresponding symbol of Uup)[23] be used until the discovery of the element is confirmed and a permanent name is decided. [21:25:17] "placeholder element name" heh [21:25:30] the fnord of the periodic table [21:28:11] (03CR) 10Madhuvishy: [C: 032] ssh-key-ldap-lookup: Deny user auth if /etc/block-ldap-key-lookup exists [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [21:28:28] (03CR) 10jenkins-bot: group1 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391591 (owner: 10Chad) [21:33:53] !log demon@tin Synchronized php: symlink swap (duration: 00m 49s) [21:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:00] mutante: Any idea what's up with mw1911? It's up and serving (some?) traffic, as I see it in logstash. But it's not in the dsh group (so had to do a manual `scap pull`). Is it depooled? [21:37:25] (03PS5) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [21:38:41] Ahh, depooled since Nov 03 [21:38:45] thcipriani: cc ^ [21:39:02] heh, that'd do it :) [21:39:03] no_justification: i cant ssh to it and dont get info from confctl it seems [21:39:18] I'm currently ssh'd to it [21:39:31] Oh, mw1191 [21:39:32] Typo! [21:39:36] 1911 sounds unusual [21:39:38] 1911 isn't a server [21:39:39] Yes [21:40:02] ack, depooled [21:40:24] it's broken https://phabricator.wikimedia.org/T179640 [21:40:30] Ok. Curious why it was getting some minimal amount of traffic [21:40:39] Was showing up in logstash [21:41:11] minimal traffic from monitoring maybe? [21:41:49] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3764750 (10RobH) Ops audit notes: * user already has shell access, this is expanding that access * user has already sig... [21:42:12] depooled doesnt remove it from icinga.. though in an ideal world it would [21:42:17] i guess [21:42:18] Hmm, good point [21:42:20] Could be it [21:43:30] i dunno [21:43:36] depooled and unmonitored are different states [21:44:08] if its powered on, it should be monitored =] [21:44:21] currently this always gets you a CRIT for "Host mw1191 is not in mediawiki-installation dsh group" [21:44:28] which is then acked again [21:44:43] i agree they can be different states though [21:45:37] mutante: you left town in time to miss the rain and cold [21:45:38] well done. [21:45:55] last weekend it got pretty for 2 days, then back to cold and drizzles [21:46:15] (it got nice in time to sail!) [21:46:27] Well we have a yes/no/inactive for states in etcd. I wonder if we can/do/should differentiate between "no" and "inactive" better? [21:46:30] oh, it's always less bad than late November in central Europe [21:47:04] heh [21:47:15] One being "totally down, don't bother even monitoring" the other being "halfway down, monitor but don't serve traffic" [21:47:35] Basically the difference being "is it online / ssh-able"? [21:48:07] "CAN SOMEONE REACH ME Y/N PLZ CIRCLE 1" [21:48:10] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3764876 (10debt) p:05Triage>03Normal Hi, I'm @Jdrewniak's manager for the Portal work that he's been doing and @phue... [21:55:33] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/8801/krypton.eqiad.wmnet/change.krypton.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [21:59:05] (03PS6) 10Dzahn: misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 [22:01:47] (03CR) 10Dzahn: [C: 04-2] misc PHP apps: convert roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [22:04:50] (03PS1) 10Dzahn: installserver: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391704 [22:05:49] !log Kicking off re-enabling puppet and puppet runs across Cloud VPS instances [22:05:51] (03PS1) 10Ottomata: Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) [22:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:09] (03CR) 10Ottomata: "https://gerrit.wikimedia.org/r/#/c/391703/ should be deployed before this." [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [22:06:13] (03CR) 10Krinkle: misc PHP apps: convert roles to profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391610 (owner: 10Dzahn) [22:06:22] (03CR) 10jerkins-bot: [V: 04-1] Move statsv varnishkafka and service to use main Kafka cluster(s) [puppet] - 10https://gerrit.wikimedia.org/r/391705 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata) [22:07:13] (03CR) 10Dzahn: [C: 032] "wmf-style: total violations delta -4" [puppet] - 10https://gerrit.wikimedia.org/r/391704 (owner: 10Dzahn) [22:08:54] (03PS1) 10Brian Wolff: Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 [22:09:25] sigh.. bast3002/4001 duplicate declaration ... [22:09:45] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568#3764935 (10Eevans) I found the following keyspaces (some of which are quite high traffic) to be erroneously configured to use leveled compaction.... [22:10:22] (03CR) 10Dzahn: "ok on install1002/2002 - but duplicate declaration issue on bast3002/bast4001 ...." [puppet] - 10https://gerrit.wikimedia.org/r/391704 (owner: 10Dzahn) [22:10:59] !log Setting keyspaces erroneously configured for leveled compaction, to use size-tiered (T180568) [22:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:05] T180568: Aberrant load on instances involved in recent bootstrap - https://phabricator.wikimedia.org/T180568 [22:12:22] (03CR) 10Zoranzoki21: [C: 031] Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [22:12:49] (03CR) 10Legoktm: "At this point we can safely remove the entire $wgTemplateSandboxEditNamespaces[] = NS_MODULE; line." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [22:13:40] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:13:50] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:14:07] that is me and i'm about to upload the next one to fix it [22:15:02] (03PS1) 10Dzahn: bastionhost: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391708 [22:16:48] (03PS14) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [22:19:26] (03PS11) 10Zoranzoki21: Enable the ArticlePlaceholder for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [22:20:10] (03CR) 10Dzahn: "there is the general issue with compiling stuff on bast3002/bast4001 again , but ok on others http://puppet-compiler.wmflabs.org/8803/" [puppet] - 10https://gerrit.wikimedia.org/r/391708 (owner: 10Dzahn) [22:20:17] (03CR) 10Dzahn: [C: 032] bastionhost: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391708 (owner: 10Dzahn) [22:21:26] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.8 [22:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:40] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [22:24:37] (03CR) 10Dzahn: "also works on bast3002/bast4001 after follow-up https://gerrit.wikimedia.org/r/391708" [puppet] - 10https://gerrit.wikimedia.org/r/391704 (owner: 10Dzahn) [22:28:51] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:33:44] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, 10Discovery-Portal-Sprint: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3764982 (10greg) >>! In T180639#3764846, @RobH wrote: > We've also had Release Engineering sign off on all additions to... [22:48:52] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: wikidatawiki back to wmf.7 [22:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:42] (03CR) 10Platonides: [C: 031] Add my pgp key to https://www.mediawiki.org/keys/keys.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391706 (owner: 10Brian Wolff) [22:56:21] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3765029 (10BBlack) 05Open>03Resolved Closing for now, assuming no new problems surface. Thanks @RobH :) [22:57:28] (03PS1) 10Dzahn: gerrit,phabricator: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391715 [22:58:59] (03PS2) 10Dzahn: gerrit,phabricator: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391715 [23:03:11] (03CR) 10Dzahn: [C: 032] gerrit,phabricator: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391715 (owner: 10Dzahn) [23:22:06] (03PS1) 10Dzahn: rm role::security::tools [puppet] - 10https://gerrit.wikimedia.org/r/391719 [23:22:37] (03CR) 10Dzahn: [C: 032] rm role::security::tools [puppet] - 10https://gerrit.wikimedia.org/r/391719 (owner: 10Dzahn) [23:28:57] (03PS1) 10Dzahn: cumin: update aliases for "nonprod" testing roles [puppet] - 10https://gerrit.wikimedia.org/r/391721 [23:29:57] (03CR) 10Volans: [C: 04-1] "typo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391721 (owner: 10Dzahn) [23:32:22] (03PS1) 10Dzahn: lists,otrs: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391722 [23:33:57] (03PS2) 10Dzahn: cumin: update aliases for "nonprod" testing roles [puppet] - 10https://gerrit.wikimedia.org/r/391721 [23:34:02] (03CR) 10Dzahn: cumin: update aliases for "nonprod" testing roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391721 (owner: 10Dzahn) [23:34:59] (03CR) 10Volans: [C: 031] "Syntax LGTM (I didn't check the usage of those role)" [puppet] - 10https://gerrit.wikimedia.org/r/391721 (owner: 10Dzahn) [23:38:16] (03PS1) 10Dzahn: etherpad,ganglia,tor_relay: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391723 [23:43:28] (03PS1) 10Dzahn: mirrors,peopleweb,spare: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391727 [23:52:22] (03CR) 10Dzahn: [C: 032] mirrors,peopleweb,spare: use profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/391727 (owner: 10Dzahn) [23:58:48] (03PS1) 10Dzahn: mediawiki:appserver:api: move firewall to role, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391731