[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T0000). [00:00:04] MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:29] I'll deploy [00:02:40] MaxSem: ok I was just reviewing the patch, but you've got my +1 to go ahead [00:03:03] (03CR) 1020after4: [C: 03+1] Undeploy Listings from dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479911 (https://phabricator.wikimedia.org/T206102) (owner: 10MaxSem) [00:03:06] (03PS2) 10MaxSem: Undeploy Listings from dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479911 (https://phabricator.wikimedia.org/T206102) [00:04:00] (03CR) 10Volans: "Tested on boron and builds successfully. The CI job fails because it requires the SPICERACK=yes flag that is not (yet) supported." [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/480885 (owner: 10Volans) [00:04:59] (03CR) 10MaxSem: [C: 03+2] Undeploy Listings from dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479911 (https://phabricator.wikimedia.org/T206102) (owner: 10MaxSem) [00:06:05] (03Merged) 10jenkins-bot: Undeploy Listings from dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479911 (https://phabricator.wikimedia.org/T206102) (owner: 10MaxSem) [00:06:20] (03CR) 10jenkins-bot: Undeploy Listings from dewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479911 (https://phabricator.wikimedia.org/T206102) (owner: 10MaxSem) [00:09:41] !log maxsem@deploy1001 Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/479911/ (duration: 00m 53s) [00:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:10] SWAT done [00:28:08] 10Operations, 10Jade, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) [00:31:04] 10Operations, 10Jade, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Harej) [00:32:25] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10kaldari) @jijiki - Is there a way I can test it, other than locally? [00:49:53] (03CR) 10Dzahn: [C: 03+2] doc: clone integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480879 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [01:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T0100). [01:12:54] (03CR) 10Dzahn: [C: 04-1] "please don't change the fileset in bacula, add a new one, that wasn't added for doc1001-only, it already existed before for other hosts wi" [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [01:19:30] no phab deployment today, still working through some conflicts from upstream [01:32:19] PROBLEM - Backup of s4 in codfw on db1115 is CRITICAL: Backup for s4 at codfw taken more than 8 days ago: Most recent backup 2018-12-12 01:10:29 [01:34:58] is there some setting in hiera which tells which cluster is primary cluster now? [01:37:35] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@c7977a7]: Update mobileapps to 42c011e [01:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:38] mutante: thanks :) [01:40:57] legoktm: welcome! i actually read her blog , you can tell her [01:41:09] :D will do [01:41:44] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@c7977a7]: Update mobileapps to 42c011e (duration: 04m 08s) [01:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:51] (03PS1) 10Smalyshev: Add kafka reporting topic to Puppet config [puppet] - 10https://gerrit.wikimedia.org/r/480894 [01:42:49] (03CR) 10jerkins-bot: [V: 04-1] Add kafka reporting topic to Puppet config [puppet] - 10https://gerrit.wikimedia.org/r/480894 (owner: 10Smalyshev) [01:44:41] (03PS2) 10Smalyshev: Add kafka reporting topic to Puppet config [puppet] - 10https://gerrit.wikimedia.org/r/480894 [01:50:52] (03CR) 10Smalyshev: "Puppet compiler check: https://puppet-compiler.wmflabs.org/compiler1002/14022/" [puppet] - 10https://gerrit.wikimedia.org/r/480894 (owner: 10Smalyshev) [01:51:38] (03PS3) 10Smalyshev: Add kafka reporting topic to Puppet config [puppet] - 10https://gerrit.wikimedia.org/r/480894 [02:08:57] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:01] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [02:43:07] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:11] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [03:00:07] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:19] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.988 second response time [03:12:19] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 914.87 seconds [04:20:13] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.13 seconds [04:36:27] (03PS1) 10BryanDavis: Remove 'release' qsub label [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480900 (https://phabricator.wikimedia.org/T212390) [04:36:29] (03PS1) 10BryanDavis: Track platform of submit host in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) [04:36:39] (03PS1) 10BryanDavis: Respect 'distribution' from service.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/480902 (https://phabricator.wikimedia.org/T212390) [05:29:09] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 7.970 second response time [05:32:51] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:56:03] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.748 second response time [05:59:47] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:00:51] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 1.306 second response time [06:04:39] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:29] (03CR) 10Marostegui: [C: 03+2] Revert "db2057: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/480712 (owner: 10Marostegui) [06:20:36] (03PS3) 10Marostegui: Revert "db2057: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/480712 [06:21:34] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480911 [06:21:36] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480911 [06:26:53] 10Operations, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10Legoktm) [06:29:55] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:31:03] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:32:39] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:32:44] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Krinkle) [06:32:49] ACKNOWLEDGEMENT - Backup of s4 in codfw on db1115 is CRITICAL: Backup for s4 at codfw taken more than 8 days ago: Most recent backup 2018-12-12 01:10:29 Marostegui re-doing it - The acknowledgement expires at: 2018-12-21 06:32:31. [06:32:55] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480911 (owner: 10Marostegui) [06:34:01] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480911 (owner: 10Marostegui) [06:34:16] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480911 (owner: 10Marostegui) [06:34:23] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2057 T212277 (duration: 00m 57s) [06:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:52] T212277: Upgrade db2057 firmware - https://phabricator.wikimedia.org/T212277 [06:36:43] 10Operations, 10ops-codfw, 10DBA: Upgrade db2057 firmware - https://phabricator.wikimedia.org/T212277 (10Marostegui) 05Openβ†’03Resolved Host repooled Thanks @papaul! [06:36:49] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) [06:44:20] 10Operations: conftool is failing flake8 - https://phabricator.wikimedia.org/T212397 (10Legoktm) [06:44:47] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:46:19] 10Operations: conftool is failing flake8 - https://phabricator.wikimedia.org/T212397 (10Legoktm) p:05Triageβ†’03High [06:47:36] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:48:00] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) [06:53:10] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:54:19] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Krinkle) This task proposes a significant change to software architecture and should follow the [RFC process](https://www.mediawiki.org/wi... [06:56:40] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:42] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:04:06] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) [07:08:22] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [07:08:30] this is me --^ [07:08:44] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1038 days) [07:09:20] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) [07:10:16] !log restart rsyslog on lithium - in:imtcp stuck in recvfrom ms-be2047.codfw.wmnet - T199406 [07:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:25] T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 [07:11:35] !log restart pdfrender on scb1002 [07:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:46] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [07:16:11] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10elukey) Added downtime up to Jan 31st, icinga was complaining about parsoid not running. Don't have a lot of context but feel free to remove downtime and add... [07:17:18] !log Re-start codfw s4 backup as the previous one failed [07:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:02] <_joe_> elukey: let's add a cron restart to that pyle of tech debt today :P [07:20:30] :) [07:43:18] (03PS2) 10Elukey: druid: reserve middlemanager ports from 8200 onward [puppet] - 10https://gerrit.wikimedia.org/r/480733 (https://phabricator.wikimedia.org/T204979) [07:44:25] (03CR) 10Elukey: [C: 03+2] druid: reserve middlemanager ports from 8200 onward [puppet] - 10https://gerrit.wikimedia.org/r/480733 (https://phabricator.wikimedia.org/T204979) (owner: 10Elukey) [07:47:02] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10hashar) [07:47:04] 10Operations, 10ops-codfw: Broken power supply on elastic2026 - https://phabricator.wikimedia.org/T212402 (10MoritzMuehlenhoff) [07:47:21] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic2026 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Power Supply 1 = Critical, Power Supplies = Critical] Muehlenhoff T212402 [07:50:39] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) >>! In T212189#4833536, @daniel wrote: > "We should not introduce a service that is called by MediaWiki, and itself calls Media... [07:52:15] 10Operations, 10ops-codfw: Non-redundant power supply on ms-be2048 - https://phabricator.wikimedia.org/T212403 (10MoritzMuehlenhoff) [07:52:41] ACKNOWLEDGEMENT - IPMI Sensor Status on ms-be2048 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Muehlenhoff T212403 [07:53:59] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10hashar) That is due to: 'cryptography>=1.7.0,<2.0.0' I... [08:05:37] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10Gilles) If the files are already on Beta, purge them, make sure your browser cache is cleared for these images, and you'll get thumbnails generated with lib... [08:05:59] !log roll restart of druid middlemanagers on druid* to pick up new port settings [08:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:17] (03CR) 10Hashar: "Oops, sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [08:19:08] (03PS2) 10Hashar: doc: relocate from /srv to /srv/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) [08:21:01] (03CR) 10Hashar: "Puppet compilation https://puppet-compiler.wmflabs.org/compiler1002/14024/doc1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [08:29:50] o/ [08:37:26] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] ">I am not completely familiar with the review culture yet but since I am listed as the 'maintainer' I guess I should have felt comfortable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/480484 (owner: 10Alexandros Kosiaris) [08:40:47] (03CR) 10Elukey: Specify allowed ldap groups by site logins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [08:45:21] (03CR) 10Zhuyifei1999: [C: 03+1] "Shall I deploy this?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480900 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [08:45:51] (03CR) 10Mathew.onipe: Add kafka reporting topic to Puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480894 (owner: 10Smalyshev) [08:47:53] (03CR) 10Zhuyifei1999: Track platform of submit host in service.manifest (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [08:52:32] (03PS3) 10Elukey: Add two new HDFS journalnodes to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/478623 (https://phabricator.wikimedia.org/T209929) [08:58:33] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/479567 (https://phabricator.wikimedia.org/T78705) (owner: 10Dduvall) [09:01:28] 10Operations, 10ops-codfw: Broken power supply on elastic2026 - https://phabricator.wikimedia.org/T212402 (10Mathew.onipe) p:05Triageβ†’03High [09:12:01] (03PS1) 10Gilles: Increase ruwiki navtiming rate + frwiki survey rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480920 (https://phabricator.wikimedia.org/T187299) [09:13:15] Request from 103.46.201.125 via cp5009 cp5009, Varnish XID 1022722332 [09:13:16] Error: 503, Backend fetch failed at Thu, 20 Dec 2018 09:13:04 GMT [09:16:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster={cache_text,cache_upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:16:28] hmm eqsin having issues it seems [09:16:44] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [09:17:09] _joe_: ema ^ I 'll depool eqsin [09:17:46] For what is worth, I see no maintenance planed on the Ops calendar for eqsin [09:17:51] looks like an already-over blip fwiw [09:17:56] Like, vendor's maintenance [09:18:18] i.e. https://logstash.wikimedia.org/goto/16a71657ea3c0fe44a52895def4398c2 [09:18:27] ah indeed it's starting to recover [09:18:42] the peak was around the point of yannf reporting it [09:18:52] ehm the start that is [09:18:59] the peak was around icinga's reporting [09:19:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:20:20] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [09:20:36] 10Operations, 10Patch-For-Review, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10fgiunchedi) [09:22:42] jouncebot: next [09:22:42] In 2 hour(s) and 37 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1200) [09:22:51] (03CR) 10DCausse: cirrus: increase number of shards for enwiki_general (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [09:23:45] !log depooling db1097:3314 for schema change T85757 [09:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:49] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:24:12] (03PS2) 10Banyek: mariadb: depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479642 (https://phabricator.wikimedia.org/T85757) [09:25:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:26:19] (03CR) 10Gilles: [C: 03+2] Increase ruwiki navtiming rate + frwiki survey rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480920 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:26:22] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479642 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:27:26] (03Merged) 10jenkins-bot: Increase ruwiki navtiming rate + frwiki survey rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480920 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:27:29] (03Merged) 10jenkins-bot: mariadb: depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479642 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:27:41] (03CR) 10jenkins-bot: Increase ruwiki navtiming rate + frwiki survey rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480920 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:28:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:29:17] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1097:3314 for schema change - T85757 (duration: 00m 52s) [09:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:20] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:30:22] banyek: seems like you deployed my change along with yours, it's all good though [09:30:57] I synced the ` wmf-config/db-eqiad.php ` file only [09:31:01] ah ok [09:31:04] I need to sync the rest then [09:31:20] are you done with what you needed to do? I can wait [09:31:36] you can proceed [09:31:50] I am doing a schema change, and when it finishes I'll repool the host [09:31:51] but [09:32:03] the more important question is for me, how this happened? [09:32:38] I mean when I deploy puppet it tells me when I run the puppet-merge that there's an another patch, but now I didn't seen anything like this [09:33:16] when you do $ git fetch; git log HEAD...origin/master it should show all the commits that you're pulling in before you rebase [09:33:42] yes, and I saw 2 commits in git log just after merging my change, then running it again no commit [09:34:01] πŸ‘ I'll check this next time [09:34:46] 10Operations, 10Patch-For-Review, 10User-Marostegui, 10User-fgiunchedi: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10fgiunchedi) I went through the list of current misc hosts and looked for "obvious" candidates to be in their own cluster. This was driven by (either of) two fact... [09:35:35] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T187299 Increase ruwiki navtiming rate + frwiki survey rate (duration: 00m 52s) [09:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:44] gilles: whenever you're done I'd also like to deploy something [09:35:53] legoktm: I'm done [09:38:58] ty [09:39:33] I'm gonna revert my depooling patch [09:39:47] banyek: I'm in the middle of syncing atm, it'll be another minute or two [09:39:52] !log repooling db1097:3314 after schema change T85757 [09:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:55] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:40:06] !log legoktm@deploy1001 Synchronized php-1.33.0-wmf.9/includes/: T199540 (duration: 01m 14s) [09:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:19] (03PS1) 10Banyek: Revert "mariadb: depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480924 [09:41:19] !log legoktm@deploy1001 Synchronized php-1.33.0-wmf.8/includes/: T199540 (duration: 01m 06s) [09:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:31] banyek: ok, done now [09:41:45] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480924 (owner: 10Banyek) [09:41:54] I'll merge then [09:42:49] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480924 (owner: 10Banyek) [09:45:15] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1097:3314 after schema change - T85757 (duration: 00m 51s) [09:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:18] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:46:39] !log depooling db1103:3314 for schema change T85757 [09:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:47] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479644 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:47:51] (03Merged) 10jenkins-bot: mariadb: depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479644 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:49:41] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1103:3314 for schema change - T85757 (duration: 00m 51s) [09:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:16] !log repooling db1103:3314 after schema change T85757 [09:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:19] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:56:24] (03PS1) 10Banyek: Revert "mariadb: depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480925 [09:57:40] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480925 (owner: 10Banyek) [09:58:44] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480925 (owner: 10Banyek) [10:00:16] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1103:3314 after schema change - T85757 (duration: 00m 52s) [10:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:05] (03CR) 10Marostegui: [C: 03+1] mariadb: depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479631 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:03:41] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10MoritzMuehlenhoff) This is a bug in python-cryptography which... [10:03:47] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:05:20] (03PS2) 10Banyek: mariadb: depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479631 (https://phabricator.wikimedia.org/T85757) [10:05:33] !log depooling db1081 for schema change T85757 [10:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:37] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:08:06] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#4815815, @aaron wrote: > I'm not sure why t... [10:08:14] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479631 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:09:22] (03Merged) 10jenkins-bot: mariadb: depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479631 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:10:54] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1081 for schema change - T85757 (duration: 00m 51s) [10:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:57] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:12:48] (03PS1) 10Elukey: Revert "mcrouter: temporary remove mc2033 to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/480929 [10:13:02] (03PS2) 10Elukey: Revert "mcrouter: temporary remove mc2033 to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/480929 [10:13:10] jijiki: --^ [10:15:31] (03PS1) 10Banyek: Revert "mariadb: depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480931 [10:15:35] (03CR) 10Effie Mouzeli: [C: 03+1] Revert "mcrouter: temporary remove mc2033 to ease network maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/480929 (owner: 10Elukey) [10:15:57] elukey: I'll merge it later :) [10:15:58] !log repooling db1081 after schema change T85757 [10:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:02] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:16:07] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash::collector: pull logs from both kafka-logging clusters [puppet] - 10https://gerrit.wikimedia.org/r/480787 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [10:16:52] (03PS1) 10Alexandros Kosiaris: releases: Don't cache /charts/ [puppet] - 10https://gerrit.wikimedia.org/r/480932 [10:17:07] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480931 (owner: 10Banyek) [10:17:31] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [10:18:13] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480931 (owner: 10Banyek) [10:18:50] (03CR) 10Filippo Giunchedi: "Would be nice to have some context/motivation on this change in the commit message, what's the current limit, etc" [puppet] - 10https://gerrit.wikimedia.org/r/480793 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [10:19:09] !log draining restbase1012 for eventual reboot for kernel security update [10:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:17] (03PS15) 10Elukey: Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 [10:19:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] releases: Don't cache /charts/ [puppet] - 10https://gerrit.wikimedia.org/r/480932 (owner: 10Alexandros Kosiaris) [10:19:31] akosiaris: no depool happened, right? [10:19:32] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [10:19:51] ema: no it wasn't required after all [10:20:00] it did seem like transient and one off [10:20:05] cool, ty [10:20:11] I do keep an eye on varnish errors ofc [10:20:27] (03CR) 10Muehlenhoff: [C: 03+1] Add remaining kerberos wrapped commands [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480433 (owner: 10Elukey) [10:20:35] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1081 after schema change - T85757 (duration: 00m 51s) [10:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:37] (03PS1) 10Elukey: Update cdh submodule to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/480933 [10:22:20] !log executing schema change on db1102 - T85757 [10:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:25] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:23:12] 10Operations: mw1230 sdb "Raw_Read_Error_Rate" SMART - https://phabricator.wikimedia.org/T194036 (10fgiunchedi) [10:24:04] 10Operations, 10DNS, 10Traffic: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460 (10fgiunchedi) a:05fgiunchediβ†’03None [10:26:25] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14025/" [puppet] - 10https://gerrit.wikimedia.org/r/480933 (owner: 10Elukey) [10:29:06] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1072 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [10:30:17] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10hashar) The CI job does not use the Debian package python-cry... [10:31:32] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [10:33:07] !log executing schema change on dbstore1002 - T85757 [10:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:11] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:35:37] !log depooling db1121 for schema change T85757 [10:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:57] (03PS4) 10Banyek: mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) [10:37:37] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:38:42] (03Merged) 10jenkins-bot: mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:40:41] (03PS3) 10Mathew.onipe: cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [10:40:44] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1121 for schema change - T85757 (duration: 00m 52s) [10:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:47] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:41:27] (03CR) 10jerkins-bot: [V: 04-1] cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [10:43:59] (03CR) 10Mathew.onipe: cirrus: increase number of shards for enwiki_general (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [10:44:47] !log stopping replication on db1121 - T85757 [10:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:03] !log draining restbase1013 for eventual reboot for kernel security update [10:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:13] (03CR) 10Vgutierrez: [C: 03+1] Ensure that depool threshold is being honored on new/updated configs [debs/pybal] - 10https://gerrit.wikimedia.org/r/443967 (https://phabricator.wikimedia.org/T184715) (owner: 10Vgutierrez) [10:45:22] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480938 (https://phabricator.wikimedia.org/T202497) [10:46:36] (03PS4) 10Mathew.onipe: cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [10:47:23] (03CR) 10jerkins-bot: [V: 04-1] cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [10:48:45] (03CR) 10Legoktm: [C: 04-1] Track platform of submit host in service.manifest (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [10:53:53] !log repooling db1121 after schema change T85757 [10:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:57] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:54:01] (03PS1) 10Banyek: Revert "mariadb: depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480940 [10:55:36] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480940 (owner: 10Banyek) [10:56:45] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480940 (owner: 10Banyek) [10:58:14] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1121 after schema change - T85757 (duration: 00m 52s) [10:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:58] (03CR) 10Hashar: "This change follow up discussions we had during the 2017 hackathon with Erik Bernhardson and David Causse. The aim is to build a proof of " [puppet] - 10https://gerrit.wikimedia.org/r/479567 (https://phabricator.wikimedia.org/T78705) (owner: 10Dduvall) [10:59:34] !log executing schema change on db1068 (s4 master) - T85757 [10:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:37] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:01:48] (03PS1) 10Elukey: spark: replace hdfs exec with cdh::exec [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480941 [11:03:18] (03PS2) 10Elukey: spark: replace hdfs exec with cdh::exec [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480941 [11:05:46] (03CR) 10Hashar: [C: 03+1] When running scripts from staging, use the CommonSettings.php from staging [puppet] - 10https://gerrit.wikimedia.org/r/480695 (owner: 10Tim Starling) [11:05:55] (03PS3) 10Elukey: spark: replace hdfs exec with cdh::exec [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480941 [11:09:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14027/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480941 (owner: 10Elukey) [11:12:15] (03PS1) 10Elukey: profile::hadoop::mysql_password: move to cdh::exec [puppet] - 10https://gerrit.wikimedia.org/r/480942 [11:21:02] !log draining restbase1014 for eventual reboot for kernel security update [11:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:33] (03PS2) 10Fsero: Adding registries VMs. [puppet] - 10https://gerrit.wikimedia.org/r/480800 (https://phabricator.wikimedia.org/T212212) [11:22:19] (03PS1) 10Filippo Giunchedi: logstash: output webrequest 5xx metrics [puppet] - 10https://gerrit.wikimedia.org/r/480943 (https://phabricator.wikimedia.org/T205870) [11:25:17] (03PS1) 10Arturo Borrero Gonzalez: openstack: introduce basic support for running Mitaka on Stretch on virt nodes [puppet] - 10https://gerrit.wikimedia.org/r/480944 (https://phabricator.wikimedia.org/T212302) [11:25:52] (03CR) 10jerkins-bot: [V: 04-1] openstack: introduce basic support for running Mitaka on Stretch on virt nodes [puppet] - 10https://gerrit.wikimedia.org/r/480944 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [11:26:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/480800 (https://phabricator.wikimedia.org/T212212) (owner: 10Fsero) [11:26:40] (03CR) 10Fsero: [C: 03+2] Adding registries VMs. [puppet] - 10https://gerrit.wikimedia.org/r/480800 (https://phabricator.wikimedia.org/T212212) (owner: 10Fsero) [11:29:30] (03PS2) 10Arturo Borrero Gonzalez: openstack: introduce basic support for running Mitaka on Stretch on virt nodes [puppet] - 10https://gerrit.wikimedia.org/r/480944 (https://phabricator.wikimedia.org/T212302) [11:33:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Not breaking compilation apparently :-)" [puppet] - 10https://gerrit.wikimedia.org/r/480944 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [11:34:47] !log powercycling restbase1014, similar EFI ASSEER error to T212305 [11:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:51] T212305: restbase1011 fails to boot, ASSERT error lines - https://phabricator.wikimedia.org/T212305 [11:34:55] !log powercycling restbase1014, similar EFI ASSERT error to T212305 [11:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:12] RECOVERY - Backup of s4 in codfw on db1115 is OK: Backup for s4 at codfw taken less than 8 days ago and larger than 10 GB: Last one 2018-12-20 06:30:40 from dbstore2002.codfw.wmnet:3314 (104 GB) [11:39:41] banyek: ^ :-) [11:39:54] Finished already [11:40:24] (03PS2) 10Arturo Borrero Gonzalez: Adding dhcpd/netboot.cfg entries cloudvirt1025-30 [puppet] - 10https://gerrit.wikimedia.org/r/480812 (https://phabricator.wikimedia.org/T209616) (owner: 10Cmjohnson) [11:44:43] (03CR) 10Vgutierrez: [C: 03+1] Call _updateServerMetrics from _serverInitDone [debs/pybal] - 10https://gerrit.wikimedia.org/r/477794 (owner: 10Mark Bergsma) [11:44:54] yay [11:45:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Adding dhcpd/netboot.cfg entries cloudvirt1025-30 [puppet] - 10https://gerrit.wikimedia.org/r/480812 (https://phabricator.wikimedia.org/T209616) (owner: 10Cmjohnson) [11:46:02] !log draining restbase1015 for eventual reboot for kernel security update [11:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:20] (03PS1) 10Arturo Borrero Gonzalez: site.pp: add role for cloudvirt1030 [puppet] - 10https://gerrit.wikimedia.org/r/480947 (https://phabricator.wikimedia.org/T209616) [11:54:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480941 (owner: 10Elukey) [11:57:19] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10mobrovac) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1200). [12:00:04] jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:51] o/ [12:01:07] Well looks like I’m the only one on the schedule [12:01:12] jan_drewniak: looks like your patch is the only one [12:01:15] :) [12:01:20] go ahead [12:01:22] swat is yours [12:01:40] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Core Platform Team Backlog (Watching / External), and 4 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10mobrovac) [12:02:09] !log draining restbase1016 for eventual reboot for kernel security update [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:16] 10Operations, 10DNS, 10Traffic, 10Core Platform Team Backlog (Watching / External), 10Services (watching): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818 (10mobrovac) [12:02:40] 10Operations, 10Collection, 10OfflineContentGenerator, 10Core Platform Team Backlog (Watching / External), and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872 (10mobrovac) [12:02:43] 10Operations, 10Traffic, 10Core Platform Team Backlog (Watching / External), 10Services (watching), and 2 others: Figure out an etcd deploy strategy that includes multi DC failure scenarios. - https://phabricator.wikimedia.org/T98165 (10mobrovac) [12:03:15] 10Operations, 10Mobile-Content-Service, 10Parsing-Team, 10Reading-Infrastructure-Team-Backlog, and 5 others: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551 (10mobrovac) [12:04:00] 10Operations, 10Mathoid, 10Core Platform Team Backlog (Watching / External), 10SCB, 10Services (watching): remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10mobrovac) [12:04:10] 10Operations, 10Core Platform Team Backlog (Watching / External), 10SCB, 10Services (watching): Page allocation stalls on scb1001, scb1002 - https://phabricator.wikimedia.org/T191199 (10mobrovac) [12:04:20] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2 - Objective 2: Set up a continuous integration and deployment pipeline - https://phabricator.wikimedia.org/T170481 (10mobrovac) [12:04:22] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2: Developers are able to develop and test their applications through a unified pipeline towards production ... - https://phabricator.wikimedia.org/T170480 [12:04:26] 10Operations, 10Release-Engineering-Team, 10Category, 10Core Platform Team Backlog (Watching / External), and 2 others: FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453 (10mobrovac) [12:04:36] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), 10Services (watching): eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524 (10mobrovac) [12:04:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmnet: introduce new cloudvirt10XX.eqiad.wmnet FQDNs (25-30) [dns] - 10https://gerrit.wikimedia.org/r/480949 (https://phabricator.wikimedia.org/T209616) (owner: 10Arturo Borrero Gonzalez) [12:04:47] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782 (10mobrovac) [12:04:51] 10Operations, 10Collection, 10OfflineContentGenerator, 10Core Platform Team Backlog (Watching / External), and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917 (10mobrovac) [12:05:27] 10Operations, 10Cassandra, 10Core Platform Team Backlog (Watching / External), 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10mobrovac) [12:07:57] 10Operations, 10RESTBase, 10Core Platform Team Backlog (Later), 10Services (later): Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237 (10mobrovac) [12:08:12] (03PS6) 10Mathew.onipe: cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [12:08:14] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:478659| Bumping portals to master (T128546, T202497)]] (duration: 00m 53s) [12:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:18] T202497: Add fundraising appeal on Wikipedia portal page - https://phabricator.wikimedia.org/T202497 [12:08:19] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [12:09:07] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:478659| Bumping portals to master (T128546, T202497)]] (duration: 00m 52s) [12:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` ['cloudvirt1030.eqiad... [12:10:27] alrighty, that's a wrap. [12:11:10] (03PS7) 10Mathew.onipe: cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) [12:16:32] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10mobrovac) [12:16:38] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Core Platform Team Backlog (Later), and 4 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659 (10mobrovac) [12:16:56] 10Operations, 10Cassandra, 10Core Platform Team Backlog (Later), 10Patch-For-Review, and 2 others: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10mobrovac) [12:17:26] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Later), 10Security, 10Services (next): Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813 (10mobrovac) [12:17:36] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10mobrovac) [12:17:54] 10Operations, 10HyperSwitch, 10RESTBase-API, 10Traffic, and 2 others: Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/ - https://phabricator.wikimedia.org/T167972 (10mobrovac) [12:18:26] 10Operations, 10ops-eqiad: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10MoritzMuehlenhoff) [12:20:56] ACKNOWLEDGEMENT - Host restbase1016 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T212418 [12:27:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1030.eqiad.wmnet'] ` and were **ALL** successful. [12:31:54] !log installing fuse updates from stretch point release [12:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:11] (03PS2) 10Arturo Borrero Gonzalez: site.pp: add role for cloudvirt1030 [puppet] - 10https://gerrit.wikimedia.org/r/480947 (https://phabricator.wikimedia.org/T209616) [12:38:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] site.pp: add role for cloudvirt1030 [puppet] - 10https://gerrit.wikimedia.org/r/480947 (https://phabricator.wikimedia.org/T209616) (owner: 10Arturo Borrero Gonzalez) [12:43:00] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) @kaldari there is a thumbor debug log on `deployment-imagescaler03` under `/var/log` which was generated as I was testing https://commons.wikimedia.... [12:43:07] 10Operations, 10Proton, 10Core Platform Team Kanban (Doing), 10Services (doing): Requests to MW 404 when on HTTPS - https://phabricator.wikimedia.org/T202982 (10mobrovac) [12:43:17] (03PS1) 10Arturo Borrero Gonzalez: hiera: introduce key instance_dev for cloudvirt1030 [puppet] - 10https://gerrit.wikimedia.org/r/480953 (https://phabricator.wikimedia.org/T209616) [12:43:33] !log kartik@deploy1001 Started deploy [cxserver/deploy@16f65cb]: Update cxserver to 803baa4 (T210581, T211889, T144467, T209473) [12:43:33] 10Operations, 10Core Platform Team Kanban (Doing), 10Services (doing), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10mobrovac) [12:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:41] T210581: Use rate-limiter for cxserver - https://phabricator.wikimedia.org/T210581 [12:43:42] T209473: CX2: Communicate template exists in the target wiki but mapping could not be completed - https://phabricator.wikimedia.org/T209473 [12:43:42] T144467: Security review for Google MT for Content Translation - https://phabricator.wikimedia.org/T144467 [12:43:42] T211889: cxserver: TypeError: Cannot read property '_options' of null - https://phabricator.wikimedia.org/T211889 [12:43:44] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Doing), 10Patch-For-Review, and 2 others: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10mobrovac) [12:44:06] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Core Platform Team Kanban (Doing), and 4 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456 (10mobrovac) [12:44:17] 10Operations, 10Proton, 10Core Platform Team Kanban (Doing), 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862 (10mobrovac) [12:44:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: introduce key instance_dev for cloudvirt1030 [puppet] - 10https://gerrit.wikimedia.org/r/480953 (https://phabricator.wikimedia.org/T209616) (owner: 10Arturo Borrero Gonzalez) [12:45:29] 10Operations, 10Electron-PDFs, 10Core Platform Team Kanban (Blocked Externally), 10Readers-Web-Backlog (Tracking), 10Services (blocked): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 (10mobrovac) [12:45:49] 10Operations, 10HyperSwitch, 10RESTBase-API, 10Traffic, and 2 others: Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/ - https://phabricator.wikimedia.org/T167972 (10BBlack) I thought this was in a different ticket somewhere at one point, but in any case I just noticed it during someone... [12:46:50] 10Operations, 10CX-cxserver, 10Citoid, 10RESTBase, and 5 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10mobrovac) [12:47:17] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Blocked Externally), 10Patch-For-Review, and 2 others: Setup automated topk wide row reporting - https://phabricator.wikimedia.org/T147366 (10mobrovac) [12:47:29] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked), 10User-Eevans: puppetize turning off reserved space for cassandra /srv - https://phabricator.wikimedia.org/T132632 (10mobrovac) [12:47:36] 10Operations, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked): Set warning thresholds for average cluster utilization - https://phabricator.wikimedia.org/T76306 (10mobrovac) [12:47:44] !log installing libxcursor security updates [12:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: add more stretch/mitaka support [puppet] - 10https://gerrit.wikimedia.org/r/480954 (https://phabricator.wikimedia.org/T212302) [12:48:15] !log kartik@deploy1001 Finished deploy [cxserver/deploy@16f65cb]: Update cxserver to 803baa4 (T210581, T211889, T144467, T209473) (duration: 04m 42s) [12:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10aborrero) [12:49:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10aborrero) [12:50:03] (03PS2) 10Arturo Borrero Gonzalez: openstack: add more stretch/mitaka support [puppet] - 10https://gerrit.wikimedia.org/r/480954 (https://phabricator.wikimedia.org/T212302) [12:50:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: add more stretch/mitaka support [puppet] - 10https://gerrit.wikimedia.org/r/480954 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [12:52:04] 10Operations, 10HyperSwitch, 10RESTBase-API, 10Traffic, and 2 others: Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/ - https://phabricator.wikimedia.org/T167972 (10mobrovac) >>! In T167972#4837630, @BBlack wrote: > I thought this was in a different ticket somewhere at one point, but i... [12:53:06] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329 (10mobrovac) [12:53:25] !log T209616 installing cloudvirt1030, icinga downtime for 1 day [12:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:28] T209616: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 [12:55:04] 10Operations, 10Analytics, 10ChangeProp, 10MediaWiki-JobQueue, and 3 others: Consider the possibility of separating ChangeProp and JobQueue on Kafka level - https://phabricator.wikimedia.org/T199431 (10mobrovac) [12:55:25] 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 3 others: RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10mobrovac) [12:55:48] 10Operations, 10TechCom-RFC, 10Traffic, 10Core Platform Team Backlog (Designing), 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906 (10mobrovac) [12:55:55] 10Operations, 10Multimedia, 10RESTBase-API, 10Reading-Admin, and 4 others: Thumb API: Varnish / CDN questions - https://phabricator.wikimedia.org/T150673 (10mobrovac) [12:56:30] 10Operations, 10TechCom-RFC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 3 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10mobrovac) [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1300) [13:01:27] 10Operations, 10TechCom, 10Core Platform Team Backlog (Attic), 10Services (attic), 10User-mobrovac: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825 (10mobrovac) [13:01:40] 10Operations, 10Core Platform Team Backlog (Attic), 10Security, 10Services (attic): Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240 (10mobrovac) [13:01:43] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Backlog (Attic), 10Services (attic): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431 (10mobrovac) [13:04:48] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/479567 (https://phabricator.wikimedia.org/T78705) (owner: 10Dduvall) [13:07:28] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) @Milimetric wrote: > In the simplest case, this code would be almost identical client and server-side. No matter where it's runnin... [13:07:52] (03PS1) 10Hashar: profile: point to real modules for specs [puppet] - 10https://gerrit.wikimedia.org/r/480957 [13:09:52] !log installing xapian-core updates from stretch point release [13:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:33] (03CR) 10Giuseppe Lavagetto: role::beta: introduce docker_services (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [13:12:40] (03PS1) 10Muehlenhoff: Add library hint for xapian [puppet] - 10https://gerrit.wikimedia.org/r/480958 [13:18:38] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10mobrovac) >>! In T212189#4837768, @daniel wrote: > But for the case at hand, there might be a workaround: the PHP code that renders the (W... [13:25:37] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for xapian [puppet] - 10https://gerrit.wikimedia.org/r/480958 (owner: 10Muehlenhoff) [13:30:11] 10Operations, 10RESTBase-Cassandra, 10Services: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10fgiunchedi) [13:33:56] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) >>! In T212189#4837780, @mobrovac wrote: > When rendering the page, `index.php` knows the exact data that needs to be rendered alr... [13:39:13] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10mobrovac) >>! In T212189#4837824, @daniel wrote: > That works, but defies the purpose. The idea is to present a default rendering to clien... [13:52:59] 10Operations, 10ops-codfw: Non-redundant power supply on ms-be2048 - https://phabricator.wikimedia.org/T212403 (10Papaul) p:05Triageβ†’03Normal [13:55:20] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) > When rendering the page, index.php knows the exact data that needs to be rendered already, correct? I just had a brief chat wi... [13:57:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10aborrero) Summary of what I did today: * added production FQDNs to all new servers * tried imaging `cloudvirt1030.eqiad.wmnet`... [13:58:09] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10daniel) > You first need to render the page on the server before you know whether the client supports JS/SW or not, so it will need to be... [13:59:45] (03CR) 10Elukey: [C: 03+2] spark: replace hdfs exec with cdh::exec [puppet/cdh] - 10https://gerrit.wikimedia.org/r/480941 (owner: 10Elukey) [14:00:04] zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1400). [14:02:06] (03PS2) 10Elukey: profile::hadoop::mysql_password: move to cdh::exec [puppet] - 10https://gerrit.wikimedia.org/r/480942 [14:02:08] (03PS1) 10Elukey: Update the cdh module to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/480960 [14:07:49] (03CR) 10Elukey: [C: 03+2] profile::hadoop::mysql_password: move to cdh::exec [puppet] - 10https://gerrit.wikimedia.org/r/480942 (owner: 10Elukey) [14:08:04] (03CR) 10Elukey: [C: 03+2] Update the cdh module to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/480960 (owner: 10Elukey) [14:11:32] thanks jouncebot, the train is leaving the station, all aboard! [14:13:18] * Hauskatze blows a whistle and raises a green flag - departure authorized zeljkof on platform group1 :P [14:13:30] :) [14:15:39] (03Restored) 10Rush: Revert "stat: add exfat for temporary narrow and approved workflow" [puppet] - 10https://gerrit.wikimedia.org/r/480756 (owner: 10Rush) [14:15:46] (03PS2) 10Rush: Revert "stat: add exfat for temporary narrow and approved workflow" [puppet] - 10https://gerrit.wikimedia.org/r/480756 [14:16:14] (03CR) 10Rush: "decided to pursue this course after all with a brief convo with mortiz and comments on https://gerrit.wikimedia.org/r/c/operations/puppet/" [puppet] - 10https://gerrit.wikimedia.org/r/480756 (owner: 10Rush) [14:16:32] (03Abandoned) 10Rush: stat: absent the temp exfat packages [puppet] - 10https://gerrit.wikimedia.org/r/480759 (https://phabricator.wikimedia.org/T211327) (owner: 10Rush) [14:18:09] (03CR) 10Rush: [C: 03+2] Revert "stat: add exfat for temporary narrow and approved workflow" [puppet] - 10https://gerrit.wikimedia.org/r/480756 (owner: 10Rush) [14:22:24] !log rebooting netmon1002 for kernel security update [14:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:02] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Some regressions in production with Zotero translation-server in production - https://phabricator.wikimedia.org/T211114 (10Mvolz) [14:26:46] !log rearmed keyholder on netmon1002 after reboot [14:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:26] (03PS1) 10Rush: labstore: remove temp exfat setup [puppet] - 10https://gerrit.wikimedia.org/r/480962 (https://phabricator.wikimedia.org/T211327) [14:28:47] (03CR) 10Elukey: [C: 03+2] Add two new HDFS journalnodes to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/478623 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [14:28:57] (03PS4) 10Elukey: Add two new HDFS journalnodes to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/478623 (https://phabricator.wikimedia.org/T209929) [14:29:15] (03CR) 10Rush: [C: 03+2] labstore: remove temp exfat setup [puppet] - 10https://gerrit.wikimedia.org/r/480962 (https://phabricator.wikimedia.org/T211327) (owner: 10Rush) [14:29:47] addshore, Amir1: I'm not saying you are to blame ;) but do you know anything about T212427? [14:29:48] T212427: No namespace configured for entity type `form` - https://phabricator.wikimedia.org/T212427 [14:30:07] hmmmmmm [14:30:14] !log installing libdap updates from stretch point release [14:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:19] it's blocking the train, just noticed it [14:30:27] I'll dig a bit in changelog [14:30:28] (03PS5) 10Elukey: Add two new HDFS journalnodes to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/478623 (https://phabricator.wikimedia.org/T209929) [14:30:37] zeljkof: ill have time to look in a little bit [14:30:39] wanted to ask first if you are already aware of it [14:30:42] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add two new HDFS journalnodes to the Analytics Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/478623 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [14:30:44] addshore: thanks! [14:39:57] !log add two journal nodes to the Analytics Hadoop cluster - T209929 [14:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:00] T209929: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 [14:41:30] !log restarted etherpad for nodejs security updates [14:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:44] (03PS3) 10Giuseppe Lavagetto: role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 [14:43:41] (03CR) 10jerkins-bot: [V: 04-1] role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [14:43:53] addshore James_F I have some information about T212427 (could not find other people mentioned in gerrit in IRC :/) [14:43:53] T212427: No namespace configured for entity type `form` - https://phabricator.wikimedia.org/T212427 [14:44:06] (info in the task) [14:44:33] thanks! [14:45:28] not much, but I think I've found a commit that causes it [14:45:43] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: cergen CI fails to run on Debian Stretch because cryptography dependency cannot be built against newer openssl version - https://phabricator.wikimedia.org/T212395 (10Ottomata) Could the CI job use the debian package? I guess not? [14:53:12] !log installing nodejs security updates on maps* (was tested via T211419) [14:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:15] T211419: Test maps stack with new nodejs security update - https://phabricator.wikimedia.org/T211419 [14:56:20] zeljkof: tarrow is our incident manager this week :) [14:56:31] he is just grabbing a coffe or something [14:58:54] addshore, tarrow: thanks for the quick reply! since time to holidays is short, I need an answer if I can move the train forward with this error in logs, or if something needs to be done [15:00:09] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::php::monitoring: fine-grained opcache invalidation (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/480714 (https://phabricator.wikimedia.org/T211964) (owner: 10Giuseppe Lavagetto) [15:05:01] zeljkof: I think probably we probably want to revert the changes that made it show up but I'm just checking with people [15:05:06] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) Plain nginx config has the ability to remove the headers, but it can't do so conditionally... [15:05:13] (03PS1) 10Elukey: Remove two journal nodes from the Analytics Hadoop config [puppet] - 10https://gerrit.wikimedia.org/r/480965 [15:06:19] tarrow: thanks! please do let me know as soon as you know more [15:07:32] (03PS2) 10Elukey: Remove two journal nodes from the Analytics Hadoop config [puppet] - 10https://gerrit.wikimedia.org/r/480965 (https://phabricator.wikimedia.org/T209929) [15:12:32] 10Operations, 10ops-codfw: Non-redundant power supply on ms-be2048 - https://phabricator.wikimedia.org/T212403 (10Papaul) 05Openβ†’03Resolved Loose power cable. System is back up. [15:13:00] zeljkof: the consensus is revert 479419 [15:15:24] tarrow: ok, is anybody doing the revert? [15:15:43] let me know when backport to wmf.9 is ready, I can merge and deploy it, and move the train forward [15:16:04] I'll do it/find someone to do it [15:16:10] then I'll ping you [15:16:19] tarrow: thanks! [15:18:50] RECOVERY - IPMI Sensor Status on ms-be2048 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:20:28] 10Operations, 10HyperSwitch, 10RESTBase-API, 10Traffic, and 2 others: Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/ - https://phabricator.wikimedia.org/T167972 (10BBlack) >>! In T167972#4837684, @mobrovac wrote: >>>! In T167972#4837630, @BBlack wrote: >> I thought this was in a diffe... [15:23:31] 10Operations, 10ops-codfw: Broken power supply on elastic2026 - https://phabricator.wikimedia.org/T212402 (10Papaul) 05Openβ†’03Resolved Power cable got loose as well may be when working on asw-b8-codfw on Tuesday. System is back up. [15:24:37] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) @BBlack would you miss x-cache, x-cache-status and x-varnish if those were completely removed at the TLS termination level? Some o... [15:25:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 (10Papaul) [15:29:27] 10Operations: Allow the deployment of users to a host without their ssh key via the admin module - https://phabricator.wikimedia.org/T212429 (10elukey) p:05Triageβ†’03Normal [15:30:33] 10Operations, 10ops-codfw: Broken power supply on elastic2026 - https://phabricator.wikimedia.org/T212402 (10Mathew.onipe) Thanks @Papaul! [15:32:22] (03PS1) 10Gilles: Hide debugging header [puppet] - 10https://gerrit.wikimedia.org/r/480977 (https://phabricator.wikimedia.org/T210484) [15:32:43] (03PS2) 10Gilles: Hide debugging headers [puppet] - 10https://gerrit.wikimedia.org/r/480977 (https://phabricator.wikimedia.org/T210484) [15:34:26] RECOVERY - IPMI Sensor Status on elastic2026 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [15:38:33] 10Operations, 10Analytics, 10Performance-Team, 10Traffic, 10Patch-For-Review: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10BBlack) I don't know off-hand if we can live without them all for manual debugging and such, or if nginx is the best... [15:39:37] 10Operations, 10Analytics, 10Performance-Team, 10Traffic, 10Patch-For-Review: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) Is there an nginx "site" or config specific to varnish termination? [15:42:05] (03CR) 10Vgutierrez: Expand Coordinator.resultUp behavior on first monitor check result (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/478203 (owner: 10Mark Bergsma) [15:43:12] 10Operations, 10Analytics, 10Performance-Team, 10Traffic, 10Patch-For-Review: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) Could be a puppet variable too, to make the filtering block conditional. [15:46:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=textvar-status_type=5 [15:51:52] zeljkof: tarrow has made the reverts [15:51:55] im happy to =2 [15:51:56] +2 [15:52:02] jouncebot now [15:52:02] For the next 0 hour(s) and 7 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1400) [15:52:05] addshore: please do [15:52:12] 10Operations, 10Analytics, 10Performance-Team, 10Traffic, 10Patch-For-Review: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10BBlack) `localssl.erb` would probably be more appropriate and is the site file, but it's a generic TLS reverse proxy... [15:52:12] i guess the train will continue in the US slot? [15:52:34] should I +2 the branch one too zeljkof, or leave it for the drivier of the train later? [15:52:44] * addshore doesnt want to leave something undeployed on the branch [15:52:50] addshore: can you deploy the revert? [15:53:00] there is still an hour until puppet swat [15:53:12] and I think train has precedence over it, if we need more time [15:53:35] I can deploy the revert if you prefer [15:53:46] addshore: a good one for me to do? Call and talk me through it again? [15:54:02] tarrow: yes you could do it :) [15:54:09] I just +2ed, guess we have 30 mins to wait for CI [15:54:14] zeljkof: we can backport it once it is merged [15:54:30] tarrow: even better, the docs are up to date and pretty good (disclaimer: I did a lot of editing of the docs ;) [15:54:43] tarrow: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [15:54:44] :) awesome [15:55:20] tarrow: there are general instructions, when they are forking, follow "mediawiki/extensions and mediawiki/skins" [15:55:31] addshore: feck, lifeboat [15:55:39] I'm around, just ping me if you have _any_ questions [15:55:41] can you do it [15:56:16] tarrow: me or addshore? you can't deploy? [15:59:11] hahaa, zeljkof one of us :) as he has just run out of the house on a call :P [15:59:26] I can do it once merged :) [16:00:03] addshore: great, thanks! please let me know when it's deployed, so I can move the train forward [16:00:07] will do [16:00:12] thanks! [16:03:14] 10Operations, 10Analytics, 10Performance-Team, 10Traffic, 10Patch-For-Review: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) I wasn't aware that the latest plan was to use ATS for TLS termination. There might be a way to do this in... [16:06:09] (03CR) 10BryanDavis: Track platform of submit host in service.manifest (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [16:06:52] (03CR) 10BryanDavis: Track platform of submit host in service.manifest (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [16:08:52] (03CR) 10BryanDavis: "> Shall I deploy this?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480900 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [16:14:36] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Addshore) [16:14:45] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Addshore) There sure has been a fair amount of discussion on this ticket! So I have created an updated interacting diagram showing off a... [16:14:50] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480900 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [16:20:00] zeljkof: once backported the task should remain open and become a blocker of the next branch / train, per https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/480966/ [16:21:09] addshore: I love running the last train of the year ;) [16:21:15] hehe [16:22:14] oh zeljkof i have a meeting in 8 mins [16:22:25] zeljkof: and it isn't merged yet, so you might have to backport it / actually do the deploy [16:22:54] addshore: ok, no problemo, I'll deploy it, thanks for letting me know [16:28:58] addshore, zeljkof sorry about that. I'm back again. What's occurring? [16:29:07] tarrow: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/480966/ [16:29:20] well, it's complicated :D [16:29:54] did the backport get merged? [16:30:02] yes, but not deployed... [16:30:04] (03CR) 10Elukey: [C: 03+2] Remove two journal nodes from the Analytics Hadoop config [puppet] - 10https://gerrit.wikimedia.org/r/480965 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [16:31:02] !log remove two journal nodes from the Analytics hadoop cluster - T209929 [16:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:06] T209929: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 [16:31:14] tarrow: well, the backport is ready, but the commit is master has -2 from Daniel, so I'm confused [16:31:40] I can deploy the wmf.9 backport but I've never done it when the commit was not already in master... [16:32:08] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Milimetric) >>! In T212189#4838090, @Addshore wrote: >>>! In T212189#4835359, @Milimetric wrote: >> Now, I started looking through the cod... [16:33:18] 10Operations, 10RESTBase-Cassandra, 10Services: restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10Eevans) Not long back, we were alarmed to see a very high rate of range-slice requests (a type of query our app does not perform). I wasn't... [16:35:37] zeljkof: let me know if you want me to do anything. I guess deploying a backport that isn't on master is probably a releng decision but I'm happy to help either way [16:36:29] tarrow: I've talked with #releng, I'll deploy the backport and make sure we don't forget about it [16:37:04] thank you! [16:38:29] (03PS1) 10Andrew Bogott: cloudvirts: install cloudvirt1025 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/480989 (https://phabricator.wikimedia.org/T209616) [16:39:21] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: install cloudvirt1025 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/480989 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [16:39:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudvirts: install cloudvirt1025 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/480989 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [16:41:16] (03CR) 10DCausse: [C: 03+1] cirrus: increase number of shards for enwiki_general [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [16:46:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10RobH) a:05RobHβ†’03Andrew [16:47:23] (03CR) 10DCausse: [C: 03+1] "fine by me" [puppet] - 10https://gerrit.wikimedia.org/r/479567 (https://phabricator.wikimedia.org/T78705) (owner: 10Dduvall) [16:51:55] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.9/extensions/Wikibase: SWAT: [[gerrit:480978|Revert "Fail hard if an entity namespace is not configured." (T212427)]] (duration: 01m 17s) [16:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:59] T212427: No namespace configured for entity type `form` - https://phabricator.wikimedia.org/T212427 [16:56:00] (03PS1) 10Zfilipin: all wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480994 [16:56:03] (03CR) 10Zfilipin: [C: 03+2] all wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480994 (owner: 10Zfilipin) [16:56:15] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Jakob_WMDE) >>! In T212189#4838123, @Milimetric wrote: > My question here was more, how **can** the client render everything it needs, whe... [16:57:12] (03CR) 10Ayounsi: [C: 03+2] Assign public /29 for cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/479337 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [16:57:18] (03PS4) 10Ayounsi: Assign public /29 for cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/479337 (https://phabricator.wikimedia.org/T207663) [16:57:37] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480994 (owner: 10Zfilipin) [16:58:38] PROBLEM - Hadoop JournalNode on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode [16:58:44] PROBLEM - Hadoop JournalNode on analytics1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode [16:59:41] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.9 [16:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:40] hadoop alerts are due to me, working on them [17:02:02] !log configure additional 208.80.155.88/29 IPs on cloud-instance-transport1-b-eqiad - T207663 [17:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:05] T207663: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 [17:05:52] !log add 208.80.155.88/29 to cloud-in4 term icmp - T207663 [17:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:33] (03PS1) 10Andrew Bogott: new cloudvirts: add initial hiera config [puppet] - 10https://gerrit.wikimedia.org/r/480996 (https://phabricator.wikimedia.org/T209616) [17:06:35] (03PS1) 10Andrew Bogott: Make cloudvirt1025 a nova compute node [puppet] - 10https://gerrit.wikimedia.org/r/480997 (https://phabricator.wikimedia.org/T209616) [17:06:38] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) `lang=shell-session root@cloudcontrol1004:~# neutron subnet-create --gateway 208.80.155.89 --name cloud-instances-tra... [17:07:32] (03CR) 10Andrew Bogott: [C: 03+2] new cloudvirts: add initial hiera config [puppet] - 10https://gerrit.wikimedia.org/r/480996 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [17:07:44] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudvirt1025 a nova compute node [puppet] - 10https://gerrit.wikimedia.org/r/480997 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [17:09:48] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) `lang=shell-session root@cloudcontrol1004:~# neutron router-gateway-set --fixed-ip subnet_id=cloud-instances-transpo... [17:22:26] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission of restbase200[1-6] (lease return in December 2018) - https://phabricator.wikimedia.org/T211070 (10Papaul) [17:24:27] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @daniel or anyone else from TechCom want to help close this out? Code review is on hold pending DBA review, so I we sho... [17:25:23] (03PS2) 10Bstorm: Track platform of submit host in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [17:26:55] (03PS1) 10Andrew Bogott: cloudvirt1025: use eth3 rather than (default) eth1 for VM communication [puppet] - 10https://gerrit.wikimedia.org/r/481001 (https://phabricator.wikimedia.org/T209616) [17:27:40] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1025: use eth3 rather than (default) eth1 for VM communication [puppet] - 10https://gerrit.wikimedia.org/r/481001 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [17:27:56] (03CR) 10Bstorm: Track platform of submit host in service.manifest (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [17:31:30] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) I cannot really provide more feedback on the code itself apart from what I commented about the queries. I'm not expe... [17:37:01] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) Moving to the RFC inbox, so TechCom will look at it during the next meeting. Since DBA have approved the plan, TechCom w... [17:39:51] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: swith cloudvirt1030 to openstack newton [puppet] - 10https://gerrit.wikimedia.org/r/481005 (https://phabricator.wikimedia.org/T212302) [17:40:28] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) [17:40:40] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) 05Openβ†’03Resolved All was fine. Thanks @ayounsi . Closing task. [17:41:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvps: swith cloudvirt1030 to openstack newton [puppet] - 10https://gerrit.wikimedia.org/r/481005 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [17:46:47] 10Operations, 10TechCom-RFC, 10Wikidata, 10Wikidata-Termbox-Hike, and 5 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Milimetric) Thanks @Jakob_WMDE, I think we're saying the same thing in slightly different terms, and it's because I'm not being precise.... [17:46:53] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4838242, @Marostegui wrote: > I cannot really provide more feedback on the code itself apart from what I... [17:47:16] (03CR) 10Dzahn: [C: 03+2] "i don't particularly love the non-standard paths and naming but i don't want to bikeshed either and i get the reason to do it.. so going a" [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [17:47:34] (03PS3) 10Dzahn: doc: relocate from /srv to /srv/docroot [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [17:49:04] (03PS1) 10Arturo Borrero Gonzalez: openstack: introduce nova templates for newton [puppet] - 10https://gerrit.wikimedia.org/r/481006 (https://phabricator.wikimedia.org/T212302) [17:49:21] !log updating puppet compiler facts: `PUPPET_COMPILER=compiler1001.puppet-diffs.eqiad.wmflabs modules/puppet_compiler/files/compiler-update-facts` [17:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:10] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10ayounsi) [17:50:14] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) 05Resolvedβ†’03Open a:05aborreroβ†’03ayounsi Keeping it open for the cleanup part after the break. [17:53:00] !log updating puppet compiler facts: `PUPPET_COMPILER=compiler1002.puppet-diffs.eqiad.wmflabs modules/puppet_compiler/files/compiler-update-facts` [17:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:19] 10Operations, 10Jade, 10TechCom, 10Core Platform Team Backlog (Watching / External), and 4 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10awight) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1800). [18:04:26] (03PS2) 10BBlack: authdns-local-update: use check-deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/480872 [18:04:28] (03PS2) 10BBlack: authdns::scripts: no more python-jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/480873 [18:04:30] (03PS13) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [18:04:32] (03PS3) 10BBlack: deploy-check.py replaces check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/480870 [18:04:34] (03PS3) 10BBlack: Remove authdns-gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/480871 [18:07:59] (03CR) 10Dzahn: [C: 03+2] "done, also moved the files around from the old dirs and then cleaned up afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/480881 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [18:08:54] (03PS1) 10Andrew Bogott: specify eth3 for neutron for cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/481009 (https://phabricator.wikimedia.org/T209616) [18:09:35] (03CR) 10Andrew Bogott: [C: 03+2] specify eth3 for neutron for cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/481009 (https://phabricator.wikimedia.org/T209616) (owner: 10Andrew Bogott) [18:11:02] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G [puppet] - 10https://gerrit.wikimedia.org/r/481011 (https://phabricator.wikimedia.org/T209929) [18:12:01] elukey: seeing your name and hadoop... btw.. i hope you like this now https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474832/ i updated again after that long break [18:12:46] mutante: ah yes sorry I didn't have time to check/merge, busy day :( [18:13:06] no worries at all.. it was me who didnt reply until just the other day [18:13:17] thanks for the patience :) [18:14:01] (03PS14) 10BBlack: New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 [18:14:03] (03PS4) 10BBlack: deploy-check.py replaces check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/480870 [18:14:05] (03PS4) 10BBlack: Remove authdns-gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/480871 [18:17:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Andrew) [18:18:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Andrew) cloudvirt1025 is working properly. The others are stuck in limbo while Arturo and I figure out what to do about stretch... [18:20:40] (03PS5) 10BBlack: deploy-check.py replaces check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/480870 [18:20:42] (03PS5) 10BBlack: Remove authdns-gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/480871 [18:24:38] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10herron) [18:24:42] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi, 10User-herron: cronspam from elasticsearch-curator on stretch - https://phabricator.wikimedia.org/T211859 (10herron) 05Openβ†’03Resolved [18:25:07] (03CR) 10Ottomata: [C: 03+1] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G [puppet] - 10https://gerrit.wikimedia.org/r/481011 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [18:25:28] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G [puppet] - 10https://gerrit.wikimedia.org/r/481011 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [18:28:15] (03CR) 10BBlack: [C: 03+2] New zone generator gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/479892 (owner: 10BBlack) [18:28:37] (03CR) 10BBlack: [C: 03+2] deploy-check.py replaces check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/480870 (owner: 10BBlack) [18:29:16] 10Operations: swift-recon-cron - cffi library '_openssl' has no function, constant or global variable named 'sk_H509_NAME]ENTRY_value' - https://phabricator.wikimedia.org/T212439 (10GTirloni) [18:29:29] !log restart hdfs namenode on an-master1001 with new heap settings (currently standby, 8->12G) [18:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:02] !log remove hdfs journalnode config+packages from analytics10(28|35) - not used anymore - T209929 [18:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:04] T209929: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 [18:34:43] (03CR) 10Smalyshev: Add kafka reporting topic to Puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480894 (owner: 10Smalyshev) [18:38:40] (03PS3) 10ArielGlenn: use lbzip2 in wikidata rdf weeklies [puppet] - 10https://gerrit.wikimedia.org/r/480140 (https://phabricator.wikimedia.org/T206535) [18:39:06] (03CR) 10Mathew.onipe: [C: 03+1] Add kafka reporting topic to Puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480894 (owner: 10Smalyshev) [18:40:44] (03CR) 10ArielGlenn: [C: 03+2] use lbzip2 in wikidata rdf weeklies [puppet] - 10https://gerrit.wikimedia.org/r/480140 (https://phabricator.wikimedia.org/T206535) (owner: 10ArielGlenn) [18:42:50] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [18:44:21] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Dzahn) The host started sending cron spam about an hour ago. They were all from F ile "/usr/bin/swift-recon-cron", in in which " AttributeError: cffi library '_openssl' has no function,... [18:45:23] (03CR) 10Elukey: [C: 04-1] hadoop::ui: migrate from apache to httpd module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [18:45:32] mutante: ---^ :( [18:46:05] anybody working on ms-be2047? [18:46:30] elukey: yes, we were talking about it in other channels [18:46:36] ah yes sorry [18:46:39] just seen it :) [18:46:43] elukey: it's broken and shut down now [18:46:58] elukey: so i removed the ldap::passwords class by accident? ouch, good catch [18:48:11] i clicked "host and all services" i could swear.. i did all services except the host though [18:48:38] yea, my bad. hence that remaining alert for the host itself but no others [18:51:30] (03PS1) 10Papaul: DNS: Remove mgmt DNS entries for elastic2001 - elastic2024 [dns] - 10https://gerrit.wikimedia.org/r/481017 (https://phabricator.wikimedia.org/T211023) [18:52:15] mutante: nono no problem, I was just asking to be sure :) [18:53:02] (03CR) 10jenkins-bot: mariadb: depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479642 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [18:53:04] (03CR) 10jenkins-bot: Revert "mariadb: depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480924 (owner: 10Banyek) [18:53:06] (03CR) 10jenkins-bot: mariadb: depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479644 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [18:53:09] (03CR) 10jenkins-bot: Revert "mariadb: depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480925 (owner: 10Banyek) [18:53:11] (03CR) 10jenkins-bot: mariadb: depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479631 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [18:53:13] (03CR) 10jenkins-bot: Revert "mariadb: depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480931 (owner: 10Banyek) [18:53:15] (03CR) 10jenkins-bot: mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [18:53:17] (03CR) 10jenkins-bot: Revert "mariadb: depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480940 (owner: 10Banyek) [18:53:21] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480938 (https://phabricator.wikimedia.org/T202497) (owner: 10Jdrewniak) [18:53:25] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480994 (owner: 10Zfilipin) [18:55:38] (03PS7) 10Dzahn: hadoop::ui: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474832 [18:55:45] elukey: i did not mean to remove that or mix it with the httpd part, leaving it as it is now^ [18:56:13] (03CR) 10Dzahn: hadoop::ui: migrate from apache to httpd module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [18:57:44] (03PS8) 10Elukey: hadoop::ui: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [18:58:09] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [18:58:19] * mutante triggers the puppet compiler that way [18:59:06] https://puppet-compiler.wmflabs.org/compiler1002/14038/analytics-tool1001.eqiad.wmnet/ looks good [18:59:16] (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/14038/analytics-tool1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T1900). [19:00:04] addshore and kaldari: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] here [19:03:10] !log restart hdfs namenode on an-master1002 with new heap settings (currently standby, 8->12G) [19:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:59] if no one else is doing the SWAT, I guess I can [19:09:04] (03PS2) 10Kaldari: Adding NOINDEX template to $wgPageTriageNoIndexTemplates for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480884 (https://phabricator.wikimedia.org/T211043) [19:10:12] elukey: thank you:) doing [19:10:22] (03CR) 10Dzahn: [C: 03+2] hadoop::ui: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [19:10:58] (03CR) 10Kaldari: [C: 03+2] WikibaseClient: Enable Lua function usage tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479407 (https://phabricator.wikimedia.org/T191416) (owner: 10Hoo man) [19:11:33] (03PS1) 10Andrew Bogott: cloudvirt1025: enable alerting [puppet] - 10https://gerrit.wikimedia.org/r/481022 [19:11:35] (03CR) 10Kaldari: [C: 03+2] Adding NOINDEX template to $wgPageTriageNoIndexTemplates for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480884 (https://phabricator.wikimedia.org/T211043) (owner: 10Kaldari) [19:11:54] (03PS2) 10Andrew Bogott: cloudvirt1025: enable alerting [puppet] - 10https://gerrit.wikimedia.org/r/481022 [19:13:06] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1025: enable alerting [puppet] - 10https://gerrit.wikimedia.org/r/481022 (owner: 10Andrew Bogott) [19:13:18] (03Merged) 10jenkins-bot: Adding NOINDEX template to $wgPageTriageNoIndexTemplates for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480884 (https://phabricator.wikimedia.org/T211043) (owner: 10Kaldari) [19:13:32] elukey: applied on analytics-tool1001. ferm rules adjusted and that was fine. just seeing one issue with mod_conf for "xml2end" [19:14:42] elukey: it's trying to configure xml2end but Module xml2end does not exist [19:15:07] seems like my change should not affect this thing, puppet run still finishes [19:15:14] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[ensure_present_mod_xml2end] [19:16:16] elukey: duh, typo .. "end" -> "enc" fixing [19:17:06] (03PS2) 10Kaldari: WikibaseClient: Enable Lua function usage tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479407 (https://phabricator.wikimedia.org/T191416) (owner: 10Hoo man) [19:17:50] (03PS1) 10Dzahn: hadoop::httpd: fix typo in name of xml2enc module [puppet] - 10https://gerrit.wikimedia.org/r/481024 [19:18:11] (03PS2) 10Dzahn: hadoop::httpd: fix typo in name of xml2enc module [puppet] - 10https://gerrit.wikimedia.org/r/481024 [19:18:33] (03CR) 10Dzahn: [C: 03+2] hadoop::httpd: fix typo in name of xml2enc module [puppet] - 10https://gerrit.wikimedia.org/r/481024 (owner: 10Dzahn) [19:19:28] (03CR) 10jenkins-bot: Adding NOINDEX template to $wgPageTriageNoIndexTemplates for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480884 (https://phabricator.wikimedia.org/T211043) (owner: 10Kaldari) [19:20:14] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:20:29] elukey: all done, puppet happy, no downtime [19:27:46] !log kaldari@deploy1001 Synchronized wmf-config/InitialiseSettings.php: syncing InitialiseSettings for SWAT deployment (duration: 00m 46s) [19:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:13] !log kaldari@deploy1001 Synchronized wmf-config/Wikibase.php: syncing Wikibase for SWAT deployment (duration: 00m 45s) [19:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:59] (03CR) 10jenkins-bot: WikibaseClient: Enable Lua function usage tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479407 (https://phabricator.wikimedia.org/T191416) (owner: 10Hoo man) [19:33:14] 10Operations, 10Puppet: Allow the deployment of users to a host without their ssh key via the admin module - https://phabricator.wikimedia.org/T212429 (10Peachey88) [19:33:44] addshore: Your wikibase/Lua changes are live now if you'd like to test. [19:34:59] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10CDanis) I pointed this out on IRC but putting it on the ticket because I can't help myself: While there's no OpenSSL symbol `sk_H509_NAME]ENTRY_value`, there is a `sk_X509_NAME_ENTRY_va... [19:35:50] (03CR) 10Dzahn: "skimming the ticket i see a few check boxes unchecked. like adding to google sheets or switch ports. adding Robh to confirm if this can go" [dns] - 10https://gerrit.wikimedia.org/r/481017 (https://phabricator.wikimedia.org/T211023) (owner: 10Papaul) [19:37:11] SWAT is finished [19:40:13] arturo: i was wondering about the "kvm processes running" checks on cloudvirt1019/1030. are those unrelated to the RAID degraded issue / is that known maintenance / should the checks be fixed? [19:41:53] ah, i found the task that is about installing them and it's in progress. nevermind then i guess. just upcoming [19:43:42] 10Operations, 10Analytics, 10Performance-Team, 10Traffic, 10Patch-For-Review: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Krinkle) I'm unfamiliar with the complexity needed in VCL to make this work, but if at all feasible, I think we shou... [19:46:11] (03CR) 10RobH: "The mgmt DNS on systems should not be removed while the system in racked and has the drac interface connected (since it still has the ip i" [dns] - 10https://gerrit.wikimedia.org/r/481017 (https://phabricator.wikimedia.org/T211023) (owner: 10Papaul) [19:50:25] (03CR) 10Dzahn: [C: 03+2] DNS: Remove mgmt DNS entries for elastic2001 - elastic2024 [dns] - 10https://gerrit.wikimedia.org/r/481017 (https://phabricator.wikimedia.org/T211023) (owner: 10Papaul) [19:50:30] (03PS2) 10Dzahn: DNS: Remove mgmt DNS entries for elastic2001 - elastic2024 [dns] - 10https://gerrit.wikimedia.org/r/481017 (https://phabricator.wikimedia.org/T211023) (owner: 10Papaul) [19:51:26] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) the tech came today to swap pc1007 system board and the new (refurbed) board is bad again. This will require another call into Dell and will not be fixed until after the holiday break. [19:52:52] (03PS1) 10Andrew Bogott: nova: add cloudvirt1025 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/481032 [19:52:52] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Dell just called me. They will be shipping a new system and will arrive by the first week on January. [19:53:29] (03CR) 10Dzahn: [C: 03+2] "Papaul confirmed he did all the switch ports and per IRC" [dns] - 10https://gerrit.wikimedia.org/r/481017 (https://phabricator.wikimedia.org/T211023) (owner: 10Papaul) [19:54:13] (03CR) 10Andrew Bogott: [C: 03+2] nova: add cloudvirt1025 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/481032 (owner: 10Andrew Bogott) [20:00:04] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 133.1 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181220T2000) [20:02:55] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10kaldari) @jijiki - The first test seems to be hugely improved: On beta cluster: {F27689226} On Commons: {F27689222} It is frustrating though that the kernin... [20:03:46] (03PS4) 10Dzahn: puppetmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [20:06:38] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work), 10Patch-For-Review: Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) [20:08:11] (03PS1) 10BBlack: Add nsid_ascii testing to mock config-options [dns] - 10https://gerrit.wikimedia.org/r/481033 [20:09:29] (03CR) 10BBlack: [C: 03+2] Add nsid_ascii testing to mock config-options [dns] - 10https://gerrit.wikimedia.org/r/481033 (owner: 10BBlack) [20:23:17] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) Terrible! Thanks for the heads up Chris! [20:27:23] (03CR) 10Nuria: "I have to say that having min and max be same looks unfamiliar but I JVM heap settings remain a mystery to me, seems really counter intui" [puppet] - 10https://gerrit.wikimedia.org/r/481011 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [20:32:26] (03PS3) 10BBlack: authdns-local-update: use check-deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/480872 [20:32:28] (03PS3) 10BBlack: authdns::scripts: no more python-jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/480873 [20:37:00] (03PS1) 10BBlack: Deployment Testing: update config [dns] - 10https://gerrit.wikimedia.org/r/481035 [20:37:02] (03PS1) 10BBlack: Deployment Testing: update zone data [dns] - 10https://gerrit.wikimedia.org/r/481036 [20:37:04] (03PS1) 10BBlack: Deployment Testing: update admin_state [dns] - 10https://gerrit.wikimedia.org/r/481037 [20:37:06] (03PS1) 10BBlack: Deployment Testing: update utils/README [dns] - 10https://gerrit.wikimedia.org/r/481038 [20:37:52] (03CR) 10BBlack: [C: 03+2] authdns-local-update: use check-deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/480872 (owner: 10BBlack) [20:40:38] (03CR) 10BBlack: [C: 03+2] Deployment Testing: update config [dns] - 10https://gerrit.wikimedia.org/r/481035 (owner: 10BBlack) [20:41:59] (03CR) 10BBlack: [C: 03+2] Deployment Testing: update zone data [dns] - 10https://gerrit.wikimedia.org/r/481036 (owner: 10BBlack) [20:42:54] (03CR) 10BBlack: [C: 03+2] Deployment Testing: update admin_state [dns] - 10https://gerrit.wikimedia.org/r/481037 (owner: 10BBlack) [20:43:55] (03CR) 10BBlack: [C: 03+2] Deployment Testing: update utils/README [dns] - 10https://gerrit.wikimedia.org/r/481038 (owner: 10BBlack) [20:46:16] (03PS1) 10BBlack: Revert "Deployment Testing: update utils/README" [dns] - 10https://gerrit.wikimedia.org/r/481042 [20:46:18] (03PS1) 10BBlack: Revert "Deployment Testing: update admin_state" [dns] - 10https://gerrit.wikimedia.org/r/481043 [20:46:20] (03PS1) 10BBlack: Revert "Deployment Testing: update zone data" [dns] - 10https://gerrit.wikimedia.org/r/481044 [20:46:22] (03PS1) 10BBlack: Revert "Deployment Testing: update config" [dns] - 10https://gerrit.wikimedia.org/r/481045 [20:46:52] (03CR) 10BBlack: [C: 03+2] Revert "Deployment Testing: update utils/README" [dns] - 10https://gerrit.wikimedia.org/r/481042 (owner: 10BBlack) [20:46:59] (03CR) 10BBlack: [C: 03+2] Revert "Deployment Testing: update admin_state" [dns] - 10https://gerrit.wikimedia.org/r/481043 (owner: 10BBlack) [20:47:05] (03CR) 10BBlack: [C: 03+2] Revert "Deployment Testing: update zone data" [dns] - 10https://gerrit.wikimedia.org/r/481044 (owner: 10BBlack) [20:47:11] (03CR) 10BBlack: [C: 03+2] Revert "Deployment Testing: update config" [dns] - 10https://gerrit.wikimedia.org/r/481045 (owner: 10BBlack) [20:50:42] 10Operations, 10DNS, 10Operations-Software-Development, 10Traffic, 10Patch-For-Review: DNS repo: add CI checks for obvious configuration errors - https://phabricator.wikimedia.org/T182028 (10BBlack) 05Openβ†’03Resolved We've done all this and gone way past it at this point. We might tag some future im... [20:51:42] Hi can someone run the mwscript? [20:52:56] 10Operations, 10DNS, 10Traffic: AuthDNS CM/CI refactor - https://phabricator.wikimedia.org/T161148 (10BBlack) 05Openβ†’03Resolved a:03BBlack Resolving this, as recent work has fixed a lot of it (other than discovery issues specifically), and at this point all the text above is woefully outdated and point... [20:53:29] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051 (10BBlack) 05Openβ†’03Resolved a:03BBlack This is fixed now, no workarounds should be needed. [20:55:53] There is need to run mwscript for https://phabricator.wikimedia.org/T212100 [20:56:02] (03CR) 10Elukey: [C: 03+2] "yes agreed, in theory there is no need. We adopted it as general guideline to avoid two things:" [puppet] - 10https://gerrit.wikimedia.org/r/481011 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [20:58:52] (03PS1) 10MR70: Fix some config settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) [21:00:18] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10BBlack) Update for the record: with recent changes to authdns CI and deployment scripts, this scenario should no longer be possible and workarounds shouldn't be n... [21:03:09] (03PS2) 10MR70: Fix some config settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) [21:04:10] (03CR) 10jerkins-bot: [V: 04-1] Fix some config settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) (owner: 10MR70) [21:06:20] (03PS3) 10MR70: Fix some config settings in cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481046 (https://phabricator.wikimedia.org/T212315) [21:47:58] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:48:50] (03CR) 10BryanDavis: Track platform of submit host in service.manifest (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [21:51:23] (03CR) 10BryanDavis: "I suppose at this point this patch is stuck in limbo until January?" [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) (owner: 10BryanDavis) [21:52:47] (03CR) 10Legoktm: Track platform of submit host in service.manifest (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [21:54:35] (03PS3) 10Bstorm: Track platform of submit host in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [22:00:12] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 109.7 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [22:02:40] (03PS4) 10BryanDavis: Track platform of submit host in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) [22:03:26] (03CR) 10BryanDavis: "> Shouldn't we update the changelog?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480900 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [22:06:00] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.372 second response time [22:06:28] (03PS2) 10BryanDavis: Respect 'distribution' from service.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/480902 (https://phabricator.wikimedia.org/T212390) [22:08:44] (03PS1) 10Alexandros Kosiaris: docker::registry: Add Cache-Control header [puppet] - 10https://gerrit.wikimedia.org/r/481093 (https://phabricator.wikimedia.org/T211719) [22:09:48] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:11:16] (03PS2) 10BryanDavis: toolforge: Redirect GET & HEAD to https [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) [22:11:38] (03CR) 10BryanDavis: "> Uploaded patch set 2." [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [22:13:04] (03CR) 10Bstorm: [C: 03+2] Respect 'distribution' from service.manifest [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/480902 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [22:13:26] (03CR) 10Bstorm: [C: 03+2] Track platform of submit host in service.manifest [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480901 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [22:13:44] (03CR) 10Bstorm: [C: 03+2] Remove 'release' qsub label [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/480900 (https://phabricator.wikimedia.org/T212390) (owner: 10BryanDavis) [22:15:00] (03Abandoned) 10BryanDavis: toolforge: Forward security@tools.wmflabs.org to security@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456280 (https://phabricator.wikimedia.org/T182812) (owner: 10BryanDavis) [22:17:23] (03Abandoned) 10BryanDavis: GSoC task for Neha Jha [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/420238 (https://phabricator.wikimedia.org/T189974) (owner: 10Nehajha) [22:17:31] (03Abandoned) 10BryanDavis: My understanding of the webservice script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/421798 (owner: 10Kevin py) [22:17:41] (03Abandoned) 10BryanDavis: my understanding of webservice with start and stop actions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/421799 (owner: 10Kevin py) [22:19:28] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.039 second response time [22:20:09] (03Abandoned) 10BryanDavis: Read command line arguments from a config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/435691 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [22:23:10] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:49] 10Operations, 10Mail, 10Toolforge, 10Patch-For-Review, 10Security: Forward security@tools.wmflabs.org to security@wikimedia.org - https://phabricator.wikimedia.org/T182812 (10Legoktm) We could have security@tools.wmflabs.org go to the Toolforge admins (I think by including tools.admin in the security too... [22:24:20] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.763 second response time [22:30:08] (03PS2) 10Alexandros Kosiaris: docker::registry: Add Cache-Control header to avoid caching [puppet] - 10https://gerrit.wikimedia.org/r/481093 (https://phabricator.wikimedia.org/T211719) [22:30:28] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:45] !log phab1001 / phabricator: installing php5 package upgrades [22:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:18] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [22:46:33] !log phab1001 / phabricator: upgraded nodejs package [22:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:17] (03CR) 10BBlack: [C: 03+1] docker::registry: Add Cache-Control header to avoid caching [puppet] - 10https://gerrit.wikimedia.org/r/481093 (https://phabricator.wikimedia.org/T211719) (owner: 10Alexandros Kosiaris) [22:54:32] (03PS1) 10BryanDavis: Merge remote-tracking branch 'origin/stretch' into master [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/481100 [22:54:57] (03CR) 10jerkins-bot: [V: 04-1] Merge remote-tracking branch 'origin/stretch' into master [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/481100 (owner: 10BryanDavis) [23:01:02] (03PS3) 10Alexandros Kosiaris: docker::registry: Add Cache-Control header to avoid caching [puppet] - 10https://gerrit.wikimedia.org/r/481093 (https://phabricator.wikimedia.org/T211719) [23:06:50] (03CR) 10Andrew Bogott: [C: 03+1] "Let's wait until we're all back from holiday to merge this" [puppet] - 10https://gerrit.wikimedia.org/r/432935 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [23:11:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] docker::registry: Add Cache-Control header to avoid caching [puppet] - 10https://gerrit.wikimedia.org/r/481093 (https://phabricator.wikimedia.org/T211719) (owner: 10Alexandros Kosiaris) [23:14:28] (03PS2) 10BryanDavis: Merge remote-tracking branch 'origin/stretch' into master [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/481100 [23:17:09] 10Operations, 10serviceops, 10Patch-For-Review: docker-registry.wikimedia.org caches images missing instead of revalidating - https://phabricator.wikimedia.org/T211719 (10akosiaris) 05Openβ†’03Resolved a:03akosiaris With the merge of the above, this is probably resolved for now. Note that newly pushed im... [23:21:15] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) all these cool little updates before year end, nice! [23:27:59] (03CR) 10GTirloni: ircecho: Drop sysvinit support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480789 (owner: 10Paladox) [23:28:28] (03CR) 10Paladox: ircecho: Drop sysvinit support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480789 (owner: 10Paladox) [23:33:23] 10Operations, 10WMF-NDA: Migrate RTΒ to Phabricator - https://phabricator.wikimedia.org/T38 (10Aklapper) Cannot see all subtasks but I guess this task could be closed as resolved now? [23:33:44] 10Puppet: Suspicious Comments in Puppet Scripts - https://phabricator.wikimedia.org/T201576 (10Aklapper) a:05Akondrahmanβ†’03None [23:40:06] (03CR) 10BryanDavis: [V: 03+1 C: 03+2] "package built and tested on a toolsbeta Trusty host via cherry-pick" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/481100 (owner: 10BryanDavis) [23:40:43] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/stretch' into master [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/481100 (owner: 10BryanDavis) [23:57:46] (03PS1) 10Urbanecm: Give all users (including IPs) the pagequality right in plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481107 (https://phabricator.wikimedia.org/T212478)