[00:41:01] PROBLEM - Host scb2006 is DOWN: PING CRITICAL - Packet loss = 100% [00:42:29] RECOVERY - Host scb2006 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [00:48:39] (03PS12) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [00:50:21] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [01:00:03] (03PS13) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [02:22:05] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32937024 and 1 seconds [02:24:39] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3496 and 30 seconds [03:13:25] PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:16:45] (03PS14) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [03:44:15] RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:52:07] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Aklapper) >>! In T219589#5072592, @Yann wrote: > Hi, There should be a clear way... [04:14:27] PROBLEM - puppet last run on an-worker1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:19:30] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:26:52] (03PS15) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [04:27:44] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:32:07] (03CR) 10Andrew Bogott: "> What's the deal with those hosts under 'Hosts that fail to compile" [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [04:32:58] (03PS16) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [04:33:24] (03CR) 10Andrew Bogott: "This is reasonable but I'd prefer to merge things like this dead last, after the final Trusty VMs have been deleted." [puppet] - 10https://gerrit.wikimedia.org/r/499933 (owner: 10Muehlenhoff) [04:36:29] (03CR) 10Andrew Bogott: "I'd like to see longer explanations about what these scripts do, either in code comments or in the command-line usage output. Seems fine " [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [04:39:23] (03PS17) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [04:39:50] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:45:20] RECOVERY - puppet last run on an-worker1094 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [04:48:40] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:01:17] (03PS18) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [05:02:27] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [05:04:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372 [05:04:12] (03PS19) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [05:06:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372 (owner: 10Marostegui) [05:06:10] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:06:33] (03PS1) 10Marostegui: mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749) [05:07:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372 (owner: 10Marostegui) [05:07:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500372 (owner: 10Marostegui) [05:08:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 (duration: 00m 53s) [05:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:39] !log Deploy schema change on db1077, this will generate lag on s3 on labs [05:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:41] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:11:10] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) root@db2070:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337FADD0) Port Name: 1I Port Name: 2I Gen8 ServBP 12+2 at Port 1I,... [05:12:06] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2070 is CRITICAL: cluster=mysql device=cciss,5 instance=db2070:9100 job=node site=codfw Marostegui T208323 - The acknowledgement expires at: 2019-04-09 05:11:44. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2070&var-datasource=codfw+prometheus/ops [05:25:02] (03PS20) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [05:35:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374 [05:36:03] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374 (owner: 10Marostegui) [05:37:23] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374 (owner: 10Marostegui) [05:38:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 (duration: 00m 50s) [05:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500374 (owner: 10Marostegui) [05:59:04] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) This wiki is triggering some false positives on our labs private data checking methods, even if it is correctly sanitized (T212625#5062038) it... [06:00:36] (03PS21) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [06:27:17] <_joe_> !log installing new bootstrap-vz on boron T219580 [06:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:25] T219580: Remove backports from wikimedia-jessie - https://phabricator.wikimedia.org/T219580 [06:28:38] <_joe_> !log pushing wikimedia-jessie:{20190401,latest} to docker-registry.w.o T219580 [06:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:46] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:32:06] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/disable-puppet] [06:32:16] PROBLEM - puppet last run on cp1090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishospital] [06:33:20] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] [06:47:08] (03PS1) 10Giuseppe Lavagetto: Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379 [06:47:09] (03PS1) 10Giuseppe Lavagetto: Rebuild jessie images for removal of backports, updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500380 (https://phabricator.wikimedia.org/T219747) [06:47:20] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [06:54:21] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Peachey88) I would prefer to see someone over prioritize a task so it shows up ea... [06:55:04] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:58:24] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:32] RECOVERY - puppet last run on cp1090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:34] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:11:57] (03PS3) 10Muehlenhoff: Pull in kibana/logstash 5.6.15 [puppet] - 10https://gerrit.wikimedia.org/r/500066 [07:13:37] 10Operations, 10Tools, 10cloud-services-team: Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10Joe) [07:13:46] 10Operations, 10Tools, 10cloud-services-team: Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10Joe) p:05Triage→03High [07:16:20] 10Operations, 10Tools, 10cloud-services-team: Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10Joe) a:05Joe→03None [07:16:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379 (owner: 10Giuseppe Lavagetto) [07:17:08] (03PS2) 10Giuseppe Lavagetto: Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379 [07:17:13] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove spurious depends completely [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500379 (owner: 10Giuseppe Lavagetto) [07:17:41] (03CR) 10Muehlenhoff: [C: 03+2] Pull in kibana/logstash 5.6.15 [puppet] - 10https://gerrit.wikimedia.org/r/500066 (owner: 10Muehlenhoff) [07:19:14] (03PS2) 10Giuseppe Lavagetto: Rebuild jessie images for removal of backports, updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500380 (https://phabricator.wikimedia.org/T219747) [07:19:30] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Rebuild jessie images for removal of backports, updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/500380 (https://phabricator.wikimedia.org/T219747) (owner: 10Giuseppe Lavagetto) [07:20:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383 [07:22:07] (03PS1) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500384 [07:23:50] (03PS1) 10Giuseppe Lavagetto: Edit Project Config [docker-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500385 [07:24:53] (03PS1) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500386 [07:26:10] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383 (owner: 10Marostegui) [07:27:18] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383 (owner: 10Marostegui) [07:27:50] (03PS2) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500384 [07:28:06] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500384 (owner: 10Giuseppe Lavagetto) [07:29:28] RECOVERY - MariaDB disk space on dbstore1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:29:46] 10Operations, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 (10MoritzMuehlenhoff) [07:29:49] 10Operations, 10Dumps-Generation: Switch dumps to component/php7.2 - https://phabricator.wikimedia.org/T218193 (10MoritzMuehlenhoff) 05Open→03Resolved This is done [07:30:13] (03CR) 10DCausse: [C: 03+1] Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel) [07:30:36] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2033 (duration: 00m 51s) [07:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:45] (03CR) 10jenkins-bot: db-codfw.php: Depool db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500383 (owner: 10Marostegui) [07:39:19] (03CR) 10Mathew.onipe: [C: 03+1] Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel) [07:42:11] (03PS1) 10Muehlenhoff: toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388 [07:44:26] (03PS17) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [07:44:28] (03PS3) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) [07:44:30] (03PS4) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [07:44:32] (03PS11) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [07:44:34] (03PS6) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [07:44:36] (03PS9) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [07:44:38] (03PS3) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [07:52:33] (03CR) 10ArielGlenn: [C: 03+2] use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) (owner: 10ArielGlenn) [07:54:04] !log ariel@deploy1001 Started deploy [dumps/dumps@7abb6c8]: get db user/passwd va mw maint script [07:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:07] !log ariel@deploy1001 Finished deploy [dumps/dumps@7abb6c8]: get db user/passwd va mw maint script (duration: 00m 03s) [07:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:38] (03PS4) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) [07:58:40] (03PS5) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [07:58:42] (03PS12) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [07:58:45] (03PS7) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [07:58:47] (03PS10) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [07:58:49] (03PS4) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [08:07:00] (03PS18) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [08:07:01] (03PS5) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) [08:07:03] (03PS6) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [08:07:05] (03PS13) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [08:07:07] (03PS8) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [08:07:09] (03PS11) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [08:07:11] (03PS5) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [08:09:05] !log Deploy testing schema change on enwiki.echo_event on db2033 and upgrade mysql - T143961 [08:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:09] T143961: Add index on event_page_id on echo_event table - https://phabricator.wikimedia.org/T143961 [08:17:08] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392 [08:22:13] (03PS1) 10Rxy: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) [08:25:26] jouncebot: next [08:25:27] In 2 hour(s) and 4 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1030) [08:29:18] (03CR) 10Ema: [C: 03+1] "The smallest nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [08:30:47] (03CR) 10Gehel: "LGTM, waiting for https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/499951 to be deployed first" [puppet] - 10https://gerrit.wikimedia.org/r/500359 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [08:34:29] (03CR) 10Volans: [C: 03+2] check_icinga: fix retry logic [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/499368 (owner: 10Volans) [08:34:42] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:35:11] (03Merged) 10jenkins-bot: check_icinga: fix retry logic [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/499368 (owner: 10Volans) [08:35:52] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 79454 bytes in 3.246 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:36:59] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392 (owner: 10Marostegui) [08:38:11] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392 (owner: 10Marostegui) [08:40:00] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2033 (duration: 00m 51s) [08:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:20] (03PS19) 10Vgutierrez: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [08:40:22] (03PS6) 10Vgutierrez: acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) [08:40:24] (03PS7) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [08:40:26] (03PS14) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate in eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [08:40:28] (03PS9) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [08:40:30] (03PS12) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [08:40:32] (03PS6) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [08:40:34] (03CR) 10Vgutierrez: Allow acme-chief to provide unified cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [08:42:08] (03CR) 10Ema: [C: 03+1] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [08:46:30] (03CR) 10Ema: [C: 03+1] acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [08:46:38] (03CR) 10Vgutierrez: [C: 03+2] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [08:49:15] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500392 (owner: 10Marostegui) [08:53:49] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Allow cp1008 to fetch the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/499974 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [08:55:47] (03CR) 10Ema: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [08:58:42] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:00:35] (03CR) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:01:09] (03PS8) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) [09:01:11] (03PS15) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [09:01:14] (03PS10) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [09:01:16] (03PS13) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [09:01:18] (03PS7) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [09:02:32] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:06:54] (03CR) 10Ema: cache: serve wikiba.se traffic using cache::canary servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:07:15] (03CR) 10Ema: [C: 03+1] hieradata: Deploy acme-chief unified certificate on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:07:49] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Deploy acme-chief unified certificate on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/499975 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:09:27] !log installing Chromium security updates on proton* (tested the new release in deployment-prep) [09:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:29] (03PS5) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [09:14:38] PROBLEM - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:17:31] ^^ that's me [09:21:15] vgutierrez, am looking at this on deployment-prep, is there an nginx vs. update-ocsp catch-22? [09:21:23] yes [09:21:40] the nginx service unit is preventing from writing in /etc [09:21:56] so the update-ocsp-all script triggered by the nginx systemd service unit fails [09:22:14] I'm considering the options to fix it [09:22:19] sorry about the noise [09:28:59] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10Krenair) [09:31:07] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10Krenair) This was on deployment-sca02 but the list of deployment-prep instances failing puppet grew suddenly, so I expect a few of these were affected by this problem, and I imagine it's not... [09:32:19] (03PS16) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [09:32:21] (03PS11) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [09:32:23] (03PS14) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [09:32:25] (03PS8) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [09:32:27] (03PS1) 10Vgutierrez: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) [09:33:45] RECOVERY - Check systemd state on cp1008 is OK: OK - running: The system is fully operational [09:34:19] 10Operations, 10Tools, 10cloud-services-team (Kanban): Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10aborrero) [09:36:58] (03CR) 10Gehel: Elasticsearch: make unfreezing writes more robust. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [09:37:36] (03PS2) 10Vgutierrez: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) [09:37:38] (03PS17) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [09:37:41] (03PS12) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [09:37:42] (03PS15) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [09:37:45] (03PS9) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [09:38:45] (03CR) 10Ema: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:39:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please reference T219362 in the commit message. Other than that, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/500388 (owner: 10Muehlenhoff) [09:40:13] (03PS3) 10Vgutierrez: tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) [09:40:15] (03PS18) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [09:40:17] (03PS13) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [09:40:19] (03PS16) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [09:40:21] (03PS10) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [09:40:23] (03CR) 10jerkins-bot: [V: 04-1] cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:40:29] (03PS9) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 [09:40:41] (03CR) 10Ema: [C: 03+1] tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:40:46] I am having some trouble with the update-ocsp stuff vgutierrez [09:40:52] (03PS2) 10Muehlenhoff: toolforge: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362) [09:41:07] oh wait this may be my fault [09:41:17] (03CR) 10Vgutierrez: [C: 03+2] tlsproxy: Allow update-ocsp-all writing in /etc/acmecerts [puppet] - 10https://gerrit.wikimedia.org/r/500397 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:41:33] sort of [09:42:09] merging that ^^ right now [09:42:18] that should do the trick [09:42:24] (tested manually in cp1008 and worked like a charm) [09:42:58] Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/Exec[unified-live-ec-prime256v1-create-ocsp]/returns: Exception: Command openssl ocsp -resp_text -respout /etc/acmecerts/unified/live/update-ocsp-fdhzvx.tmp/ec-prime256v1.client.ocsp -issuer /etc/ssl/certs/4f06f81d.0 -verify_other /etc/ssl/certs/4f06f81d.0 -url http://ocsp.int-x3.letsencrypt.org -header Host ocsp.int-x3.letsencrypt.org [09:42:59] -cert /etc/acmecerts/unified/live/ec-prime256v1.crt failed with exit code 1, stderr: [09:42:59] Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/Exec[unified-live-ec-prime256v1-create-ocsp]/returns: Missing = in header key=value [09:44:49] hmmm, $proxy rendering issues on the template? [09:45:08] vgutierrez, well first of all it tried to hardcode the prod proxy [09:45:22] then I tried removing the proxy line, not much happier [09:45:26] then I tried setting it to empty string [09:45:50] but actually, read the command it's complaining about closely [09:45:55] -header Host ocsp.int-x3.letsencrypt.org [09:46:00] Missing = in header key=value [09:46:15] isn't openssl ocsp complaining about update-ocsp passing it a -header without = ? [09:47:21] Krenair: which version of openssl is running that instance? [09:47:35] cause that's working as expected in cp1008 as we speak [09:47:56] 1.1.0j [09:47:58] what do you have? [09:48:03] 1.0.2r [09:48:14] specifically I've got 1.1.0j-1~deb9u1 out of stretch/main [09:48:46] cp1008 is still running jessie [09:48:50] ah [09:49:01] that might be something to watch out for in prod [09:49:09] don't those other cp hosts run stretch? [09:49:21] you're right [09:49:24] fun [09:49:31] okay [09:49:34] they run stretch, and ocsp stapling is currently working as expected there [09:49:40] ... huh. [09:49:42] for the current unified cert [09:49:50] (not the acme-chief managed one) [09:49:53] ah [09:50:15] so that part should be exactly the same [09:50:44] ah it looks like header is only passed if you don't have a proxy [09:51:00] ok.. so it's an existing bug there [09:51:04] yeah [09:51:11] one sec [09:51:14] awesome, let's check that code :) [09:52:53] yeah that did the trick [09:53:43] (03PS1) 10Alex Monk: sslcert: update-ocsp: Fix passing Host header in absence of proxy [puppet] - 10https://gerrit.wikimedia.org/r/500398 [09:54:32] (03PS1) 10Muehlenhoff: base: Remove support for trusty/Ubuntu in multiple places [puppet] - 10https://gerrit.wikimedia.org/r/500400 [09:54:57] (03CR) 10Vgutierrez: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500398 (owner: 10Alex Monk) [10:04:06] (03CR) 10Elukey: admin: allow users to be removed preserving their home directories (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey) [10:04:35] (03PS1) 10Muehlenhoff: base/ntp: Remove trusty/Ubuntu support [puppet] - 10https://gerrit.wikimedia.org/r/500403 [10:04:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/500388 (https://phabricator.wikimedia.org/T219362) (owner: 10Muehlenhoff) [10:07:07] (03PS1) 10Muehlenhoff: kube-proxy: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500404 [10:09:27] (03PS1) 10ArielGlenn: get rid of references to deprecated MW maintenance script [dumps] - 10https://gerrit.wikimedia.org/r/500405 [10:11:28] (03Abandoned) 10Jbond: jessie-backports: remove updates from jessie bootstrap-vz config [puppet] - 10https://gerrit.wikimedia.org/r/500069 (https://phabricator.wikimedia.org/T219580) (owner: 10Jbond) [10:11:59] jouncebot: next [10:12:00] In 0 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1030) [10:13:17] (03PS4) 10Elukey: admin: allow users to be removed preserving their home directories [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) [10:17:49] (03PS1) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [10:18:54] (03PS1) 10Muehlenhoff: Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407 [10:19:06] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [10:22:34] (03PS2) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [10:24:01] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [10:25:53] (03CR) 10Jbond: admin: allow users to be removed preserving their home directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498399 (https://phabricator.wikimedia.org/T215171) (owner: 10Elukey) [10:27:07] (03PS1) 10Giuseppe Lavagetto: Upgrade to 1.1.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500408 [10:27:21] !log T219626 reimaging cloudcontrol2001-dev [10:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [10:28:58] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Upgrade to 1.1.2 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500408 (owner: 10Giuseppe Lavagetto) [10:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1030). [10:30:28] (03PS1) 10Muehlenhoff: Remove tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/500409 [10:31:17] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@7ef5ca3]: Upgrade to 1.1.2 [10:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:23] (03PS3) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [10:31:43] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@7ef5ca3]: Upgrade to 1.1.2 (duration: 00m 26s) [10:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:51] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546) [10:32:59] <_joe_> !log pruning old images on boron [10:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:02] (03PS1) 10Muehlenhoff: Remove trusty-wikimedia from aptrepo config [puppet] - 10https://gerrit.wikimedia.org/r/500411 [10:34:00] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Ladsgroup) >>! In T218155#5071692, @MarcoAurelio wrote: > Per docs, ops must be notified. I've been doing so emailing the ops list every time.... [10:35:23] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:35:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:38:14] (03PS1) 10Muehlenhoff: dnsrecursor: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500413 [10:38:58] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:11] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:39:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:39:58] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:42:23] !log rolling security update of tshark [10:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:46] vgutierrez, fyi I found https://phabricator.wikimedia.org/T182927#5073598 to be necessary to start serving traffic using the new cert [10:43:53] was there an easier way I missed? [10:45:23] (03PS1) 10Muehlenhoff: redis::instance: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/500415 [10:45:45] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:46:19] Krenair: right, in prod we aren't there yet, that's why isn't provided in the puppetization [10:46:19] (03CR) 10Alex Monk: "These designate/gdnsd sync scripts are what acme_chief use to set DNS TXT records up. In gdnsd's case, it will expire challenges for us - " [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) (owner: 10Alex Monk) [10:46:31] 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads some weird behaviour - https://phabricator.wikimedia.org/T219775 (10aborrero) [10:47:02] vgutierrez, I imagine when moving prod to use these certs we wouldn't want to roll it out everywhere at the same time... what mechanism should be added to permit prod rollout? [10:47:09] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:47:14] I imagine I could use the exact same mechanism to enable it in deployment-prep without this nasty hack [10:47:30] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:500410| Bumping portals to master (T128546)]] (duration: 00m 52s) [10:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:33] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:47:35] yes, basically moving that to hieradata [10:47:44] so like [10:48:14] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500410 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:48:17] serve_acme_chief_certs = false [10:48:21] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:500410| Bumping portals to master (T128546)]] (duration: 00m 50s) [10:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:41] which we can set to true in specific cases, but only set certs/certs_active if it's false vgutierrez ? [10:49:32] volans: just opened T219775 [10:49:32] T219775: wmf-auto-reimage-host: puppet first run error leads some weird behaviour - https://phabricator.wikimedia.org/T219775 [10:50:05] hmm not right now, cause we need to be able to deliver both "certs" and acme_chief managed certificates to the same server [10:50:33] will need some more complicated logic in localssl.erb then [10:51:07] maybe replacing @certs_nginx.empty? with a check for serve_acme_chief_certs [10:51:22] so not more complicated really [10:51:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/500407 (owner: 10Muehlenhoff) [10:54:29] 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads some weird behaviour - https://phabricator.wikimedia.org/T219775 (10Volans) p:05Triage→03Normal a:03Volans It's kinda expected, the line when it says: ` Scheduled delayed downtime on Icinga ` spawn a subp... [10:56:50] 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10aborrero) [10:57:51] 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10aborrero) Regarding the `puppet_first_run` error, it would be interesting if a bit more information or context is provided by the script. W... [10:59:01] 10Operations, 10Operations-Software-Development: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10Volans) It's all in the logs mentioned at the start of the script: ` Could not retrieve catalog from remote server: Error 500 on SERVER: Se... [10:59:05] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [10:59:26] ^^ looking [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1100). [11:00:04] Daimona, odder, Lucas_WMDE, and rxy: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:13] o/ [11:00:21] puppet runs fine, suspect this is just because there was a lock on the dpkg db [11:00:47] I can SWAT today [11:00:52] !log halt rolling updates of tshark untill after SWAT [11:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:05] o/ [11:01:15] also in relation to the error above puppet runs fine, suspect this is just because there was a lock on the dpkg db [11:01:52] jbond42: tshark is ensured via "package/present" in puppet, this kind of puppet spam sometimes happens if the package is being upgraded with debdeploy and then puppet tries to also install it [11:01:54] zeljkof: is it okay if I already +2 my backport? it’ll probably take a while to go through CI [11:02:05] looks like everything else is config changes, those can be deployed first [11:02:21] moritzm: thatnks that was my suspicion [11:02:43] Lucas_WMDE: sure [11:02:48] Daimona: congrats on Gerrit change #500000 btw :) [11:03:19] Lucas_WMDE: Thanks, an important milestone for all of us :) [11:03:30] 500k?* [11:04:17] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:04:21] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:04:53] btw Daimona you seem to have the same patch twice in the Deployments calendar [11:04:53] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [11:05:11] IIRC there were two config var removals, so I guess that should be two different patches (instead of just removing the duplicate)? [11:05:34] odder: around for swat? [11:05:41] Ah yes Lucas, thanks [11:05:43] Fixing right now [11:05:58] fortunately the hashtag makes it easy to find the other one :) [11:06:37] (03CR) 10Zfilipin: "This is scheduled for EU SWAT but will not be deployed unless the developer joins #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder) [11:06:55] Yay, I like this new hashtag thingy [11:08:19] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498817 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [11:08:28] (03PS1) 10Giuseppe Lavagetto: Various fixes to 1.1.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 [11:09:26] (03Merged) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498817 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [11:10:12] (03PS1) 10Arturo Borrero Gonzalez: openstack: bootstrap hiera keys for codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/500418 (https://phabricator.wikimedia.org/T219626) [11:10:21] (03CR) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498817 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [11:11:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Various fixes to 1.1.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 (owner: 10Giuseppe Lavagetto) [11:11:15] Daimona: 498817 is at mwdebug1002, please test and let me know if I can deploy it [11:11:20] (03PS1) 10ArielGlenn: use yaml safe_load everywhere [dumps] - 10https://gerrit.wikimedia.org/r/500419 [11:11:21] Testing [11:11:36] (03CR) 10jenkins-bot: Various fixes to 1.1.2 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500417 (owner: 10Giuseppe Lavagetto) [11:12:13] Ahhhh today debug servers are slow I see [11:13:08] It'll take a while to check [11:13:11] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:13:15] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:13:40] (03PS1) 10Giuseppe Lavagetto: Upgrade to 1.1.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500420 [11:14:44] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Upgrade to 1.1.3 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500420 (owner: 10Giuseppe Lavagetto) [11:15:38] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@0c32dc1]: Upgrade to 1.1.2 [11:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:41] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 79653 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:16:46] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@0c32dc1]: Upgrade to 1.1.2 (duration: 01m 08s) [11:16:47] zeljkof looks good, please go ahead [11:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:53] Daimona: ok, deploying [11:16:56] Thanks [11:17:38] (03PS1) 10Volans: wmf-auto-reimage: fix Icinga delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) [11:17:54] !log zfilipin@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:498817|Revert "Revert "Remove $wgAbuseFilterProfile"" (T191039)]] (duration: 00m 52s) [11:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:57] T191039: Re-enable filter profiling on every wiki - https://phabricator.wikimedia.org/T191039 [11:18:01] Daimona: deployed [11:18:13] Nice [11:18:40] (03PS3) 10Zfilipin: Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [11:19:42] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [11:19:44] (03CR) 10Daniel Kinzler: [C: 03+1] Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE)) [11:20:13] (03CR) 10ArielGlenn: [C: 03+2] get rid of references to deprecated MW maintenance script [dumps] - 10https://gerrit.wikimedia.org/r/500405 (owner: 10ArielGlenn) [11:20:50] (03Merged) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [11:20:58] (03CR) 10ArielGlenn: [C: 03+2] use yaml safe_load everywhere [dumps] - 10https://gerrit.wikimedia.org/r/500419 (owner: 10ArielGlenn) [11:21:11] Daimona: is there a reason 498817 and 498818 are not in the same patch? [11:21:20] (03PS2) 10Rxy: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) [11:21:26] zeljkof: I think that's because they depended-on different patches in AF [11:21:33] (03CR) 10jenkins-bot: Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498818 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [11:21:34] ah, ok [11:21:34] And AF had two different patches for review ease [11:21:43] Although they were merged together [11:21:50] Daimona: 498818 is at mwdebug [11:21:55] Testing [11:22:26] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] openstack: bootstrap hiera keys for codfw1dev [labs/private] - 10https://gerrit.wikimedia.org/r/500418 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:23:50] This one looks good, too [11:24:13] Daimona: ok, deploying [11:25:11] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:498818|Revert "Revert "Remove $wgAbuseFilterRuntimeProfile"" (T191039)]] (duration: 00m 51s) [11:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:15] T191039: Re-enable filter profiling on every wiki - https://phabricator.wikimedia.org/T191039 [11:25:19] Daimona: deployed [11:25:31] hi zeljkof [11:25:40] (03PS3) 10Zfilipin: Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad) [11:25:43] hi aharoni! [11:25:53] zeljkof: are you OK with me testing the deployment of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/499210/ ? [11:26:00] Cool, keeping an eye on it [11:26:07] Great, so I'm around [11:26:14] aharoni: sure [11:26:26] aharoni: I'll ping you after I'm done with current deployment [11:26:30] great :) [11:27:55] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad) [11:29:13] (03CR) 10Amire80: [C: 03+1] "I'll do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder) [11:29:26] (03Merged) 10jenkins-bot: Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad) [11:30:24] Daimona: 497236 is at mwdebug [11:30:43] CI for my backport is still running btw o_O [11:30:47] zeljkof Thanks, I'm trying to figure out how to test it [11:30:59] I'm unsure if there's a direct way [11:31:11] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:31:27] I don't know where udp notifications are sent for commons [11:31:46] Daimona: should I deploy it? [11:31:51] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:31:54] Lucas_WMDE: I still have a few more patches [11:32:00] I'm checking logstash just to be sure no fatals etc. [11:32:22] zeljkof: go ahead with rxy and aharoni then :) [11:32:30] (btw there seems to be a parser error in fatalmonitor?) [11:32:33] Lucas_WMDE: ok [11:32:45] (03CR) 10jenkins-bot: Enable logging of private filters on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497236 (https://phabricator.wikimedia.org/T218527) (owner: 10Ammarpad) [11:32:49] Lucas_WMDE: which error? [11:32:55] [ I'm around ] [11:32:58] “start tag expected, '<' not found” [11:33:06] (03PS3) 10ArielGlenn: explicitly start wikidata entity dumps on the 1st and 20th of each month [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160) [11:33:07] zeljkof Everything seems quiet, cleared for deployment I guess [11:33:09] checking logstash rn [11:33:19] Daimona: ok, deploying [11:34:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:34:19] jouncebot: now [11:34:19] For the next 0 hour(s) and 25 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1100) [11:34:25] !log zfilipin@deploy1001 Synchronized wmf-config/abusefilter.php: SWAT: [[gerrit:497236|Enable logging of private filters on commonswiki (T218527)]] (duration: 00m 50s) [11:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:33] T218527: Enable wgAbuseFilterNotificationsPrivate on commons.wikimedia.org - https://phabricator.wikimedia.org/T218527 [11:34:44] (03PS1) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) [11:35:09] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:35:29] Daimona: deployed, thanks for deploying with #releng and please check production and logs :) [11:35:30] 10Operations: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632 (10ArielGlenn) I usually just grumble to myself and move on when I forget and try to push to refs/for/master but today I reached MAX_GRUMBLES. How much work would this be? [11:35:45] zeljkof: Noooice, thank you :-) [11:35:59] zeljkof: looks like the last such error in mwlog1001:/srv/mw-log/hhvm.log was at 9:51:20, so ignore it I guess [11:36:04] (03PS3) 10Gehel: Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) [11:36:18] it’s just still in the last 1000 lines of hhvm.log, that’s why it appears in the `fatalmonitor` command [11:36:18] Lucas_WMDE: ok, thanks [11:36:41] (03PS1) 10Arturo Borrero Gonzalez: labtestnet2003: rename to cloudnet2003-dev and reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/500423 (https://phabricator.wikimedia.org/T219776) [11:37:18] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder) [11:37:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2003: rename to cloudnet2003-dev and reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/500423 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [11:37:43] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:37:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:38:22] afk for a few minutes [11:38:23] (03Merged) 10jenkins-bot: Correct logos for the Gujarati Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder) [11:38:41] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:39:04] aharoni: 499210 is at mwdebug1002, please test and let me know if I can deploy it [11:40:15] zeljkof: tested on mwdebug1002, looks good, please go on deploying. [11:40:25] aharoni: ok, deploying [11:41:27] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:499210|Correct logos for the Gujarati Wikipedia (T219373)]] (duration: 00m 52s) [11:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:30] T219373: Correct logo for the Gujarati Wikipedia - https://phabricator.wikimedia.org/T219373 [11:41:41] aharoni: it's deployed, purging logos [11:42:11] aharoni: purged, please test, and thanks for deploying with #releng :) [11:42:35] zeljkof: sounds like a sales transaction :P [11:42:53] rxy: please stand by, you're next [11:42:59] k [11:43:01] Zppix: :D [11:43:06] zeljkof: works in production. Thanks! [11:43:21] Alright my jokes are over, good luck with SWAT :) [11:43:26] (03CR) 10jenkins-bot: Correct logos for the Gujarati Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) (owner: 10Odder) [11:43:57] (03CR) 10Mathew.onipe: [C: 03+1] Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [11:44:11] aharoni: great! [11:44:18] Zppix: thanks :) [11:44:29] Np [11:44:39] (03PS1) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500424 (https://phabricator.wikimedia.org/T219776) [11:44:54] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) (owner: 10Rxy) [11:44:59] my backport finally went through CI, halleluja [11:44:59] only took ¾ of the SWAT window :D [11:46:20] (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500424 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [11:46:21] Lucas_WMDE: that is pretty long :/ [11:46:21] (03CR) 10Gehel: [C: 04-1] elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:46:21] I'm deploying my final patch, you're next :) [11:46:21] Lucas_WMDE: ^ [11:46:28] (03Merged) 10jenkins-bot: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) (owner: 10Rxy) [11:46:28] I thought the swat pipeline was supposed to speed ci up for swat? [11:46:29] I’ll wait for the config change to be done [11:46:30] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Will workaround the jenkins -1 because is totally expected:" [dns] - 10https://gerrit.wikimedia.org/r/500424 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [11:46:31] ack [11:46:52] Zppix: I don’t think there was any contention waiting for a container or something [11:46:56] Zppix: some tests take long time to run [11:46:59] it just takes a really long time to run the tests [11:47:14] Hmm i wonder if theres any way to speed that test up [11:47:15] (IIRC WikibaseLexeme pulls in a bunch of other extensions as well, to avoid breaking them?) [11:47:36] (I may take a looksie with the very very little knowledge i have of ci) [11:47:42] (03CR) 10Gehel: [C: 03+2] Cookbook to reset frozen writes on elasticsearch / cirrus. [cookbooks] - 10https://gerrit.wikimedia.org/r/500064 (https://phabricator.wikimedia.org/T219638) (owner: 10Gehel) [11:47:58] Zppix: there are many things we could do... but there is little time... :) [11:48:20] zeljkof: true thats why i said ill see (although i know little about ci) [11:48:32] rxy: 500393 is at mwdebug1002, please test and let me know if I can deploy it [11:49:48] zeljkof: ok, It works expectedly . please deploy to prod [11:49:52] hm, https://gerrit.wikimedia.org/r/441178 was just abandoned and looks like it could have helped? [11:49:55] (with the test time) [11:50:02] rxy: ok, deploying [11:51:00] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:500393|Add unwatchedpages permission to rollbacker and patroller at zhwiki (T219285)]] (duration: 00m 52s) [11:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:09] T219285: Add "unwatchedpages" to patrollers and rollbackers on zhwiki - https://phabricator.wikimedia.org/T219285 [11:51:28] (03PS1) 10Arturo Borrero Gonzalez: Revert "wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs" [dns] - 10https://gerrit.wikimedia.org/r/500426 [11:51:32] rxy: it's deployed, please test in production and thanks for deploying with #releng :) [11:51:38] Lucas_WMDE: the swat is yours [11:51:40] (03PS1) 10Arturo Borrero Gonzalez: Revert "wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs" [dns] - 10https://gerrit.wikimedia.org/r/500427 [11:51:43] ok thanks, going ahead [11:52:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs" [dns] - 10https://gerrit.wikimedia.org/r/500427 (owner: 10Arturo Borrero Gonzalez) [11:52:29] (03PS1) 10Jbond: facter3: add augeas-tools [puppet] - 10https://gerrit.wikimedia.org/r/500428 [11:52:53] !log uploaded logstash/kibana/elasticsearch to component thirdparty/elastic56 [11:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:01] backport is on mwdebug1002, testing [11:53:07] zeljkof: ok, It works correctly at server: mw1268.eqiad.wmnet prod [11:53:11] thanks :) [11:53:29] !log uploaded logstash/kibana/elasticsearch 5.6.15 to component thirdparty/elastic56 [11:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:32] (03CR) 10jenkins-bot: Add 'unwatchedpages' permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500393 (https://phabricator.wikimedia.org/T219285) (owner: 10Rxy) [11:54:48] (03CR) 10Gehel: [C: 03+1] "Good enough for me, since we're going to replace this with a cookbook." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) (owner: 10Volans) [11:55:20] hm, the debug server seems to time out when POSTing an edit [11:55:32] but the fix is really for the frontend UI, and that part seems to be working as expected [11:55:39] so I’m going to chalk that error up to the debug server being slow [11:55:45] and will go ahead with deploying the backport [11:57:54] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.33.0-wmf.23/extensions/WikibaseLexeme: SWAT: [[gerrit:500237|Fix GrammaticalFeatureListWidget (T219134, T219734)]] (duration: 01m 00s) [11:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:59] T219734: Not possible to edit forms of Wikidata lexemes - https://phabricator.wikimedia.org/T219734 [11:58:00] T219134: `text.trim is not a function` in node selenium tests blocking Wikibase CI; blocking test/merge in WB/WBL/WBMI/etc. - https://phabricator.wikimedia.org/T219134 [11:58:17] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:58:18] !log EU SWAT done [11:58:19] (03PS1) 10Hashar: Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778) [11:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:22] just in time, yay [12:02:10] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@UNKNOWN]: Rollback to 1.0.0, T219778 [12:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:13] T219778: docker-pkg is unhappy on contint1001 - https://phabricator.wikimedia.org/T219778 [12:02:44] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@UNKNOWN]: Rollback to 1.0.0, T219778 (duration: 00m 34s) [12:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:05] (03CR) 10Gehel: [C: 04-1] icinga: add mediawiki cirrus update lag check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [12:04:25] (03PS4) 10Alaa Sarhan: Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [12:05:31] (03CR) 10jerkins-bot: [V: 04-1] Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [12:05:46] (03PS2) 10Muehlenhoff: Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE)) [12:08:30] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@0c32dc1]: Rollback to 1.0.0, T219778 [12:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:33] T219778: docker-pkg is unhappy on contint1001 - https://phabricator.wikimedia.org/T219778 [12:08:48] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@0c32dc1]: Rollback to 1.0.0, T219778 (duration: 00m 18s) [12:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:18] (03CR) 10Muehlenhoff: [C: 03+2] Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE)) [12:12:12] (03CR) 10Muehlenhoff: [C: 03+2] "I've also moved Daniel from the cn=nda group to the cn=wmf LDAP group." [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE)) [12:12:31] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:13:41] (03PS1) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) [12:14:01] (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [12:14:10] (03PS11) 10Gehel: Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [12:14:37] (03PS5) 10Alaa Sarhan: Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [12:15:43] (03CR) 10jerkins-bot: [V: 04-1] Add wgScoreLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [12:16:32] (03CR) 10Gehel: [C: 03+2] Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [12:18:45] (03PS1) 10Muehlenhoff: Update Birgit Mueller's email address [puppet] - 10https://gerrit.wikimedia.org/r/500432 [12:19:25] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:26] (03PS6) 10Alaa Sarhan: Add wgMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [12:19:29] (03PS1) 10Giuseppe Lavagetto: Revert "Upgrade to 1.1.3", "Upgrade to 1.1.2" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500433 [12:20:22] (03PS2) 10Alaa Sarhan: Add wgScoreLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) [12:22:08] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "Upgrade to 1.1.3", "Upgrade to 1.1.2" [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500433 (owner: 10Giuseppe Lavagetto) [12:23:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: add DB configuration for stretch [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) [12:23:31] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@46ba982]: Rollback - third time is the charm [12:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500428 (owner: 10Jbond) [12:24:01] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:24:30] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@46ba982]: Rollback - third time is the charm (duration: 00m 43s) [12:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778) (owner: 10Hashar) [12:26:20] (03CR) 10Gehel: [C: 03+2] Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [12:26:40] (03Merged) 10jenkins-bot: Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778) (owner: 10Hashar) [12:27:08] (03CR) 10jenkins-bot: Fix build error logging $s -> %s [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500429 (https://phabricator.wikimedia.org/T219778) (owner: 10Hashar) [12:27:13] (03CR) 10jenkins-bot: Elasticsearch: make unfreezing writes more robust. [software/spicerack] - 10https://gerrit.wikimedia.org/r/500067 (https://phabricator.wikimedia.org/T219640) (owner: 10Gehel) [12:31:45] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:33:40] (03CR) 10Jbond: [C: 03+2] facter3: add augeas-tools [puppet] - 10https://gerrit.wikimedia.org/r/500428 (owner: 10Jbond) [12:33:53] (03PS1) 10Jbond: pbuilder: fix backports hook and add archive hook [puppet] - 10https://gerrit.wikimedia.org/r/500435 [12:34:00] (03PS2) 10Jbond: facter3: add augeas-tools [puppet] - 10https://gerrit.wikimedia.org/r/500428 [12:40:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/500435 (owner: 10Jbond) [12:41:53] RECOVERY - Disk space on dbprov2001 is OK: DISK OK [12:43:06] (03PS1) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 [12:43:08] (03PS1) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 [12:45:47] RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:46:09] 10Operations, 10monitoring, 10Proposal: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158 (10jcrespo) 05Open→03Declined I was convinced, this is desirable, but I don't see a way to move forward that doesn't create many... [12:47:21] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel) [12:47:23] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel) [12:47:30] (03PS2) 10Jbond: pbuilder: fix backports hook and add archive hook [puppet] - 10https://gerrit.wikimedia.org/r/500435 [12:51:20] (03PS2) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 [12:51:22] (03PS2) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 [12:51:51] (03CR) 10Jbond: [C: 03+2] pbuilder: fix backports hook and add archive hook [puppet] - 10https://gerrit.wikimedia.org/r/500435 (owner: 10Jbond) [12:52:11] (03PS2) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) [12:52:52] (03CR) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [12:53:57] (03PS2) 10Arturo Borrero Gonzalez: openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) [12:55:12] (03CR) 10jerkins-bot: [V: 04-1] openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [13:00:04] Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) New wikis deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1300). [13:00:18] o/ [13:02:14] (03CR) 10Volans: [C: 03+1] "Trivial, LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel) [13:05:59] (03PS1) 10Volans: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 [13:06:24] (03PS4) 10Ladsgroup: Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [13:08:03] (03CR) 10Ladsgroup: [C: 03+2] Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [13:09:10] (03Merged) 10jenkins-bot: Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [13:09:14] !log rolling security update of tshark [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:20] (03CR) 10Volans: [C: 03+1] "LGTM, couple of very minor nitpicks inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel) [13:09:23] (03CR) 10jenkins-bot: Reinstate "Initial configuration for hyw.wikipedia", take 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500190 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [13:09:30] (03PS3) 10Arturo Borrero Gonzalez: openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) [13:09:43] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:11:03] !log Upgraded CI Jenkins jobs to Quibble 0.0.30 # T219647 [13:11:13] (03PS4) 10Arturo Borrero Gonzalez: openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) [13:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:19] T219647: Upgrade CI jobs to Quibble 0.0.30 - https://phabricator.wikimedia.org/T219647 [13:11:31] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [13:11:37] (03PS3) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 [13:11:39] (03PS3) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 [13:12:04] !log mvolz@deploy1001 scap-helm citoid upgrade staging -f citoid-staging-values.yaml stable/citoid [namespace: citoid, clusters: staging] [13:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:05] !log mvolz@deploy1001 scap-helm citoid cluster staging completed [13:12:05] !log mvolz@deploy1001 scap-helm citoid finished [13:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:39] PROBLEM - puppet last run on an-worker1094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:13:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: add some missing bits for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/500434 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [13:13:29] (03PS19) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [13:13:31] (03PS14) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [13:13:33] (03PS17) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [13:13:35] (03PS11) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [13:13:37] (03PS1) 10Vgutierrez: localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) [13:14:19] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:14:22] (03PS1) 10Arturo Borrero Gonzalez: openstack: admin_scripts: libvirt-bin doesn't exists in stretch [puppet] - 10https://gerrit.wikimedia.org/r/500444 (https://phabricator.wikimedia.org/T215407) [13:15:14] (03CR) 10Ema: [C: 03+1] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:15:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500432 (owner: 10Muehlenhoff) [13:15:41] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:15:48] (03CR) 10Alex Monk: "This should probably be put behind the same hiera check we'll be using to determine whether the host is using the cert to serve traffic or" [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:15:55] (03PS1) 10Giuseppe Lavagetto: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 [13:16:27] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark] [13:16:27] PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:16:33] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:16:43] (03CR) 10jerkins-bot: [V: 04-1] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:16:45] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:16:59] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark] [13:17:03] PROBLEM - puppet last run on mw1344 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark] [13:17:19] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:17:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: admin_scripts: libvirt-bin doesn't exists in stretch [puppet] - 10https://gerrit.wikimedia.org/r/500444 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [13:17:27] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:17:43] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark] [13:17:55] PROBLEM - puppet last run on mw1273 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark] [13:17:57] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark] [13:18:05] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[augeas-tools],Package[tshark] [13:18:17] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:18:47] PROBLEM - puppet last run on sessionstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:18:59] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:19:04] (03CR) 10Alex Monk: "Oh, yeah, restart instead of reload. Let's not do that." [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:20:13] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:20:17] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:20:17] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:20:23] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[augeas-tools] [13:20:37] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10MoritzMuehlenhoff) @Krinkle I've prepared a new build and uploaded it to https://... [13:20:45] (03PS2) 10Vgutierrez: localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) [13:20:47] (03PS20) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [13:20:49] (03PS15) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [13:20:51] (03PS18) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [13:20:53] (03PS12) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [13:22:00] (03CR) 10Alex Monk: [C: 03+1] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:23:06] !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-eqiad-values.yaml stable/citoid [namespace: citoid, clusters: eqiad] [13:23:07] !log mvolz@deploy1001 scap-helm citoid cluster eqiad completed [13:23:07] !log mvolz@deploy1001 scap-helm citoid finished [13:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:07] (03CR) 10Vgutierrez: [C: 03+2] localssl: Avoid acme-chief puppetization triggers nginx restart [puppet] - 10https://gerrit.wikimedia.org/r/500443 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:21] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:52] !log mvolz@deploy1001 scap-helm citoid upgrade production -f citoid-codfw-values.yaml stable/citoid [namespace: citoid, clusters: codfw] [13:26:53] !log mvolz@deploy1001 scap-helm citoid cluster codfw completed [13:26:53] !log mvolz@deploy1001 scap-helm citoid finished [13:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:15] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:27:32] (03PS2) 10Giuseppe Lavagetto: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) [13:28:16] (03PS2) 10Muehlenhoff: Update Birgit Mueller's email address [puppet] - 10https://gerrit.wikimedia.org/r/500432 [13:29:42] (03CR) 10Muehlenhoff: [C: 03+2] Update Birgit Mueller's email address [puppet] - 10https://gerrit.wikimedia.org/r/500432 (owner: 10Muehlenhoff) [13:29:43] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:33:35] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:36:43] (03CR) 10Ema: [C: 03+1] hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:38:46] marostegui: I looked at lots of options, the only thing I can do for now is to drop hywwiki on s3 and make it again (make sure you don't drop hywiki, with one y, that's a huge wiki) [13:39:07] Is it okay if I drop it? [13:39:15] Amir1: that is not that straightforward, we have to drop it on all the es hosts and x1 [13:39:43] What will dropping actually solve? [13:40:31] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [13:40:56] marostegui: not es or x1, the maintenance script can skip re-adding those clusters now [13:41:19] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:41:22] can't you comment out the bit of the script that creates the s3 DB? [13:41:45] Reedy: data corruption on the main page [13:42:19] Has the cause of the corruption been fixed? [13:42:25] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:33] Krenair: Aaron has added "skipclusters" so we don't have to comment stuff out [13:42:43] RECOVERY - puppet last run on mw1321 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:42:53] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:43:04] It gives out this [13:43:05] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:43:08] https://www.irccloud.com/pastebin/FkL7Xptw/ [13:43:14] Reedy: Wow, living in the future. [13:43:17] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:43:18] ikr? [13:43:23] RECOVERY - puppet last run on mw1344 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:43:27] exactly :D [13:43:29] (03PS21) 10Vgutierrez: hieradata: Deploy acme-chief unified certificate on eqsin cp servers [puppet] - 10https://gerrit.wikimedia.org/r/499780 (https://phabricator.wikimedia.org/T213705) [13:43:30] Back in the day we didn't have these nice things [13:43:37] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:43:59] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:44:00] As a deployer you would have to customise addWiki against whichever kind of breakage was live on that particular day [13:44:11] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:44:11] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:44:11] RECOVERY - puppet last run on an-worker1094 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:44:16] Reedy: Maybe there is something wrong somewhere else. Do you want to double check the error? [13:45:09] RECOVERY - puppet last run on sessionstore1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:45:37] (03PS3) 10Herron: logstash: send varnish syslogs via kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899) [13:45:54] 10Operations, 10ops-eqiad, 10DBA: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) Any update from HP? [13:46:09] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.653 second response time https://phabricator.wikimedia.org/T174916 [13:46:28] Amir1: I've no idea what tt:1 is supposed to be [13:46:31] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:46:41] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:46:43] But it kinda looks like it's having issues loading/saving the revisions from/to ES [13:47:28] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5001.eqsin.wmnet [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:01] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:49:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto) [13:50:02] !log Reverted CI Jenkins jobs to Quibble 0.0.28 # T219647 [13:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:06] (03PS37) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [13:50:07] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [13:50:07] T219647: Upgrade CI jobs to Quibble 0.0.30 - https://phabricator.wikimedia.org/T219647 [13:52:03] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:56:17] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5001.eqsin.wmnet [13:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:05] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5007.eqsin.wmnet [13:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:38] (03PS1) 10Elukey: aptrepo: update cloudera-jessie to 5.16.1 [puppet] - 10https://gerrit.wikimedia.org/r/500453 (https://phabricator.wikimedia.org/T218343) [13:59:53] so the extstore is empty for both of them [14:00:36] (03CR) 10Herron: [C: 03+1] "Looks good! One small/optional thing inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite) [14:02:00] (03CR) 10Herron: [C: 03+1] Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [14:03:21] (03PS1) 10Ladsgroup: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597) [14:03:25] (03CR) 10Ppchelko: "Oh, I've collected a +4 on this one..." [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [14:03:47] (03CR) 10Ladsgroup: [C: 03+2] Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup) [14:04:10] (03CR) 10Herron: [C: 03+1] Expose rsyslog_udp_port to services configs. [puppet] - 10https://gerrit.wikimedia.org/r/498872 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [14:04:31] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5007.eqsin.wmnet [14:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:25] (03Merged) 10jenkins-bot: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup) [14:05:33] (03CR) 10Herron: [C: 03+1] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:05:38] (03CR) 10jenkins-bot: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500454 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup) [14:07:20] !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 00m 52s) [14:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:34] (03PS6) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [14:08:55] (03CR) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [14:09:18] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [14:09:43] !log uploaded debdeploy 0.0.99.10 to apt.wikimedia.org (jessie, stretch, buster) [14:09:52] (03CR) 10Mholloway: [C: 03+2] Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [14:11:08] (03CR) 10Ladsgroup: [C: 04-2] "DO NOT DEPLOY. I'm at a middle of deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [14:11:19] ladsgroup@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:47] (03CR) 10Mholloway: "Yikes, sorry about that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [14:14:03] (03CR) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel) [14:15:02] (03PS1) 10Jbond: pdebuild: ensure proxy config exists for apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/500458 [14:16:45] forgot to rebase, need to do wikiversions again [14:17:10] (03PS19) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [14:17:14] (03PS13) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [14:17:16] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [14:17:36] (03CR) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:18:07] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [14:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] (03PS4) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 [14:21:31] (03PS4) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 [14:21:33] (03PS7) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [14:22:51] (03PS16) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [14:22:53] (03PS20) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [14:22:55] (03PS14) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [14:26:16] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [14:28:54] 10Operations, 10User-fgiunchedi: Session storage Cassandra metrics (Prometheus) not being collected - https://phabricator.wikimedia.org/T219523 (10Eevans) [14:28:59] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:29:01] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10Eevans) [14:29:18] okay, it's only the main page is grumpy :D [14:29:19] https://hyw.wikipedia.org/wiki/Foo [14:29:42] 10Operations, 10Cassandra, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Eevans) [14:29:50] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10Eevans) [14:30:59] (03CR) 10Ema: [C: 03+1] cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:31:33] (03PS5) 10Gehel: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 [14:31:36] (03PS5) 10Gehel: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 [14:31:38] (03PS8) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [14:31:41] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [14:32:05] !log wikiadmin@10.64.32.136(hywwiki)> update text set old_text = 'DB://cluster25/1'; [14:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:51] (03PS1) 10Jbond: pdebuild: add a new repo for build dependencies [puppet] - 10https://gerrit.wikimedia.org/r/500464 [14:33:52] !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: T212597 (duration: 00m 51s) [14:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:00] T212597: Create Wikipedia Western Armenian - https://phabricator.wikimedia.org/T212597 [14:34:02] (03CR) 10Ema: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:35:17] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 50s) [14:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:04] marostegui: There is no need to drop anything, I manually fixed it in db [14:36:39] Amir1: Well, doesn't look completley fixed [14:36:43] curprev 12:33, 27 March 2019‎ 127.0.0.1 talk‎ 1,380 bytes +1,380‎ Created page with "==This subdomain is reserved for the creation of a Wikipedia in '''արեւմտահայերէն''' language==..." [14:36:50] Content is ". [14:36:50] " [14:37:06] !log ladsgroup@deploy1001 Synchronized langlist: (no justification provided) (duration: 00m 50s) [14:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:57] yeah, I basically changed pointer to the content [14:38:02] I know, I read it [14:38:16] because the content doesn't exist [14:38:19] (03PS17) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) [14:38:21] (03PS21) 10Vgutierrez: cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) [14:38:23] (03PS15) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [14:38:25] How can I make it better? [14:38:30] (03CR) 10Vgutierrez: nagios_common: provide check_ssl_unified variants for LE certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:38:41] I think you delete the main page (or someone does), just to clear up the history [14:38:48] Editing other pages seems fine [14:38:57] So it's some bug/context thing in addWiki that's broken [14:39:07] I've no idea where tt comes from [14:39:31] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel) [14:39:44] (03CR) 10Gehel: [C: 03+2] elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel) [14:39:53] (03CR) 10Gehel: [C: 03+2] elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel) [14:40:11] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [14:40:13] I also made a page called Foo, that needs to be deleted as well [14:40:16] I'll handle it [14:40:36] Amir1: yaaaay :) [14:40:40] Amir1: thanks [14:41:05] Amir1: let me know when I can also try to register my user so I can see if it gets correctly filtered before sending it for the views creation [14:41:54] !log ladsgroup@mwmaint1002:~$ mwscript maintenance/createAndPromote.php --wiki=hywwiki --force --sysop Ladsgroup [14:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:25] !log restarting superset on analytics-tool1004 to pick up latest Python [14:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:32] (03CR) 10jenkins-bot: elasticsearch: rename elasticsearchclusters to elasticsearch_clusters [software/spicerack] - 10https://gerrit.wikimedia.org/r/500438 (owner: 10Gehel) [14:43:10] (03PS1) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) [14:43:22] (03CR) 10jenkins-bot: elasticsearch: cleanup test by introducing a method to mock API calls [software/spicerack] - 10https://gerrit.wikimedia.org/r/500437 (owner: 10Gehel) [14:43:22] How can I desysop myself now 🤔 [14:43:51] can’t you un-sysop yourself on Special:UserRights? [14:43:53] Removed [14:44:19] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [14:44:43] !log rolling out debdeploy 0.0.99.10 for jessie, buster, stretch systems [14:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:03] (03CR) 10Ema: [C: 03+1] nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:45:12] nope :P [14:45:20] Thanks Reedy [14:46:12] (03CR) 10Vgutierrez: [C: 03+2] nagios_common: provide check_ssl_unified variants for LE certs [puppet] - 10https://gerrit.wikimedia.org/r/499823 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:46:14] (03PS9) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [14:46:39] Amir1: Just checked my user, and it gets filtered correctly [14:47:04] marostegui: \o/ [14:47:42] Amir1: so from your side this is all done? [14:47:49] yessssss [14:48:00] I'm handing it out to community [14:48:01] (03CR) 10Volans: [C: 03+1] "LGTM, optional nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [14:48:13] (Connection to wikidata is needed) etc. [14:48:31] (03CR) 10Vgutierrez: [C: 03+2] cache: serve wikiba.se traffic using cache::canary servers [puppet] - 10https://gerrit.wikimedia.org/r/499825 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:48:36] Amir1: So can I send it to cloud for the views creation?= [14:48:45] marostegui: sure [14:48:48] (03PS2) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) [14:49:14] (03PS3) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) [14:50:11] (03CR) 10Ladsgroup: "Safe to go now, sorry for this. I was at the middle of very complex deploy that could blow up in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [14:50:30] Amir1: let me run some more checks on my side [14:50:34] just to be 100% clear [14:50:40] Sure [14:50:55] 10Operations, 10Acme-chief, 10Traffic, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [14:51:02] as we had some reports today about it (that I commented on the ticket) - I want to see if those clear out [14:51:26] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [14:52:11] marostegui: This https://phabricator.wikimedia.org/T212597#5065429 ? [14:52:33] Amir1: No, this: https://phabricator.wikimedia.org/T212625#5072845 [14:52:43] So far looks solved, I am waiting for the check to fully finish [14:52:48] To be 100% sure [14:52:57] oh okay, sure [14:53:17] (03CR) 10Mholloway: "Thanks, that's OK, and sorry for the scare! I should have double-checked the calendar." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [14:53:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [14:54:18] (03CR) 10Mholloway: [C: 03+2] Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [14:54:40] (03PS2) 10Volans: wmf-auto-reimage: fix Icinga delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) [14:55:28] (03Merged) 10jenkins-bot: Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [14:55:30] (03CR) 10Volans: "replies inline, agreed" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) (owner: 10Volans) [14:55:50] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: fix Icinga delayed downtime [puppet] - 10https://gerrit.wikimedia.org/r/500421 (https://phabricator.wikimedia.org/T219775) (owner: 10Volans) [14:55:59] (03CR) 10Muehlenhoff: cumin: add aliases for Hadoop HDFS journalnodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [14:57:14] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: wmf-auto-reimage-host: puppet first run error leads to some weird behaviour - https://phabricator.wikimedia.org/T219775 (10Volans) 05Open→03Resolved [14:57:27] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10Krinkle) a:05MoritzMuehlenhoff→03Krinkle [14:57:29] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [14:59:12] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs (duration: 00m 50s) [14:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:42] (03CR) 10jenkins-bot: Cleanup: Remove obsolete WikimediaEditorTasks beta cluster prefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500049 (owner: 10Mholloway) [15:01:01] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: codfw1dev: nova: add missing nova keys [puppet] - 10https://gerrit.wikimedia.org/r/500468 (https://phabricator.wikimedia.org/T219626) [15:01:38] Search is broken on hywwiki [15:01:40] yay [15:02:03] (03CR) 10Muehlenhoff: [C: 03+1] "Ah, that makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [15:03:17] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:05:07] (03PS10) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [15:06:05] (03Abandoned) 10Andrew Bogott: nova: add wmcs-rescue-console.sh to compute hosts [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [15:06:35] (03PS2) 10Arturo Borrero Gonzalez: hieradata: openstack: codfw1dev: nova: add missing nova keys [puppet] - 10https://gerrit.wikimedia.org/r/500468 (https://phabricator.wikimedia.org/T219626) [15:08:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc https://puppet-compiler.wmflabs.org/compiler1002/15467/" [puppet] - 10https://gerrit.wikimedia.org/r/500468 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [15:08:55] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:09:06] !log mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=hywwiki --baseName hywwiki --cluster (eqiad|codfw) [15:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:17] (03PS1) 10Marostegui: realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) [15:11:16] (03CR) 10Elukey: [C: 03+2] cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [15:11:22] (03PS4) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) [15:12:27] (03PS2) 10Marostegui: mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749) [15:13:26] (03CR) 10Bstorm: [C: 03+1] mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749) (owner: 10Marostegui) [15:13:49] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove labsdb1004,labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/500373 (https://phabricator.wikimedia.org/T216749) (owner: 10Marostegui) [15:14:03] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:14:03] (03PS7) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) [15:14:57] (03CR) 10Marostegui: [C: 03+1] labsdb: remove old and likely unused cname for labsdb1004 [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749) (owner: 10Bstorm) [15:15:16] (03PS2) 10Bstorm: labsdb: remove old and likely unused cname for labsdb1004 [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749) [15:15:55] (03CR) 10Bstorm: [C: 03+2] labsdb: remove old and likely unused cname for labsdb1004 [dns] - 10https://gerrit.wikimedia.org/r/500090 (https://phabricator.wikimedia.org/T216749) (owner: 10Bstorm) [15:16:58] (03PS5) 10Elukey: cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) [15:17:03] (03CR) 10Elukey: [V: 03+2 C: 03+2] cumin: add aliases for Hadoop HDFS journalnodes [puppet] - 10https://gerrit.wikimedia.org/r/500465 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [15:18:01] (03PS2) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) [15:18:21] (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [15:18:23] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) >>! In T212625#5072845, @Marostegui wrote: > This wiki is triggering some false positives on our labs private data checking methods, even if it... [15:18:55] (03CR) 10Herron: "discussed on irc a bit and following up here -- switched away from hiera regex in favor of a new hiera key called profile::ores::logstash_" [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [15:19:15] (03PS1) 10BBlack: Add wikiba.se to HTTPS/HSTS regexes for canonicals [puppet] - 10https://gerrit.wikimedia.org/r/500472 (https://phabricator.wikimedia.org/T213705) [15:20:08] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.343 second response time https://phabricator.wikimedia.org/T174916 [15:20:16] 10Operations, 10Traffic, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992 (10Volans) @ema given the speedup due to prometheus 2 do you think this still needs to be worked on or could be resolved? [15:20:36] (03PS38) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [15:20:47] 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm) [15:22:01] (03PS5) 10Giuseppe Lavagetto: Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [15:22:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto) [15:22:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add rsyslog kafka to service nodes. [puppet] - 10https://gerrit.wikimedia.org/r/496813 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [15:23:16] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [15:23:20] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm) a:05Bstorm→03RobH [15:23:26] <_joe_> Pchelolo: ^^ now merging [15:23:34] awesome, thank you! [15:23:39] (03Merged) 10jenkins-bot: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto) [15:23:45] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm) [15:24:03] (03PS11) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [15:24:05] (03CR) 10jenkins-bot: Log in to the registry if credentials are provided [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500445 (https://phabricator.wikimedia.org/T219778) (owner: 10Giuseppe Lavagetto) [15:24:12] (03Abandoned) 10Gehel: elasticsearch: add method to mock node info API [software/spicerack] - 10https://gerrit.wikimedia.org/r/492385 (owner: 10Gehel) [15:24:20] _joe_: there's also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/498872/ and I can finally switch things over [15:24:28] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Bstorm) @Marostegui it was supposed to be down. It needed a kill -9. [15:24:36] <_joe_> I know [15:24:36] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) #cloud-services-team this is ready for the views creation. I have added the usual GRANT so the views can be created for this wiki: Please rem... [15:24:41] <_joe_> one at a time please :) [15:24:43] !log disable puppet in the cache text cluster - T213705 [15:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:47] T213705: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 [15:25:15] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet - https://phabricator.wikimedia.org/T216749 (10Marostegui) Thanks :-) So fully ready for @RobH to take over [15:25:31] (03CR) 10Mathew.onipe: "PCC output is OK: https://puppet-compiler.wmflabs.org/compiler1002/15471/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:25:50] (03CR) 10Vgutierrez: [C: 03+2] cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:25:58] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10bd808) [15:26:02] (03PS16) 10Vgutierrez: cache: serve wikiba.se traffic using cache::text servers [puppet] - 10https://gerrit.wikimedia.org/r/499981 (https://phabricator.wikimedia.org/T213705) [15:26:06] <_joe_> Pchelolo: there is some issue with the patch, damn [15:26:27] (03PS2) 10BBlack: Add wikiba.se to HSTS regex [puppet] - 10https://gerrit.wikimedia.org/r/500472 (https://phabricator.wikimedia.org/T213705) [15:26:29] (03PS1) 10BBlack: Add wikiba.se to HTTPS redirect regex [puppet] - 10https://gerrit.wikimedia.org/r/500473 (https://phabricator.wikimedia.org/T213705) [15:26:44] how can you see that? [15:26:45] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) a:05Marostegui→03None [15:27:36] <_joe_> Pchelolo: by running puppet on one host [15:27:48] PROBLEM - DPKG on restbase2010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:27:52] <_joe_> Pchelolo: this is also why I wanted people who work on this to merge your change [15:27:56] <_joe_> Pchelolo: this ^^ [15:27:57] (03PS3) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) [15:28:02] <_joe_> and now I have a meeting [15:28:18] (03CR) 10BryanDavis: "I'm working on the big picture fix for this in Iefcc0a8ea51a3cddc0e79218809e14d97acfc186, but removing this check now is ok with me." [puppet] - 10https://gerrit.wikimedia.org/r/500409 (owner: 10Muehlenhoff) [15:28:28] (03CR) 10jerkins-bot: [V: 04-1] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [15:28:29] oh. damn.. [15:28:43] <_joe_> Pchelolo: and for some reason the second attempt at installing the update to rsyslog works [15:28:50] (03PS1) 10Gehel: elasticsearch: add cookbook to reboot all nodes in a cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/500474 [15:28:51] <_joe_> which makes me think this is a puppet ordering issue [15:28:56] RECOVERY - DPKG on restbase2010 is OK: All packages OK [15:29:05] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [15:30:31] (03PS12) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [15:30:55] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp5007.eqsin.wmnet [15:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:15] 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Milimetric) p:05Triage→03High [15:40:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [15:40:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/500458 (owner: 10Jbond) [15:42:07] 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Milimetric) p:05Triage→03High [15:42:08] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5007.eqsin.wmnet [15:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500453 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [15:42:26] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 17 seconds ago with 2 failures. Failed resources (up to 3 shown) [15:43:05] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4032.ulsfo.wmnet [15:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:08] 10Operations, 10Analytics, 10Wikimedia-Mailing-lists: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10Milimetric) p:05Triage→03Normal [15:44:04] (03CR) 10Jbond: [C: 03+2] pdebuild: ensure proxy config exists for apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/500458 (owner: 10Jbond) [15:44:14] (03PS2) 10Jbond: pdebuild: ensure proxy config exists for apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/500458 [15:44:34] PROBLEM - DPKG on scb2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:45:50] RECOVERY - DPKG on scb2002 is OK: All packages OK [15:46:06] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:11] 10Operations, 10Discovery-Search (Current work), 10Wikimedia-Incident: Create cookbook to reset readonly indices on elasticsearch clusters - https://phabricator.wikimedia.org/T219799 (10Gehel) [15:48:51] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4032.ulsfo.wmnet [15:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:36] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3042.esams.wmnet [15:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:41] (03CR) 10Muehlenhoff: pdebuild: add a new repo for build dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500464 (owner: 10Jbond) [15:49:50] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 51 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka] [15:49:58] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 56 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka] [15:50:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10Milimetric) p:05Normal→03High [15:51:06] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:51:08] (03PS4) 10Arturo Borrero Gonzalez: wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) [15:51:57] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: add cookbook to reboot all nodes in a cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/500474 (owner: 10Gehel) [15:52:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500430 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [15:52:36] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka] [15:52:40] PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 28 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka] [15:54:52] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [15:55:00] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:18] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) p:05Triage→03Normal [15:56:25] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3042.esams.wmnet [15:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:51] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10greg) Just drive-by checking in on UBN!s: is this task still UBN!? [15:57:26] PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:57:39] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2023.codfw.wmnet [15:57:40] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:57:40] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:44] RECOVERY - puppet last run on restbase2011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:59:48] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) ## notes building facter3 for debian: facter3 has a dependency on debhelper 11, however for now i am testing with debhelper v10 by updating the package locally pbuilder-sa... [16:00:04] PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:33] <_joe_> jbond42: if you're upgrading puppet to puppet 5, we might want to talk about fixing the hiera mess [16:00:45] <_joe_> it will help when we need to rewrite the hiera backends having just one left [16:00:47] <_joe_> :) [16:01:06] PROBLEM - keystone admin endpoint port 35357 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 35357: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:01:06] PROBLEM - Check whether ferm is active by checking the default input chain on cloudcontrol2001-dev is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:01:09] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10ema) [16:01:20] RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [16:01:25] 10Operations, 10Traffic, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992 (10ema) 05Open→03Resolved a:03ema >>! In T190992#5074344, @Volans wrote: > @ema given the speedup due to prometheus 2 do you think this still needs t... [16:01:34] 10Operations, 10Traffic, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10ema) 05Open→03Resolved a:03ema [16:01:41] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10ema) [16:02:32] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:02:54] PROBLEM - keystone public endoint port 5000 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 5000: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:44] PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 23 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[rabbit_nova_create],Package[keystone] [16:05:27] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2023.codfw.wmnet [16:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:07] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet [16:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:36] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:11:10] (03CR) 10Jbond: pdebuild: add a new repo for build dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500464 (owner: 10Jbond) [16:12:54] PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:01] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet [16:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:08] RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [16:15:45] !log T219776 reimaging + renaming labtestnet2003 into cloudnet2003-dev [16:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:48] T219776: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 [16:18:49] (03PS1) 10Arturo Borrero Gonzalez: labtestnet2003: cleanup old FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500483 (https://phabricator.wikimedia.org/T219776) [16:20:36] (03CR) 10Gehel: [C: 03+2] elasticsearch: add cookbook to reboot all nodes in a cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/500474 (owner: 10Gehel) [16:25:12] !log uploading gdnsd-3.1.0-1~wmf1 to stretch-wikimedia [16:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:22] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.414 second response time https://phabricator.wikimedia.org/T174916 [16:26:50] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:28:33] !log upgrade gdnsd -> 3.1.0 on cp1099 (authdns test) [16:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:47] (03PS37) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [16:28:48] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [16:30:08] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:30:47] (03PS1) 10Jgreen: flip payments.wm.o back to eqiad cluster [dns] - 10https://gerrit.wikimedia.org/r/500487 [16:32:37] !log slowly reenabling puppet in cache text cluster - T213705 [16:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:40] T213705: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 [16:38:07] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans) [16:43:30] 10Operations, 10ops-eqiad, 10DBA: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) Thanks for the update @Cmjohnson! Are the HP hosts that can have the BBU changed with no disruption or should we plan a failover for this host? Thanks! [16:43:57] (03CR) 10Jgreen: [C: 03+2] flip payments.wm.o back to eqiad cluster [dns] - 10https://gerrit.wikimedia.org/r/500487 (owner: 10Jgreen) [16:48:12] (03PS3) 10Santhosh: ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) [16:50:44] 10Operations, 10Acme-chief, 10Traffic, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [16:51:05] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) [16:51:08] (03PS3) 10Mholloway: Add cron job to update WikimediaEditorTasks suggestions table [puppet] - 10https://gerrit.wikimedia.org/r/500104 (https://phabricator.wikimedia.org/T218136) [16:52:39] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10elukey) If you don't find anybody online to shutdown stat1005 please go ahead and do it, we are not running anything on it! [16:57:36] (03PS3) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [16:58:34] (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [16:59:14] (03PS4) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [16:59:42] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:00:04] gehel and onimisionipe: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1700). [17:00:13] (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [17:00:15] jouncebot: here here [17:02:37] (03CR) 10Volans: [C: 04-1] puppet compiler: collect facts from cloud VMs as well as prod hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [17:03:13] (03PS5) 10EBernhardson: Disable wbcs dispatching query builder on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [17:03:35] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@115a6bf]: Added more endpoint, GUI updates and new bot pattern [17:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) I've just rechecked, and the following hosts are either empty or only running canary instances: labvirt1008 cloudvirt1009 cloudvirt101... [17:06:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Making this a -1 but I indicated in the comments what needs to be done before this patch can be merged." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [17:06:36] PROBLEM - Host cloudcontrol2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [17:07:22] RECOVERY - Host cloudcontrol2001-dev is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [17:07:59] !log restart dhcp server in install2002 to release old lease for labtestnet2003 [17:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:11] (03PS13) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [17:10:22] PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:10:34] PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:56] PROBLEM - Check whether ferm is active by checking the default input chain on cloudcontrol2001-dev is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:12:16] RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [17:14:32] PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Exec[rabbit_nova_create],Package[keystone] [17:15:31] (03PS2) 10Volans: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 [17:15:45] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@115a6bf]: Added more endpoint, GUI updates and new bot pattern (duration: 12m 10s) [17:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:51] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson) Dell sent the correct size disk, thanks to @robh. Raid is rebuildingcmjohnson@sodium:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Rebuild Firmwa... [17:18:20] (03CR) 10ArielGlenn: "not quite rebased but close, it will need some inspection and testing before merge, but since the shortUrl project seems to have new life," [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [17:21:15] !log uploading gdnsd-3.1.0-1~wmf2 to stretch-wikimedia [17:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:14] !log upgrade gdnsd -> 3.1.0 (wmf2) on cp1099 (authdns test) [17:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:49] (03CR) 10Andrew Bogott: "new version with --wmcs switch coming up" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [17:23:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2003: cleanup old FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500483 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [17:24:05] (03PS2) 10Arturo Borrero Gonzalez: labtestnet2003: cleanup old FQDNs [dns] - 10https://gerrit.wikimedia.org/r/500483 (https://phabricator.wikimedia.org/T219776) [17:24:49] Reedy: can we regenerate the interwiki cache for `hyw:` to work now that the wiki is created? [17:25:33] PROBLEM - Host stat1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:26:25] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:27:31] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:31:09] !log authdns1001 (ns0) upgrade gdnsd -> 3.1.0 [17:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:31] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) [17:32:31] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `... [17:32:38] (03CR) 10Gehel: [C: 03+2] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [17:33:25] PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:29] (03PS39) 10Gehel: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [17:35:31] RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [17:40:43] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:41:35] (03PS1) 10Arturo Borrero Gonzalez: wmnet: fix typo in cloudnet2003-dev FQDN [dns] - 10https://gerrit.wikimedia.org/r/500500 (https://phabricator.wikimedia.org/T219776) [17:41:51] RECOVERY - Host stat1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [17:42:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmnet: fix typo in cloudnet2003-dev FQDN [dns] - 10https://gerrit.wikimedia.org/r/500500 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [17:42:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) >>! In T216195#5075091, @Andrew wrote: > I've just rechecked, and the following hosts are either empty or only running canary instances:... [17:43:34] (03PS18) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [17:43:50] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] ` Of which those... [17:44:09] (03Abandoned) 10Andrew Bogott: puppet compiler: add more puppet masters to the fact-collection stage [puppet] - 10https://gerrit.wikimedia.org/r/499584 (owner: 10Andrew Bogott) [17:44:13] (03CR) 10Volans: [C: 03+2] tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans) [17:44:35] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:44:37] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:44:39] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1002 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:44:43] (03PS4) 10Andrew Bogott: puppet-compiler: restore the ability to export facts without puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/499007 (https://phabricator.wikimedia.org/T219430) [17:44:43] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:44:45] (03PS1) 10Andrew Bogott: compiler-update-facts: better support addition of arbitrary fact sets [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430) [17:44:47] ^ new check failing [17:44:47] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:01] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `... [17:45:05] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:45:05] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:05] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2003 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:17] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:45:21] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:23] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:29] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:33] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:45:42] (03Abandoned) 10Andrew Bogott: puppet compiler: collect facts from cloud VMs as well as prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [17:46:21] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2002 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [17:47:01] PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:48:05] RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [17:48:21] (03Merged) 10jenkins-bot: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans) [17:48:44] (03PS2) 10Andrew Bogott: compiler-update-facts: better support addition of arbitrary fact sets [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430) [17:49:22] (03CR) 10Jcrespo: [C: 03+1] realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) (owner: 10Marostegui) [17:50:42] (03CR) 10jenkins-bot: tests/docs: unify usage of example.com domain [software/spicerack] - 10https://gerrit.wikimedia.org/r/500440 (owner: 10Volans) [17:54:29] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:55:39] !log remove asw2-c-eqiad:et-3/1/2 from disabled interfaces - T218059 [17:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:43] T218059: asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1 flapping - https://phabricator.wikimedia.org/T218059 [17:57:15] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:58:09] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:58:11] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] ` Of which those... [18:00:05] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T1800) [18:00:05] dcausse, Lucas_WMDE, and kart_: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:25] o/ [18:00:26] o/ [18:00:31] Lucas_WMDE: please go ahead [18:00:39] ok [18:00:49] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [18:01:37] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10Cmjohnson) [18:01:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [18:01:43] my / Amir1’s config change should have no effect in production [18:01:46] only on beta [18:01:51] it’ll be deployed there automatically, right? [18:02:02] (and I’ll still do the scap + mwdebug dance to make sure it doesn’t break anything, of course) [18:02:27] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10Cmjohnson) [18:02:39] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2001 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [18:02:39] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2004 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [18:02:45] I'm around. [18:03:02] more silencing [18:03:23] (03PS3) 10Lucas Werkmeister (WMDE): Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [18:04:10] Lucas_WMDE: can you deploy my patch too? :) [18:04:21] probably, yeah [18:04:25] does it have a +1 already? :) [18:04:32] (03CR) 10Lucas Werkmeister (WMDE): "sorry, forgot the rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [18:04:37] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [18:04:40] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: install new GPU in stat1005 - https://phabricator.wikimedia.org/T219522 (10Cmjohnson) 05Open→03Resolved card has been swapped [18:04:47] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.799 second response time https://phabricator.wikimedia.org/T174916 [18:05:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Per Chris's request I've gone ahead and put the following servers into maint for the until Friday in icinga: labvirt1008 cloudvirt1009 c... [18:05:50] (03Merged) 10jenkins-bot: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [18:06:20] Lucas_WMDE: Let me do that. It had, lost in rebase. [18:06:45] Lucas_WMDE: OK. It has :) [18:07:32] ok good :) [18:07:36] still busy with my own change atm [18:08:12] doesn’t seem to have broken anything on mwdebug1002, syncing [18:08:43] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [18:09:35] (03CR) 10Volans: "Looks mostly ok, couple of nitpicks inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [18:10:02] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:499999|Add tmpSerializeEmptyListsAsObjects Wikibase repo config (T138104)]] (duration: 00m 54s) [18:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:05] T138104: Do not serialize empty containers (descriptions/aliases/sitelinks) as empty array [] - https://phabricator.wikimedia.org/T138104 [18:12:05] I’m done for the moment, dcausse / kart_ who wants to go next? [18:12:16] * Lucas_WMDE looks at kart_’s Gerrit change [18:12:40] sorry I'm busy atm so if anyone wants to go ahead please do so [18:12:47] alright then I’ll continue [18:14:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh) [18:15:04] (03PS4) 10Lucas Werkmeister (WMDE): ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh) [18:15:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh) [18:15:22] 10Operations, 10ops-eqiad: asw2-c-eqiad fpc3 Rear QSFP+ PIC Chan# 1 flapping - https://phabricator.wikimedia.org/T218059 (10ayounsi) 05Open→03Resolved There was a VC link between FPC3 and FPC8 that was acting up/flapping long time ago. As we already had quite too many VC links, I disabled as well but left... [18:15:37] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:17:05] (03Merged) 10jenkins-bot: ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh) [18:17:45] kart_: your change is on mwdebug1002, can you test it? [18:18:10] Lucas_WMDE: Thanks. Nothing to test, but let me check if nothing is broken. [18:18:17] ok [18:18:27] !log multatuli (ns2) upgrade gdnsd -> 3.1.0 [18:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:01] (03CR) 10jenkins-bot: Add tmpSerializeEmptyListsAsObjects Wikibase repo config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499999 (https://phabricator.wikimedia.org/T138104) (owner: 10Ladsgroup) [18:20:05] (03CR) 10jenkins-bot: ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) (owner: 10Santhosh) [18:21:11] Lucas_WMDE: go ahead. [18:21:17] alright [18:22:56] apparently there was only one occurrence of this error in the last 24h [18:22:57] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:498913|ExternalGuidance: Allow google translate hosts as known services (T218948)]] (duration: 00m 53s) [18:23:05] might be a while until we can definitely say it’s fixed, I guess [18:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:09] T218948: InvalidArgumentException from SpecialExternalGuidance: Invalid service name - https://phabricator.wikimedia.org/T218948 [18:23:18] but anyways, done [18:24:57] PROBLEM - ElasticSearch unassigned shard check - 9200- on logstash1007 is CRITICAL: NRPE: Command check_elasticsearch not defined https://wikitech.wikimedia.org/wiki/Search%23Administration [18:25:08] Lucas_WMDE: can I go ahead? [18:25:30] Lucas_WMDE: Thanks a lot! [18:25:38] Lucas_WMDE: Yes. I'll keep watching. [18:25:54] dcausse: yeah, sure [18:25:57] sorry for the missing ping [18:26:00] thanks! [18:26:04] np [18:26:45] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [18:26:48] (03PS1) 10Gehel: Revert "elasticsearch: add profile for icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/500514 [18:28:07] (03CR) 10Gehel: [C: 03+2] Revert "elasticsearch: add profile for icinga checks" [puppet] - 10https://gerrit.wikimedia.org/r/500514 (owner: 10Gehel) [18:28:23] (03Merged) 10jenkins-bot: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [18:32:48] 10Operations, 10puppet-compiler: Puppet compiler returns errors - https://phabricator.wikimedia.org/T219742 (10Smalyshev) 05Open→03Resolved a:03Smalyshev [18:33:41] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] Cleanup transitional states (duration: 00m 53s) [18:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:44] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [18:34:48] (03PS4) 10DCausse: [cirrus] Use bm25 similarity for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268) [18:36:04] (03CR) 10Ladsgroup: [C: 03+1] realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) (owner: 10Marostegui) [18:36:08] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: initializing_shards: 0, cluster_name: production-logstash-eqiad, status: green, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, active_shards: 202, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, number_of_data_nodes: 3, number_of_pending_tasks: 0, number_of_in_fligh [18:36:08] r_of_nodes: 6, unassigned_shards: 0, timed_out: False, active_primary_shards: 86 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:12] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: active_primary_shards: 83, status: green, initializing_shards: 0, active_shards: 104, number_of_nodes: 2, number_of_data_nodes: 2, number_of_pending_tasks: 0, cluster_name: relforge-eqiad, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, active_shards_percent_as_number: [18:36:12] _shards: 0, number_of_in_flight_fetch: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:12] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: number_of_pending_tasks: 0, active_shards: 156, number_of_nodes: 6, number_of_data_nodes: 3, relocating_shards: 0, timed_out: False, active_shards_percent_as_number: 100.0, cluster_name: production-logstash-codfw, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, initializing_shards [18:36:12] ry_shards: 63, delayed_unassigned_shards: 0, unassigned_shards: 0, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:16] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_nodes: 6, unassigned_shards: 0, active_primary_shards: 86, number_of_in_flight_fetch: 0, active_shards: 202, number_of_pending_tasks: 0, status: green, timed_out: False, active_shards_percent_as_number: 100.0, cluster_name: production-logstash-eqiad, task_max_waiting_in_queue_millis: 0, de [18:36:16] shards: 0, number_of_data_nodes: 3, relocating_shards: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:17] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268) (owner: 10DCausse) [18:36:20] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: number_of_data_nodes: 2, active_shards: 12, status: green, number_of_pending_tasks: 0, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, timed_out: False, unassigned_shards: 0, initializing_shards: 0, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_numb [18:36:20] r_name: relforge-eqiad-small-alpha, active_primary_shards: 6, number_of_nodes: 2 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:24] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: cluster_name: production-logstash-eqiad, status: green, active_shards: 202, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: 0, unassigned_shards: 0, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 100.0, relocating_shards: 0, active_primary_shards: 86, numb [18:36:24] ks: 0, task_max_waiting_in_queue_millis: 0, number_of_nodes: 6, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:24] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2005 is OK: OK - elasticsearch status production-logstash-codfw: active_primary_shards: 63, number_of_nodes: 6, number_of_pending_tasks: 0, timed_out: False, unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards: 156, delayed_unassigned_shards: 0, cluster_name: production-logstash-codfw, number_of_data_nodes: 3, initializing_shards: 0, relocati [18:36:24] ive_shards_percent_as_number: 100.0, status: green, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:30] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, active_shards_percent_as_number: 100.0, active_shards: 104, active_primary_shards: 83, number_of_data_nodes: 2, relocating_shards: 0, initializing_shards: 0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number [18:36:30] ster_name: relforge-eqiad, timed_out: False, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:54] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, status: green, active_shards: 202, unassigned_shards: 0, initializing_shards: 0, cluster_name: production-logstash-eqiad, active_shards_percent_as_number: 100.0, number_of_nodes: 6, relocating_shards: 0, timed_out: False, active_primar [18:36:54] ber_of_data_nodes: 3, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:54] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2004 is OK: OK - elasticsearch status production-logstash-codfw: number_of_in_flight_fetch: 0, status: green, active_shards: 156, initializing_shards: 0, relocating_shards: 0, active_shards_percent_as_number: 100.0, number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 0, unassigned_shards: 0, active_primary_shards: 63, timed_out: False, cluster_name: produc [18:36:54] fw, delayed_unassigned_shards: 0, number_of_nodes: 6, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:36:54] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2001 is OK: OK - elasticsearch status production-logstash-codfw: relocating_shards: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, number_of_data_nodes: 3, number_of_in_flight_fetch: 0, active_shards: 156, delayed_unassigned_shards: 0, status: green, number_of_pending_tasks: 0, cluster_name: production-logstash-codfw, initializing_shards: 0, tas [18:36:55] queue_millis: 0, timed_out: False, number_of_nodes: 6, active_primary_shards: 63 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:37:14] (03Merged) 10jenkins-bot: [cirrus] Use bm25 similarity for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268) (owner: 10DCausse) [18:37:50] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: cluster_name: production-logstash-eqiad, status: green, timed_out: False, number_of_nodes: 6, active_shards: 202, initializing_shards: 0, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, active_shards_percent_as_number: 100.0, active_primary_shards: 86, unassi [18:37:50] umber_of_pending_tasks: 0, number_of_data_nodes: 3, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:37:55] (03CR) 10jenkins-bot: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [18:37:57] (03CR) 10jenkins-bot: [cirrus] Use bm25 similarity for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499795 (https://phabricator.wikimedia.org/T219268) (owner: 10DCausse) [18:38:42] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: unassigned_shards: 0, initializing_shards: 0, number_of_nodes: 6, timed_out: False, number_of_data_nodes: 3, number_of_in_flight_fetch: 0, relocating_shards: 0, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstas [18:38:42] f_pending_tasks: 0, active_primary_shards: 86, active_shards: 202, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration [18:38:42] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2003 is OK: OK - elasticsearch status production-logstash-codfw: active_shards: 156, initializing_shards: 0, number_of_nodes: 6, task_max_waiting_in_queue_millis: 0, timed_out: False, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, active_primary_shards: 63, status: green, relocating_shards: 0, number_of_data_nodes: 3, number_of_in_flight_fe [18:38:42] ame: production-logstash-codfw, unassigned_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:38:42] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2002 is OK: OK - elasticsearch status production-logstash-codfw: active_shards_percent_as_number: 100.0, number_of_pending_tasks: 0, relocating_shards: 0, delayed_unassigned_shards: 0, initializing_shards: 0, cluster_name: production-logstash-codfw, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, status: green, number_of_nodes: 6, active_shards [18:38:42] data_nodes: 3, timed_out: False, unassigned_shards: 0, active_primary_shards: 63 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:40:52] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.929 second response time https://phabricator.wikimedia.org/T174916 [18:42:35] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T219268: [cirrus] Use bm25 similarity for all wikis (duration: 00m 51s) [18:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:38] T219268: Elasticsearch 6: the classic similarity is deprecated - https://phabricator.wikimedia.org/T219268 [18:43:55] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10colewhite) [18:43:58] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10colewhite) 05Open→03Resolved [18:44:06] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [18:44:21] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10colewhite) [18:44:52] !log Morning SWAT done [18:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:40] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.296 second response time https://phabricator.wikimedia.org/T174916 [18:52:09] (03PS1) 10Bstorm: sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500521 (https://phabricator.wikimedia.org/T219817) [18:52:16] !log restart mjolnir-kafka-msearch on relforge1002 to adopt new logging config [18:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:10] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [18:54:22] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500521 (https://phabricator.wikimedia.org/T219817) (owner: 10Bstorm) [18:57:51] (03PS1) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921) [18:58:51] !log re-set ulsfo-codfw ospf cost to previous default - T219591 [18:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:00] T219591: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 [19:01:15] (03PS2) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921) [19:01:45] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: ulsfo <-> codfw transit link flapping causing nginx availability alerts - https://phabricator.wikimedia.org/T219591 (10ayounsi) 05Open→03Resolved Link has been up for 1+ day. Got a notification saying the emergency maintenance was done. [19:07:30] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:09:39] (03PS1) 10Bstorm: Revert "sonofgridengine: make tools-checker hosts submit hosts" [puppet] - 10https://gerrit.wikimedia.org/r/500531 [19:11:04] (03CR) 10Bstorm: [C: 03+2] Revert "sonofgridengine: make tools-checker hosts submit hosts" [puppet] - 10https://gerrit.wikimedia.org/r/500531 (owner: 10Bstorm) [19:23:40] (03PS1) 10Bstorm: sonofgridengine: make tools-checker hosts submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/500535 (https://phabricator.wikimedia.org/T219817) [19:32:38] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:32:58] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:34:56] (03PS1) 10Gilles: Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499) [19:36:35] (03CR) 10Gilles: [C: 03+2] Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [19:37:22] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:37:45] (03Merged) 10jenkins-bot: Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [19:38:04] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:41:10] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:43:29] (03CR) 10jenkins-bot: Renew Priority Hints origin trial token [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500537 (https://phabricator.wikimedia.org/T216499) (owner: 10Gilles) [19:48:02] !log authdns2001 (ns1) upgrade gdnsd -> 3.1.0 [19:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:56] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T216499 Renew Priority Hints origin trial token (duration: 00m 54s) [19:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:59] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [19:54:21] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T2000). [20:07:04] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:07:04] PROBLEM - Disk space on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:07:08] PROBLEM - Check systemd state on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:07:12] PROBLEM - swift-account-auditor on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:07:20] PROBLEM - swift-container-updater on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:07:22] PROBLEM - swift-container-auditor on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:07:26] PROBLEM - very high load average likely xfs on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:07:30] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:07:36] PROBLEM - swift-container-replicator on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:07:44] PROBLEM - DPKG on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:07:46] PROBLEM - MD RAID on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:07:56] PROBLEM - swift-object-auditor on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:07:56] PROBLEM - swift-account-replicator on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:07:58] PROBLEM - dhclient process on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:08:16] PROBLEM - swift-object-replicator on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:08:50] PROBLEM - swift-object-updater on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:09:14] PROBLEM - swift-account-server on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:09:16] PROBLEM - swift-container-server on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:09:48] PROBLEM - configured eth on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:09:56] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:10:02] PROBLEM - very high load average likely xfs on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:10:22] PROBLEM - MD RAID on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer [20:10:32] PROBLEM - swift-account-reaper on ms-be2026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [20:11:12] RECOVERY - swift-container-auditor on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [20:11:14] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2026 is OK: OK ferm input default policy is set [20:11:18] RECOVERY - swift-object-updater on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [20:11:22] RECOVERY - swift-container-replicator on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [20:11:30] RECOVERY - DPKG on ms-be2026 is OK: All packages OK [20:11:40] RECOVERY - swift-object-auditor on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [20:11:40] RECOVERY - swift-account-reaper on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [20:11:40] RECOVERY - swift-account-replicator on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [20:11:40] RECOVERY - swift-account-server on ms-be2026 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [20:11:40] RECOVERY - swift-container-server on ms-be2026 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [20:11:42] RECOVERY - dhclient process on ms-be2026 is OK: PROCS OK: 0 processes with command name dhclient [20:12:00] RECOVERY - swift-object-replicator on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [20:12:14] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [20:12:16] RECOVERY - configured eth on ms-be2026 is OK: OK - interfaces up [20:12:16] RECOVERY - swift-account-auditor on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [20:13:10] PROBLEM - DNS labtestnet2003.mgmt on labtestnet2003.mgmt is CRITICAL: Domain labtestnet2003.mgmt.codfw.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:16:22] RECOVERY - very high load average likely xfs on ms-be2026 is OK: OK - load average: 20.88, 72.79, 71.45 https://wikitech.wikimedia.org/wiki/Swift [20:18:48] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:00:04] bawolff and Reedy: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T2100). Please do the needful. [21:00:41] \o/ [21:02:36] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:05:32] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:10:59] !log elasticsearch search cluster: reindex spaceless languages (T219533) [21:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:02] T219533: Reindex space less languages wikis to use BM25 - https://phabricator.wikimedia.org/T219533 [21:11:38] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:12:10] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:16:01] We're going to deploy a security thing [21:18:24] PROBLEM - dhclient process on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:18:28] PROBLEM - swift-container-updater on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:18:34] PROBLEM - swift-container-auditor on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:18:40] PROBLEM - Check size of conntrack table on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:18:44] PROBLEM - swift-account-server on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:18:44] PROBLEM - swift-account-auditor on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:19:06] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:19:12] PROBLEM - MD RAID on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:19:14] PROBLEM - swift-container-replicator on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:19:16] PROBLEM - swift-object-auditor on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:19:18] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:19:24] PROBLEM - swift-account-reaper on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:19:26] PROBLEM - Disk space on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:19:28] PROBLEM - swift-object-updater on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:19:32] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:19:58] PROBLEM - swift-object-replicator on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:20:04] PROBLEM - swift-account-replicator on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:20:08] PROBLEM - DPKG on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:20:08] PROBLEM - swift-container-server on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:21:02] PROBLEM - dhclient process on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:21:04] PROBLEM - swift-object-server on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:21:06] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:21:14] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [21:22:04] PROBLEM - Disk space on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:22:06] PROBLEM - swift-object-updater on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [21:22:06] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:22:22] PROBLEM - configured eth on ms-be2018 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.160: Connection reset by peer [21:22:24] RECOVERY - swift-container-auditor on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [21:22:24] RECOVERY - Check size of conntrack table on ms-be2018 is OK: OK: nf_conntrack is 6 % full [21:22:26] RECOVERY - swift-object-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift [21:22:30] RECOVERY - swift-account-auditor on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [21:22:30] RECOVERY - swift-account-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [21:22:30] RECOVERY - swift-account-server on ms-be2018 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift [21:22:36] RECOVERY - DPKG on ms-be2018 is OK: All packages OK [21:22:36] RECOVERY - swift-container-server on ms-be2018 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [21:22:52] RECOVERY - very high load average likely xfs on ms-be2018 is OK: OK - load average: 40.35, 36.11, 29.34 https://wikitech.wikimedia.org/wiki/Swift [21:22:58] RECOVERY - MD RAID on ms-be2018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:23:02] RECOVERY - swift-container-replicator on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift [21:23:02] RECOVERY - swift-object-auditor on ms-be2018 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [21:23:06] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2018 is OK: OK ferm input default policy is set [21:23:12] RECOVERY - swift-account-reaper on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift [21:23:14] RECOVERY - Disk space on ms-be2018 is OK: DISK OK [21:23:16] RECOVERY - swift-object-updater on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift [21:23:20] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational [21:23:24] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational [21:23:30] RECOVERY - dhclient process on ms-be2018 is OK: PROCS OK: 0 processes with command name dhclient [21:23:33] RECOVERY - configured eth on ms-be2018 is OK: OK - interfaces up [21:23:33] RECOVERY - swift-object-server on ms-be2018 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift [21:23:33] RECOVERY - swift-container-updater on ms-be2018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [21:27:32] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:28:06] 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) `lang=diff,name=cr1-eqsin [edit protocols bgp group Transit4] - import [ BGP_sanitize_in BGP_transit_in BGP_avoid_long_RTT_in BGP_community_actions ]; + import [ BGP_sanitize_in BGP_tran... [21:28:40] !log Push AS specific policy-statements to cr1/2-eqsin v4 peers - T211930 [21:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:44] T211930: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 [21:33:28] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:35:40] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.105 second response time https://phabricator.wikimedia.org/T174916 [21:37:20] (03PS1) 10Alex Monk: service::node: Only try to define node10 repository if it is not already defined [puppet] - 10https://gerrit.wikimedia.org/r/500615 [21:39:40] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [21:44:18] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:44:26] !log sbassett@deploy1001 Synchronized private/PrivateSettings.php: Remove kowiki spam mitigations T212679 (duration: 00m 54s) [21:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:37] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Tbayer) @elukey Sure, that totally makes sense! The end of January estimate from T178802#4647106 turned out a bit optimistic (see again our internal tim... [21:52:39] (03CR) 10MSantos: "Problem found in the Beta Cluster at deployment-maps04 https://horizon.wikimedia.org/project/instances/e469cff8-0791-4a83-aa86-3ba9bf3780d" [puppet] - 10https://gerrit.wikimedia.org/r/500615 (owner: 10Alex Monk) [21:56:24] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.834 second response time https://phabricator.wikimedia.org/T174916 [22:00:26] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:01:18] (03PS1) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 [22:04:03] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Yann) >>! In T219589#5072801, @Aklapper wrote: >>>! In T219589#5072592, @Yann wro... [22:05:45] (03PS2) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 [22:14:20] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:15:52] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:16:16] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:17:09] (03PS1) 10Andrew Bogott: Reconcile some passwords between eqiad1 and main regions [labs/private] - 10https://gerrit.wikimedia.org/r/500627 [22:20:03] (03PS2) 10Andrew Bogott: Reconcile some passwords between eqiad1 and main regions [labs/private] - 10https://gerrit.wikimedia.org/r/500627 [22:20:20] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Reconcile some passwords between eqiad1 and main regions [labs/private] - 10https://gerrit.wikimedia.org/r/500627 (owner: 10Andrew Bogott) [22:20:24] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:20:40] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:21:14] 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) `lang=diff,name=cr2-eqsin [edit protocols bgp group Transit4] - import [ BGP_sanitize_in BGP_transit_in BGP_avoid_long_RTT_in BGP_community_actions ]; + import [ BGP_sanitize_in BGP_tran... [22:21:20] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:21:38] (03PS5) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [22:21:48] PROBLEM - Varnishkafka Eventlogging Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All [22:23:16] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:23:56] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:24:18] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:25:26] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:26:12] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 18 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [22:29:48] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:32:49] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10bd808) p:05Unbreak!→03High >>! In T217280#5074689, @greg wrote: > Just drive-by checking in on UBN!s: is this task s... [22:37:54] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.312 second response time https://phabricator.wikimedia.org/T174916 [22:41:46] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:43:00] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.560 second response time https://phabricator.wikimedia.org/T174916 [22:45:41] tons of varnishkafka errors when sending data to eventlogging [22:45:43] https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All [22:46:53] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:50:58] (03PS3) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 [22:52:45] !log restart pdfrender on scb1003 - T174916 [22:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:49] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [22:53:10] (03PS1) 10Alex Monk: Add hiera option to serve user traffic using acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) [22:53:31] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10Krinkle) >>! In T215562#5074013, @MoritzMuehlenhoff wrote: > @Krinkle I've prepar... [22:53:53] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time https://phabricator.wikimedia.org/T174916 [22:54:13] (03CR) 10jerkins-bot: [V: 04-1] Add hiera option to serve user traffic using acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [22:54:19] yay [22:56:15] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) [22:57:33] (03PS2) 10Alex Monk: Add hiera option to serve user traffic using acme-chief certs [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190401T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:17] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:01:27] 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) 05Open→03Resolved Pushed progressively and confirmed with the looking glasses that only the proper communities were received on the other side. As well as the proper local_pref was applied... [23:06:35] (03CR) 10Cwhite: profile: do not mutate level for mjolnir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite) [23:07:27] Is there anywhere here that restart eventlogging kafka consumers? [23:07:33] (03CR) 10Cwhite: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [23:08:10] (03CR) 10Cwhite: [C: 03+1] logstash: send varnish syslogs via kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [23:10:05] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:16:22] !log jnt push to csw2-esams [23:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:46] bblack: do you think you could help us reboot all eventlogging kafka producers ? cc mobrovac [23:18:07] herron: too ^ [23:18:27] bblack, herron: kafka jumbo cluster is kaput: https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All [23:19:30] mobrovac: we need to bounce back cluster entirely right? seems it is totally off [23:19:35] PROBLEM - swift-object-auditor on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:19:35] PROBLEM - swift-container-server on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:19:37] I'm around if you need someone from SRE, but I don't know anything about kafka [23:19:37] PROBLEM - swift-container-auditor on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:19:37] PROBLEM - dhclient process on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [23:19:41] think so nuria, yeah [23:19:45] PROBLEM - configured eth on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [23:19:47] PROBLEM - very high load average likely xfs on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:19:55] PROBLEM - Disk space on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [23:19:59] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [23:20:01] PROBLEM - swift-account-replicator on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:20:03] PROBLEM - swift-container-updater on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [23:20:05] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.137: Connection reset by peer [23:20:31] RECOVERY - swift-object-auditor on ms-be2017 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift [23:20:31] RECOVERY - swift-container-server on ms-be2017 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift [23:20:31] RECOVERY - swift-container-auditor on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift [23:20:31] RECOVERY - dhclient process on ms-be2017 is OK: PROCS OK: 0 processes with command name dhclient [23:20:41] RECOVERY - configured eth on ms-be2017 is OK: OK - interfaces up [23:20:41] RECOVERY - very high load average likely xfs on ms-be2017 is OK: OK - load average: 27.17, 28.45, 27.39 https://wikitech.wikimedia.org/wiki/Swift [23:20:49] RECOVERY - Disk space on ms-be2017 is OK: DISK OK [23:20:55] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2017 is OK: OK ferm input default policy is set [23:20:57] RECOVERY - swift-account-replicator on ms-be2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift [23:20:59] RECOVERY - swift-container-updater on ms-be2017 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [23:22:47] (03CR) 10Andrew Bogott: "Compiler diffs:" [puppet] - 10https://gerrit.wikimedia.org/r/500622 (owner: 10Andrew Bogott) [23:22:47] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [23:23:54] ping bblack again or herron [23:24:59] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [23:27:11] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:27:29] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) ` papaul@asw-b-codfw> show interfaces ge-8/0/11 descriptions Interface Admin Link Description ge-8/0/11... [23:27:49] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) [23:28:17] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) [23:28:29] !log restart kafka on kafka-jumbo1001 [23:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:03] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:36:21] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:36:53] !log restart kafka on kafka-jumbo1002 [23:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [23:41:25] PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:42:21] (03PS1) 10Papaul: DNS: Remove mgmt and production DNS for cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/500634 [23:43:59] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [23:43:59] RECOVERY - Check systemd state on kafka-jumbo1002 is OK: OK - running: The system is fully operational [23:45:55] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:46:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 56 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [23:46:35] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational [23:46:35] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) [23:47:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 122 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [23:47:27] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 149.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [23:47:37] !log restarting kafka on kafka-jumbo1003 [23:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:07] PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:48:07] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [23:49:11] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul) [23:51:49] (03PS1) 10Bstorm: cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere [puppet] - 10https://gerrit.wikimedia.org/r/500635 (https://phabricator.wikimedia.org/T209527) [23:51:59] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 80 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [23:53:15] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [23:53:33] PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:53:39] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:54:15] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [23:54:31] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:54:49] !log restarting kafka on kafka-jumbo1004 [23:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:58:55] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:59:15] PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:59:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties