[00:00:05] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T0000). [00:03:20] !log starting phabricator upgrade to 2019-08-14/1 refs T215697 [00:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:29] T215697: Please add Brazilian Portuguese (pt-br) language to Phabricator - https://phabricator.wikimedia.org/T215697 [00:15:08] !log scs-ulsfo offline due to networking issues, rob returning tomorrow with fix T230077 [00:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:16] T230077: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 [00:35:07] PROBLEM - ElasticSearch shard size check - 9643 on search.svc.codfw.wmnet is CRITICAL: usage: check_elasticsearch_shard_size.py [-h] [--url URL] [--timeout SECONDS] https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [00:37:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) Submitted the ticket with Dell. We will see what happens You have successfully submitted request SR996138617. [00:47:34] (03PS1) 10Cmjohnson: Adding mgmt dns for cloudceph10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/530246 (https://phabricator.wikimedia.org/T224188) [00:48:28] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for cloudceph10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/530246 (https://phabricator.wikimedia.org/T224188) (owner: 10Cmjohnson) [00:48:33] (03PS2) 10Cmjohnson: Adding mgmt dns for cloudceph10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/530246 (https://phabricator.wikimedia.org/T224188) [00:48:35] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Adding mgmt dns for cloudceph10[1-3] [dns] - 10https://gerrit.wikimedia.org/r/530246 (https://phabricator.wikimedia.org/T224188) (owner: 10Cmjohnson) [00:51:20] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) cloudcephosd1001 10.65.2.177 cloudcephosd1002 10.65.2.178 cloudcephosd1003 10.65.2.179 [00:51:45] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Cmjohnson) [01:59:00] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect lzh.wikipedia to zh-classical.wikipedia - https://phabricator.wikimedia.org/T167513 (10Viztor) a:05Viztor→03Fomafix [02:24:47] (03PS6) 10Mathew.onipe: cloudelastic: fix monitored ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/529362 (https://phabricator.wikimedia.org/T229621) (owner: 10Jbond) [02:24:49] (03PS5) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [02:24:52] (03PS1) 10Mathew.onipe: icinga: add the option separator for elastic shard size alerts [puppet] - 10https://gerrit.wikimedia.org/r/530256 (https://phabricator.wikimedia.org/T230366) [02:40:11] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: usage: check_elasticsearch_shard_size.py [-h] [--url URL] [--timeout SECONDS] Mathew.onipe Missed the arg separator(!) initially. Fix is here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/530256 - The acknowledgement expires at: 2019-08-15 12:39:05. https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [02:40:11] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9643 on search.svc.codfw.wmnet is CRITICAL: usage: check_elasticsearch_shard_size.py [-h] [--url URL] [--timeout SECONDS] Mathew.onipe Missed the arg separator(!) initially. Fix is here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/530256 - The acknowledgement expires at: 2019-08-15 12:39:05. https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [02:40:11] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9443 on search.svc.eqiad.wmnet is CRITICAL: usage: check_elasticsearch_shard_size.py [-h] [--url URL] [--timeout SECONDS] Mathew.onipe Missed the arg separator(!) initially. Fix is here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/530256 - The acknowledgement expires at: 2019-08-15 12:39:05. https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [02:40:11] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9643 on search.svc.eqiad.wmnet is CRITICAL: usage: check_elasticsearch_shard_size.py [-h] [--url URL] [--timeout SECONDS] Mathew.onipe Missed the arg separator(!) initially. Fix is here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/530256 - The acknowledgement expires at: 2019-08-15 12:39:05. https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [03:50:35] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_upload site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:53:31] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 131.6 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [03:53:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:36:50] (03PS4) 10Smalyshev: Add L and M to allowed statement starts [puppet] - 10https://gerrit.wikimedia.org/r/526755 [05:55:31] (03PS1) 10Smalyshev: Make wdqs1009 regular host again [puppet] - 10https://gerrit.wikimedia.org/r/530260 (https://phabricator.wikimedia.org/T230244) [05:59:21] (03PS1) 10Smalyshev: Restore autodeploy on wdq1009 [puppet] - 10https://gerrit.wikimedia.org/r/530261 (https://phabricator.wikimedia.org/T230244) [06:24:08] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: blazegraph journal on wdqs1005 has doubled in space - https://phabricator.wikimedia.org/T229876 (10Smalyshev) 05Open→03Resolved [06:29:53] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:03] PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:30:42] SMalyshev: ^ [06:31:29] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:13] RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [06:33:53] PROBLEM - High lag on wdqs1009 is CRITICAL: 9.386e+04 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:34:51] I see that wdqs-updater.service is marked as failed on wdqs1009 [06:36:17] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:07] https://phabricator.wikimedia.org/P8915 [06:37:51] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:00] !log wdqs1009: restart wdqs-updater.service [06:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:32] so yeah restarting wdqs-updater did the trick, there's still a "High lag" icinga critical though for wdqs1009 [06:41:11] ema: thanks! [06:41:37] I thought Stas was around. I'm currently commuting so can't do much [06:41:51] I'm looking into it [06:42:05] high lag is normal, it'll catch up [06:42:12] ema: ^^ [06:42:24] I'll ack it [06:42:38] SMalyshev: thanks [06:42:56] ACKNOWLEDGEMENT - High lag on wdqs1009 is CRITICAL: 9.404e+04 ge 3600 Stas Malychev After tests, will catch up https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:50:29] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:29:55] (03Abandoned) 10Ema: logstash: add TLS support via profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/524527 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:31:25] (03PS1) 10Ema: Revert "Revert "ATS: enable compress plugin on cp5002"" [puppet] - 10https://gerrit.wikimedia.org/r/530326 (https://phabricator.wikimedia.org/T227432) [07:31:59] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:32:04] (03CR) 10Ema: [C: 03+2] Revert "Revert "ATS: enable compress plugin on cp5002"" [puppet] - 10https://gerrit.wikimedia.org/r/530326 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:35:06] !log cp5002: ats-backend-restart to enable compress plugin [07:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:33] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530088 (https://phabricator.wikimedia.org/T230313) (owner: 10Ammarpad) [08:18:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/530175 (owner: 10Jhedden) [08:19:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. At some point we would like to delete the /kubeadm/ namespace as well." [puppet] - 10https://gerrit.wikimedia.org/r/530186 (https://phabricator.wikimedia.org/T229009) (owner: 10Bstorm) [08:25:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10aborrero) >>! In T229871#5414444, @Phamhi wrote: > I managed to bypass that issue by running > > ` > sudo wmf-auto-reimage-host... [08:42:03] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [08:43:02] (03PS2) 10Gehel: icinga: add the option separator for elastic shard size alerts [puppet] - 10https://gerrit.wikimedia.org/r/530256 (https://phabricator.wikimedia.org/T230366) (owner: 10Mathew.onipe) [08:43:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [08:43:56] (03CR) 10Gehel: [C: 03+2] icinga: add the option separator for elastic shard size alerts [puppet] - 10https://gerrit.wikimedia.org/r/530256 (https://phabricator.wikimedia.org/T230366) (owner: 10Mathew.onipe) [08:48:19] (03Abandoned) 10Ema: tlsproxy::instance: use lookup() instead of hiera() [puppet] - 10https://gerrit.wikimedia.org/r/524190 (owner: 10Ema) [08:48:38] (03Abandoned) 10Ema: cp1008: move hiera settings to cache::canary role [puppet] - 10https://gerrit.wikimedia.org/r/476005 (owner: 10Ema) [08:49:15] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) Tentative transition plan A: [] Move all VMs to the in-cloud puppetmasters (T171188) [] Create a new set... [08:49:19] (03Abandoned) 10Ema: site: make cp1099 the new pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/459989 (https://phabricator.wikimedia.org/T202966) (owner: 10Ema) [08:49:45] (03Abandoned) 10Ema: cache_canary LVS service [puppet] - 10https://gerrit.wikimedia.org/r/480728 (https://phabricator.wikimedia.org/T202966) (owner: 10Ema) [08:52:14] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [08:52:17] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [08:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:01] RECOVERY - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [08:54:44] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [08:54:48] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [08:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) [09:01:23] (03PS1) 10Arturo Borrero Gonzalez: openstack: cleanup nova-network version of nova [puppet] - 10https://gerrit.wikimedia.org/r/530332 (https://phabricator.wikimedia.org/T220051) [09:02:29] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10ema) See T188831 for the `application/x-www-form-urlencoded` variation of this. [09:03:39] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:03:51] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:04:05] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:04:07] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:19] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:23] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:06:19] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:06:31] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:07] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10Jdforrester-WMF) Is https://gerrit.wikimedia.org/r/c/maps/kartotherian/package/+/510456 meant to be in mediawiki... [09:14:53] (03CR) 10Arturo Borrero Gonzalez: "CC @jhedden this probably conflicts with your patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/530175" [puppet] - 10https://gerrit.wikimedia.org/r/530332 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez) [09:38:29] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Andrew) For this plan to work at all, we'd have to ensure that there's nothing in the 'core' catalog that actively... [09:41:07] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Krenair) We need to be very careful about File purge => true resources appearing in the core catalog that can be h... [09:41:37] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:51:10] (03PS1) 10Jon Harald Søby: Add more import sources for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530337 (https://phabricator.wikimedia.org/T230533) [09:55:15] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 109.4 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [09:56:10] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [10:02:06] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [10:03:13] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [10:09:53] (03PS1) 10Ema: VCL: workaround for images delivered with CT:x-www-form-urlencoded [puppet] - 10https://gerrit.wikimedia.org/r/530338 (https://phabricator.wikimedia.org/T162035) [10:17:32] (03PS2) 10Ema: VCL: workaround for images delivered with CT:x-www-form-urlencoded [puppet] - 10https://gerrit.wikimedia.org/r/530338 (https://phabricator.wikimedia.org/T162035) [10:27:11] Jhs: hi. None of the imports you're doing on nap.wikisource do display when clicked on them. [10:27:36] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [10:27:56] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [10:30:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/530338 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [10:32:13] hauskatze, yeah, i noticed it myself :\ [10:32:32] Jhs: I'm not sure if it's some issue in the storage [10:32:39] I don't know [10:32:52] Will you file a task? (I don't wanna duplicate) [10:33:00] cc Reedy as well [10:33:01] (03CR) 10Ema: [C: 03+2] VCL: workaround for images delivered with CT:x-www-form-urlencoded [puppet] - 10https://gerrit.wikimedia.org/r/530338 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [10:34:05] Jhs: not even history: https://nap.wikisource.org/w/index.php?title=Paggena:%27E_Lluce-luce.djvu/8&action=history [10:34:07] weird [10:35:09] (03PS1) 10Alex Monk: cloud: Switch encapi calls to new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530340 (https://phabricator.wikimedia.org/T171188) [10:35:55] (03PS1) 10Andrew Bogott: cloud recursors: alias 'puppet' to the new in-labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530341 (https://phabricator.wikimedia.org/T171188) [10:36:21] hauskatze, and I know exactly why! I feel like a genious [10:36:37] Jhs: oh, share? :) [10:36:40] hauskatze, it's because the Page and Index namespaces are defined twice [10:36:40] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [10:36:49] oh, crap [10:36:51] both in initialise-settings.php and in the proofread extension [10:36:59] god... [10:37:07] they should be removed from IS [10:37:08] it's the Welsh thing all over again [10:37:11] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [10:37:22] why people *insist* on adding them there? [10:37:50] hauskatze, do you remember how the Welsh situation was fixed? [10:38:05] Jhs: nope [10:38:08] do you? [10:38:17] (03CR) 10Jforrester: [C: 03+2] Add more import sources for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530337 (https://phabricator.wikimedia.org/T230533) (owner: 10Jon Harald Søby) [10:38:24] My guess would be remove the namespaces from IS, run namespacedupes [10:38:48] James_F might be able to do the IS.php removal now that he's deploying ;) [10:39:05] or have a better understanding of the situation [10:39:42] Oh, meh. [10:39:48] Config sucks. [10:40:36] James_F: how's in British English, localise or localice? [10:40:45] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [10:40:52] ie: Localize Scribunto strings for xx.wiki [10:40:55] hauskatze: Localise. [10:41:01] Using that. [10:41:03] thanks. [10:41:10] But you don't localise for a wiki, you localise for a language. [10:41:36] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [10:41:42] hauskatze: Can you file a task? I'm not sure I can just fix it right now, sorry. [10:42:06] James_F: If Jhs is not filing it already, sure. [10:42:24] I think it'd be removing the faulty namespaces and running namespacedupes [10:42:33] they already exist on ProofreadPage [10:42:48] (03PS1) 10Jon Harald Søby: Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) [10:43:03] But not localised? Why do we define these namespaces for all the other wikisources? [10:43:09] That's why they're being cargo-culted. [10:43:22] jouncebot, next [10:43:22] In 0 hour(s) and 16 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1100) [10:43:49] Urbanecm: I stole Jhs's patch and am deploying right now. ;-) [10:44:01] (03PS2) 10Alex Monk: cloud: Switch encapi calls to new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530340 (https://phabricator.wikimedia.org/T171188) [10:44:36] (03CR) 10Jforrester: [C: 03+2] Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) (owner: 10Jon Harald Søby) [10:44:44] Oh, I see. [10:44:47] James_F, yay :) [10:44:50] Ok James_F :) [10:45:01] Are the namespace localisations for nap installed in the ProofreadPage extension? [10:45:07] I wanted to do the same thing, but then I realized a window is in few minutes, so... [10:45:14] I vaguely remember so, let me double-check [10:45:18] Urbanecm: SWAT schmat. [10:45:44] I mean, there are two "Paggena" namespaces in Special:Search, so the dupes have to come from there [10:45:48] I don't know what that means, but... [10:47:21] (03Merged) 10jenkins-bot: Add more import sources for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530337 (https://phabricator.wikimedia.org/T230533) (owner: 10Jon Harald Søby) [10:48:00] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10ema) [10:48:05] 10Operations, 10Cloud-Services, 10DBA: Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10MarcoAurelio) [10:49:03] Finally. [10:49:51] James_F, confirming that they are in the extension: https://github.com/wikimedia/mediawiki-extensions-ProofreadPage/blob/3e298ff5c002e1c628bb825d17e797fb39cc2971/ProofreadPage.namespaces.php [10:49:53] Jhs: I remember having done the ProofreadPage thing [10:50:02] Cool. [10:50:10] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10fireattack) It is fixed here for https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/... [10:50:25] 10Operations, 10Wikimedia-Mailing-lists: Create central notice admins mailing list - https://phabricator.wikimedia.org/T230544 (10Urbanecm) [10:50:33] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T230533: Add more import sources for napwikisource (duration: 00m 52s) [10:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:41] T230533: Enable more import sources for napwikisource - https://phabricator.wikimedia.org/T230533 [10:50:45] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10Ciencia_Al_Poder) Issue solved for the examples provided [10:50:55] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [10:51:28] (03PS5) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) [10:57:21] James_F: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/530342/ is failing to merge [10:57:53] (03PS2) 10Jforrester: Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) (owner: 10Jon Harald Søby) [10:58:01] * James_F sighs expressively at gerrit. [10:59:10] James_F, looking at the config change for Welsh a while back it seems we need to add something to wgProofreadPageNamespaceIds as well [10:59:18] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/394189/4/wmf-config/InitialiseSettings.php [10:59:58] Ah, hmm. Yeah, want to mmend? [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1100). [11:00:04] Jhs: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] Also, amend. [11:00:13] fsk [11:00:32] James_F, sure, but i'm not 100 % sure which numbers to use [11:00:35] MW needs to be told about the legacy namespaces yeah :| [11:00:53] James_F, Jhs: Is there anything else to do for the window? [11:00:59] (that's already officially scheduled) [11:01:08] (03PS1) 10Alex Monk: cloud: Change monitoring things to look at new pupeptmaster [puppet] - 10https://gerrit.wikimedia.org/r/530344 [11:01:16] 104 and 106. [11:01:33] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [11:01:44] Jhs: 106 for page and 104 for index [11:03:12] hauskatze, sure it's not the other way around? Page is usually the lower number [11:03:56] (03PS1) 10Ema: VCL: update 01-basic-caching.vtc to expect 421 [puppet] - 10https://gerrit.wikimedia.org/r/530345 (https://phabricator.wikimedia.org/T207340) [11:04:36] (03PS3) 10Jon Harald Søby: Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) [11:04:48] Jhs: the namespaces we removed were numbered 104 for index and 106 for page [11:04:59] so I guess we should respect that order [11:05:09] ok, yeah, makes sense [11:05:14] James_F, uploaded the new patch [11:05:20] mwdebug is our friend though [11:05:29] (03PS5) 10Urbanecm: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083) (owner: 10DannyS712) [11:05:37] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083) (owner: 10DannyS712) [11:05:42] I'm handing over to Urbanecm. :-) [11:05:49] (And getting some lunch.) [11:05:53] Leaving now for the cumbersome task of face-shaving. [11:06:23] (03PS6) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) [11:06:54] (03CR) 10Jbond: [C: 03+1] "LGTM, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/530230 (owner: 10MarcoAurelio) [11:06:57] Jhs, are you sure it should be 106 for page and 104 for index? It's the other way around for mrwikisource for instance [11:07:53] Urbanecm: we are doubting about that, yes. The dupe namespaces removed where numbered the other way around. [11:08:03] ah yes, I see that now [11:08:06] Not sure if we should keep the old numbers or use the new ones. [11:08:15] Urbanecm, yeah, some other wikisources have them the opposite ways as well. seems there is no standard [11:08:25] okay then [11:08:26] "Config. sucks." (c) J. D. Forrester, 2019. [11:08:50] (03Abandoned) 10Jbond: cloudelastic: fix monitored ip addresses [puppet] - 10https://gerrit.wikimedia.org/r/529362 (https://phabricator.wikimedia.org/T229621) (owner: 10Jbond) [11:08:57] (03CR) 10Urbanecm: [C: 03+2] Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) (owner: 10Jon Harald Søby) [11:09:02] okay, makes sense [11:09:15] let's try that, we can always upload a new patch if needed [11:09:52] (Y) [11:10:01] let me know when I should check on mwdebug [11:10:21] will do! [11:10:21] (03Merged) 10jenkins-bot: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083) (owner: 10DannyS712) [11:11:57] (03PS4) 10Urbanecm: Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) (owner: 10Jon Harald Søby) [11:12:04] (03CR) 10Urbanecm: [C: 03+2] Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) (owner: 10Jon Harald Søby) [11:12:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0d8c516: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains (T230083) (duration: 00m 48s) [11:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:15] T230083: Add Hubblesite.org and Spacetelescope.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T230083 [11:12:37] (03CR) 10Urbanecm: [C: 03+2] Add new throttle rule for cawiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530088 (https://phabricator.wikimedia.org/T230313) (owner: 10Ammarpad) [11:13:30] (03PS7) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) [11:13:46] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) (owner: 10Jon Harald Søby) [11:14:16] Jhs, okay, ready to test on mwdebug1002 [11:15:47] Urbanecm, now the duped namespaces are gone from Special:Search, but the imports I made are now title "Special:BadTitle/NS250") [11:16:18] hmm [11:16:21] So should we try to change wgProofreadPageNamespace to 250? [11:16:51] let me have a look [11:16:58] (03Merged) 10jenkins-bot: Add new throttle rule for cawiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530088 (https://phabricator.wikimedia.org/T230313) (owner: 10Ammarpad) [11:18:18] Jhs, from where did you import? [11:18:44] 10Operations, 10Elasticsearch, 10Traffic, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga check defined from LVS configuration for cloudelastic are borked - https://phabricator.wikimedia.org/T229621 (10jbond) When i started I believe someone from OIT created an [[https://office.wikimedia.or... [11:19:09] Urbanecm, oldwikisource [11:19:19] where Page is 104 and index 106 [11:19:24] ok [11:19:39] and for some wierd reason, the pages are in ns 250 in napwikisource [11:20:01] oh, that's why... [11:20:16] wgProofreadPageNamespaceIds has 250 as a default for page [11:20:26] (03CR) 10jenkins-bot: Add more import sources for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530337 (https://phabricator.wikimedia.org/T230533) (owner: 10Jon Harald Søby) [11:20:35] (03PS8) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a hhvm-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) [11:20:43] Jhs, sorry for not watching the previous conversation, but why did we add an override for wgProofreadPageNamespaceIds? [11:21:08] we have a plenty of pages in NS 250 and 252, which are the defaults [11:21:34] it shouldn't be hard to move the pages from NS 250 to NS whatever, but... [11:21:37] ...do we need to do that? [11:21:39] Jhs, ^^ [11:21:44] Urbanecm, because that was what they did when the same problem was in Welsh wikisource. But we can just remove it if you think that's better [11:22:04] I guess it'll help [11:22:13] I'll change the patch again [11:22:14] let me try that locally on the deployment server, so we don't have jenkins waiting [11:22:18] (it's already merged) [11:22:21] unless you can just use the previous patch [11:22:22] oh yeah [11:22:30] I'll change the mwdebug state of things [11:22:35] thx [11:23:31] (03CR) 10jenkins-bot: Fix addition of Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530015 (https://phabricator.wikimedia.org/T230083) (owner: 10DannyS712) [11:23:38] Jhs, okay, removed napwikisource from wgProofreadPageNamespaceIds on mwdebug1002 [11:23:40] can you check again? [11:24:11] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:24:14] Urbanecm, yes, you're a genious! It's working properly now [11:24:19] wonderful! [11:24:24] going to commit&sync that then [11:24:32] awesome [11:25:08] (03PS1) 10Urbanecm: Remove napwikisource from wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530346 (https://phabricator.wikimedia.org/T230541) [11:25:20] (03CR) 10Urbanecm: [C: 03+2] Remove napwikisource from wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530346 (https://phabricator.wikimedia.org/T230541) (owner: 10Urbanecm) [11:26:30] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove napwikisource from wgProofreadPageNamespaceIds (T230541) (duration: 00m 47s) [11:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:38] T230541: Remove duplicate namespace definition for napwikisource - https://phabricator.wikimedia.org/T230541 [11:26:48] (03Merged) 10jenkins-bot: Remove napwikisource from wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530346 (https://phabricator.wikimedia.org/T230541) (owner: 10Urbanecm) [11:27:27] Urbanecm hey! Would you have spare time for a deployment in about 10 minutes? [11:27:34] certainly Daimona! [11:27:50] Hooray! I'll be back then, thanks :) [11:27:59] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:29:07] Urbanecm, everything looks ok now even without mwdebug. I'll go to lunch then unless there's anything else needed :) [11:29:09] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 377cc53: Add new throttle rule for cawiki editathon (T230313) (duration: 00m 47s) [11:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:17] T230313: Request: temporary lift of IP cap in cawiki on 2019-08-22 - https://phabricator.wikimedia.org/T230313 [11:29:31] except closing the task if it's working, I'd consider that done Jhs :) [11:29:33] thanks for the patch [11:30:18] Urbanecm, thank you too :) [11:30:23] happy to help! [11:33:42] (03CR) 10jenkins-bot: Remove duplicate namespace definitions for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530342 (https://phabricator.wikimedia.org/T230541) (owner: 10Jon Harald Søby) [11:33:44] (03CR) 10jenkins-bot: Add new throttle rule for cawiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530088 (https://phabricator.wikimedia.org/T230313) (owner: 10Ammarpad) [11:33:46] (03CR) 10jenkins-bot: Remove napwikisource from wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530346 (https://phabricator.wikimedia.org/T230541) (owner: 10Urbanecm) [11:33:49] (03PS3) 10Urbanecm: Add Portal namespace on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529600 (https://phabricator.wikimedia.org/T230294) (owner: 10Zoranzoki21) [11:33:53] (03CR) 10BBlack: [C: 03+1] VCL: update 01-basic-caching.vtc to expect 421 [puppet] - 10https://gerrit.wikimedia.org/r/530345 (https://phabricator.wikimedia.org/T207340) (owner: 10Ema) [11:33:55] (03CR) 10Urbanecm: [C: 03+2] Add Portal namespace on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529600 (https://phabricator.wikimedia.org/T230294) (owner: 10Zoranzoki21) [11:34:55] (03Merged) 10jenkins-bot: Add Portal namespace on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529600 (https://phabricator.wikimedia.org/T230294) (owner: 10Zoranzoki21) [11:36:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: fe9b6ed: Add Portal namespace on zhwikisource (T230294) (duration: 00m 47s) [11:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:47] T230294: Add Portal namespace on Chinese Wikisource - https://phabricator.wikimedia.org/T230294 [11:37:53] !log Run mwscript namespaceDupes.php --wiki=zhwikisource --add-prefix="FIXME" --fix (T230294) [11:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:13] (03CR) 10jenkins-bot: Add Portal namespace on zhwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529600 (https://phabricator.wikimedia.org/T230294) (owner: 10Zoranzoki21) [11:40:20] (03PS1) 10Jbond: idp: enable mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/530348 [11:40:34] Urbanecm: I'm ready [11:40:41] I'm ready too! [11:40:50] could you link me to the patch(es), please? [11:41:00] Daimona, ^ [11:41:32] Yep [11:41:57] First one is https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/475772/ [11:42:04] Can you please take a look in the meanwhile? [11:42:12] Certainlz [11:42:34] (03CR) 10Jbond: [C: 03+2] idp: enable mapped ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/530348 (owner: 10Jbond) [11:42:41] (Note my PC is a bit angry these days and I could disappear suddenly :D) [11:43:19] ok [11:43:34] the config patch looks good to me Daimona [11:43:53] should we first deploy the config patch, and then the AF one, or the other way around? [11:43:53] OK, so that's the first one [11:43:59] Yep, config first [11:44:01] ok [11:44:04] Then the AF backport is https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/530349/ [11:44:08] (03PS31) 10Urbanecm: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [11:44:14] (03CR) 10Urbanecm: [C: 03+2] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [11:44:23] +2'ed the config then [11:44:29] Ty [11:44:55] TBH I'm not 100% sure that there won't be discontinuities, either this way or the other way around, but as long as they're deployed consequently there shouldn't be any problem [11:45:14] and +2'ed the backport too [11:45:20] (03Merged) 10jenkins-bot: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [11:46:32] (03CR) 10jenkins-bot: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [11:46:47] Daimona, could you remove V-1 from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/530349/ please? [11:46:50] I can't do that for some reason [11:47:06] Done [11:47:21] thanks, hopefully, +2 will trigger the jobs then [11:47:33] It's probably complaining because the config patch wasn't rebased [11:47:51] "Patch Set 31: Patch Set 30 was rebased", it was? [11:48:04] I can't rebase the AF patch, since it's up to date [11:48:08] Hah, exactly [11:48:09] but hopefully, it will all merge now [11:48:18] It was rebased after creating the backport, it's kinda common [11:48:35] got that [11:48:42] When you have changes A and B, where B depends on A, if you rebase A while CI is running for B, B will fail [11:48:54] makes sense [11:49:40] Daimona, I guess I should wait with deploying the config patch until the AF patch gets merged, so they can be deployed more to each other? [11:49:59] Yes [11:50:04] That'd be super-good [11:50:10] ok [11:50:30] I just hope zuul is kidding with that "30 min" :) [11:50:57] the test job I started with the recheck command says "just" 19 mins [11:51:04] https://xkcd.com/612/ [11:51:07] only wmf-quibble-vendor-mysql-hhvm-docker is running [11:51:47] I was looking at gate-and-submit-swat [11:52:30] yes, there's also something running in test [11:52:42] and...there's nothing running for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/468696 [11:52:47] Daimona, ^ [11:53:12] Meh [11:53:19] I'll think about the master change later [11:53:30] To avoid clogging CI [11:53:55] gate-and-submit-swat **should** have higher priority than gate-and-submit, but makes sense [12:00:23] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:00:27] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:03] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:01:08] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:22] !log EU SWAT is going a few minutes out of its window [12:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:09] (03PS9) 10Catrope: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [12:07:45] (03PS1) 10Jbond: idp: move service to port 443 and open firewall [puppet] - 10https://gerrit.wikimedia.org/r/530353 [12:07:53] ok Daimona, it's merged! [12:08:02] Hah, finally [12:08:34] Daimona, can you check on mwdebug1002, please? [12:08:39] Yep [12:08:42] thanks [12:08:44] Ping me when both patches are there [12:08:49] that's right now :) [12:08:59] Daimona, [12:09:24] Testing [12:09:55] thanks [12:10:06] (03CR) 10Jbond: [C: 03+2] idp: move service to port 443 and open firewall [puppet] - 10https://gerrit.wikimedia.org/r/530353 (owner: 10Jbond) [12:12:01] Should be good AFAICS [12:12:13] Although I cannot obviously test on all wikis :) [12:12:31] (03PS1) 10Jbond: apereo_cas: use unix path for the prefix [puppet] - 10https://gerrit.wikimedia.org/r/530354 [12:13:06] a few wikis is good Daimona >] [12:13:09] syncing! [12:13:19] (03CR) 10Jbond: [C: 03+2] apereo_cas: use unix path for the prefix [puppet] - 10https://gerrit.wikimedia.org/r/530354 (owner: 10Jbond) [12:13:22] Cool, ty [12:14:47] !log urbanecm@deploy1001 Synchronized wmf-config/: SWAT: 7e95f6d: Update AbuseFilter config to keep the status quo (T191740, T200032, T226987) (duration: 00m 49s) [12:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:57] T200032: Some wikis have block enabled but don't assign the abusefilter-modify-restricted right to anyone - https://phabricator.wikimedia.org/T200032 [12:14:57] T226987: Missing abusefilter-log-private right in $wgGrantPermissions - https://phabricator.wikimedia.org/T226987 [12:14:58] T191740: Bundle AbuseFilter extension with MediaWiki - https://phabricator.wikimedia.org/T191740 [12:16:11] (03CR) 10Urbanecm: [C: 03+2] Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [12:16:18] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.17/extensions/AbuseFilter/extension.json: SWAT: e9422c5: Rearrange config to provide better experience (T191740, T200032, T226987) (duration: 00m 47s) [12:16:23] Daimona, should be done! [12:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:06] Nice, thanks again [12:17:09] (03Merged) 10jenkins-bot: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [12:17:23] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:17:24] (03CR) 10jenkins-bot: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [12:19:11] happy to help Daimona ! [12:19:17] (03PS2) 10Ema: VCL: update 01-basic-caching.vtc to expect 421 [puppet] - 10https://gerrit.wikimedia.org/r/530345 (https://phabricator.wikimedia.org/T207340) [12:19:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: d036388: Increase default thumb size to 260px on Dutch Wikipedia (T215106) (duration: 00m 48s) [12:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:57] T215106: Enlarging the default thumb size on Dutch Wikipedia - https://phabricator.wikimedia.org/T215106 [12:20:08] (03CR) 10Ema: [C: 03+2] VCL: update 01-basic-caching.vtc to expect 421 [puppet] - 10https://gerrit.wikimedia.org/r/530345 (https://phabricator.wikimedia.org/T207340) (owner: 10Ema) [12:21:17] !log EU SWAT done [12:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:57] (03PS1) 10Jbond: idp: add port paramater [puppet] - 10https://gerrit.wikimedia.org/r/530356 [12:27:08] (03CR) 10Reedy: [C: 04-1] "Although we don't enforce PHPCS on this repo (yet).. It'd be nice to at least have the style consistent :)" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [12:35:35] (03CR) 10Jbond: [C: 03+2] idp: add port paramater [puppet] - 10https://gerrit.wikimedia.org/r/530356 (owner: 10Jbond) [12:49:59] (03PS1) 10Ema: secret: dummy key for grafana [labs/private] - 10https://gerrit.wikimedia.org/r/530361 (https://phabricator.wikimedia.org/T210411) [12:51:40] (03PS1) 10Reedy: Remove some PHPCS exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [12:52:52] (03CR) 10Krinkle: [C: 03+1] Remove some PHPCS exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [12:53:34] (03CR) 10jerkins-bot: [V: 04-1] Remove some PHPCS exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [12:53:55] (03PS2) 10Reedy: Remove some PHPCS exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [12:55:00] (03PS1) 10Ema: grafana: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/530363 (https://phabricator.wikimedia.org/T210411) [12:55:04] (03CR) 10jerkins-bot: [V: 04-1] Remove some PHPCS exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [12:55:08] (03PS1) 10Urbanecm: Revert "Set account create throttle to 2 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530364 (https://phabricator.wikimedia.org/T230521) [12:55:13] jouncebot, now [12:55:13] No deployments scheduled for the next 3 hour(s) and 4 minute(s) [12:55:15] jouncebot, next [12:55:15] In 3 hour(s) and 4 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1600) [12:56:07] (03CR) 10Urbanecm: [C: 03+2] Revert "Set account create throttle to 2 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530364 (https://phabricator.wikimedia.org/T230521) (owner: 10Urbanecm) [12:56:26] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove account creation restrictions (T230304, T230521) (duration: 00m 48s) [12:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:36] T230521: Users are unable to create more than 2 accounts per day - https://phabricator.wikimedia.org/T230521 [12:57:11] (03Merged) 10jenkins-bot: Revert "Set account create throttle to 2 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530364 (https://phabricator.wikimedia.org/T230521) (owner: 10Urbanecm) [12:57:17] (03PS1) 10Jbond: idp: add prometheous and htst parameters [puppet] - 10https://gerrit.wikimedia.org/r/530365 [12:58:15] (03PS1) 10Ema: Add TLS termination for grafana [puppet] - 10https://gerrit.wikimedia.org/r/530367 (https://phabricator.wikimedia.org/T210411) [12:58:47] (03CR) 10Jbond: [C: 03+2] idp: add prometheous and htst parameters [puppet] - 10https://gerrit.wikimedia.org/r/530365 (owner: 10Jbond) [13:00:33] (03CR) 10jenkins-bot: Revert "Set account create throttle to 2 everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530364 (https://phabricator.wikimedia.org/T230521) (owner: 10Urbanecm) [13:02:15] (03PS3) 10Reedy: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [13:02:29] (03PS2) 10Ema: grafana: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/530363 (https://phabricator.wikimedia.org/T210411) [13:02:32] (03PS2) 10Ema: Add TLS termination for grafana [puppet] - 10https://gerrit.wikimedia.org/r/530367 (https://phabricator.wikimedia.org/T210411) [13:03:31] (03PS1) 10Reedy: Remove duplicate exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530368 [13:03:33] (03CR) 10jerkins-bot: [V: 04-1] Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [13:03:37] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for grafana [labs/private] - 10https://gerrit.wikimedia.org/r/530361 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:05:05] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/17910/" [puppet] - 10https://gerrit.wikimedia.org/r/530367 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:05:12] (03CR) 10Ema: [C: 03+2] grafana: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/530363 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:05:25] (03CR) 10Ema: [C: 03+2] Add TLS termination for grafana [puppet] - 10https://gerrit.wikimedia.org/r/530367 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [13:06:14] (03CR) 10CDanis: [C: 03+1] swift: stop monitoring individual daemons [puppet] - 10https://gerrit.wikimedia.org/r/530080 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [13:07:16] (03CR) 10Reedy: [C: 03+1] Remove duplicate exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530368 (owner: 10Reedy) [13:07:45] (03CR) 10CDanis: [C: 03+1] "yes please!" [puppet] - 10https://gerrit.wikimedia.org/r/530098 (https://phabricator.wikimedia.org/T230413) (owner: 10Filippo Giunchedi) [13:11:05] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10Wang_Qiliang) @ema not fixed for https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa... [13:11:09] (03CR) 10CDanis: mediawiki: add cluster latency alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529923 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:13:24] (03PS1) 10Ema: ATS: use TLS for grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/530370 (https://phabricator.wikimedia.org/T210411) [13:14:13] (03PS4) 10Reedy: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [13:14:27] (03PS2) 10Reedy: Remove duplicate exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530368 [13:14:32] (03CR) 10Reedy: [C: 03+2] Remove duplicate exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530368 (owner: 10Reedy) [13:14:37] (03PS1) 10Alex Monk: cloud: Move instances to use new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530371 [13:15:10] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10ema) >>! In T188831#5416179, @Wang_Qiliang wrote: > @ema not fixed for https://upload.wiki... [13:15:39] (03CR) 10jerkins-bot: [V: 04-1] Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [13:16:00] (03Merged) 10jenkins-bot: Remove duplicate exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530368 (owner: 10Reedy) [13:16:27] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [13:16:35] (03CR) 10jenkins-bot: Remove duplicate exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530368 (owner: 10Reedy) [13:17:02] (03CR) 10Andrew Bogott: [C: 03+1] "Looks good; waiting to merge as part of a planned migration window." [puppet] - 10https://gerrit.wikimedia.org/r/530340 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [13:17:05] !log reedy@deploy1001 Synchronized phpcs.xml: remove excess lines (duration: 00m 46s) [13:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:49] (03PS5) 10Reedy: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [13:18:59] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10Vort) @Wang_Qiliang I don't see `application/x-www-form-urlencoded` there. The only noticab... [13:19:36] (03CR) 10Andrew Bogott: [C: 03+1] "Looks good; waiting to merge as part of a planned migration window." [puppet] - 10https://gerrit.wikimedia.org/r/530344 (owner: 10Alex Monk) [13:19:49] (03CR) 10jerkins-bot: [V: 04-1] Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [13:26:47] (03PS6) 10Reedy: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [13:28:34] (03CR) 10jerkins-bot: [V: 04-1] Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [13:30:40] (03CR) 10Alex Monk: [C: 04-1] "Andrew just pointed out that this will set the CA key even for clients of custom puppetmasters, which is sure to break things. We can prob" [puppet] - 10https://gerrit.wikimedia.org/r/530371 (owner: 10Alex Monk) [13:31:43] (03PS7) 10Reedy: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [13:43:39] (03PS8) 10Reedy: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 [13:46:49] (03CR) 10Jhedden: [C: 03+1] "looks great. Lets merge this first and I'll rebase" [puppet] - 10https://gerrit.wikimedia.org/r/530332 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez) [13:48:10] (03PS2) 10Arturo Borrero Gonzalez: openstack: cleanup nova-network version of nova [puppet] - 10https://gerrit.wikimedia.org/r/530332 (https://phabricator.wikimedia.org/T220051) [14:00:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/17911/" [puppet] - 10https://gerrit.wikimedia.org/r/530332 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez) [14:01:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "A better PCC run: https://puppet-compiler.wmflabs.org/compiler1001/17912/" [puppet] - 10https://gerrit.wikimedia.org/r/530332 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez) [14:05:30] (03PS1) 10Jbond: idp: add tls proxy [puppet] - 10https://gerrit.wikimedia.org/r/530376 [14:07:19] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:10:10] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10MSantos) >>! In T223275#5415510, @Jdforrester-WMF wrote: > Is https://gerrit.wikimedia.org/r/c/maps/kartotherian... [14:11:37] (03CR) 10Jforrester: "Yay." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529924 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [14:16:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: add tls proxy [puppet] - 10https://gerrit.wikimedia.org/r/530376 (owner: 10Jbond) [14:16:15] (03PS2) 10Jbond: idp: add tls proxy [puppet] - 10https://gerrit.wikimedia.org/r/530376 [14:16:18] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: add tls proxy [puppet] - 10https://gerrit.wikimedia.org/r/530376 (owner: 10Jbond) [14:18:20] (03PS1) 10Jbond: idp: add cert_name [puppet] - 10https://gerrit.wikimedia.org/r/530379 [14:18:57] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Gehel) [14:22:20] (03PS6) 10Jhedden: openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 [14:26:59] 10Operations, 10Traffic, 10Patch-For-Review: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 (10BBlack) [14:27:02] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10BBlack) [14:27:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: add cert_name [puppet] - 10https://gerrit.wikimedia.org/r/530379 (owner: 10Jbond) [14:27:59] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [14:32:47] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [14:33:02] !log shutting down db2063 for maintenance [14:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:13] (03CR) 10Reedy: [C: 03+2] Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [14:37:35] PROBLEM - Host db2063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:38:54] (03PS1) 10Jforrester: [sqwikiquote] Enable WikiLove and SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530381 (https://phabricator.wikimedia.org/T230390) [14:39:15] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:41:08] (03CR) 10Jhedden: [C: 03+2] openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 (owner: 10Jhedden) [14:41:17] (03PS7) 10Jhedden: openstack: add core filter to nova scheduler [puppet] - 10https://gerrit.wikimedia.org/r/530175 [14:41:33] (03PS2) 10Andrew Bogott: cloud recursors: alias 'puppet' to the new in-labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530341 (https://phabricator.wikimedia.org/T171188) [14:41:35] (03PS1) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [14:41:49] !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot-wdqs [14:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:17] RECOVERY - Host db2063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms [14:44:30] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot-wdqs (exit_code=0) [14:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:23] !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot-wdqs [14:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:17] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10BBlack) General status updates and planning, for this very old ticket which is still on the radar! T186550 and T228190 cover anycasting our internal recdns,... [14:48:51] (03Merged) 10jenkins-bot: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [14:49:07] (03CR) 10jenkins-bot: Tidy up some unenforced phpcs violations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530362 (owner: 10Reedy) [14:49:16] (03PS1) 10Jbond: idp: use tlsproxy::localssl directly [puppet] - 10https://gerrit.wikimedia.org/r/530384 [14:49:55] !log reedy@deploy1001 Synchronized phpcs.xml: remove exclusions (duration: 00m 49s) [14:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:03] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot-wdqs (exit_code=0) [14:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:46] !log cp5002 depool due to compress.so crash [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:53] !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot-wdqs [14:50:54] !log reedy@deploy1001 Synchronized multiversion/: phpcs cleanup (duration: 00m 47s) [14:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:03] (03CR) 10Jbond: [C: 03+2] idp: use tlsproxy::localssl directly [puppet] - 10https://gerrit.wikimedia.org/r/530384 (owner: 10Jbond) [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:13] (03PS2) 10Jbond: idp: use tlsproxy::localssl directly [puppet] - 10https://gerrit.wikimedia.org/r/530384 [14:51:16] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.reboot-wdqs (exit_code=97) [14:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:43] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:51:50] (03PS2) 10Ema: ATS: use TLS for grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/530370 (https://phabricator.wikimedia.org/T210411) [14:52:00] 10Operations, 10Wikimedia-Mailing-lists: Create central notice admins mailing list - https://phabricator.wikimedia.org/T230544 (10Urbanecm) 05Open→03Invalid It seems there is already one. [14:52:20] !log reedy@deploy1001 Synchronized wmf-config/: phpcs cleanup (duration: 00m 47s) [14:52:24] (03PS1) 10Reedy: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [14:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] (03PS3) 10Ema: ATS: use TLS for grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/530370 (https://phabricator.wikimedia.org/T210411) [14:54:01] (03CR) 10jerkins-bot: [V: 04-1] Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [14:54:35] (03CR) 10Ema: [C: 03+2] ATS: use TLS for grafana1001 [puppet] - 10https://gerrit.wikimedia.org/r/530370 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:55:34] (03PS2) 10Reedy: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [14:56:37] !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot-wdqs [14:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:01] PROBLEM - Host pc2010 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:03] (03CR) 10jerkins-bot: [V: 04-1] Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [14:58:14] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.reboot-wdqs (exit_code=97) [14:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:32] !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot-wdqs [14:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:20] (03PS1) 10Jbond: acme_chief: add idp certificate [puppet] - 10https://gerrit.wikimedia.org/r/530388 [15:02:43] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Cmjohnson) @Marostegui I see a potential issue with B3 as well. I will need to do a DIMM swap A -> B side and see if the errors stay with the DIMM or are the CPU. Le... [15:02:50] (03PS3) 10Reedy: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:03:43] PROBLEM - Host wdqs2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:49] RECOVERY - Host wdqs2004 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [15:04:23] (03CR) 10jerkins-bot: [V: 04-1] Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:05:00] (03PS7) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [15:05:15] (03CR) 10Jbond: [C: 03+2] acme_chief: add idp certificate [puppet] - 10https://gerrit.wikimedia.org/r/530388 (owner: 10Jbond) [15:05:17] (03CR) 10jerkins-bot: [V: 04-1] Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [15:05:31] (03CR) 10Alex Monk: [C: 03+1] labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [15:06:39] (03PS8) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [15:06:54] (03CR) 10jerkins-bot: [V: 04-1] Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [15:07:35] (03PS4) 10Reedy: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:08:35] (03CR) 10jerkins-bot: [V: 04-1] Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:09:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1023.mgmt.eqiad.... [15:09:27] (03PS5) 10Reedy: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:10:21] PROBLEM - Host elastic2050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:53] (03CR) 10jerkins-bot: [V: 04-1] Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:13:17] !log Messing around with CommonSettings.php on mwdebug1002 to profile config loading [15:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:17] (03PS1) 10Jbond: idp: use correct cert name [puppet] - 10https://gerrit.wikimedia.org/r/530393 [15:15:25] (03CR) 10Jbond: [C: 03+2] idp: use correct cert name [puppet] - 10https://gerrit.wikimedia.org/r/530393 (owner: 10Jbond) [15:18:07] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10Cmjohnson) [15:18:24] (03PS6) 10Reedy: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:18:47] (03PS7) 10Reedy: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:19:06] (03PS9) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) [15:19:27] PROBLEM - Host db2063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:19:53] (03CR) 10jerkins-bot: [V: 04-1] Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:20:14] (03CR) 10Krinkle: Attempt to remove some more rule exclusions... (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:20:30] (03PS8) 10Krinkle: Attempt to remove some more rule exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:20:37] (03CR) 10SBassett: Add rate limiter to Special:ConfirmEmail - config change (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [15:21:09] (03PS9) 10Reedy: Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:21:13] (03CR) 10Reedy: [C: 03+2] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:21:18] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10fireattack) It returns content-type: image/png here (using both `curl` and browser). [15:21:29] RECOVERY - ElasticSearch shard size check - 9643 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [15:22:37] DAMN IT [15:22:41] wikibase blocking up CI again [15:24:09] ++ [15:25:09] RECOVERY - Host db2063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [15:25:36] (03PS1) 10Jbond: acme_chief: use correct authorized_host for idp cert [puppet] - 10https://gerrit.wikimedia.org/r/530395 [15:28:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] acme_chief: use correct authorized_host for idp cert [puppet] - 10https://gerrit.wikimedia.org/r/530395 (owner: 10Jbond) [15:28:20] (03PS1) 10Ema: ATS: compress.so only cache compressed/decompressed variant [puppet] - 10https://gerrit.wikimedia.org/r/530396 (https://phabricator.wikimedia.org/T227432) [15:29:29] (03PS2) 10Reedy: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [15:30:39] (03PS1) 10Reedy: Lets try removing docroot from phpcs exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530397 [15:30:42] (03PS2) 10Ema: ATS: compress.so only cache compressed/decompressed variant [puppet] - 10https://gerrit.wikimedia.org/r/530396 (https://phabricator.wikimedia.org/T227432) [15:31:11] jouncebot, next [15:31:11] In 0 hour(s) and 28 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1600) [15:31:13] (03PS1) 10Urbanecm: Fix zhwikisource wgExtraNamespaces entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530398 (https://phabricator.wikimedia.org/T230294) [15:31:59] (03CR) 10Ema: [V: 03+2 C: 03+2] ATS: compress.so only cache compressed/decompressed variant [puppet] - 10https://gerrit.wikimedia.org/r/530396 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:33:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Phamhi) Blocked due to https://phabricator.wikimedia.org/T212855 [15:33:55] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:26] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot-wdqs (exit_code=0) [15:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:36] (03CR) 10CDanis: "Thanks, and sorry for my unfamiliarity with PHP style." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [15:34:57] !log performing rolling restarts of eqiad kafka-main brokers for security updates [15:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:05] !log cp5002: re-pool with compress.so cache:false [15:37:07] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:38] (03PS1) 10Jbond: idp: add dhparam and disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/530401 [15:38:10] (03PS2) 10Jbond: idp: add dhparam and disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/530401 [15:39:04] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: add dhparam and disable ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/530401 (owner: 10Jbond) [15:40:51] !log unfreeze writes to cloudelastic cluster [15:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:49] (03PS1) 10Jbond: idp: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/530402 [15:45:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: correct typo [puppet] - 10https://gerrit.wikimedia.org/r/530402 (owner: 10Jbond) [15:49:52] (03CR) 10jerkins-bot: [V: 04-1] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:49:59] (03CR) 10jerkins-bot: [V: 04-1] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:50:01] (03CR) 10jerkins-bot: [V: 04-1] Lets try removing docroot from phpcs exclusions... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530397 (owner: 10Reedy) [15:50:58] (03PS1) 10Jbond: apereo_cas: update cas to 6.1.0-rc4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530403 [15:51:13] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: update cas to 6.1.0-rc4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530403 (owner: 10Jbond) [15:51:23] (03PS10) 10Reedy: Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:51:31] (03CR) 10Reedy: [C: 03+2] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:54:07] (03CR) 10jerkins-bot: [V: 04-1] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:54:14] (03CR) 10jerkins-bot: [V: 04-1] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:54:54] (03PS2) 10Reedy: Remove docroot from phpcs exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530397 [15:55:00] (03CR) 10Reedy: [C: 03+2] Remove docroot from phpcs exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530397 (owner: 10Reedy) [15:56:03] (03PS11) 10Reedy: Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:56:05] (03PS1) 10Jbond: remove templates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530404 [15:56:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] remove templates [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530404 (owner: 10Jbond) [15:56:59] (03Merged) 10jenkins-bot: Remove docroot from phpcs exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530397 (owner: 10Reedy) [15:57:23] (03PS12) 10Reedy: Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:57:42] (03CR) 10jenkins-bot: Remove docroot from phpcs exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530397 (owner: 10Reedy) [15:57:44] (03CR) 10Reedy: [C: 03+2] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:57:53] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:32] (03CR) 10jerkins-bot: [V: 04-1] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:59:06] (03CR) 10jerkins-bot: [V: 04-1] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [15:59:15] (03PS1) 10BryanDavis: toolforge: Provision jsub and friends on grid exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/530405 (https://phabricator.wikimedia.org/T230562) [15:59:51] (03PS13) 10Reedy: Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 [15:59:54] (03CR) 10Reedy: [C: 03+2] Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:20] (03PS1) 10Reedy: Collapse nested if statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530406 [16:01:25] RECOVERY - Host pc2010 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [16:01:25] (03Merged) 10jenkins-bot: Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [16:01:57] (03PS2) 10Reedy: Collapse nested if statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530406 [16:02:06] (03CR) 10Reedy: [C: 03+2] Collapse nested if statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530406 (owner: 10Reedy) [16:02:24] (03CR) 10jenkins-bot: Remove some more phpcs exclusions, make some more narrow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530386 (owner: 10Reedy) [16:03:09] !log reedy@deploy1001 Synchronized phpcs.xml: more exclusions! (duration: 00m 47s) [16:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:22] (03Merged) 10jenkins-bot: Collapse nested if statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530406 (owner: 10Reedy) [16:04:05] !log reedy@deploy1001 Synchronized tests/: phpunit (duration: 00m 47s) [16:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:32] (03CR) 10jenkins-bot: Collapse nested if statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530406 (owner: 10Reedy) [16:05:07] !log reedy@deploy1001 Synchronized wmf-config/arclamp.php: phpcs (duration: 00m 47s) [16:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:13] (03CR) 10Bstorm: [C: 04-1] "There's something odd here. The submit host profile should add this package, but it doesn't seem applied to the exec hosts outside of web" [puppet] - 10https://gerrit.wikimedia.org/r/530405 (https://phabricator.wikimedia.org/T230562) (owner: 10BryanDavis) [16:06:21] !log reedy@deploy1001 Synchronized docroot/: phpcs fixes (duration: 00m 47s) [16:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:29] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:26] (03PS1) 10Jbond: idp: remove ldap[0].providerClass [puppet] - 10https://gerrit.wikimedia.org/r/530409 [16:09:23] !log Finished messing around with mwdebug1002 [16:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1023.mgmt.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudv... [16:13:20] (03CR) 10Bstorm: [C: 04-1] "Ok, yes, this patch will conflict with the web nodes because they include `profile::toolforge::grid::submit_host`. The better way to do t" [puppet] - 10https://gerrit.wikimedia.org/r/530405 (https://phabricator.wikimedia.org/T230562) (owner: 10BryanDavis) [16:15:40] (03PS3) 10Reedy: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [16:16:01] PROBLEM - MariaDB Slave Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2310.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:16:44] (03CR) 10Reedy: "No worries, as above, it's something that should've at least been enforced by phpcs, but the docroot was excluded for unknown reasons..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [16:17:12] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10Amire80) >>! In T230020#5415067, @CDanis wrote: > @abi_ Can you clarify if you need access to private (user webrequest logs) data? Statistics a... [16:18:02] (03CR) 10jerkins-bot: [V: 04-1] noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [16:19:08] (03PS4) 10Reedy: noc: read dbctl JSON from local disk mirror of etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [16:23:04] (03PS1) 10DannyS712: Add `WS` and `CAT` as aliases for zhwikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530413 (https://phabricator.wikimedia.org/T230548) [16:24:42] (03PS2) 10DannyS712: Add `WS` and `CAT` as aliases for zhwikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530413 (https://phabricator.wikimedia.org/T230548) [16:24:54] (03PS1) 10Reedy: Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 [16:25:52] (03CR) 10jerkins-bot: [V: 04-1] Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 (owner: 10Reedy) [16:26:30] (03PS2) 10Reedy: Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 [16:27:44] !log advertise core v4 range (208.80.152.0/22) from eqord - T167841 [16:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:52] T167841: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 [16:28:54] ty Reedy! [16:32:32] (03PS1) 10Reedy: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 [16:33:07] (03PS2) 10Krinkle: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 (owner: 10Reedy) [16:34:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) >>! In T230020#5416604, @Amire80 wrote: >>>! In T230020#5415067, @CDanis wrote: >> @abi_ Can you clarify if you need access to private (... [16:35:11] RECOVERY - MariaDB Slave Lag: pc1 on pc2010 is OK: OK slave_sql_lag Replication lag: 3.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [16:35:59] (03PS3) 10Reedy: Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 [16:36:01] (03PS3) 10Reedy: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 [16:36:46] (03PS4) 10Reedy: Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 [16:39:01] (03PS5) 10Reedy: Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 [16:39:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10Amire80) >>! In T230020#5416655, @CDanis wrote: >>>! In T230020#5416604, @Amire80 wrote: >>>>! In T230020#5415067, @CDanis wrote: >>> @abi_ Can... [16:39:16] (03PS4) 10Reedy: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 [16:40:17] (03PS5) 10Reedy: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 [16:44:38] (03CR) 10Ori.livneh: "This works, but it is not obvious. I think that any scheme that relies on modification time will run into subtle problems that will be dif" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [16:46:24] Krinkle: ^. It's possible I misunderstood the requirements, though -- I'm rusty :) [16:46:49] (03CR) 10Krinkle: [C: 03+2] Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 (owner: 10Reedy) [16:47:50] (03Merged) 10jenkins-bot: Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 (owner: 10Reedy) [16:48:08] (03CR) 10jenkins-bot: Move WmfClusters class to own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530414 (owner: 10Reedy) [16:49:27] (03PS6) 10Reedy: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 [16:50:03] (03CR) 10Krinkle: [C: 03+2] Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 (owner: 10Reedy) [16:51:00] (03Merged) 10jenkins-bot: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 (owner: 10Reedy) [16:51:16] (03CR) 10jenkins-bot: Make WmfClusters->htmlFor() return a string rather than printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530416 (owner: 10Reedy) [16:51:20] Reedy: did my patch nerdsnipe you and Krinkle into cleaning up noc's code a bunch? :D [16:52:04] Something like that ;P [16:52:07] !log reedy@deploy1001 Synchronized src/WmfClusters.php: Move WmfClusters.php (duration: 00m 47s) [16:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:29] !log reedy@deploy1001 Synchronized docroot/noc/db.php: Use WmfClusters from seperate file (duration: 00m 47s) [16:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] cscott, arlolra, subbu, halfak, and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1700). [17:22:06] (03CR) 10Ayounsi: Add script to import management DNS entries (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) (owner: 10CRusnov) [17:28:42] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10wiki_willy) [17:30:15] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10wiki_willy) [17:31:02] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10wiki_willy) [17:32:01] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10wiki_willy) [17:32:53] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 9/19 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10wiki_willy) [17:32:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): elastic1017 lost network after reboot - https://phabricator.wikimedia.org/T230518 (10Cmjohnson) * I checked the network switch and the port shows up/up meaning that link from the server to the network switch is up ge-3/0/17 up... [17:33:45] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10wiki_willy) [17:34:40] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10wiki_willy) [17:35:50] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10wiki_willy) [17:36:46] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10wiki_willy) [17:37:36] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10wiki_willy) [17:38:42] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) - https://phabricator.wikimedia.org/T227542 (10wiki_willy) [17:39:17] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10wiki_willy) [17:40:33] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Anomie) [17:41:28] !log mbsantos@deploy1001 Started deploy [mobileapps/deploy@1bd2bea]: Update service-mobileapp-node to 5c1da03 (T230067 T229984) [17:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:40] T229984: [BUG] Handle footer menu icon style correctly - https://phabricator.wikimedia.org/T229984 [17:41:40] T230067: mobile-html: enhance media-list to include exact image URLs - https://phabricator.wikimedia.org/T230067 [17:42:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): elastic1017 lost network after reboot - https://phabricator.wikimedia.org/T230518 (10Cmjohnson) I will add that this server is out of warranty and would require a motherboard replacement if it is the nic. We typically do not do this a... [17:43:30] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Anomie) [17:46:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): elastic1017 lost network after reboot - https://phabricator.wikimedia.org/T230518 (10Gehel) 05Open→03Resolved @Cmjohnson don't spend more time on it, it is scheduled for replacement and the replacement should arrive August 21. We c... [17:47:21] !log mbsantos@deploy1001 Finished deploy [mobileapps/deploy@1bd2bea]: Update service-mobileapp-node to 5c1da03 (T230067 T229984) (duration: 05m 53s) [17:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:30] T229984: [BUG] Handle footer menu icon style correctly - https://phabricator.wikimedia.org/T229984 [17:47:30] T230067: mobile-html: enhance media-list to include exact image URLs - https://phabricator.wikimedia.org/T230067 [17:58:11] !log mbsantos@deploy1001 Started deploy [proton/deploy@fb0b2a5]: Update chromium-renderer to 3f1cc72 (T218220) [17:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:19] T218220: Make mobileapps & proton swagger spec compliant - https://phabricator.wikimedia.org/T218220 [17:58:52] (03PS2) 10Gehel: Disable DCAT-AP updates - will be moved to separate endpoint [puppet] - 10https://gerrit.wikimedia.org/r/529443 (https://phabricator.wikimedia.org/T228297) (owner: 10Smalyshev) [17:58:53] !log mbsantos@deploy1001 Finished deploy [proton/deploy@fb0b2a5]: Update chromium-renderer to 3f1cc72 (T218220) (duration: 00m 43s) [17:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:31] (03CR) 10Gehel: [C: 03+2] Disable DCAT-AP updates - will be moved to separate endpoint [puppet] - 10https://gerrit.wikimedia.org/r/529443 (https://phabricator.wikimedia.org/T228297) (owner: 10Smalyshev) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:14:24] jouncebot: now [18:14:24] For the next 0 hour(s) and 45 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T1800) [18:18:42] Hey all- deploying sec patch T230402 right now through gerrit (just checked with SWAT folks) [18:24:25] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [18:29:11] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [18:51:14] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [18:51:56] !log gehel@cumin1001 START - Cookbook sre.wdqs.reboot-wdqs [18:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] !log restart elasticsearch on cloudelastic1001 with -XX:NewRatio=3 [19:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:05] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:10:25] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:11:29] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [19:14:23] (03PS2) 10Gehel: Make wdqs1009 regular host again [puppet] - 10https://gerrit.wikimedia.org/r/530260 (https://phabricator.wikimedia.org/T230244) (owner: 10Smalyshev) [19:15:01] (03CR) 10Gehel: [C: 03+2] Make wdqs1009 regular host again [puppet] - 10https://gerrit.wikimedia.org/r/530260 (https://phabricator.wikimedia.org/T230244) (owner: 10Smalyshev) [19:15:18] (03PS2) 10Gehel: Restore autodeploy on wdq1009 [puppet] - 10https://gerrit.wikimedia.org/r/530261 (https://phabricator.wikimedia.org/T230244) (owner: 10Smalyshev) [19:15:36] (03PS3) 10Cwhite: logster: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) [19:15:52] (03CR) 10Gehel: [C: 03+2] Restore autodeploy on wdq1009 [puppet] - 10https://gerrit.wikimedia.org/r/530261 (https://phabricator.wikimedia.org/T230244) (owner: 10Smalyshev) [19:16:17] (03PS4) 10Cwhite: logster: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) [19:17:00] (03PS1) 10Herron: check_systemd_state: downgrade 'degraded' status to warning [puppet] - 10https://gerrit.wikimedia.org/r/530442 (https://phabricator.wikimedia.org/T230570) [19:17:16] (03PS5) 10Cwhite: logster: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) [19:17:52] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:17:53] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [19:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:40] !log Deployed security patch for T229541 (1.34.0-wmf.17) [19:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:59] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:28] !log Deployed security patch for T230402 (1.34.0-wmf.17) [19:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:03] (03CR) 10Cwhite: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [19:28:38] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot-wdqs (exit_code=0) [19:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:08] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Gehel) [19:32:12] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [19:33:33] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [19:37:07] !log depool cp5002 during the EU night, running compress.so experiment [19:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:20] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [19:38:23] PROBLEM - Check the Netbox report-s- management for fail status. on netmon1002 is CRITICAL: management.ManagementConsole CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:38:23] RECOVERY - Check the Netbox report-s- librenms for fail status. on netmon1002 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:41:06] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [19:58:59] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:00:29] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:06:17] RECOVERY - Check the Netbox report-s- management for fail status. on netmon1002 is OK: management.ManagementConsole OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [20:07:09] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10RobH) 05Stalled→03Resolved done tested and works on port 8 on scs-ulsfo (baud rate 19200 8n1, default on scs is 9600, so its the only one differing on the scs console right... [20:07:11] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [20:07:30] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [20:07:52] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) 05Open→03Resolved Ok, the new scs is now in place, with all connections documented and tested as working. [20:10:36] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10RobH) the single ops-ulsfo item has been fixed (its not on the switch any longer) so removing that tag. [20:10:37] PROBLEM - Check the Netbox report-s- cables for fail status. on netmon1002 is CRITICAL: cables.Cables CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [20:10:45] o no [20:10:48] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10RobH) [20:20:27] (03CR) 10Ori.livneh: [C: 03+1] CommonSettings: Clean up wmf-config caching code [no-op] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [20:22:35] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] PROBLEM - High lag on wdqs1010 is CRITICAL: 3865 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:25:05] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10ops-monitoring-bot) [20:26:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) Thanks @Amire80! @abi_ Since Analytics access is extremely sensitive, please familiarize yourself with https://wikitech.wikimedia.org/w... [20:26:18] bstorm_ or jeh ^ [20:26:32] Yeah, saw that [20:26:36] Looking [20:26:40] ok :) [20:26:43] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 3695 ge 3600 Gehel recovering after data transfer https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:32:18] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Bstorm) Looks like a bad disk here: ` Enclosure Device ID: 32 Slot Number: 1 Enclosure position: 1 Device Id: 1 WWN: 55cd2e41505091c6 Sequence Number: 4 Media Error Count: 31 Other Error Count: 12965 Pred... [20:37:49] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Bstorm) Looks like the exact same thing as {T229156}. Same disk, same error and even same hot spare rebuilding. That's not great. [20:40:11] (03PS1) 10CDanis: admin: shell acct & analytics access for abi [puppet] - 10https://gerrit.wikimedia.org/r/530448 (https://phabricator.wikimedia.org/T230020) [20:42:52] (03CR) 10CDanis: [C: 03+2] admin: shell acct & analytics access for abi [puppet] - 10https://gerrit.wikimedia.org/r/530448 (https://phabricator.wikimedia.org/T230020) (owner: 10CDanis) [20:43:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10CDanis) 05Open→03Resolved [20:44:49] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10CDanis) 05Open→03Resolved I've made you a member of the `wmf` LDAP group, which gives you access to Logstash and other things: https://wikitech.wikimedia.org/wiki/LDAP/Groups#... [20:57:30] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 1181 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:58:10] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi) [20:58:42] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1078 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:07:08] !log restart cloudelastic1002 with -XX:NewRatio=3 to match cloudelastic1001 [21:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:25] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10wiki_willy) a:03Cmjohnson [21:11:24] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 404 threshold =0.15 breach: timed_out: False, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, initializing_shards: 5, active_shards: 1795, number_of_nodes: 4, active_shards_percent_as_number: 81.62801273306049, number_of_data_nodes: 4, active_primary_shards: 733, delayed_unassigned_sh [21:11:24] f_in_flight_fetch: 0, unassigned_shards: 399, relocating_shards: 0, status: yellow, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:11:58] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 404 threshold =0.15 breach: initializing_shards: 5, delayed_unassigned_shards: 0, active_shards_percent_as_number: 81.62801273306049, number_of_in_flight_fetch: 0, timed_out: False, number_of_nodes: 4, number_of_pending_tasks: 0, unassigned_shards: 399, number_of_data_nodes: 4, task_max_waiting_in_queu [21:11:58] us: yellow, relocating_shards: 0, cluster_name: cloudelastic-chi-eqiad, active_shards: 1795, active_primary_shards: 733 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:02] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 404 threshold =0.15 breach: number_of_in_flight_fetch: 0, status: yellow, number_of_data_nodes: 4, active_shards_percent_as_number: 81.62801273306049, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, active_shards: 1795, cluster_name: cloudelastic-chi-eqiad, active_primary_shards: 733, [21:12:02] d_shards: 0, unassigned_shards: 399, timed_out: False, number_of_nodes: 4, initializing_shards: 5, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:14] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 404 threshold =0.15 breach: active_shards: 1795, active_primary_shards: 733, number_of_data_nodes: 4, cluster_name: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, unassigned_shards: 399, initializing_shards: 5, number_of_pending_tasks: 0, active_shards_percent_as_number: 81.62801273306049, timed [21:12:14] er_of_nodes: 4, number_of_in_flight_fetch: 0, relocating_shards: 0, status: yellow, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:46] ebernhardson: ^ [21:12:52] ^ related to restart a cpl minutes ago [21:18:24] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: delayed_unassigned_shards: 0, timed_out: False, active_primary_shards: 733, number_of_data_nodes: 4, task_max_waiting_in_queue_millis: 45, number_of_pending_tasks: 5, unassigned_shards: 163, initializing_shards: 9, status: yellow, number_of_nodes: 4, cluster_name: cloudelastic-chi-eqiad, active_sha [21:18:24] mber: 92.17826284674852, active_shards: 2027, number_of_in_flight_fetch: 0, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:18:28] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1004 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: task_max_waiting_in_queue_millis: 2286, active_shards: 2031, delayed_unassigned_shards: 0, active_primary_shards: 733, timed_out: False, cluster_name: cloudelastic-chi-eqiad, unassigned_shards: 159, number_of_pending_tasks: 9, number_of_data_nodes: 4, number_of_nodes: 4, number_of_in_flight_fetch: [21:18:28] , active_shards_percent_as_number: 92.36016371077763, relocating_shards: 0, initializing_shards: 9 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:18:40] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_primary_shards: 733, number_of_data_nodes: 4, task_max_waiting_in_queue_millis: 236, status: yellow, unassigned_shards: 109, number_of_nodes: 4, delayed_unassigned_shards: 0, initializing_shards: 9, active_shards: 2081, number_of_in_flight_fetch: 0, timed_out: False, number_of_pending_tasks: [21:18:40] ards: 0, cluster_name: cloudelastic-chi-eqiad, active_shards_percent_as_number: 94.63392451114143 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:18:55] !log increase cloudelastic indices.recovery.max_bytes_per_sec from 40mbit to 512mbit as these have 10G networking [21:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:26] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: unassigned_shards: 0, number_of_pending_tasks: 0, active_shards: 2195, cluster_name: cloudelastic-chi-eqiad, initializing_shards: 4, active_primary_shards: 733, relocating_shards: 0, number_of_nodes: 4, timed_out: False, active_shards_percent_as_number: 99.8180991359709, number_of_data_nodes: 4, st [21:19:26] ber_of_in_flight_fetch: 0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:20:12] err, mbyte not mbit...whatever just an order of magnitude :P [21:27:37] !log finish restarting cloudelastic-chi-eqiad with -XX:NewRatio=3 [21:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:48] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@fce8177]: Weekly deploy [21:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:18] PROBLEM - Check the Netbox report-s- cables for fail status. on netmon1002 is CRITICAL: cables.Cables CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:54:17] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@fce8177]: Weekly deploy (duration: 25m 28s) [21:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:54] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active, AS2914/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:22:42] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [22:27:30] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [22:41:20] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 83, down: 0, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:42:46] 10Operations, 10netops: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 (10ayounsi) 05Resolved→03Open Reopening as the cleanup above is only part of the solution. It was made with the idea that it would be ok for all sites to route to any other site, while as explai... [22:42:53] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [22:51:58] 10Operations, 10Analytics: Access to HUE for Mayakpwiki - https://phabricator.wikimedia.org/T229143 (10Mayakp.wiki) Following up on this request : I have been able to use Jupyter notebooks for some of my work. However, I would still like to get access to HUE for running small, simple queries on hive tables. Th... [23:00:04] MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190815T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:25:55] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@b4da6e4]: Rollback blazegraph due to T230588 [23:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:05] T230588: Wikidata Query Service is swapping items and properties - https://phabricator.wikimedia.org/T230588 [23:28:02] (03PS2) 10BryanDavis: toolforge: treat all compute nodes as submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/530405 (https://phabricator.wikimedia.org/T230562) [23:35:44] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@b4da6e4]: Rollback blazegraph due to T230588 (duration: 09m 48s) [23:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:53] T230588: Wikidata Query Service is swapping items and properties - https://phabricator.wikimedia.org/T230588