[00:50:55] (03CR) 10Bstorm: "Looks good https://puppet-compiler.wmflabs.org/compiler1001/17918/" [puppet] - 10https://gerrit.wikimedia.org/r/530405 (https://phabricator.wikimedia.org/T230562) (owner: 10BryanDavis) [00:51:11] (03PS3) 10Bstorm: toolforge: treat all compute nodes as submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/530405 (https://phabricator.wikimedia.org/T230562) (owner: 10BryanDavis) [00:52:35] (03CR) 10Bstorm: [C: 03+2] toolforge: treat all compute nodes as submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/530405 (https://phabricator.wikimedia.org/T230562) (owner: 10BryanDavis) [02:51:23] !log repooling cp5002, running compress.so experiment [02:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:31] (03PS1) 10Mholloway: MachineVision (Beta): Request labels targeting Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530460 [03:00:40] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Some thumbnail images delivered with wrong application/x-www-form-urlencoded mime-type - https://phabricator.wikimedia.org/T188831 (10Wang_Qiliang) >>! In T188831#5416184, @ema wrote: >>>! In T188831#5416179, @Wang_Qiliang wr... [04:01:33] PROBLEM - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2019-08-12 03:45:11 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:03:35] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:19] RECOVERY - snapshot of s7 in codfw on db1115 is OK: snapshot for s7 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-16 03:40:14 from db2100.codfw.wmnet:3317 (849 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:45:01] (03PS1) 10Vgutierrez: ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) [05:45:03] (03PS1) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [05:47:35] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [06:09:01] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [06:42:58] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/530230 (owner: 10MarcoAurelio) [06:43:05] (03PS4) 10Muehlenhoff: openldap::offboard-user.py: Adjust several renamed projects [puppet] - 10https://gerrit.wikimedia.org/r/530230 (owner: 10MarcoAurelio) [06:45:41] (03CR) 10Muehlenhoff: [C: 03+2] openldap::offboard-user.py: Adjust several renamed projects [puppet] - 10https://gerrit.wikimedia.org/r/530230 (owner: 10MarcoAurelio) [06:50:07] <_joe_> !log upgrading envoyproxy across production (http2 CVEs) [06:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:09] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:25:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:59] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [07:27:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:25] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 81, down: 0, shutdown: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:31:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:32:39] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:15] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:33:17] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:35:53] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:43] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:38:07] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:38:09] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:39:03] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:43:03] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10abi_) Can confirm that I'm able to access this. [07:50:12] 10Operations, 10serviceops, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) [07:53:33] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 81, down: 0, shutdown: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:51] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:56:05] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:56:47] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:57:29] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:01:39] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:19] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:02:19] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:02:31] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:04:47] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:06:03] 10Operations, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Mathew.onipe) [08:08:57] 10Operations, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Mathew.onipe) [08:10:58] ACKNOWLEDGEMENT - Host elastic2050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Mathew.onipe see T230597 - The acknowledgement expires at: 2019-08-19 08:10:34. [08:18:12] <_joe_> !log stopping php on phab1003, to restart it with systemd [08:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:17] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 83, down: 0, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:43:31] (03PS2) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [08:43:41] (03PS4) 10Ema: ATS: do not autostart service upon package installation [puppet] - 10https://gerrit.wikimedia.org/r/529402 [08:44:45] (03CR) 10Ema: [C: 03+2] ATS: do not autostart service upon package installation [puppet] - 10https://gerrit.wikimedia.org/r/529402 (owner: 10Ema) [08:45:44] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [08:46:12] (03PS1) 10Giuseppe Lavagetto: phabricator::main: correct the php extension list [puppet] - 10https://gerrit.wikimedia.org/r/530538 [08:46:14] (03CR) 10Muehlenhoff: ATS: do not autostart service upon package installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529402 (owner: 10Ema) [08:48:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17919/" [puppet] - 10https://gerrit.wikimedia.org/r/530538 (owner: 10Giuseppe Lavagetto) [08:48:39] (03PS2) 10Giuseppe Lavagetto: phabricator::main: correct the php extension list [puppet] - 10https://gerrit.wikimedia.org/r/530538 [08:48:44] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] phabricator::main: correct the php extension list [puppet] - 10https://gerrit.wikimedia.org/r/530538 (owner: 10Giuseppe Lavagetto) [08:51:01] (03PS2) 10Vgutierrez: ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) [08:51:03] (03PS3) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [08:52:29] (03CR) 10Vgutierrez: ATS: do not autostart service upon package installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529402 (owner: 10Ema) [09:04:50] 10Operations, 10netops: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10jbond) [09:08:12] (03PS2) 10Jakob: Whitelist jenkins for edit rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530144 (https://phabricator.wikimedia.org/T230481) [09:14:00] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10bd808) [09:17:08] 10Operations, 10Cloud-Services, 10Traffic: All sites served by cloudweb2001-dev return 503 - https://phabricator.wikimedia.org/T230105 (10bd808) [09:23:33] (03PS1) 10BryanDavis: toolforge: provision zstd [puppet] - 10https://gerrit.wikimedia.org/r/530547 (https://phabricator.wikimedia.org/T225380) [09:25:11] (03CR) 10Muehlenhoff: "Is there still Toolforge on jessie? If so, this will need an os_version guard as it's only part of Debian starting with Stretch." [puppet] - 10https://gerrit.wikimedia.org/r/530547 (https://phabricator.wikimedia.org/T225380) (owner: 10BryanDavis) [09:25:31] (03PS4) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [09:25:33] (03PS1) 10Vgutierrez: ocsp: Provide basic test coverage [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) [09:28:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1023.eqiad.wmn... [09:29:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: provision zstd [puppet] - 10https://gerrit.wikimedia.org/r/530547 (https://phabricator.wikimedia.org/T225380) (owner: 10BryanDavis) [09:31:04] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/530547 (https://phabricator.wikimedia.org/T225380) (owner: 10BryanDavis) [09:32:44] (03CR) 10BryanDavis: "> Is there still Toolforge on jessie? If so, this will need an" [puppet] - 10https://gerrit.wikimedia.org/r/530547 (https://phabricator.wikimedia.org/T225380) (owner: 10BryanDavis) [09:33:18] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/530547 (https://phabricator.wikimedia.org/T225380) (owner: 10BryanDavis) [09:34:34] (03PS2) 10Giuseppe Lavagetto: envoyproxy: support debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/529919 [09:36:05] (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid_environ: zstd is only available starting with startch [puppet] - 10https://gerrit.wikimedia.org/r/530551 (https://phabricator.wikimedia.org/T225380) [09:37:05] (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: grid_environ: zstd is only available starting with startch [puppet] - 10https://gerrit.wikimedia.org/r/530551 (https://phabricator.wikimedia.org/T225380) (owner: 10Arturo Borrero Gonzalez) [09:39:29] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:44:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1023.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt10... [09:46:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1023.eqiad.wmn... [09:56:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1023.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cloudvirt10... [10:04:12] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:10] 10Operations, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Mathew.onipe) [10:12:11] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: remove shell redirect [puppet] - 10https://gerrit.wikimedia.org/r/530555 [10:12:20] 10Operations, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Mathew.onipe) p:05Triage→03High [10:14:27] (03PS2) 10Elukey: profile::analytics::refinery::job::data_purge: remove shell redirect [puppet] - 10https://gerrit.wikimedia.org/r/530555 [10:15:53] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: remove shell redirect [puppet] - 10https://gerrit.wikimedia.org/r/530555 (owner: 10Elukey) [10:33:51] (03PS1) 10Elukey: profile::analytics::cluster::packages::common: temp remove python3-tk [puppet] - 10https://gerrit.wikimedia.org/r/530556 (https://phabricator.wikimedia.org/T229347) [10:35:43] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::packages::common: temp remove python3-tk [puppet] - 10https://gerrit.wikimedia.org/r/530556 (https://phabricator.wikimedia.org/T229347) (owner: 10Elukey) [10:51:03] (03CR) 10Alex Monk: "What's the reason for the python-cryptography version bump?" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [10:59:59] (03CR) 10Alex Monk: [C: 03+2] ocsp: Provide basic test coverage [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [11:12:45] (03CR) 10Alex Monk: acme_chief: Provide OCSP responses (034 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [11:25:29] (03PS1) 10Muehlenhoff: Remove apt-setup/multiarch from d-i config [puppet] - 10https://gerrit.wikimedia.org/r/530559 [11:29:25] (03CR) 10Krinkle: "@Ori I like the direction and pre-building. I'm not sure how long it would take to run for 900+ wikis, but I think it is worth trying. Cou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [11:46:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/530559 (owner: 10Muehlenhoff) [12:19:25] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [12:19:48] (03CR) 10jerkins-bot: [V: 04-1] logster: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [12:21:04] (03PS6) 10Elukey: logster: add ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [12:22:00] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [12:23:32] (03CR) 10Elukey: "There was a parent change that was causing some troubles, rebased and resent the code review :)" [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [12:25:50] (03CR) 10Elukey: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/17922/ looks fine!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529399 (https://phabricator.wikimedia.org/T229357) (owner: 10Cwhite) [12:26:18] (03CR) 10Muehlenhoff: profile::kerberos::kdc: add debconf settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [12:29:26] (03CR) 10Elukey: profile::kerberos::kdc: add debconf settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529786 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [12:29:58] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&panelId=9&fullscreen [12:36:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add more granularity to query/time|size buckets [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/519365 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [12:40:26] (03CR) 10Muehlenhoff: "That looks fine approach-wise, I'll have a closer look/review on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/529733 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [12:55:14] 10Operations, 10netops: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10CDanis) p:05Triage→03Normal [12:55:18] 10Operations, 10vm-requests: Site: 2 VMs for puppetdb - https://phabricator.wikimedia.org/T230609 (10MoritzMuehlenhoff) [12:55:32] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10CDanis) p:05Triage→03Normal [12:55:42] 10Operations, 10Puppet: offboard-user.py: do not hardcode Phabricator project names, use PHID instead - https://phabricator.wikimedia.org/T230516 (10CDanis) p:05Triage→03Normal [12:55:56] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Sundown aliases `minnan` and `zh-cfr` for `nan`/`zh-min-nan` - https://phabricator.wikimedia.org/T230382 (10CDanis) p:05Triage→03Normal [12:56:22] 10Operations, 10Puppet: clean up systemd::timer::job logging basedir mess - https://phabricator.wikimedia.org/T230127 (10CDanis) 05Open→03Resolved a:03CDanis [12:56:47] 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10CDanis) p:05Triage→03Normal [12:57:10] 10Operations, 10Wikimedia-Mailing-lists: Set up mailing list for Santali Wikipedia - https://phabricator.wikimedia.org/T230435 (10CDanis) a:03CDanis [12:57:29] 10Operations, 10vm-requests: Site: 2 VMs for puppetdb - https://phabricator.wikimedia.org/T230609 (10CDanis) p:05Triage→03Normal [12:58:04] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10CDanis) p:05Triage→03Normal [12:58:41] 10Operations, 10Release-Engineering-Team, 10cloud-services-team (Kanban): Requesting access to Puppet for Viztor[S] - https://phabricator.wikimedia.org/T229894 (10CDanis) p:05Triage→03Normal [12:58:53] 10Operations, 10Cassandra: Create a cassandra.service which subsumes casandra-{a,b,c} services using PartsOf=cassandra.service - https://phabricator.wikimedia.org/T229916 (10CDanis) p:05Triage→03Normal [12:59:22] 10Operations, 10Wiki-Setup (Delete / Redirect): Merge or delete grantswiki - https://phabricator.wikimedia.org/T229950 (10CDanis) p:05Triage→03Normal [12:59:37] 10Operations: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail - https://phabricator.wikimedia.org/T229998 (10CDanis) p:05Triage→03Normal [13:00:22] 10Operations, 10MediaWiki-Maintenance-scripts, 10serviceops: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10CDanis) [13:00:30] 10Operations, 10MediaWiki-Maintenance-scripts, 10serviceops: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10CDanis) p:05Triage→03Normal [13:01:50] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10CDanis) p:05Triage→03Normal [13:01:57] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10CDanis) p:05Triage→03Normal [13:02:03] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) - https://phabricator.wikimedia.org/T227536 (10CDanis) p:05Triage→03Normal [13:02:14] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10CDanis) p:05Triage→03Normal [13:02:19] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10CDanis) p:05Triage→03Normal [13:02:24] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 9/19 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10CDanis) p:05Triage→03Normal [13:02:31] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10CDanis) p:05Triage→03Normal [13:03:07] 10Operations, 10PDF-Rendering, 10Proton, 10Reading-Infrastructure-Team-Backlog, and 2 others: PDF renderer needs better CJK font - https://phabricator.wikimedia.org/T226633 (10CDanis) p:05Triage→03Normal [13:11:03] 10Operations, 10Release-Engineering-Team: Requesting access to Puppet for Viztor[S] - https://phabricator.wikimedia.org/T229894 (10bd808) [13:17:52] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-logtash03 - https://phabricator.wikimedia.org/T230611 (10Krenair) [13:18:11] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-logtash03 - https://phabricator.wikimedia.org/T230611 (10Krenair) [13:30:14] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:31:42] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 79058 bytes in 2.275 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:33:14] (03CR) 10Alex Monk: ocsp: Provide basic test coverage [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [13:33:34] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-logtash03 - https://phabricator.wikimedia.org/T230611 (10herron) Not sure what caused the system to be in this state, but after the following steps logstash is back up and running. ` root@deployment-logstash03:~# apt remove logstash Reading p... [13:37:29] 10Operations, 10Wikimedia-Mailing-lists: Set up mailing list for Santali Wikipedia - https://phabricator.wikimedia.org/T230435 (10CDanis) Hi Manik, Happy to create this for you, but first, we'll also need a second list administrator -- can you provide someone? Thanks! [13:42:21] (03CR) 10Alex Monk: acme_chief: Provide OCSP responses (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [13:45:27] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-logtash03 - https://phabricator.wikimedia.org/T230611 (10Krenair) Interesting, okay, so the file is `/etc/systemd/system/logstash.service` (init.pp left out the `system/` part), and that doesn't seem to come from a package: `dpkg-query: no pat... [13:46:01] (03PS1) 10Mholloway: Machine vision (beta): Configure Wikidata Beta item URL template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530575 [13:46:39] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-logtash03 - https://phabricator.wikimedia.org/T230611 (10Krenair) 05Open→03Resolved a:03herron [13:49:17] (03CR) 10Mholloway: [C: 03+2] MachineVision (Beta): Request labels targeting Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530460 (owner: 10Mholloway) [13:50:23] (03Merged) 10jenkins-bot: MachineVision (Beta): Request labels targeting Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530460 (owner: 10Mholloway) [13:50:39] (03CR) 10jenkins-bot: MachineVision (Beta): Request labels targeting Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530460 (owner: 10Mholloway) [13:52:58] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: MachineVision (beta): Request labels targeting Beta Wikidata (duration: 00m 50s) [13:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:54] (03CR) 10Alex Monk: " while a given cert_id/key_type_id combination is in the process of being renewed" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [14:02:27] (03PS2) 10Mholloway: Machine vision (beta): Configure Wikidata Beta item URL template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530575 [14:08:27] (03PS1) 10Ema: profile::tlsproxy::instance: do not autostart nginx [puppet] - 10https://gerrit.wikimedia.org/r/530578 [14:12:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one typo inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/530578 (owner: 10Ema) [14:19:43] (03PS1) 10Jhedden: openstack: change codfw nova api and metadata port [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [14:20:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: change codfw nova api and metadata port [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:24:32] !log rolling reboot of cloudelastic [14:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:50] (03PS2) 10Jhedden: openstack: change codfw nova api and metadata port [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [14:26:52] (03CR) 10jerkins-bot: [V: 04-1] openstack: change codfw nova api and metadata port [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:30:14] (03PS3) 10Jhedden: openstack: change codfw nova api and metadata port [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [14:30:23] (03CR) 10Herron: [C: 03+1] icinga: disable autocomplete.js in icinga search text input [puppet] - 10https://gerrit.wikimedia.org/r/528586 (owner: 10Cwhite) [14:31:01] (03PS1) 10Elukey: Add metrics related to number of queries to Broker and Historicals [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/530583 [14:31:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add metrics related to number of queries to Broker and Historicals [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/530583 (owner: 10Elukey) [14:39:06] !log run `bmc-device --cold-reset; echo $?` in elastic2050 hoping it resets mgmt interface -T230597 [14:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:14] T230597: can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 [14:46:25] (03PS4) 10Jhedden: openstack: change codfw nova api and metadata port [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [14:47:56] 10Operations, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10RobH) Please note this mgmt interface is still down: ` robh@cumin2001:~$ ping elastic2050.mgmt.codfw.wmnet PING elastic2050.mgmt.codfw.wmnet (10.193.3.56) 56(84) bytes of... [14:47:59] (03PS3) 10Alexandros Kosiaris: mediawiki: Introduce startupregistrystats.pp to record RL modules registry [puppet] - 10https://gerrit.wikimedia.org/r/528526 (https://phabricator.wikimedia.org/T229836) (owner: 10Ladsgroup) [14:48:10] (03PS4) 10Alexandros Kosiaris: mediawiki: Introduce startupregistrystats.pp to record RL modules registry [puppet] - 10https://gerrit.wikimedia.org/r/528526 (https://phabricator.wikimedia.org/T229836) (owner: 10Ladsgroup) [14:49:47] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Introduce startupregistrystats.pp to record RL modules registry [puppet] - 10https://gerrit.wikimedia.org/r/528526 (https://phabricator.wikimedia.org/T229836) (owner: 10Ladsgroup) [14:51:09] (03CR) 10Jhedden: "compiler results: https://puppet-compiler.wmflabs.org/compiler1001/17924/" [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:51:27] (03PS5) 10Alexandros Kosiaris: mediawiki: Introduce startupregistrystats.pp to record RL modules registry [puppet] - 10https://gerrit.wikimedia.org/r/528526 (https://phabricator.wikimedia.org/T229836) (owner: 10Ladsgroup) [14:53:40] (03CR) 10Jhedden: "Once I verify that this works as expected in codfw I'll run through the other services and submit a patch with the full haproxy configurat" [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:55:06] (03PS1) 10Elukey: Add QueryCountStatsMonitor to Druid broker/historicals [puppet] - 10https://gerrit.wikimedia.org/r/530588 [14:55:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki: Introduce startupregistrystats.pp to record RL modules registry [puppet] - 10https://gerrit.wikimedia.org/r/528526 (https://phabricator.wikimedia.org/T229836) (owner: 10Ladsgroup) [14:57:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) Dell approved my ticket. I talked to the technician today and he will be out Monday morning to replace the mot... [14:57:46] (03PS2) 10Elukey: Add QueryCountStatsMonitor to Druid broker/historicals [puppet] - 10https://gerrit.wikimedia.org/r/530588 [14:59:17] (03CR) 10Elukey: [C: 03+2] Add QueryCountStatsMonitor to Druid broker/historicals [puppet] - 10https://gerrit.wikimedia.org/r/530588 (owner: 10Elukey) [15:04:03] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10JHedden) 05Stalled→03Resolved For this phase we're going to install haproxy directly on the openstack controllers. We will not be need... [15:07:53] (03PS2) 10Ema: profile::tlsproxy::instance: do not autostart nginx [puppet] - 10https://gerrit.wikimedia.org/r/530578 [15:13:15] (03PS2) 10Bstorm: toolforge: rebranding k8s control plane to control [puppet] - 10https://gerrit.wikimedia.org/r/530186 (https://phabricator.wikimedia.org/T229009) [15:15:27] (03CR) 10Bstorm: [C: 03+2] toolforge: rebranding k8s control plane to control [puppet] - 10https://gerrit.wikimedia.org/r/530186 (https://phabricator.wikimedia.org/T229009) (owner: 10Bstorm) [15:18:50] (03CR) 10Ori.livneh: [C: 03+1] "I'm confident it could be done efficiently. IIRC, generating the configuration for a wiki with a cold cache took something like 40ms on an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [15:21:15] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10RobH) a:03Papaul IRC sync: Chatted with @Mathew.onipe, who let me know they had synced with @papaul to take this offline on Monday to reset the power/bmc. [15:22:20] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10RobH) [15:41:54] 10Operations, 10ops-codfw, 10DC-Ops: solve mtp panel issue for row uplinks - https://phabricator.wikimedia.org/T112774 (10Papaul) 05Open→03Resolved Resolving this task since we will not be using and we are not using the patch panels. This will not be setup anymore [15:42:33] !log roll restart of druid broker/historicals to pick up new logging/metrics settings [15:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:14] (03PS1) 10RobH: update to dell skus [software] - 10https://gerrit.wikimedia.org/r/530597 [15:43:45] 10Operations, 10MediaWiki-General, 10Multimedia: Segmentation fault creating thumbnail - https://phabricator.wikimedia.org/T159242 (10Ebe123) Note that even though no error is shown, the image is not of Richmond City but of Richmond //County//, so this bug has not been resolved. [15:46:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Papaul) @Mathew.onipe any reason why this is set to high priority ? [15:47:56] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Mathew.onipe) p:05High→03Normal [15:48:53] 10Operations, 10netops: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10ayounsi) Indeed, should replace bgpmon.net (going EoL soon). [15:49:01] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): can't SSH to elastic2050.mgmt - https://phabricator.wikimedia.org/T230597 (10Mathew.onipe) @Papaul On second thought, we have other servers and losing one elastic node is Ok. So this should be set to normal [15:52:31] (03CR) 10RobH: [C: 03+1] "I'm awaiting feedback from Dell confirming this SKU change is going to be used on all quotations going forward. Once I have that confirma" [software] - 10https://gerrit.wikimedia.org/r/530597 (owner: 10RobH) [15:53:24] (03PS1) 10Jbond: apereo_cas: bump to RC5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530600 [15:58:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: bump to RC5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530600 (owner: 10Jbond) [16:12:14] !log upload prometheus-druid-exporter 0.7-1 to stretch/buster-wikimedia [16:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:17] (03PS1) 10Mholloway: MachineVision (beta): Update handler services to support label lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530605 [16:32:20] (03PS1) 10Elukey: Fix README [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/530607 [16:32:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix README [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/530607 (owner: 10Elukey) [16:36:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) Thanks Chris, hopefully this will solve things. [16:38:37] !log add BGP sessions to Scaleway (AS12876) in esams [16:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:56] (03PS1) 10Jbond: apereo_cas: roll back version to 6.1.0-RC4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530608 [16:41:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: roll back version to 6.1.0-RC4 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/530608 (owner: 10Jbond) [16:48:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17920/" [puppet] - 10https://gerrit.wikimedia.org/r/529919 (owner: 10Giuseppe Lavagetto) [16:48:22] (03PS3) 10Giuseppe Lavagetto: envoyproxy: support debian jessie [puppet] - 10https://gerrit.wikimedia.org/r/529919 [16:50:02] 10Operations, 10Wikimedia-Mailing-lists: Set up mailing list for Santali Wikipedia - https://phabricator.wikimedia.org/T230435 (10Manik87) Dear @CDanis, Please see the below information: Name: R Ashwani Banjan Murmu E-mail: ashwani.murmu@gmail.com Thanks again for your support. Manik [17:01:21] 10Operations, 10Wikimedia-Mailing-lists: Set up mailing list for Santali Wikipedia - https://phabricator.wikimedia.org/T230435 (10CDanis) 05Open→03Resolved List created! @Manik87 you should have received an email with your administrator password for the mailing list. Please also add the mailing list to t... [17:02:06] 10Operations, 10Wikimedia-Mailing-lists: Set up mailing list for Santali Wikipedia - https://phabricator.wikimedia.org/T230435 (10CDanis) Oh, also, please note that list administrators are not automatically subscribed to the list -- subscribe yourselves if you want to receive posts. [17:20:16] (03PS1) 10Urbanecm: Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) [17:21:12] (03CR) 10jerkins-bot: [V: 04-1] Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [17:26:16] (03PS2) 10Urbanecm: Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) [17:27:42] Urbanecm: that suppress thing is a flow-specific "thing" I think. [17:27:50] totally useless [17:28:05] at least for WMF wikis - its permissions should be on the oversight group indeed [17:31:54] (03PS1) 10Giuseppe Lavagetto: aptrepo: add distributions-wikimedia as well, conform naming [puppet] - 10https://gerrit.wikimedia.org/r/530615 [17:33:16] (03PS2) 10Giuseppe Lavagetto: aptrepo: add distributions-wikimedia as well, conform naming [puppet] - 10https://gerrit.wikimedia.org/r/530615 [17:34:51] <_joe_> third time's the charm? [17:34:54] (03PS3) 10Giuseppe Lavagetto: aptrepo: add distributions-wikimedia as well, conform naming [puppet] - 10https://gerrit.wikimedia.org/r/530615 [17:35:15] <_joe_> apparently not! [17:36:10] (03PS4) 10Giuseppe Lavagetto: aptrepo: add distributions-wikimedia as well, conform naming [puppet] - 10https://gerrit.wikimedia.org/r/530615 [17:36:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] aptrepo: add distributions-wikimedia as well, conform naming [puppet] - 10https://gerrit.wikimedia.org/r/530615 (owner: 10Giuseppe Lavagetto) [17:40:31] (03PS1) 10Herron: prometheus: add prometheus ipsec exporter service & config [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [17:45:35] (03CR) 10JJMC89: [C: 03+1] Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [17:45:54] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:48:39] (03CR) 10Daimona Eaytoy: [C: 04-1] Assign all rights assigned to suppress group to oversight group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [17:49:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 131 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:50:48] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 50 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:51:36] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [17:54:42] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:56:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 21 probes of 449 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:59:22] XioNoX: I don't see any maintenances that could be responsible for that codfw blip [18:00:01] looking [18:00:22] interesting, even the oob, which is a totally different network [18:00:34] (03PS3) 10Urbanecm: Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) [18:01:07] (03CR) 10Urbanecm: Assign all rights assigned to suppress group to oversight group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:01:59] (03CR) 10Daimona Eaytoy: [C: 03+1] Assign all rights assigned to suppress group to oversight group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530612 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:02:52] could be a telia network issue, as we go through telia from icinga to mr1-codfw.oob (which is a Cyrus one IP) [18:08:49] hauskatze, well, so should we just change suppress to oversight in AbuseFilter extension? [18:08:51] (btw, Flow uses oversight rn) [18:11:12] is it AF this time? [18:11:15] *facepalm* [18:12:37] that doesn't answer my question... [18:12:38] ...but there's no one to ask now [18:14:57] Hmm [18:53:21] (03PS1) 10Eevans: Revert "Deploy 2019-08-14-210839-production Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530627 (https://phabricator.wikimedia.org/T229697) [18:53:23] (03PS1) 10Eevans: Revert "sessionstore: (Temporarily )use HTTP for liveness" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530628 (https://phabricator.wikimedia.org/T229697) [18:55:00] (03CR) 10Eevans: "Reverting as discussed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/530627 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [18:55:19] (03CR) 10Eevans: "Reverting as discussed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/530628 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [18:55:44] (03CR) 10Eevans: [V: 03+2 C: 03+2] Revert "Deploy 2019-08-14-210839-production Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530627 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [18:55:54] (03CR) 10Eevans: [V: 03+2 C: 03+2] Revert "sessionstore: (Temporarily )use HTTP for liveness" [deployment-charts] - 10https://gerrit.wikimedia.org/r/530628 (https://phabricator.wikimedia.org/T229697) (owner: 10Eevans) [18:57:09] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [18:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:40] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 38 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:09:40] RECOVERY - Check the Netbox report-s- cables for fail status. on netmon1002 is OK: cables.Cables OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:12:18] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 27 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:13:02] cdanis: digging a bit more, seems like Telia is having issues around the east coast [19:14:49] oh interesting [19:16:14] the probes failing seem to be in europe [19:16:49] nothing super clear, but that's all using the mesurement link above and doing some reverse mtr to the ones that failed [19:38:12] Hey all - I'd like to deploy sec patch for T230576 (ex:MobileFrontend) now. [19:48:24] !log Deployed security patch for T230576 (ex:MobileFrontend) [19:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:04] PROBLEM - Elasticsearch HTTPS for production-search-eqiad on elastic1046 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [20:17:22] PROBLEM - Check size of conntrack table on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:17:22] PROBLEM - Check systemd state on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:26] PROBLEM - configured eth on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:17:32] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:17:44] PROBLEM - dhclient process on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:18:00] PROBLEM - SSH on elastic1046 is CRITICAL: connect to address 10.64.16.70 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:18:06] PROBLEM - DPKG on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:18:08] PROBLEM - Disk space on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1046&var-datasource=eqiad+prometheus/ops [20:18:14] PROBLEM - Elasticsearch HTTPS for production-search-psi-eqiad on elastic1046 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search [20:22:21] ^expired downtime [20:32:36] (03PS5) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [20:33:36] (03CR) 10jerkins-bot: [V: 04-1] openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [20:37:59] (03PS1) 10Kosta Harlan: Echo: Enable poll for updates feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530639 (https://phabricator.wikimedia.org/T219222) [20:38:21] (03PS2) 10Kosta Harlan: Echo: Enable poll for updates feature on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530639 (https://phabricator.wikimedia.org/T219222) [20:39:49] (03PS1) 10Kosta Harlan: Echo: Enable poll for updates feature on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530640 (https://phabricator.wikimedia.org/T219222) [20:42:16] (03PS6) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [20:55:18] (03PS7) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [21:03:48] 10Operations, 10ops-ulsfo: refresh/replace scs-ulsfo - https://phabricator.wikimedia.org/T230077 (10RobH) [21:12:54] 10Operations, 10observability: automation: issue reminders for about-to-expire downtimes - https://phabricator.wikimedia.org/T230633 (10CDanis) [21:13:00] 10Operations, 10observability: automation: issue reminders for about-to-expire downtimes - https://phabricator.wikimedia.org/T230633 (10CDanis) p:05Triage→03Normal [21:16:33] 10Operations, 10Jade, 10Scoring-platform-team, 10TechCom, and 4 others: Deploy Jade extension MVP to production - https://phabricator.wikimedia.org/T183381 (10Halfak) [21:30:00] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 131.1 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen