[00:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T0000). [00:58:51] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:01:59] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:20:47] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:27:05] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:36:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:37:39] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, and 9 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Krinkle) 05Open→03Resolved The last point of the task description is wmerrors taking care of showing the error page... [01:37:47] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [01:38:01] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10observability, and 9 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Krinkle) [01:39:39] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:40:35] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:55:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:58:27] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:33:49] (03PS1) 10Tim Starling: Add extra key for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/533125 [02:46:39] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19445624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:46:59] RECOVERY - Check the Netbox report-s- librenms for fail status. on netmon1002 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:49:09] (03PS1) 10CRusnov: librenms: Exclude problematic InventoryItem type as requested [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/533128 (https://phabricator.wikimedia.org/T231502) [02:49:47] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 456 and 62 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:52:05] RECOVERY - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [02:54:01] (03CR) 10Mathew.onipe: "Eqiad and codfw have recovered. shard sizes for commonswiki_content are back to normal. I'll still leave this patch here for while to see " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533023 (https://phabricator.wikimedia.org/T231446) (owner: 10Mathew.onipe) [02:54:14] (03CR) 10Mathew.onipe: [C: 04-1] "do not merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533023 (https://phabricator.wikimedia.org/T231446) (owner: 10Mathew.onipe) [02:54:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:01:15] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:03:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Mathew.onipe) 05Resolved→03Open elastic1029 is back on icinga showing memory errors. see https://icinga.wikimedia.org/cgi-bi... [03:04:23] (03PS3) 10CRusnov: Add script to import management DNS entries [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/529977 (https://phabricator.wikimedia.org/T228670) [03:07:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:12:11] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:20:27] PROBLEM - cassandra-b service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:20:29] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:59] PROBLEM - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [03:21:37] PROBLEM - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [03:23:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:26:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:38:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:40:55] RECOVERY - cassandra-b service on restbase2017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:40:57] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:03] RECOVERY - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-b valid until 2020-11-29 09:26:18 +0000 (expires in 458 days) https://phabricator.wikimedia.org/T120662 [03:42:55] RECOVERY - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is OK: TCP OK - 0.030 second response time on 10.192.48.122 port 9042 https://phabricator.wikimedia.org/T93886 [03:43:37] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:59:01] (03PS1) 10CRusnov: Add script to rotate backup dumps, and dump with timestamp [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) [04:02:51] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 33.46 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:05:59] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 101 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:15:03] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:19:51] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:22:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:28:49] 10Operations, 10Traffic: Perform HTTPS redirect without crossing domain boundaries for non canonical domains - https://phabricator.wikimedia.org/T231513 (10Vgutierrez) [04:29:31] 10Operations, 10Traffic: Perform HTTPS redirect without crossing domain boundaries for non canonical domains - https://phabricator.wikimedia.org/T231513 (10Vgutierrez) p:05Triage→03Normal [04:31:09] 10Operations, 10Traffic: Enable HSTS for non canonical domains using the ncredir service - https://phabricator.wikimedia.org/T231514 (10Vgutierrez) [04:31:38] 10Operations, 10Traffic: Enable HSTS for non canonical domains using the ncredir service - https://phabricator.wikimedia.org/T231514 (10Vgutierrez) p:05Triage→03Normal [04:42:26] (03PS1) 10Vgutierrez: ncredir: Perform HTTPS upgrade without crossing domain boundaries [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) [04:47:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:49:03] 10Operations, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1018 - https://phabricator.wikimedia.org/T231418 (10Marostegui) Works! Thank you! [04:49:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:52:23] (03PS1) 10Marostegui: report_users: Whitelist dbproxy1018 [software] - 10https://gerrit.wikimedia.org/r/533136 [04:53:50] (03CR) 10Marostegui: [C: 03+2] report_users: Whitelist dbproxy1018 [software] - 10https://gerrit.wikimedia.org/r/533136 (owner: 10Marostegui) [04:54:03] (03PS5) 10Marostegui: mariadb: Promote db1133 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/529331 (https://phabricator.wikimedia.org/T229657) [04:54:11] (03PS4) 10Marostegui: wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657) [04:54:22] (03PS3) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) [04:54:31] (03PS2) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) [04:56:34] (03PS1) 10Marostegui: mariadb: Decommission db2053 [puppet] - 10https://gerrit.wikimedia.org/r/533139 (https://phabricator.wikimedia.org/T231407) [04:56:49] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Marostegui) [04:57:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:58:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2053 [puppet] - 10https://gerrit.wikimedia.org/r/533139 (https://phabricator.wikimedia.org/T231407) (owner: 10Marostegui) [04:58:47] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Marostegui) [04:59:06] !log Remove db2053 from tendril and zarcillo T231407 [04:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:13] T231407: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 [05:00:14] (03PS1) 10Vgutierrez: ncredir: Enable HSTS with max-age set to 1 week [puppet] - 10https://gerrit.wikimedia.org/r/533140 (https://phabricator.wikimedia.org/T231514) [05:00:36] !log Stop MySQL on db2053 for decommission T231407 [05:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:30] 10Operations, 10ops-codfw, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Marostegui) a:05Marostegui→03RobH [05:01:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2053.codfw.wmnet - https://phabricator.wikimedia.org/T231407 (10Marostegui) This host is ready for #dc-ops to decommission [05:02:07] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:03:26] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18100/" [puppet] - 10https://gerrit.wikimedia.org/r/533140 (https://phabricator.wikimedia.org/T231514) (owner: 10Vgutierrez) [05:08:15] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:10:06] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, and 2 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10aaron) >>! In T231086#5434295, @CDanis wrote: > I suggest: > * when ATS gets a 404 response from one cluster, retry on the other activ... [05:12:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:19:55] (03PS1) 10Vgutierrez: redirects.dat: Get rid of non canonical domains rules [puppet] - 10https://gerrit.wikimedia.org/r/533141 (https://phabricator.wikimedia.org/T133548) [05:19:57] (03PS1) 10Vgutierrez: redirects.dat: Enforce HTTPS for canonnical domains [puppet] - 10https://gerrit.wikimedia.org/r/533142 (https://phabricator.wikimedia.org/T133548) [05:21:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:24:25] (03PS2) 10Vgutierrez: redirects.dat: Enforce HTTPS for canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/533142 (https://phabricator.wikimedia.org/T133548) [05:32:56] (03PS1) 10Marostegui: dbproxy1018: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/533144 (https://phabricator.wikimedia.org/T202367) [05:33:34] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/533144 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [05:37:53] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:40:51] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:45:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:47:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:50:56] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:52:12] hmm something got deployed yesterday that's giving hiccups to php-fpm? [05:54:14] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:57:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:00:20] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:07:08] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:08:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:11:40] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:28:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:35:12] (03PS8) 10Jeena Huneidi: Add restbase chart (port from local-charts) [deployment-charts] - 10https://gerrit.wikimedia.org/r/517557 (https://phabricator.wikimedia.org/T224935) [06:36:24] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:41:06] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:44:12] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:01:12] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:04:55] 68c9e46ab0c2171a74619cecc445d55c28b31eab these alerts are relatively new, perhaps we should up the threshhold a bit [07:06:05] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/531664/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/531691/ [07:08:09] godog: any thoughts? ^^ [07:08:14] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:12:54] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:13:36] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:20:26] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 56 probes of 451 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [07:22:38] apergos: hmmm checking the last 7 days I'd say that php-fpm response time got substantially worse since last UTC night [07:22:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:24:24] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:24:33] I didn't see anything deployed near the time the alerts started though. [07:26:00] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 24 probes of 451 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [07:29:38] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [07:29:49] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) p:05Triage→03Normal [07:30:02] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [07:30:38] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:31:58] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [07:32:12] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:35:16] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:43:11] (03PS1) 10Dzahn: ATS/varnish: switch iegreview to miscweb backend and use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533154 (https://phabricator.wikimedia.org/T210411) [07:44:46] (03PS2) 10Dzahn: ATS/varnish: switch iegreview to miscweb backend and use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533154 (https://phabricator.wikimedia.org/T210411) [07:48:46] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [07:50:52] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:51:12] (03PS1) 10Dzahn: racktables: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/533155 (https://phabricator.wikimedia.org/T224247) [07:53:42] (03PS1) 10Dzahn: iegreview: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/533157 (https://phabricator.wikimedia.org/T224247) [07:57:04] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:57:09] (03PS1) 10Dzahn: wikimania_scholarships: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533158 (https://phabricator.wikimedia.org/T224247) [08:00:07] (03Abandoned) 10Dzahn: ATS/varnish: replace krypton with miscweb1001, rename director [puppet] - 10https://gerrit.wikimedia.org/r/532695 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [08:00:15] (03PS1) 10KartikMistry: Update cxserver to 2019-08-29-074757-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/533160 (https://phabricator.wikimedia.org/T230200) [08:00:17] (03PS1) 10Dzahn: misc_apps::httpd: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533159 (https://phabricator.wikimedia.org/T224247) [08:01:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:02:57] (03PS1) 10Dzahn: wikistats (cloud): remove php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533161 [08:03:30] <_joe_> !log live tweak on mw1270: apc.ttl removed; apc size 4 GB; tideways disabled. [08:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:08:45] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Update cxserver to 2019-08-29-074757-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/533160 (https://phabricator.wikimedia.org/T230200) (owner: 10KartikMistry) [08:11:20] <_joe_> !log disabling zend GC on mw1347, testing an hypothesis for T231011 [08:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:26] T231011: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 [08:14:09] I'm planning to update cxserver. Anything else going on that can cause issue with it. _joe_ ? [08:14:41] <_joe_> kart_: I'm around for 10 more minutes, and no one else with k8s experience is [08:14:58] <_joe_> so you have no support on our side [08:15:07] <_joe_> other than that, do what you need [08:15:12] OK. It is pretty quick. In case of trouble, I'll revert. [08:15:19] Thanks! [08:15:40] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [08:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:10] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [08:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:16] (03CR) 10Dzahn: [C: 03+2] ATS/varnish: switch iegreview to miscweb backend and use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533154 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [08:21:06] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [08:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:21] (03PS1) 10Dzahn: Revert "webserver_misc_apps: only include envoy if on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/533163 [08:23:52] !log Updated cxserver to 2019-08-29-074757-production (T230200) [08:23:54] !log switching iegreview app to stretch backend with TLS and discovery record [08:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:57] T230200: Whole article as one section - https://phabricator.wikimedia.org/T230200 [08:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:15] !log running puppet on cp-text_eqiad [08:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:28] !log Change min_replicas to 2 on s3 for eqiad and codfw T231019 [08:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:33] T231019: set min_replicas on database sections in dbctl - https://phabricator.wikimedia.org/T231019 [08:32:50] 10Operations, 10Discovery-Search (Current work): Alert when a jvm hits more than 100 old gc ops/hour - https://phabricator.wikimedia.org/T231516 (10Mathew.onipe) [08:36:00] !log Change min_replicas to 4 on s4 for eqiad and codfw T231019 [08:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:07] T231019: set min_replicas on database sections in dbctl - https://phabricator.wikimedia.org/T231019 [08:38:33] 10Operations: two user pages on meta can't be rendered - https://phabricator.wikimedia.org/T231522 (10ArielGlenn) These two pages are aliases for the same contributor, and the problematic revisions were added in 2016 on each page, so this is some sort of regression (php? babel? combo?) [08:38:51] 10Operations, 10MediaWiki-extensions-Babel: two user pages on meta can't be rendered - https://phabricator.wikimedia.org/T231522 (10ArielGlenn) [08:39:27] !log cp1085 - kill stuck puppet processes and run manually [08:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:44] !log cp1085 - puppet run stuck after Loading facts, possibly related to ACKed IPMI sensor status issue in Icinga [08:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:45] !log Change min_replicas to 4 on s2 for eqiad and codfw T231019 [08:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:51] T231019: set min_replicas on database sections in dbctl - https://phabricator.wikimedia.org/T231019 [09:03:23] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) [09:08:07] !log Reboot db1133 to upgrade kernel - T229657 [09:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:12] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [09:15:29] (03PS1) 10Marostegui: install_server: Do not reimage dbproxy1018,dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/533171 (https://phabricator.wikimedia.org/T202367) [09:19:02] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [09:19:35] !log iegreview.wikimedia.org switched to new stretch backend and using TLS (T210411) [09:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:42] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [09:21:07] (03PS1) 10KartikMistry: WIP: Move ContentTranslation out of Beta in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) [09:21:22] (03CR) 10Dzahn: [C: 03+2] racktables: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/533155 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [09:21:31] (03PS2) 10Dzahn: racktables: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/533155 (https://phabricator.wikimedia.org/T224247) [09:21:50] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage dbproxy1018,dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/533171 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [09:23:19] (03PS3) 10Dzahn: racktables: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/533155 (https://phabricator.wikimedia.org/T224247) [09:24:54] (03PS2) 10Dzahn: iegreview: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/533157 (https://phabricator.wikimedia.org/T224247) [09:27:03] (03CR) 10Dzahn: [C: 03+2] iegreview: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/533157 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [09:43:41] 10Operations, 10Traffic: Improve ATS prometheus metrics - https://phabricator.wikimedia.org/T231533 (10Vgutierrez) [09:47:28] (03PS1) 10Dzahn: ATS/varnish: switch scholarschips to miscweb and use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533175 (https://phabricator.wikimedia.org/T210411) [09:49:41] (03PS2) 10Dzahn: ATS/varnish: switch wikimania scholarships to miscweb, use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533175 (https://phabricator.wikimedia.org/T210411) [09:50:42] (03PS3) 10Dzahn: ATS/varnish: switch wikimania scholarships to miscweb, use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533175 (https://phabricator.wikimedia.org/T210411) [10:04:29] (03PS1) 10Vgutierrez: prometheus: Ship a custom metrics file for trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/533178 (https://phabricator.wikimedia.org/T231533) [10:06:28] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Ship a custom metrics file for trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/533178 (https://phabricator.wikimedia.org/T231533) (owner: 10Vgutierrez) [10:07:50] (03PS2) 10Vgutierrez: prometheus: Ship a custom metrics file for trafficserver_exporter [puppet] - 10https://gerrit.wikimedia.org/r/533178 (https://phabricator.wikimedia.org/T231533) [10:08:57] (03CR) 10Mobrovac: [C: 04-1] "Hm so, this will wrap the restbase ports in tls, which means all of the internal customers need to switch to tls as soon as this goes out." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:14:46] (03CR) 10Ema: "> Hm so, this will wrap the restbase ports in tls, which means all of" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:18:50] (03PS1) 10Dzahn: ATS: configure "never_cache" for miscweb1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/533181 (https://phabricator.wikimedia.org/T224247) [10:26:14] 10Operations, 10Traffic: Allow blocking requests from specific networks on the edge - https://phabricator.wikimedia.org/T231063 (10ema) 05Open→03Resolved This is now done. [10:28:43] 10Operations, 10Traffic, 10Patch-For-Review: cergen fails signing CSR - https://phabricator.wikimedia.org/T231423 (10ema) @Ottomata: I understand that this is now fixed, can you confirm and close the task if so? [10:30:06] (03CR) 10Ema: [C: 03+2] Revert "ATS: temporarily use plain HTTP to access docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/533041 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [10:30:08] (03PS2) 10Ema: Revert "ATS: temporarily use plain HTTP to access docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/533041 (https://phabricator.wikimedia.org/T227432) [10:31:01] (03PS1) 10Dzahn: mediawiki::webserver: send stderr of hhvm-restart script to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/533184 [10:38:40] (03CR) 10Mobrovac: [C: 03+1] "From IRC:" [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:38:58] (03PS2) 10Dzahn: mediawiki::webserver: send stderr of hhvm-restart script to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/533184 [10:47:11] 10Operations, 10DBA: Switchover m1 primary master: db1063 to db1135 - https://phabricator.wikimedia.org/T231403 (10akosiaris) Sep 10 16:00UTC sounds okish to me as far as `bacula` goes. We will probably have a couple of full backups (it's the start of the month, when full backups happen and they might not have... [10:48:22] (03PS3) 10Dzahn: mediawiki::webserver: send stderr of hhvm-restart script to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/533184 [10:48:48] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10ema) 05Open→03Resolved We have managed to generate a proper certificate for the docker-registry origin servers, and cp1075 is now back to using TLS to conne... [10:49:08] (03PS4) 10Dzahn: mediawiki::webserver: send stderr of hhvm-restart script to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/533184 [10:49:56] (03CR) 10Dzahn: [C: 03+2] "for now to avoid cron spam" [puppet] - 10https://gerrit.wikimedia.org/r/533184 (owner: 10Dzahn) [10:50:35] 10Operations, 10Traffic, 10Patch-For-Review: Improve ATS prometheus metrics - https://phabricator.wikimedia.org/T231533 (10ema) p:05Triage→03Normal [10:50:47] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10ema) p:05Triage→03Normal [10:50:55] 10Operations, 10DBA: Switchover m1 primary master: db1063 to db1135 - https://phabricator.wikimedia.org/T231403 (10Marostegui) Thanks @akosiaris! As spoken on IRC, no need to re-schedule because of bacula. [10:54:38] 10Operations, 10DBA: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) [10:54:46] (03CR) 10Dzahn: "we should follow-up with a change to the Python script to handle actual errors different from info there and then revert this?" [puppet] - 10https://gerrit.wikimedia.org/r/533184 (owner: 10Dzahn) [10:57:59] 10Operations, 10observability, 10Discovery-Search (Current work): Alert when a jvm hits more than 100 old gc ops/hour - https://phabricator.wikimedia.org/T231516 (10Peachey88) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T1100). [11:00:05] matthiasmullie: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] (03PS1) 10Mathew.onipe: icinga: add old JVM GC check [puppet] - 10https://gerrit.wikimedia.org/r/533189 (https://phabricator.wikimedia.org/T231516) [11:00:19] o/ [11:04:12] 10Operations, 10DBA: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) [11:04:32] o/ [11:04:37] I can SWAT [11:04:39] 10Operations, 10DBA: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) p:05Triage→03Normal [11:05:10] matthiasmullie: you’re a deployer right? [11:05:23] do you want to deploy your changes? [11:05:31] s/changes/backports/ [11:08:52] 10Operations, 10Traffic: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 (10ema) p:05Triage→03High [11:10:07] 10Operations, 10observability, 10Discovery-Search (Current work), 10Patch-For-Review: Alert when a jvm hits more than 100 old gc ops/hour - https://phabricator.wikimedia.org/T231516 (10Mathew.onipe) On another note, I think this check make sense for other clusters as well [11:10:49] Lucas_WMDE: yeah I can do them myuself [11:11:06] thanks for the offer, though! [11:11:12] ok :) [11:12:46] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/18105/" [puppet] - 10https://gerrit.wikimedia.org/r/533189 (https://phabricator.wikimedia.org/T231516) (owner: 10Mathew.onipe) [11:15:20] (03CR) 10Volans: "Replies inline" (036 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [11:16:11] (03PS2) 10Dzahn: ATS: configure "never-cache" for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/533181 (https://phabricator.wikimedia.org/T224247) [11:19:04] 10Operations, 10Traffic: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 (10ema) Thanks for filing this bug and for providing all request/response headers, very useful! I'm currently trying to reproduce the issue with [[ https://en.wikipedia.org/... [11:22:23] (03CR) 10Ema: [C: 03+1] ATS: configure "never-cache" for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/533181 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [11:24:08] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [11:24:11] (03CR) 10Dzahn: [C: 03+2] ATS: configure "never-cache" for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/533181 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [11:24:15] (03PS3) 10Dzahn: ATS: configure "never-cache" for webserver-misc-apps.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/533181 (https://phabricator.wikimedia.org/T224247) [11:24:51] (03CR) 10Ema: [C: 03+1] ATS/varnish: switch wikimania scholarships to miscweb, use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533175 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [11:25:43] (03PS4) 10Dzahn: ATS/varnish: switch wikimania scholarships to miscweb, use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533175 (https://phabricator.wikimedia.org/T210411) [11:26:09] (03CR) 10Ema: [C: 03+1] ATS: remove wikiba.se backend [puppet] - 10https://gerrit.wikimedia.org/r/532976 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [11:28:20] matthiasmullie: everything okay? I’m not seeing any log messages [11:28:41] oh, gate-and-submit-swat is being slow… [11:29:02] yeah, these take a little while :p [11:29:18] almost there ^^ [11:31:23] (03CR) 10Dzahn: [C: 03+2] ATS/varnish: switch wikimania scholarships to miscweb, use TLS [puppet] - 10https://gerrit.wikimedia.org/r/533175 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [11:35:09] (03PS1) 10Vgutierrez: prometheus: Add basic ATS network and ssl metrics [puppet] - 10https://gerrit.wikimedia.org/r/533193 (https://phabricator.wikimedia.org/T231533) [11:37:16] !log scholarships.wikimedia.org app moving to new backend and using TLS. backend upgraded from jessie to stretch and PHP7 (T210411) [11:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:24] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [11:38:17] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [11:38:39] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [11:38:49] (03CR) 10Dzahn: [C: 03+2] wikimania_scholarships: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533158 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [11:39:01] (03PS2) 10Dzahn: wikimania_scholarships: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533158 (https://phabricator.wikimedia.org/T224247) [11:44:31] !log mlitn@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/WikibaseMediaInfo: [SDC] Add "copy statements" functionality (MediaInfo part) (duration: 00m 54s) [11:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:26] (03PS2) 10Dzahn: Revert "webserver_misc_apps: only include envoy if on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/533163 [11:45:34] !log mlitn@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/UploadWizard: [SDC] Add "copy statements" functionality (UploadWizard part) (duration: 00m 52s) [11:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:51] (03PS3) 10Dzahn: Revert "webserver_misc_apps: only include envoy if on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/533163 [11:46:21] (03CR) 10Dzahn: [C: 03+2] Revert "webserver_misc_apps: only include envoy if on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/533163 (owner: 10Dzahn) [11:49:35] (03PS1) 10Dzahn: planet: include envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/533197 (https://phabricator.wikimedia.org/T210411) [11:49:53] (03PS2) 10Dzahn: misc_apps::httpd: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533159 (https://phabricator.wikimedia.org/T224247) [11:50:37] (03CR) 10Dzahn: [C: 03+2] misc_apps::httpd: remove jessie/php5 support [puppet] - 10https://gerrit.wikimedia.org/r/533159 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [11:50:58] (03CR) 10Volans: [C: 04-1] "Needs clarification on why we need this." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/533131 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [11:51:14] Done [11:53:51] jouncebot: now [11:53:51] For the next 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T1100) [11:54:06] nothing else to SWAT, and we don’t have much time left anyways [11:54:09] !log EU SWAT done [11:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:36] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/533128 (https://phabricator.wikimedia.org/T231502) (owner: 10CRusnov) [11:57:05] 10Operations, 10decommission, 10serviceops: decom krypton.eqiad.wmnet - https://phabricator.wikimedia.org/T231546 (10Dzahn) [12:00:15] (03PS3) 10Dzahn: site/install_server: decom krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/532701 (https://phabricator.wikimedia.org/T224247) [12:10:41] (03PS4) 10Dzahn: site/install_server: decom krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/532701 (https://phabricator.wikimedia.org/T224247) [12:11:57] (03CR) 10Dzahn: [C: 03+2] site/install_server: decom krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/532701 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [12:12:06] (03PS5) 10Dzahn: site/install_server: decom krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/532701 (https://phabricator.wikimedia.org/T224247) [12:19:49] 10Operations, 10decommission, 10serviceops: decom krypton.eqiad.wmnet - https://phabricator.wikimedia.org/T231546 (10Dzahn) [12:20:39] 10Operations, 10decommission, 10serviceops: decom krypton.eqiad.wmnet - https://phabricator.wikimedia.org/T231546 (10Dzahn) [12:22:27] 10Operations, 10serviceops, 10Patch-For-Review: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [12:22:35] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [12:23:32] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [12:24:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:25:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:26:14] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:26:44] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:30] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) The following services have been moved away from krypton miscweb1001. - http://racktables.wikimedia.org - http://iegreview.wikimedia.org - http://scholarships.wikimedia.or... [12:28:22] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:30:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:31:56] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) TODO: DB config needs to use codfw proxy if puppet role applied on codfw node [12:33:02] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:33:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:00:05] zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - European version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T1300). [13:00:29] thank you jouncebot but train looks blocked at the moment :/ [13:10:12] (03CR) 10Ema: [C: 03+1] prometheus: Ship a custom metrics file for trafficserver_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533178 (https://phabricator.wikimedia.org/T231533) (owner: 10Vgutierrez) [13:10:38] (03CR) 10Ema: [C: 03+1] prometheus: Add basic ATS network and ssl metrics [puppet] - 10https://gerrit.wikimedia.org/r/533193 (https://phabricator.wikimedia.org/T231533) (owner: 10Vgutierrez) [13:15:32] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Dzahn) One thing that you can already do is create https://transparency.wikimedia.org/historical/ since that is just inside the content repo that is... [13:17:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Nuria) 05Open→03Resolved [13:17:41] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Nuria) [13:22:41] 10Operations, 10Traffic: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 (10ema) a:03ema [13:32:01] (03PS2) 10Mathew.onipe: icinga: add old JVM GC check for elastic [puppet] - 10https://gerrit.wikimedia.org/r/533189 (https://phabricator.wikimedia.org/T231516) [13:40:05] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:40:33] (03CR) 10BBlack: ncredir: Perform HTTPS upgrade without crossing domain boundaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) (owner: 10Vgutierrez) [13:42:54] 10Operations, 10Traffic, 10Patch-For-Review: cergen fails signing CSR - https://phabricator.wikimedia.org/T231423 (10Ottomata) 05Open→03Resolved [13:43:49] (03PS1) 10Andrew Bogott: prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) [13:44:15] (03PS1) 10Urbanecm: Fix "Assign all rights assigned to suppress group to oversight group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) [13:45:06] Daimona: could you review ^^, please? [13:45:48] Yep [13:47:40] thx [13:48:15] (03CR) 10Daimona Eaytoy: [C: 04-1] Fix "Assign all rights assigned to suppress group to oversight group" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [13:49:20] (03PS2) 10Urbanecm: Fix "Assign all rights assigned to suppress group to oversight group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) [13:49:23] Daimona: like that? [13:52:36] (03PS1) 10Vgutierrez: Point wikimania.org to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/533213 (https://phabricator.wikimedia.org/T133548) [13:52:43] (03CR) 10DCausse: [C: 03+1] icinga: add old JVM GC check for elastic [puppet] - 10https://gerrit.wikimedia.org/r/533189 (https://phabricator.wikimedia.org/T231516) (owner: 10Mathew.onipe) [13:52:53] (03CR) 10Daimona Eaytoy: [C: 03+1] Fix "Assign all rights assigned to suppress group to oversight group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [13:53:01] Yes. Haven't tested, but can help testing on mwdebug [13:53:17] jouncebot: now [13:53:17] For the next 1 hour(s) and 6 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T1300) [13:53:37] hmm, we have train now, but it seems like blocked [13:54:07] zeljkof: do you think I can try the commit above (https://gerrit.wikimedia.org/r/533211) at mwdebug (and deploy if it works) now? [13:55:39] Urbanecm: is it urgent? can it wait until next swat window? if it's urgent, go ahead, train is blocked, I don't plan to deploy anything soon [13:55:57] Heh, train is very blocked this week... [13:56:11] Ever heard of trenitalia? [13:56:40] italian railroad company? [13:56:41] Anyway, I believe we can wait [13:56:51] (03PS2) 10Vgutierrez: redirects.dat: Get rid of non canonical domains rules [puppet] - 10https://gerrit.wikimedia.org/r/533141 (https://phabricator.wikimedia.org/T133548) [13:57:00] Yeah, probably [13:57:01] And subject of tons of memes because of the constant delays, yep [13:57:14] That's like internet explorer but on rails [13:57:20] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 93 probes of 452 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:58:02] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 112 probes of 452 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:58:05] thank you Daimona for the review [13:58:18] (03PS2) 10Andrew Bogott: prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) [13:58:26] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 120 probes of 452 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:59:08] (03PS1) 10BBlack: Remove wikimedia.ee zonefile [dns] - 10https://gerrit.wikimedia.org/r/533216 [13:59:13] np ;) [14:00:25] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) (owner: 10Andrew Bogott) [14:01:08] (03PS3) 10Vgutierrez: redirects.dat: Enforce HTTPS for canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/533142 (https://phabricator.wikimedia.org/T133548) [14:01:16] (03PS3) 10Andrew Bogott: prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) [14:01:18] (03CR) 10BBlack: [C: 03+1] redirects.dat: Get rid of non canonical domains rules [puppet] - 10https://gerrit.wikimedia.org/r/533141 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:01:51] (03CR) 10BBlack: [C: 03+2] Remove wikimedia.ee zonefile [dns] - 10https://gerrit.wikimedia.org/r/533216 (owner: 10BBlack) [14:02:09] Urbanecm: would you have time to run a couple of scripts? [14:02:57] (03PS1) 10Vgutierrez: Point wikimedia.community to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/533219 (https://phabricator.wikimedia.org/T133548) [14:04:01] Daimona: sure, which ones? [14:04:12] https://phabricator.wikimedia.org/T231542#5451131 [14:04:27] ty [14:04:33] (03PS1) 10Jhedden: labstore: check nfs v4 cluster status with rpcinfo [puppet] - 10https://gerrit.wikimedia.org/r/533220 (https://phabricator.wikimedia.org/T229448) [14:04:55] (03PS4) 10Andrew Bogott: prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) [14:06:35] Daimona: looking [14:06:43] Thanks [14:06:58] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) (owner: 10Andrew Bogott) [14:07:41] (03PS2) 10Jhedden: labstore: check nfs v4 cluster status with rpcinfo [puppet] - 10https://gerrit.wikimedia.org/r/533220 (https://phabricator.wikimedia.org/T229448) [14:12:25] (03PS5) 10Andrew Bogott: prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) [14:13:34] Daimona: I'm not sure I understand what you want [14:13:41] (03CR) 10BBlack: [C: 03+1] redirects.dat: Enforce HTTPS for canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/533142 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:14:00] The last two code bits should be executed via shell.php/eval.php, I'm unsure what's the most suited [14:14:13] First $id = SqlBlobStore::makeAddressFromTextId( 850575 ); to get the id [14:14:20] Then put it in MediaWikiServices::getInstance()->getBlobStore()->getBlob( $id ); [14:14:28] It should throw an exception which I'd like to see [14:15:02] (03CR) 10BBlack: [C: 03+1] ncredir: Enable HSTS with max-age set to 1 week [puppet] - 10https://gerrit.wikimedia.org/r/533140 (https://phabricator.wikimedia.org/T231514) (owner: 10Vgutierrez) [14:16:10] looking [14:16:42] !log depool ats-be on cp1075 to investigate T231504 [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:49] T231504: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 [14:17:00] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [14:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:33] (03PS1) 10CDanis: trafficserver: fix grafana link [puppet] - 10https://gerrit.wikimedia.org/r/533229 [14:19:00] Daimona: it doesn't seem to throw any exception [14:19:03] see https://phabricator.wikimedia.org/P9000 [14:19:04] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 24 probes of 452 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:19:55] Indeed, thanks, you and hashar did it at the same time :D That's even worse [14:20:05] oops sorry [14:20:10] Maybe there's a bug in fetchText [14:20:15] Ahah np [14:20:29] I love the moment when you investigate a bug and find 3 bugs more [14:21:37] Daimona: why there should be a bug in fetchText.php? [14:21:40] https://phabricator.wikimedia.org/P9001 [14:22:07] Per T231542#5450927 [14:22:08] T231542: AFPData.php: Refusing to cast DUNDEFINED to something else - https://phabricator.wikimedia.org/T231542 [14:22:24] But actually... I've just realized that maybe the ID should be inserted after running the command, and not together... [14:22:35] that's bug in hashar's command :-) [14:23:07] Huh, well, nevermind... Let me see what's wrong at this point [14:23:12] Sure :-) [14:25:22] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 22 probes of 452 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:25:23] Would you mind eval'ing unserialize($res), where $res is the result of fetchText, please? [14:25:50] Copy 'n pasting gives an error [14:25:59] not at all [14:27:01] (03CR) 10CDanis: prometheus: add some cloud-dev dns metrics to the codfw prometheus host (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) (owner: 10Andrew Bogott) [14:28:10] (03CR) 10BBlack: [C: 03+2] Point wikimania.org to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/533213 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:28:14] (03CR) 10BBlack: [C: 03+2] Point wikimedia.community to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/533219 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:28:22] (03PS2) 10BBlack: Point wikimania.org to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/533213 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:28:28] Daimona: althrough I'm not sure how am I supposed to assign an output of fetchText.php to a variable in shell [14:28:39] I was also wondering about that. [14:28:58] (03CR) 10Bstorm: [C: 03+1] labstore: check nfs v4 cluster status with rpcinfo [puppet] - 10https://gerrit.wikimedia.org/r/533220 (https://phabricator.wikimedia.org/T229448) (owner: 10Jhedden) [14:29:00] So, we can try to emulate the code via shell.php, tell me when you're ready and I'll paste the commands [14:29:05] Daimona: shouldn't we coordinate the process of fixing in T231542 in -dev or something? [14:29:05] T231542: AFPData.php: Refusing to cast DUNDEFINED to something else - https://phabricator.wikimedia.org/T231542 [14:29:11] not sure how is this relevant to -operations scope [14:29:19] Yes, sure [14:29:40] But a private chat is also fine, I'll post relevant updates on phab [14:29:46] (03PS6) 10Andrew Bogott: prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) [14:30:18] (03PS2) 10BBlack: Point wikimedia.community to the non canonical redirect service [dns] - 10https://gerrit.wikimedia.org/r/533219 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:32:21] (03CR) 10Andrew Bogott: "pcc run: https://puppet-compiler.wmflabs.org/compiler1002/18113/" [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) (owner: 10Andrew Bogott) [14:34:23] (03CR) 10CDanis: [C: 03+1] "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) (owner: 10Andrew Bogott) [14:34:58] (03CR) 10Andrew Bogott: [C: 03+2] prometheus: add some cloud-dev dns metrics to the codfw prometheus host [puppet] - 10https://gerrit.wikimedia.org/r/533210 (https://phabricator.wikimedia.org/T224828) (owner: 10Andrew Bogott) [14:36:41] (03PS3) 10BBlack: redirects.dat: Get rid of non canonical domains rules [puppet] - 10https://gerrit.wikimedia.org/r/533141 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:36:50] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 452 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:38:41] (03CR) 10BBlack: [C: 03+2] redirects.dat: Get rid of non canonical domains rules [puppet] - 10https://gerrit.wikimedia.org/r/533141 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:39:05] (03PS4) 10BBlack: redirects.dat: Enforce HTTPS for canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/533142 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:41:21] (03CR) 10BBlack: [C: 03+2] redirects.dat: Enforce HTTPS for canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/533142 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:42:42] (03PS2) 10Ottomata: Release Spark 2.4.3 [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/532455 (https://phabricator.wikimedia.org/T222253) [14:47:02] (03CR) 10Thcipriani: "> outdated or still holding?" [puppet] - 10https://gerrit.wikimedia.org/r/528433 (owner: 10Paladox) [14:48:10] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10Papaul) ` papaul@asw-a-codfw# run show interfaces ge-5/0/32 descriptions Interface Admin Link Description ge-5/0/32 down down DISABLED [14:48:51] (03PS1) 10BBlack: redirects.dat: secure external redirects [puppet] - 10https://gerrit.wikimedia.org/r/533236 [14:51:52] (03CR) 10BBlack: [C: 03+2] redirects.dat: secure external redirects [puppet] - 10https://gerrit.wikimedia.org/r/533236 (owner: 10BBlack) [14:57:47] (03PS2) 10Vgutierrez: ncredir: Perform HTTPS upgrade without crossing domain boundaries [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) [15:04:09] (03CR) 10Jhedden: [C: 03+2] labstore: check nfs v4 cluster status with rpcinfo [puppet] - 10https://gerrit.wikimedia.org/r/533220 (https://phabricator.wikimedia.org/T229448) (owner: 10Jhedden) [15:04:17] (03PS3) 10Jhedden: labstore: check nfs v4 cluster status with rpcinfo [puppet] - 10https://gerrit.wikimedia.org/r/533220 (https://phabricator.wikimedia.org/T229448) [15:11:40] (03CR) 10BBlack: ncredir: Perform HTTPS upgrade without crossing domain boundaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) (owner: 10Vgutierrez) [15:11:51] (03CR) 10BBlack: [C: 03+1] ncredir: Perform HTTPS upgrade without crossing domain boundaries [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) (owner: 10Vgutierrez) [15:14:00] (03PS3) 10Vgutierrez: ncredir: Perform HTTPS upgrade without crossing domain boundaries [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) [15:15:16] (03CR) 10Vgutierrez: ncredir: Perform HTTPS upgrade without crossing domain boundaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) (owner: 10Vgutierrez) [15:15:23] (03CR) 10BBlack: [C: 03+1] ncredir: Perform HTTPS upgrade without crossing domain boundaries [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) (owner: 10Vgutierrez) [15:15:37] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Perform HTTPS upgrade without crossing domain boundaries [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) (owner: 10Vgutierrez) [15:15:47] (03PS4) 10Vgutierrez: ncredir: Perform HTTPS upgrade without crossing domain boundaries [puppet] - 10https://gerrit.wikimedia.org/r/533133 (https://phabricator.wikimedia.org/T231513) [15:18:50] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:19:01] (03PS1) 10Ema: ATS: perform MW and RB mangling after cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/533242 (https://phabricator.wikimedia.org/T231504) [15:21:02] (03CR) 10jerkins-bot: [V: 04-1] ATS: perform MW and RB mangling after cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/533242 (https://phabricator.wikimedia.org/T231504) (owner: 10Ema) [15:24:19] (03PS2) 10Vgutierrez: ncredir: Enable HSTS with max-age set to 1 week [puppet] - 10https://gerrit.wikimedia.org/r/533140 (https://phabricator.wikimedia.org/T231514) [15:26:55] (03PS2) 10Ema: ATS: perform MW and RB mangling after cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/533242 (https://phabricator.wikimedia.org/T231504) [15:27:25] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Enable HSTS with max-age set to 1 week [puppet] - 10https://gerrit.wikimedia.org/r/533140 (https://phabricator.wikimedia.org/T231514) (owner: 10Vgutierrez) [15:36:19] (03PS1) 10Vgutierrez: ncredir: Get rid of wikimedia.ee [puppet] - 10https://gerrit.wikimedia.org/r/533245 [15:36:23] (03PS2) 10Ayounsi: Add bash script between FNM and notify script [puppet] - 10https://gerrit.wikimedia.org/r/533081 (https://phabricator.wikimedia.org/T226810) [15:37:00] (03CR) 10Ayounsi: "Answered, thanks!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/533081 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [15:37:52] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2034 [dns] - 10https://gerrit.wikimedia.org/r/533247 [15:38:54] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Get rid of wikimedia.ee [puppet] - 10https://gerrit.wikimedia.org/r/533245 (owner: 10Vgutierrez) [15:38:58] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:39:38] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18114/netflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/533081 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [15:39:48] (03PS3) 10Ayounsi: Add bash script between FNM and notify script [puppet] - 10https://gerrit.wikimedia.org/r/533081 (https://phabricator.wikimedia.org/T226810) [15:40:44] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2034 [dns] - 10https://gerrit.wikimedia.org/r/533247 (owner: 10Papaul) [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:05:20] (03PS3) 10Ema: ATS: perform MW and RB mangling after cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/533242 (https://phabricator.wikimedia.org/T231504) [16:07:04] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2045 [dns] - 10https://gerrit.wikimedia.org/r/533255 [16:07:52] (03CR) 10Ema: [C: 03+2] ATS: perform MW and RB mangling after cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/533242 (https://phabricator.wikimedia.org/T231504) (owner: 10Ema) [16:08:11] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2045 [dns] - 10https://gerrit.wikimedia.org/r/533255 (owner: 10Papaul) [16:13:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:23:22] (03PS7) 10CRusnov: Add Netbox instance addresses [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) [16:25:41] (03CR) 10CRusnov: [C: 03+2] Add Netbox instance addresses [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [16:30:34] (03PS2) 10Dzahn: planet: include envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/533197 (https://phabricator.wikimedia.org/T210411) [16:33:07] (03CR) 10Jforrester: "Doing T112147 would be nicer, of course. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [16:33:10] (03PS2) 10Cwhite: add the option of passing a custom metrics context manager to EndpointRequest [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 [16:34:18] (03CR) 10Dzahn: [C: 03+2] planet: include envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/533197 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [16:36:40] (03PS1) 10Dzahn: Revert "planet: include envoy for TLS termination" [puppet] - 10https://gerrit.wikimedia.org/r/533263 [16:38:40] (03CR) 10Dzahn: [C: 03+2] Revert "planet: include envoy for TLS termination" [puppet] - 10https://gerrit.wikimedia.org/r/533263 (owner: 10Dzahn) [16:49:09] !log crusnov@cumin1001 START - Cookbook sre.ganeti.makevm [16:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:20] !log crusnov@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [16:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:30] !log crusnov@cumin1001 START - Cookbook sre.ganeti.makevm [16:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:28] (03PS1) 10Bstorm: pdns: set the recursor threads in line with best practices [puppet] - 10https://gerrit.wikimedia.org/r/533268 (https://phabricator.wikimedia.org/T224828) [16:54:02] (03CR) 10Urbanecm: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [16:58:05] (03CR) 10Thcipriani: [C: 03+2] Update go-import and wikimedia plugins [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525869 (owner: 10Paladox) [16:59:25] !log crusnov@cumin1001 START - Cookbook sre.ganeti.makevm [16:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] cscott, arlolra, subbu, halfak, and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T1700). [17:00:19] no parsoid deploy today [17:03:10] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 59.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:03:34] (03CR) 10Bstorm: [C: 04-1] "We don't want to merge this until we've done some tests and tried a bit more to reproduce the problem just in case this is related." [puppet] - 10https://gerrit.wikimedia.org/r/533268 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [17:03:48] (03CR) 10Jhedden: [C: 03+1] pdns: set the recursor threads in line with best practices [puppet] - 10https://gerrit.wikimedia.org/r/533268 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [17:04:44] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 78.38 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:05:12] (03Merged) 10jenkins-bot: Update go-import and wikimedia plugins [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525869 (owner: 10Paladox) [17:06:03] Will there be a train next week? https://wikitech.wikimedia.org/wiki/Deployments wasn't sure if there would be with the US Holiday or not [17:07:54] davidwbarratt, I believe there will be, handled by Europeans [17:08:30] ah, ok, great! [17:09:00] !log crusnov@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [17:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:40] !log crusnov@cumin1001 START - Cookbook sre.ganeti.makevm [17:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:12] !log crusnov@cumin1001 START - Cookbook sre.ganeti.makevm [17:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:20] !log restarted elasticsearch on cloudelastic1004 (T231517) [17:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:26] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [17:16:07] (03CR) 10Daimona Eaytoy: [C: 03+1] "> > Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [17:19:22] !log crusnov@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [17:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:01] (03CR) 10Volans: "Sorry, I couldn't do a full pass now, I did a very quick one. I also had some pre-existing un-pushed comments, ignore them if they don't a" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [17:22:08] !log crusnov@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [17:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:27] davidwbarratt: One of the reasons we have the train on Tuesday–Thursday is to avoid having to cancel it on weeks when there's a Monday holiday. [17:48:44] * Krinkle staging on mwdebug1002 soon with https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/533267/ and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralAuth/+/533218/ [18:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:21] * Krinkle takes the window for now, although still waiting for CI [18:00:44] Krinkle: ping me once you're done, please [18:00:57] Urbanecm: config deploy? [18:01:25] Krinkle: yup [18:01:29] (but I can wait) [18:01:34] as long as it isn't wait three hours [18:02:36] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:04:13] Urbanecm: go ahead then, will be quicker than my CI MW patches [18:04:26] just be careful not to git pull in any php.* dir just in case. [18:04:39] * Urbanecm takes the window then [18:04:40] thanks Krinkle [18:04:47] (03PS1) 10Herron: prometheus: aggregate systemd failed metrics [puppet] - 10https://gerrit.wikimedia.org/r/533282 (https://phabricator.wikimedia.org/T230570) [18:05:59] (03PS3) 10Urbanecm: Fix "Assign all rights assigned to suppress group to oversight group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) [18:06:15] (03CR) 10Urbanecm: [C: 03+2] Fix "Assign all rights assigned to suppress group to oversight group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:07:16] !log increase index.refresh_interval to 5m for all indices on cloudelastic-chi [18:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:03] * Urbanecm waits on CI as well [18:10:43] (03Merged) 10jenkins-bot: Fix "Assign all rights assigned to suppress group to oversight group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:10:54] (03CR) 10jenkins-bot: Fix "Assign all rights assigned to suppress group to oversight group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533211 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:11:10] (03PS1) 10Andrew Bogott: Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 [18:11:27] (03PS1) 10Andrew Bogott: designate: update pool config for a single server in codfw1-dev [puppet] - 10https://gerrit.wikimedia.org/r/533287 [18:12:19] (03CR) 10jerkins-bot: [V: 04-1] Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 (owner: 10Andrew Bogott) [18:17:07] (03PS2) 10Andrew Bogott: Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 [18:17:14] (03CR) 10jerkins-bot: [V: 04-1] Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 (owner: 10Andrew Bogott) [18:18:33] (03PS1) 10Urbanecm: Follow up for e70da21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) [18:18:56] (03CR) 10Urbanecm: [C: 03+2] Follow up for e70da21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:20:39] (03CR) 10Jforrester: [C: 03+1] Follow up for e70da21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:20:40] Urbanecm: my patches have landed, fyi - let me know when done :) [18:21:23] (03PS2) 10Urbanecm: Follow up for e70da21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) [18:21:30] (03CR) 10Urbanecm: [C: 03+2] Follow up for e70da21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:21:43] Krinkle: syncing the patch(es) [18:21:43] (03PS3) 10Andrew Bogott: Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 [18:22:11] but the second one I just +2'ed is on deploy1001, but not in git yet [18:22:19] (was making sure it works before uploading) [18:22:48] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: Fix "Assign all rights assigned to suppress group to oversight group" (T230601) (duration: 00m 54s) [18:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:06] T230601: Groups 'oversight'/'suppress' should be reconciled - https://phabricator.wikimedia.org/T230601 [18:23:23] !log restart elasticsearch on cloudelastic1001 (T231517) [18:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:29] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [18:23:44] I'm done with actual deployment, but the commit has a different commit message in git and deploy1001, so I'd need to do some git-fu to have it really done [18:23:54] Krinkle: if you want, you can deploy, but bear what I wrote in mind [18:23:58] (not safe to deploy config rn) [18:24:01] (03CR) 10DannyS712: [C: 04-1] "The latter does work; its appears to be working currently" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:24:52] (03CR) 10Urbanecm: [C: 03+2] "> Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:25:42] (03CR) 10jerkins-bot: [V: 04-1] Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 (owner: 10Andrew Bogott) [18:25:44] (03CR) 10DannyS712: [C: 03+1] "> > Patch Set 2: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:25:46] (03Merged) 10jenkins-bot: Follow up for e70da21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:26:01] (03CR) 10jenkins-bot: Follow up for e70da21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:26:26] Krinkle: done for real [18:27:06] ok :) [18:27:37] * Krinkle staging on mwdebug1002 [18:28:42] (03PS4) 10Andrew Bogott: Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 [18:30:19] (03CR) 10Daimona Eaytoy: "Hah, because use() passes in the current value of $wg..., not the one it'll have when running the update. I'm sorry, my fault." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:30:52] (03CR) 10Urbanecm: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533288 (https://phabricator.wikimedia.org/T230601) (owner: 10Urbanecm) [18:31:48] Krinkle: is the AF change on mwdebug? [18:32:09] Daimona: pulling as we speak [18:32:18] live on mwdebug1002 now [18:32:24] OK, I'm gonna check as well [18:32:57] Works! [18:33:20] So the suspect about private + unserialize was right [18:33:39] [XWgaYgpAAC4AAGLnkcQAAAAS] /wiki/Especial:RegistroAbusos/4122 Error from line 639 of /srv/mediawiki/php-1.34.0-wmf.20/includes/Revision.php: Call to a member function getId() on null [18:33:45] That's unrelated [18:33:47] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/CentralAuth/modules/ext.centralauth.ForeignApi.js: e7cd3cd313a4642 (duration: 00m 55s) [18:33:47] Eurgh. [18:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:54] And PHP7-only [18:33:54] Caught BadMethodCallException - T187153 [18:33:55] T187153: Special:Abuselog throws when viewing details or examining (BadMethodCallException: Call get getId() on null) - https://phabricator.wikimedia.org/T187153 [18:34:05] T187153 [18:34:17] The fun never ends when you stuff serialized classes in the DB ;) [18:34:42] Alright, Scap, let's scatter the code around production. [18:34:50] Yeah, getting rid of that is epic, but will be so worth it. [18:36:17] Beauty is in the AbuseFilter of the VariableHolder – .php [18:36:18] And it's even funnier when HHVM throws a BadMethodCallException, and PHP7 an Error [18:36:36] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/AbuseFilter/includes/AbuseFilterVariableHolder.php: T231542 f37f0bd50cf (duration: 00m 53s) [18:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:54] T231542: AFPData.php: Refusing to cast DUNDEFINED to something else - https://phabricator.wikimedia.org/T231542 [18:39:50] hallo [18:40:02] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:40:08] Krenair - thanks for all the recent translatewiki patches [18:40:15] I'm not sure about https://gerrit.wikimedia.org/r/#/c/translatewiki/+/532120/ , though [18:40:38] I might be wrong, but I suspect that Cargo is not actually used on Wikimedia servers. [18:40:51] It isn't. [18:41:08] But the API messages in any repo aren't great for most translators. [18:41:32] It somehow made it to the checklist at https://phabricator.wikimedia.org/T189982 . It's good to split the messages, but it shouldn't be configured under "Used by Wikimedia". [18:41:37] I'll submit a path. [18:41:40] patch [18:41:48] !log set index.merge.scheduler.max_thread_count to null to accept default values on cloudelastic-chi (T231517) [18:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:55] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [18:42:10] Good spot, aharoni. [18:48:28] aharoni: BTW, https://gerrit.wikimedia.org/r/c/translatewiki/+/475614 needs a rebase, then I'll merge. [18:50:15] !log restart elasticsearch on cloudelastic1002 (T231517) [18:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:21] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [18:50:53] James_F - thanks, done [18:56:52] (03PS1) 10CRusnov: install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 [18:57:46] (03CR) 10jerkins-bot: [V: 04-1] install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 (owner: 10CRusnov) [18:59:00] !log restart elasticsearch on cloudelastic1003 (T231517) [18:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:09] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [18:59:58] (03PS2) 10CRusnov: install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 [19:00:59] (03CR) 10jerkins-bot: [V: 04-1] install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 (owner: 10CRusnov) [19:01:20] (03PS1) 10Jhedden: labstore: Open NFS between secondary servers [puppet] - 10https://gerrit.wikimedia.org/r/533302 (https://phabricator.wikimedia.org/T229448) [19:04:06] (03CR) 10Bstorm: [C: 03+1] labstore: Open NFS between secondary servers [puppet] - 10https://gerrit.wikimedia.org/r/533302 (https://phabricator.wikimedia.org/T229448) (owner: 10Jhedden) [19:04:55] (03PS2) 10Jhedden: labstore: Open NFS between secondary servers [puppet] - 10https://gerrit.wikimedia.org/r/533302 (https://phabricator.wikimedia.org/T229448) [19:06:56] (03CR) 10Jhedden: [C: 03+2] labstore: Open NFS between secondary servers [puppet] - 10https://gerrit.wikimedia.org/r/533302 (https://phabricator.wikimedia.org/T229448) (owner: 10Jhedden) [19:17:08] (03CR) 10Andrew Bogott: [C: 03+2] Update dns entries for codfw1dev, the cloud test region [dns] - 10https://gerrit.wikimedia.org/r/533286 (owner: 10Andrew Bogott) [19:19:52] (03PS2) 10Andrew Bogott: designate: update pool config for a single server in codfw1-dev [puppet] - 10https://gerrit.wikimedia.org/r/533287 [19:20:43] (03CR) 10Andrew Bogott: [C: 03+2] designate: update pool config for a single server in codfw1-dev [puppet] - 10https://gerrit.wikimedia.org/r/533287 (owner: 10Andrew Bogott) [19:31:47] (03Abandoned) 10Jhedden: toolschecker: match nginx and wsgi timeouts [puppet] - 10https://gerrit.wikimedia.org/r/528892 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [19:32:03] (03Abandoned) 10Jhedden: toolschecker: check status for webservice tasks [puppet] - 10https://gerrit.wikimedia.org/r/528897 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [19:39:02] (03PS1) 10Bstorm: powerdns: correct some database variables in my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) [19:44:07] (03PS1) 10Jhedden: toolschecker: remove webservice grid and k8s check [puppet] - 10https://gerrit.wikimedia.org/r/533310 (https://phabricator.wikimedia.org/T221301) [19:45:24] (03CR) 10Jhedden: [C: 03+1] powerdns: correct some database variables in my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [19:48:53] (03CR) 10Jhedden: [C: 03+2] toolschecker: remove webservice grid and k8s check [puppet] - 10https://gerrit.wikimedia.org/r/533310 (https://phabricator.wikimedia.org/T221301) (owner: 10Jhedden) [19:56:50] !log cloudelastic-chi run frwiki_content/_forcemerge?only_expunge_deletes=true to try and fix 5gb segments with 96% deleted documents [19:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:27] /win 21 [20:03:07] (03PS1) 10Andrew Bogott: profile::openstack::codfw1dev::db: allow designate host to access db server [puppet] - 10https://gerrit.wikimedia.org/r/533314 [20:07:21] (03PS3) 10CRusnov: install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 [20:08:15] (03CR) 10jerkins-bot: [V: 04-1] install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 (owner: 10CRusnov) [20:09:31] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::codfw1dev::db: allow designate host to access db server [puppet] - 10https://gerrit.wikimedia.org/r/533314 (owner: 10Andrew Bogott) [20:11:29] (03PS1) 10Jhedden: toolschecker: remove showmount check [puppet] - 10https://gerrit.wikimedia.org/r/533318 (https://phabricator.wikimedia.org/T229448) [20:11:40] (03PS4) 10CRusnov: install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 [20:14:13] !log removing two files for legal compliance [20:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:18] (03CR) 10Jhedden: [C: 03+2] toolschecker: remove showmount check [puppet] - 10https://gerrit.wikimedia.org/r/533318 (https://phabricator.wikimedia.org/T229448) (owner: 10Jhedden) [20:15:35] (03PS2) 10Jhedden: toolschecker: remove showmount check [puppet] - 10https://gerrit.wikimedia.org/r/533318 (https://phabricator.wikimedia.org/T229448) [20:16:17] (03PS7) 10BryanDavis: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558) (owner: 10Bstorm) [20:16:19] (03PS1) 10BryanDavis: Add black to tox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533320 [20:16:21] (03PS1) 10BryanDavis: flake8: remove ignored tests [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533321 [20:16:23] (03PS1) 10BryanDavis: Ignore a local .python-version file [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533322 [20:16:25] (03PS1) 10BryanDavis: Add --single CLI argument [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533323 [20:16:27] (03PS1) 10BryanDavis: Downgrade libbz2 for jessie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533324 [20:30:47] (03CR) 10CRusnov: [C: 03+2] librenms: Exclude problematic InventoryItem type as requested [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/533128 (https://phabricator.wikimedia.org/T231502) (owner: 10CRusnov) [20:31:32] (03CR) 10BryanDavis: [C: 03+2] docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558) (owner: 10Bstorm) [20:32:18] (03Merged) 10jenkins-bot: docker: add support for "testing" tags [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528178 (https://phabricator.wikimedia.org/T224558) (owner: 10Bstorm) [20:35:25] (03CR) 10BryanDavis: [C: 03+2] Add black to tox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533320 (owner: 10BryanDavis) [20:35:46] (03CR) 10BryanDavis: [C: 03+2] flake8: remove ignored tests [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533321 (owner: 10BryanDavis) [20:35:53] (03CR) 10BryanDavis: [C: 03+2] Ignore a local .python-version file [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533322 (owner: 10BryanDavis) [20:38:01] (03CR) 10Ayounsi: [C: 03+1] "Assuming the MACs match what's given by Ganeti" [puppet] - 10https://gerrit.wikimedia.org/r/533301 (owner: 10CRusnov) [20:39:12] (03Merged) 10jenkins-bot: Add black to tox [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533320 (owner: 10BryanDavis) [20:39:14] (03CR) 10CRusnov: [C: 03+2] install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 (owner: 10CRusnov) [20:39:17] (03Merged) 10jenkins-bot: flake8: remove ignored tests [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533321 (owner: 10BryanDavis) [20:39:34] (03Merged) 10jenkins-bot: Ignore a local .python-version file [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533322 (owner: 10BryanDavis) [20:39:45] (03CR) 10BryanDavis: [C: 03+2] Add --single CLI argument [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533323 (owner: 10BryanDavis) [20:40:09] (03CR) 10BryanDavis: [C: 03+2] Downgrade libbz2 for jessie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533324 (owner: 10BryanDavis) [20:40:35] (03Merged) 10jenkins-bot: Add --single CLI argument [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533323 (owner: 10BryanDavis) [20:40:58] (03Merged) 10jenkins-bot: Downgrade libbz2 for jessie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/533324 (owner: 10BryanDavis) [20:44:34] (03PS5) 10CRusnov: install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 [20:45:25] (03CR) 10CRusnov: [V: 03+2 C: 03+2] install_server: at netbox server types, and dhcp config [puppet] - 10https://gerrit.wikimedia.org/r/533301 (owner: 10CRusnov) [21:11:50] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:14:09] (03PS1) 10Paladox: Merge tag 'v2.15.16' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533332 [21:14:44] (03PS1) 10Andrew Bogott: codfw1dev: standardize on a single pdns server, codfw1dev-ns0 [puppet] - 10https://gerrit.wikimedia.org/r/533333 [21:15:55] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: standardize on a single pdns server, codfw1dev-ns0 [puppet] - 10https://gerrit.wikimedia.org/r/533333 (owner: 10Andrew Bogott) [21:17:00] (03CR) 10jerkins-bot: [V: 04-1] Merge tag 'v2.15.16' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533332 (owner: 10Paladox) [21:32:33] (03PS1) 10Paladox: Support newer bazel versions [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533336 [21:32:44] (03PS2) 10Paladox: Support newer bazel versions [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533336 [21:32:55] (03PS3) 10Paladox: Support newer bazel versions [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533336 [21:36:27] (03PS2) 10Bstorm: toolforge: add CORS header to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [21:36:50] (03PS4) 10Paladox: Support newer bazel versions [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533336 [21:36:53] (03PS2) 10Paladox: Merge tag 'v2.15.16' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533332 [21:45:11] (03Abandoned) 10Paladox: Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525865 (owner: 10Paladox) [21:50:09] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:00:10] (03CR) 10Thcipriani: [C: 03+2] "Nice work!" [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533336 (owner: 10Paladox) [22:00:43] (03PS1) 10Krinkle: Update interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533345 [22:00:54] (03PS2) 10Krinkle: Update interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533345 (https://phabricator.wikimedia.org/T187716) [22:03:59] (03CR) 10Krinkle: [C: 03+2] Update interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533345 (https://phabricator.wikimedia.org/T187716) (owner: 10Krinkle) [22:08:27] (03Merged) 10jenkins-bot: Support newer bazel versions [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/533336 (owner: 10Paladox) [22:08:58] (03Merged) 10jenkins-bot: Update interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533345 (https://phabricator.wikimedia.org/T187716) (owner: 10Krinkle) [22:09:19] (03CR) 10jenkins-bot: Update interwiki-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533345 (https://phabricator.wikimedia.org/T187716) (owner: 10Krinkle) [22:11:27] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:20:16] (03PS1) 10Paladox: Support newer bazel versions [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533349 [22:20:50] (03PS1) 10Paladox: Merge tag 'v2.16.11' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533350 [22:26:53] (03PS2) 10Bstorm: pdns: set the recursor threads in line with best practices [puppet] - 10https://gerrit.wikimedia.org/r/533268 (https://phabricator.wikimedia.org/T224828) [22:29:05] (03CR) 10Bstorm: [C: 03+2] pdns: set the recursor threads in line with best practices [puppet] - 10https://gerrit.wikimedia.org/r/533268 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [22:31:18] (03CR) 10Paladox: [C: 03+2] "Self merging as it's been merged into 2.15 and it passes the build." [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533349 (owner: 10Paladox) [22:37:13] (03PS44) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:39:42] (03Merged) 10jenkins-bot: Support newer bazel versions [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/533349 (owner: 10Paladox) [22:49:23] (03PS2) 10Krinkle: CommonSettings: Clean up wmf-config caching code [no-op] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) [22:49:40] (03PS45) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:51:16] (03CR) 10Krinkle: [C: 03+2] CommonSettings: Clean up wmf-config caching code [no-op] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [22:54:22] (03CR) 10Jforrester: [C: 03+1] CommonSettings: Clean up wmf-config caching code [no-op] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [22:57:33] (03Merged) 10jenkins-bot: CommonSettings: Clean up wmf-config caching code [no-op] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [22:58:44] (03CR) 10jenkins-bot: CommonSettings: Clean up wmf-config caching code [no-op] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528446 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [22:59:13] * Krinkle staging on mwdebug1002 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190829T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:04:07] (03PS46) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:06:42] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:07:45] filing an unrelated bug report first, then will proceed with the patch deploy [23:09:03] (03PS47) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:14:26] (03PS4) 10Krinkle: CommonSettings: Store mtime inside wmf-config cache file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) [23:15:04] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: 4cdfebe (duration: 00m 54s) [23:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:45] (03CR) 10Jforrester: "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [23:21:19] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 26499 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [23:25:24] (03PS48) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [23:31:32] (03CR) 10BryanDavis: "AKosiaris: can you give this a sanity check from your perspective on the prod network's use of ::docker::registry? I tried to separate thi" [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [23:33:53] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [23:43:14] (03PS4) 10Krinkle: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz)