[00:00:42] Ya, all good, it's cranking along. [00:00:43] Thanks! [00:01:06] Now we can actually modify that code without bugging y'all for a +2 :-) [00:04:10] marlier: great! thanks for confirming [00:05:31] !log aaron@tin Synchronized wmf-config/mc-labs.php: 8ad186728d: use mcrouter key prefixes (deployment-prep only) (duration: 01m 15s) [00:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:30] 10Operations, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842#4118429 (10Dzahn) Which deployment server is this about? Production or deployment-prep or another? [00:26:47] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-IRC-RC-Server: Create RC feed for login.wikimedia - https://phabricator.wikimedia.org/T191625#4111859 (10Dzahn) There is an RFC to stop running irc.wikimedia.org. It might more sense to focus on getting this into the replacement for that. [00:29:29] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-IRC-RC-Server: Create RC feed for login.wikimedia - https://phabricator.wikimedia.org/T191625#4111859 (10Platonides) Account creations there are already logged on #central, not sure about renames [00:32:52] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4138132 (10faidon) WMF3565 is > 5 years old, so there's really no point in setting hardware that old right now. How urgent is this task? We have a task open for procuring new hardware... [00:38:28] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4138143 (10Dzahn) a:05Dzahn>03None I think it kind of blocks the "never use PHP5" / "switch to PHP7" thing which also affects appservers and deployment servers. In the last Service... [00:40:43] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-IRC-RC-Server: Create RC feed for login.wikimedia - https://phabricator.wikimedia.org/T191625#4138146 (10Dzahn) Afaict the way it works is that the bot has the right to create channels and it joins a channel when the first message comes in. (We know bec... [00:41:12] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-IRC-RC-Server: Create RC feed for login.wikimedia - https://phabricator.wikimedia.org/T191625#4138148 (10Dzahn) p:05Triage>03Low [00:42:14] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: Scap stalled at sync-masters, ok: 1, left: 1 - https://phabricator.wikimedia.org/T191029#4138152 (10Dzahn) 05Open>03Resolved a:03Dzahn [00:43:09] (03PS1) 10Jforrester: Drop old wgEnableAPI and wgEnableWriteAPI, no longer used in MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) [00:44:03] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4138159 (10Dzahn) p:05Triage>03Normal [00:44:39] (03CR) 10Jforrester: "These were already set to true by default in MW, and are being removed from 1.32.0-wmf.1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427289 (https://phabricator.wikimedia.org/T115414) (owner: 10Jforrester) [00:47:01] 10Operations, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842#4138160 (10awight) Sorry--this is about production. [00:48:00] 10Operations, 10Ops-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4138162 (10EBjune) >>! In T188453#4064287, @mark wrote: > Is it an option for your purposes to get access to only a few sub domains, so we... [01:10:53] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4138212 (10Papaul) [01:11:11] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#4138214 (10Papaul) [01:11:17] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3978461 (10Papaul) 05Open>03Resolved [01:17:18] PROBLEM - ensure kvm processes are running on labvirt1015 is CRITICAL: PROCS CRITICAL: 97 processes with regex args /usr/bin/kvm [01:18:00] (03CR) 10Krinkle: [C: 031] "Confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/425967 (https://phabricator.wikimedia.org/T192139) (owner: 10EddieGP) [01:18:18] RECOVERY - ensure kvm processes are running on labvirt1015 is OK: PROCS OK: 94 processes with regex args /usr/bin/kvm [01:24:24] (03PS1) 10Andrew Bogott: Depool labvirt1015, it needs a break. [puppet] - 10https://gerrit.wikimedia.org/r/427300 (https://phabricator.wikimedia.org/T192422) [01:26:06] (03CR) 10Andrew Bogott: [C: 032] Depool labvirt1015, it needs a break. [puppet] - 10https://gerrit.wikimedia.org/r/427300 (https://phabricator.wikimedia.org/T192422) (owner: 10Andrew Bogott) [01:28:35] 10Operations, 10Patch-For-Review, 10Wikimedia-maintenance-script-run: Remove monthly run of updateArticleCount.php - https://phabricator.wikimedia.org/T192139#4129100 (10Krinkle) [01:30:45] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4138240 (10Krinkle) [02:35:24] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10PHP 7.0 support, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4138301 (10Krinkle) [02:51:38] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 05m 55s) [02:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:36] PROBLEM - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [03:49:16] PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.114 and port 9042: Connection refused [03:55:56] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [03:57:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:58:02] (03CR) 10Jayprakash12345: "@Urbanecm Window for creating wiki has set. Please add all config like Project Namespace etc. And you will go "Hindi_Wikimedians_User_Grou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [04:04:56] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [04:05:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:56:02] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138339 (10Marostegui) @jcrespo what do you feel it is being missed? [05:02:00] !log Deploy schema change on db1071 (s8 primary master) - T185128 T153182 [05:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:07] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:02:07] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:07:05] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138360 (10Marostegui) Changing the network cable didn't have any effect. Errors are still there [05:07:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427313 [05:08:02] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427313 [05:11:04] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138365 (10Marostegui) [05:11:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427313 (owner: 10Marostegui) [05:12:26] PROBLEM - HHVM rendering on mw2288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:12:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427313 (owner: 10Marostegui) [05:12:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427313 (owner: 10Marostegui) [05:13:16] RECOVERY - HHVM rendering on mw2288 is OK: HTTP OK: HTTP/1.1 200 OK - 77838 bytes in 0.310 second response time [05:13:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 after alter table (duration: 01m 16s) [05:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427314 (https://phabricator.wikimedia.org/T191996) [05:16:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427314 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [05:18:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427314 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [05:18:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427314 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [05:20:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 01m 15s) [05:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:29] !log Change RX buffers on db1114 - T191996 [05:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:35] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [05:22:09] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138369 (10Marostegui) RX buffers changed: ``` root@db1114:~# ethtool -g eno1 Ring parameters for eno1: Pre-set maximums: RX: 2047 RX Mini: 0 RX Jumbo: 0 TX: 511 Current... [05:22:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427315 [05:23:57] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427315 (owner: 10Marostegui) [05:25:02] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138370 (10Marostegui) [05:25:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427315 (owner: 10Marostegui) [05:27:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1114 after changing RX buffers - T191996 (duration: 01m 09s) [05:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:19] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [05:30:24] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427315 (owner: 10Marostegui) [06:10:46] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [06:10:46] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:11:28] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138424 (10Marostegui) For the record, this is the amount of dropped packets per server, of all the servers that are on that switch: ``` ores1008 RX errors 0 dropp... [06:32:04] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138427 (10Marostegui) [06:43:13] !log installing ruby security updates for trusty [06:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:18] !log Deploy schema change on s5 codfw master (db2052) this will generate lag in codfw - T191519 T188299 T190148 [06:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:26] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [06:49:26] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [06:49:27] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [06:52:57] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2003.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2003.codfw.wmnet}[5m]))): 165382.43786982246 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [06:53:27] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))): 39515.09489704566 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:53:57] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2003.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2003.codfw.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:54:27] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes2001.codfw.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:56:01] that's me upgrading mathoid chart to 0.0.5 ^ [06:58:24] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138448 (10jcrespo) Were the right interfaces disabled after the revert? [07:00:01] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138459 (10Marostegui) >>! In T191193#4138448, @jcrespo wrote: > Were the right interfaces disabled after the revert? Yeah: >>! In T191193#4136764, @ayounsi wrote: > asw-c6-... [07:03:01] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138462 (10jcrespo) 05Open>03Resolved Okey, I feel we should check what went wrong (was it the clarity of the communication, was it a one-time mistake that will unlikely hap... [07:19:50] let's see now what to do about that kubelet operational latencies alert [07:22:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138484 (10jcrespo) For example, as a procedure, could activity be checked on the port before being disabled to check the host is down/moved away? [07:22:46] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 122364.04785894201 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [07:22:53] (03PS1) 10Vgutierrez: install_server: Reimage lvs3003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427321 (https://phabricator.wikimedia.org/T191897) [07:23:16] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 89922.17099748531 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:23:46] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:24:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4138486 (10Marostegui) >>! In T191193#4138484, @jcrespo wrote: > For example, as a procedure, could activity be checked on the port before being disabled to check the host is do... [07:24:16] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:33:57] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect [07:36:57] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 62, down: 0, shutdown: 2 [07:39:01] !log Depool and reimage lvs3003 as stretch - T191897 [07:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:07] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [07:40:55] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138493 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3003.esams.wmnet ``` The log can be found in `/var/lo... [07:44:20] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs3003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427321 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [07:45:47] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138499 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3003.esams.wmnet'] ``` Of which those **FAILED**: ``` ['lvs3003.esams.wmnet'] ``` [07:46:47] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138502 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3003.esams.wmnet ``` The log can be found in `/var/lo... [07:46:50] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138503 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3003.esams.wmnet'] ``` Of which those **FAILED**: ``` ['lvs3003.esams.wmnet'] ``` [07:47:23] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138504 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs3003.esams.wmnet ``` The log can be found in `/var/lo... [07:47:32] (03PS1) 10Elukey: profile::zookeeper::monitoring: add jmx mbeans to the jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/427323 [07:51:21] (03CR) 10Elukey: [C: 032] profile::zookeeper::monitoring: add jmx mbeans to the jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/427323 (owner: 10Elukey) [07:51:46] 10Operations, 10Wikimedia-Etherpad, 10Security: Important critical Etherpad release – 1.6.4 - https://phabricator.wikimedia.org/T191767#4138505 (10akosiaris) >>! In T191767#4133113, @akosiaris wrote: > https://github.com/ether/etherpad-lite/commit/9daade0b95bbc5443637977652d3cd0dbc44e112 fixes this but it's... [07:51:53] 10Operations, 10Wikimedia-Etherpad, 10Security: Important critical Etherpad release – 1.6.4 - https://phabricator.wikimedia.org/T191767#4138506 (10akosiaris) [07:52:10] 10Operations, 10Wikimedia-Etherpad, 10Security: Important critical Etherpad release – 1.6.4 - https://phabricator.wikimedia.org/T191767#4115929 (10akosiaris) 05Open>03Resolved a:03akosiaris [08:24:30] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:29:41] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138553 (10Marostegui) >>! In T191996#4129814, @ayounsi wrote: > > And switch is now seeing received MAC pause frames. Which confirms that the server is receiving busts of... [08:32:03] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138555 (10Marostegui) [08:35:06] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138557 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs3003.esams.wmnet'] ``` and were **ALL** successful. [08:35:45] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/425999 (owner: 10Volans) [08:36:03] (03PS2) 10Volans: puppetboard: notify service on settings change [puppet] - 10https://gerrit.wikimedia.org/r/425999 [08:36:53] (03CR) 10Volans: [C: 032] puppetboard: notify service on settings change [puppet] - 10https://gerrit.wikimedia.org/r/425999 (owner: 10Volans) [08:44:24] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:44:35] !log execute cumin 'analytics10[28-69]*' 'rm /etc/apt/preferences.d/r_* && apt-get update' to clear jessie backports apt config - T192348 [08:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:42] T192348: SparkR on Spark 2.3.0 - Testing on Large Data Sets - https://phabricator.wikimedia.org/T192348 [08:45:11] elukey: pro-tip -m async and 'cmd1' 'cmd2'... ;) [08:46:19] volans: ah yes I know that syntax but it is not yet wired in my brain, need some time :) [08:46:26] thanks :) [08:46:53] RECOVERY - puppet last run on analytics1059 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:47:14] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:47:34] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:48:03] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect, AS6939/IPv4: Connect [08:48:53] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:49:04] yw :) [08:49:04] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:49:23] RECOVERY - puppet last run on analytics1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:49:23] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:49:24] RECOVERY - puppet last run on analytics1069 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:49:24] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:50:04] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:50:24] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:50:43] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:50:53] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:50:53] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:51:03] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 62, down: 0, shutdown: 2 [08:51:33] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:51:33] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:51:54] RECOVERY - puppet last run on analytics1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:52:14] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:53:01] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4138577 (10Tim_WMDE) Hey @RStallman-legalteam, my full name is Tim Fabian Eulitz and my email address is tim.eulitz@wikimedia.de. Regards, Tim [08:53:53] RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:53:53] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:54:23] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:54:34] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:54:34] RECOVERY - puppet last run on analytics1062 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:54:44] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:54:44] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:54:44] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:54:53] RECOVERY - puppet last run on analytics1060 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:55:03] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:55:24] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:55:43] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:55:43] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:55:53] RECOVERY - puppet last run on analytics1063 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:56:04] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Connect [08:56:44] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:20] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs3003 [puppet] - 10https://gerrit.wikimedia.org/r/427327 (https://phabricator.wikimedia.org/T191897) [08:57:22] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4138587 (10Marostegui) After the two servers that were decommissioned yesterday. This is the last host to decommission in codfw as part of T176243 \o/ [08:57:23] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:44] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:58:29] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs3003 [puppet] - 10https://gerrit.wikimedia.org/r/427327 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [08:59:16] (03PS5) 10Filippo Giunchedi: nagios: more understandable output for check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) [08:59:23] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:59:34] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:59:46] !log Repool (Re-enable BGP) in lvs3003 - T191897 [08:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:52] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [08:59:53] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:00:13] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 62, down: 0, shutdown: 2 [09:00:43] (03CR) 10Filippo Giunchedi: [C: 032] nagios: more understandable output for check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427115 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [09:04:42] <_joe_> !log restart HHVM on mw1223,mw1224, also repool them after investigation in crashes [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:49] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Trim long output from check_prometheus_metric - https://phabricator.wikimedia.org/T192343#4138612 (10fgiunchedi) 05Open>03Resolved The query no longer appears in `check_prometheus_metric` output, resolving [09:05:06] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138614 (10Vgutierrez) [09:06:05] FYI I'm going to briefly stop puppet fleetwide before merging https://gerrit.wikimedia.org/r/c/421860/ to avoid puppet fail spam [09:07:44] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:34] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 77825 bytes in 0.114 second response time [09:08:42] !log reimaging mw1281 to stretch [09:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:35] !log stop puppet agent fleetwide before applying https://gerrit.wikimedia.org/r/c/421860/ [09:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:21] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: adjust passenger pool size [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) (owner: 10Filippo Giunchedi) [09:11:27] (03PS5) 10Filippo Giunchedi: puppetmaster: adjust passenger pool size [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) [09:13:01] (03PS1) 10Volans: Icinga: do not start service managed by timer [puppet] - 10https://gerrit.wikimedia.org/r/427328 [09:16:02] (03PS1) 10Vgutierrez: install_server: Reimage lvs2005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427329 (https://phabricator.wikimedia.org/T191897) [09:16:53] !log imported lz4 0.0~r131-2~wmf1+trusty1 for trusty-wikimedia to apt.wikimedia.org (needed to build HHVM 3.18 for trusty) [09:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:12] (03CR) 10Volans: "Compiler result available here:" [puppet] - 10https://gerrit.wikimedia.org/r/427328 (owner: 10Volans) [09:21:16] has something happened to the deployment train that the Wikisources (+?) are not getting 1.31.0-wmf.30? [09:21:50] or am I jumping the gun? [09:23:25] presuming the latter [09:23:40] By about 10 hours :) [09:25:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427333 (https://phabricator.wikimedia.org/T190148) [09:25:21] k, clearly my brain is playing tricks on deployments days, or I have just simply lost track of changes in schedules [09:26:03] we used to get deployments on Tuesdays [09:27:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427333 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:27:15] !log reenable puppet fleetwide after https://gerrit.wikimedia.org/r/c/421860 [09:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427333 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:30:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 for alter table (duration: 01m 22s) [09:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:43] !log Deploy schema change on db1096:3315 - T191519 T188299 T190148 [09:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:50] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [09:30:50] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [09:30:51] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [09:31:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427333 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:32:36] (03PS1) 10Jcrespo: mariadb: Depool db2082 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427335 [09:32:54] !log Depool and reimage lvs2005 - T191897 [09:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:03] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [09:33:30] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs2005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427329 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [09:33:36] (03PS2) 10Vgutierrez: install_server: Reimage lvs2005 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427329 (https://phabricator.wikimedia.org/T191897) [09:36:22] (03PS1) 10Giuseppe Lavagetto: scap-helm: print the cluster that is being referred to [puppet] - 10https://gerrit.wikimedia.org/r/427336 [09:36:24] (03PS1) 10Giuseppe Lavagetto: scap-helm: add logging to SAL for install, upgrade [puppet] - 10https://gerrit.wikimedia.org/r/427337 [09:37:04] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138707 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs2005.codfw.wmnet ``` The log can be found in `/var/lo... [09:37:10] (03CR) 10Alexandros Kosiaris: [C: 032] "Beat me to it! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/427336 (owner: 10Giuseppe Lavagetto) [09:37:41] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2082 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427335 (owner: 10Jcrespo) [09:37:48] !log strip apache/nginx/nutcracker/hhvm from former image scaler (now spares) [09:37:49] (03PS1) 10Jcrespo: mariadb: Allow reimage of all db208* servers [puppet] - 10https://gerrit.wikimedia.org/r/427338 [09:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:21] (03PS2) 10Jcrespo: mariadb: Allow reimage of all db208* servers [puppet] - 10https://gerrit.wikimedia.org/r/427338 [09:38:54] (03Merged) 10jenkins-bot: mariadb: Depool db2082 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427335 (owner: 10Jcrespo) [09:39:17] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of all db208* servers [puppet] - 10https://gerrit.wikimedia.org/r/427338 (owner: 10Jcrespo) [09:39:27] (03CR) 10jenkins-bot: mariadb: Depool db2082 for reimage to stretch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427335 (owner: 10Jcrespo) [09:41:24] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2082 (duration: 01m 15s) [09:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:54] (03PS1) 10Muehlenhoff: Also handle Prometheus exporters in app server decom script [puppet] - 10https://gerrit.wikimedia.org/r/427340 [09:46:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some minor comments. Overall +1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/427337 (owner: 10Giuseppe Lavagetto) [09:46:21] !log start of deleting auto patrol actions in small wikis (T184485) [09:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:27] T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485 [09:48:10] (03PS1) 10Ema: VCL: cap 404 objects TTL, stop doing so for other 4xx [puppet] - 10https://gerrit.wikimedia.org/r/427341 (https://phabricator.wikimedia.org/T180712) [09:49:22] !log starting reimage of db2082 [09:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:34] !log Ran scap pull on mwdebug1001 after checking https://gerrit.wikimedia.org/r/427156 [09:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:58] (03CR) 10Hoo man: [C: 031] "Applied this on mwdebug1001 briefly and $wgRateLimits looks good :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427156 (https://phabricator.wikimedia.org/T184948) (owner: 10Ladsgroup) [09:53:10] (03PS2) 10Ema: VCL: cap 404 objects TTL, stop doing so for other 4xx [puppet] - 10https://gerrit.wikimedia.org/r/427341 (https://phabricator.wikimedia.org/T180712) [09:55:23] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic: Pybal support of configuration from the kubernetes API - https://phabricator.wikimedia.org/T192437#4138733 (10Joe) [09:57:32] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic: Pybal support of configuration from the kubernetes API - https://phabricator.wikimedia.org/T192437#4138744 (10Joe) Just to clarify, this is a bare support for kubernetes. In theory, it would be nice to gather all information about services we have to con... [10:00:45] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic: Pybal support of configuration from the kubernetes API - https://phabricator.wikimedia.org/T192437#4138747 (10Vgutierrez) This twisted-based kubernetes client could become handy: https://github.com/LeastAuthority/txkube [10:00:58] <_joe_> /win 25 [10:01:14] (03PS1) 10Jcrespo: install_server: Install all db208* servers with stretch instead of jessie [puppet] - 10https://gerrit.wikimedia.org/r/427342 [10:01:52] (03CR) 10Jcrespo: [C: 032] install_server: Install all db208* servers with stretch instead of jessie [puppet] - 10https://gerrit.wikimedia.org/r/427342 (owner: 10Jcrespo) [10:05:03] (03PS3) 10Ema: VCL: cap 404 objects TTL, stop doing so for other 4xx [puppet] - 10https://gerrit.wikimedia.org/r/427341 (https://phabricator.wikimedia.org/T180712) [10:05:57] (03CR) 10Ema: [C: 032] VCL: cap 404 objects TTL, stop doing so for other 4xx [puppet] - 10https://gerrit.wikimedia.org/r/427341 (https://phabricator.wikimedia.org/T180712) (owner: 10Ema) [10:06:12] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138752 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs2005.codfw.wmnet'] ``` and were **ALL** successful. [10:07:39] (03PS1) 10Elukey: role::configcluster: upgrade zookeeper main-eqiad to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/427343 (https://phabricator.wikimedia.org/T182924) [10:07:50] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2082 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427344 [10:09:39] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove more hiera entries related to the new deprecated image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427129 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [10:09:44] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4138757 (10Vgutierrez) [10:09:57] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138759 (10Vgutierrez) [10:10:00] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4127315 (10Vgutierrez) 05Resolved>03Open [10:12:17] (03PS2) 10Muehlenhoff: Remove more hiera entries related to the new deprecated image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427129 (https://phabricator.wikimedia.org/T188062) [10:13:29] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4138763 (10Vgutierrez) @ayounsi something got messed up when lvs2006 port descriptions were updated: ``` vgutierrez@lvs2006:~$ sudo lldpcli show neighbors |e... [10:13:58] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138764 (10Marostegui) I have captured iostat ouput during two bursts of errors. And there is some reads and cpu spike on both of them, but nothing too worrying or two mass... [10:15:36] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138768 (10Marostegui) [10:19:13] (03CR) 10Muehlenhoff: [C: 032] Remove more hiera entries related to the new deprecated image scalers [puppet] - 10https://gerrit.wikimedia.org/r/427129 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [10:19:36] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs2005 [puppet] - 10https://gerrit.wikimedia.org/r/427345 (https://phabricator.wikimedia.org/T191897) [10:20:39] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10961/" [puppet] - 10https://gerrit.wikimedia.org/r/427343 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [10:24:33] (03PS1) 10Jcrespo: mariadb: Repool es2013 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427346 [10:24:35] (03PS1) 10Jcrespo: mariadb: Depool db2081 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427347 [10:26:11] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs2005 [puppet] - 10https://gerrit.wikimedia.org/r/427345 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:26:18] (03PS2) 10Vgutierrez: pybal: Re-enable BGP in lvs2005 [puppet] - 10https://gerrit.wikimedia.org/r/427345 (https://phabricator.wikimedia.org/T191897) [10:26:28] (03PS9) 10Ema: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) [10:27:44] !log Repool (Re-enable BGP) in lvs2005 - T191897 [10:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:50] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:32:36] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138794 (10Vgutierrez) [10:36:51] (03CR) 10Lucas Werkmeister (WMDE): mediawiki: Add clearTermSqlIndexSearchFields for wikidata (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) (owner: 10Ladsgroup) [10:41:03] (03PS1) 10Vgutierrez: install_server: Reimage lvs2004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427349 (https://phabricator.wikimedia.org/T191897) [10:42:59] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2082 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427344 (owner: 10Jcrespo) [10:43:54] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs2004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427349 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:44:10] (03PS2) 10Vgutierrez: install_server: Reimage lvs2004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/427349 (https://phabricator.wikimedia.org/T191897) [10:44:13] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2082 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427344 (owner: 10Jcrespo) [10:45:51] !log Depool and reimage lvs2004 - T191897 [10:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:57] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:47:07] (03PS2) 10Jcrespo: mariadb: Repool es2013 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427346 [10:47:13] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es2013 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427346 (owner: 10Jcrespo) [10:47:38] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138823 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs2004.codfw.wmnet ``` The log can be found in `/var/lo... [10:48:36] (03Merged) 10jenkins-bot: mariadb: Repool es2013 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427346 (owner: 10Jcrespo) [10:51:46] (03PS2) 10Jcrespo: mariadb: Depool db2081 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427347 [10:52:09] PROBLEM - Request latencies on acrux is CRITICAL: 1.742e+07 = 1e+05 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:52:09] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2081 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427347 (owner: 10Jcrespo) [10:53:09] RECOVERY - Request latencies on acrux is OK: (C)1e+05 = (W)5e+04 = 5112 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:53:21] (03Merged) 10jenkins-bot: mariadb: Depool db2081 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427347 (owner: 10Jcrespo) [10:54:59] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4138836 (10Marostegui) [10:55:46] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4126641 (10Marostegui) [10:56:58] PROBLEM - DPKG on snapshot1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:57:56] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2081, repool db2082, es2013 (duration: 01m 15s) [10:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:01] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2082 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427344 (owner: 10Jcrespo) [11:02:07] (03CR) 10jenkins-bot: mariadb: Repool es2013 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427346 (owner: 10Jcrespo) [11:02:11] (03CR) 10jenkins-bot: mariadb: Depool db2081 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427347 (owner: 10Jcrespo) [11:02:27] !log starting reimage of db2081 [11:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:39] (03CR) 10Giuseppe Lavagetto: [C: 031] Icinga: do not start service managed by timer [puppet] - 10https://gerrit.wikimedia.org/r/427328 (owner: 10Volans) [11:05:08] (03PS2) 10Volans: Icinga: do not start service managed by timer [puppet] - 10https://gerrit.wikimedia.org/r/427328 [11:06:03] (03CR) 10Volans: [C: 032] Icinga: do not start service managed by timer [puppet] - 10https://gerrit.wikimedia.org/r/427328 (owner: 10Volans) [11:07:03] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[hhvm-luasandbox],Package[hhvm-tidy],Package[hhvm-wikidiff2] [11:09:22] (03PS1) 10Jcrespo: mariadb: Repool db2081, depool db2080 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427354 [11:10:15] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [11:12:49] (03PS1) 10Jcrespo: Revert "mariadb: Allow reimage of all db208* servers" [puppet] - 10https://gerrit.wikimedia.org/r/427355 [11:13:37] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138858 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs2004.codfw.wmnet'] ``` and were **ALL** successful. [11:15:27] (03PS1) 10Muehlenhoff: Remove rendering from lvs::configuration::lvs_service_ips for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/427356 [11:16:05] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [11:17:05] PROBLEM - Request latencies on chlorine is CRITICAL: 2.131e+07 = 1e+05 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:17:27] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs2004 [puppet] - 10https://gerrit.wikimedia.org/r/427357 (https://phabricator.wikimedia.org/T191897) [11:18:06] the icinga check is bailing on lvs2004.mgmt [11:19:13] 10Operations, 10Puppet, 10Traffic: Puppet: tlsproxy localssl default_server make a Notify at each run - https://phabricator.wikimedia.org/T191393#4138867 (10Volans) @Joe no it would not be super easy to solve in a DRY way, I agree. But I've noticed that all the calls to `tlsproxy::localssl` in our puppet re... [11:19:30] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs2004 [puppet] - 10https://gerrit.wikimedia.org/r/427357 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [11:19:52] godog: ^^^ 2.131e+07 = 1e+05 ??? [11:19:56] lol [11:20:40] !log Repool (Re-enable BGP) lvs2004 - T191897 [11:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:46] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [11:21:55] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [11:21:56] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [11:22:26] PROBLEM - Request latencies on acrux is CRITICAL: 1.754e+07 = 1e+05 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:22:46] moritzm: I'm looking at che icinga config [11:23:06] RECOVERY - Request latencies on chlorine is OK: (C)1e+05 = (W)5e+04 = 3653 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:23:26] RECOVERY - Request latencies on acrux is OK: (C)1e+05 = (W)5e+04 = 4153 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:23:44] volans: ack [11:23:46] PROBLEM - Request latencies on argon is CRITICAL: 1.778e+07 = 1e+05 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:24:46] RECOVERY - Request latencies on argon is OK: (C)1e+05 = (W)5e+04 = 5667 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:25:12] godog: apart the .g format that might be less readable than before, the comparison operator seems wrong too [11:28:18] moritzm: seems to have been a race condition, icinga config was failing because of a missing definition of lvs2004.mgmt but is all good now [11:28:33] on the icinga puppet side [11:29:31] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138939 (10Vgutierrez) [11:30:34] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4138940 (10Vgutierrez) [11:30:54] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4127315 (10Vgutierrez) [11:31:37] volans: ack, good [11:34:30] (03Draft2) 10Reedy: New prod key for reedy [puppet] - 10https://gerrit.wikimedia.org/r/427359 [11:36:04] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [11:41:56] (03PS1) 10Muehlenhoff: Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) [11:42:24] (03CR) 10jerkins-bot: [V: 04-1] Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [11:51:36] (03PS2) 10Muehlenhoff: Simplify threedtopng::deploy after image scaler removal [puppet] - 10https://gerrit.wikimedia.org/r/427361 (https://phabricator.wikimedia.org/T188062) [11:54:46] hallo [11:55:49] For Python 2 on terbium you can use MySQLdb to connect to mysql. It doesn't work with Python 3. Is there any way to connect to the db from Python 3. [11:56:00] ? [11:56:33] (03PS1) 10Muehlenhoff: Remove role::mediawiki::imagescaler [puppet] - 10https://gerrit.wikimedia.org/r/427364 [11:57:52] aharoni: we can fix that, just needs some change to the puppet manifests, let me check [11:59:36] aharoni: can you file a Phab task and tag it "Operations"? [11:59:49] moritzm: thanks, I will. [12:04:35] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [12:05:35] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [12:10:44] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [12:11:44] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [12:18:14] PROBLEM - Request latencies on argon is CRITICAL: 1.808e+07 = 1e+05 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:19:14] RECOVERY - Request latencies on argon is OK: (C)1e+05 = (W)5e+04 = 8426 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:23:51] (03CR) 10Muehlenhoff: [C: 032] New prod key for reedy [puppet] - 10https://gerrit.wikimedia.org/r/427359 (owner: 10Reedy) [12:27:22] volans: heh I noticed the wrong output too, turns out icinga strips some characters [12:28:41] yeah illegal_macro_output_chars [12:33:57] either tweaking that or changing the symbol I suppose [12:34:43] yep, careful if changing it, might have a lot of unwanted side effects ;) [12:35:53] (03CR) 10Filippo Giunchedi: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) (owner: 10Filippo Giunchedi) [12:35:57] indeed [12:36:31] ≥ [12:36:37] #fixedit [12:37:11] or > [12:46:46] (03PS1) 10Ema: varnishxcache.mtail: add HTTP status info [puppet] - 10https://gerrit.wikimedia.org/r/427373 [12:47:01] (03PS1) 10Filippo Giunchedi: nagios: avoid using characters from illegal_macro_output_chars in check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427374 (https://phabricator.wikimedia.org/T192343) [12:47:05] volans: ^ [12:47:38] (03CR) 10jerkins-bot: [V: 04-1] nagios: avoid using characters from illegal_macro_output_chars in check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427374 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [12:47:54] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 6 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4139073 (10Mholloway) [12:48:30] (03PS2) 10Filippo Giunchedi: nagios: avoid using characters from illegal_macro_output_chars in check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427374 (https://phabricator.wikimedia.org/T192343) [12:48:59] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/427374 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [12:49:05] godog: \o/ [12:49:31] (03CR) 10jerkins-bot: [V: 04-1] nagios: avoid using characters from illegal_macro_output_chars in check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427374 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [12:49:44] PROBLEM - Host maps-test2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:07] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2081, depool db2080 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427354 (owner: 10Jcrespo) [12:50:49] (03PS3) 10Filippo Giunchedi: nagios: avoid using icinga-illegal characters in check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427374 (https://phabricator.wikimedia.org/T192343) [12:51:36] see if that pleases jenkins [12:51:54] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [12:52:24] (03CR) 10Filippo Giunchedi: [C: 032] nagios: avoid using icinga-illegal characters in check_prometheus_metric [puppet] - 10https://gerrit.wikimedia.org/r/427374 (https://phabricator.wikimedia.org/T192343) (owner: 10Filippo Giunchedi) [12:52:26] (03Merged) 10jenkins-bot: mariadb: Repool db2081, depool db2080 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427354 (owner: 10Jcrespo) [12:52:51] (03CR) 10jenkins-bot: mariadb: Repool db2081, depool db2080 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427354 (owner: 10Jcrespo) [12:52:54] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.080 second response time [12:54:54] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [12:57:13] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2081, depool db2080 (duration: 01m 16s) [12:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:54] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [13:00:30] !log starting reimage of db2080 [13:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:54] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [13:01:36] hey, I have a question regarding our logs, I found two logs in Kibana and I have no idea whats that, is there anyone who could help me clarify whats going on? [13:01:54] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [13:03:10] jouncebot: next [13:03:11] In 0 hour(s) and 56 minute(s): Create new wikis (window to be confirmed) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1400) [13:03:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427375 [13:03:54] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [13:03:58] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427375 [13:04:00] jouncebot: refresh [13:04:01] I refreshed my knowledge about deployments. [13:04:04] jouncebot: next [13:04:04] In 0 hour(s) and 55 minute(s): Create new wikis (window to be confirmed) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1400) [13:04:41] wait, what happened to swat today, jouncebot did not announce it? [13:05:25] zeljkof: sorry to bother you, I know you're busy with SWAT, could you point me to the person who could answer couple questions regarding logs. I found two really strange logs and I have no idea whats that and I'm trying to find out whats that [13:05:52] and I wrote whats that twice - you see, my brain is already melting :) [13:06:15] raynor: uh, not sure, mutante is on clinic duty, maybe he would know [13:06:16] (03CR) 10Filippo Giunchedi: "LGTM overall, bear in mind though that this means effectively the current varnish_x_cache will stop updating and new ones will be created." [puppet] - 10https://gerrit.wikimedia.org/r/427373 (owner: 10Ema) [13:06:38] I'm here for the SWAT, not testable :D I mean I can write a javascript to edit wikidata in a crazy rate but rather not :D [13:07:54] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [13:08:37] Amir1: want to deploy your change yourself? [13:08:54] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.081 second response time [13:08:58] yeah sure [13:09:04] while I find out what needs to be deployed, jouncebot confused me by not announcing swat [13:09:19] (03PS2) 10Ladsgroup: Limit page creation and edit rate on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427156 (https://phabricator.wikimedia.org/T184948) [13:10:55] " 23485 exception: Could not open extension /usr/lib/x86_64-linux-gnu/hhvm/exten [13:10:55] sions/20150212/luasandbox.so: /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212 [13:10:55] /luasandbox.so: cannot open shared object file: No such file or directory" [13:11:00] Dat number [13:11:17] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427156 (https://phabricator.wikimedia.org/T184948) (owner: 10Ladsgroup) [13:11:30] Amir1: looks like you are the only one for swat, when done, just close the window :) [13:11:38] Nice [13:11:41] sure thing [13:11:55] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.080 second response time [13:12:29] (03Merged) 10jenkins-bot: Limit page creation and edit rate on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427156 (https://phabricator.wikimedia.org/T184948) (owner: 10Ladsgroup) [13:12:43] (03CR) 10jenkins-bot: Limit page creation and edit rate on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427156 (https://phabricator.wikimedia.org/T184948) (owner: 10Ladsgroup) [13:12:54] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [13:15:25] (03PS12) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [13:15:42] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4139141 (10Niedzielski) Here's the latest report: ``` Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 254949111 Error: 503, Backe... [13:16:37] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [13:16:53] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:427156|Limit page creation and edit rate on Wikidata (T184948)]] (duration: 01m 17s) [13:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:59] T184948: limit page creation and edit rate on Wikidata - https://phabricator.wikimedia.org/T184948 [13:17:12] !log uploaded HHVM 3.18.5+dfsg-1+wmf7+deb9u1 to apt.wikimedia.org/stretch-wikimedia (includes a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854) [13:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:17] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [13:17:42] !log EU SWAT is done [13:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427375 (owner: 10Marostegui) [13:17:55] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [13:18:09] (03PS13) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [13:18:55] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.080 second response time [13:19:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427375 (owner: 10Marostegui) [13:19:36] (03PS1) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [13:20:21] volans: ^ [13:20:40] (03CR) 10jerkins-bot: [V: 04-1] tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [13:20:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1096:3315 after alter table (duration: 01m 15s) [13:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:55] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [13:22:33] volans: nevermind not ready to merge yet [13:22:54] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [13:22:55] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [13:23:43] godog: lol, no prob ping me when you make it works :-P [13:24:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427375 (owner: 10Marostegui) [13:24:52] (03PS14) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [13:25:13] * mdholloway will look into the mobileapps alert [13:27:16] (03PS2) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [13:28:15] (03CR) 10jerkins-bot: [V: 04-1] tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [13:31:13] 10Operations, 10Prod-Kubernetes, 10Pybal, 10Traffic: Pybal support of configuration from the kubernetes API - https://phabricator.wikimedia.org/T192437#4139192 (10ema) p:05Triage>03Normal [13:32:01] (03PS3) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [13:32:57] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139205 (10Marostegui) For the record, the irq for eno1 is balanced across CPUs, so I don't think it is the bottleneck here: ``` root@db1114:/srv/tmp# for i in `cat /proc/i... [13:33:08] (03CR) 10jerkins-bot: [V: 04-1] tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [13:33:34] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139207 (10Marostegui) [13:33:38] (03PS2) 10Jcrespo: Revert "mariadb: Allow reimage of all db208* servers" [puppet] - 10https://gerrit.wikimedia.org/r/427355 [13:36:58] (03PS4) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [13:38:05] (03CR) 10jerkins-bot: [V: 04-1] tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [13:39:08] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.080 second response time [13:41:38] (03PS5) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [13:41:57] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Allow reimage of all db208* servers" [puppet] - 10https://gerrit.wikimedia.org/r/427355 (owner: 10Jcrespo) [13:42:08] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [13:42:17] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [13:42:43] (03CR) 10jerkins-bot: [V: 04-1] tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [13:43:17] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [13:44:51] !log uploaded HHVM 3.18.5+dfsg-1+wmf7+icu57 to apt.wikimedia.org/jessie-wikimedia (includes a patch for the MEMC_VAL_COMPRESSION_ZLIB flag in the memcached module (T184854)) [13:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:57] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [13:45:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427380 (https://phabricator.wikimedia.org/T190148) [13:47:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427380 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:47:54] (03CR) 10Filippo Giunchedi: "Failed tests are now pep8 for jupyterhub_old, which AFAICS is unused? Andrew, can it be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [13:48:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427380 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:49:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427380 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:49:54] !log Deploy schema change on db1100 - T191519 T188299 T190148 [13:50:00] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4139304 (10Lucas_Werkmeister_WMDE) Sorry, forgot to update – the meeting happened, I’ve created {T192452} for the outcome. Depending on how we proceed with that task, stats... [13:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:01] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [13:50:01] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [13:50:01] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [13:50:17] 10Operations, 10Traffic: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368#4139309 (10ema) [13:50:17] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [13:50:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 for alter table (duration: 01m 16s) [13:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:17] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [13:52:17] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.080 second response time [13:52:27] 10Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team, and 2 others: [Discuss] Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#4139329 (10Ladsgroup) 05Open>03Resolved a:03Ladsgroup I think it's clear that we should not do this thus closing it as resolved. [13:53:17] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [13:54:18] (03PS1) 10Jcrespo: install_server: Allow stretch reimage of db207* except db2079 [puppet] - 10https://gerrit.wikimedia.org/r/427381 [13:56:17] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4139341 (10Mholloway) I don't seem to be able to ACK the health check alerts (as I just attempted to do for the current unhandle... [13:57:13] (03PS1) 10Jcrespo: mariadb: Repool db2080, depool db2077 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427382 [13:59:17] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [14:00:02] !log restart kafka on kafka1001 and kafka2001 (jobqueues,eventbus) for opnejdk-7 upgrades [14:00:04] Dereckson: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Create new wikis (window to be confirmed). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1400). [14:00:04] No GERRIT patches in the queue for this window AFAICS. [14:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:18] PROBLEM - tilerator on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6534: Connection refused [14:00:54] 10Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team, and 2 others: [Discuss] Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#4139359 (10akosiaris) 05Resolved>03declined Declined actually. [14:01:17] RECOVERY - tilerator on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.080 second response time [14:02:10] (03CR) 10Jcrespo: [C: 032] install_server: Allow stretch reimage of db207* except db2079 [puppet] - 10https://gerrit.wikimedia.org/r/427381 (owner: 10Jcrespo) [14:02:17] PROBLEM - tileratorui on maps-test2002 is CRITICAL: connect to address 10.192.0.129 and port 6535: Connection refused [14:02:40] (03PS1) 10Ottomata: Remove unused jupyterhub_old module [puppet] - 10https://gerrit.wikimedia.org/r/427385 (https://phabricator.wikimedia.org/T183145) [14:03:19] 10Operations, 10Analytics, 10Patch-For-Review: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4139369 (10Ottomata) 05Open>03Resolved [14:04:53] !log powercycle unresponsive maps-test2001 [14:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:01] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2080, depool db2077 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427382 (owner: 10Jcrespo) [14:05:46] (03PS2) 10Ottomata: Remove unused jupyterhub_old module [puppet] - 10https://gerrit.wikimedia.org/r/427385 (https://phabricator.wikimedia.org/T183145) [14:05:54] (03CR) 10Ottomata: [V: 032 C: 032] Remove unused jupyterhub_old module [puppet] - 10https://gerrit.wikimedia.org/r/427385 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [14:06:13] 10Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team, and 2 others: [Discuss] Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#4139385 (10akosiaris) 05declined>03Resolved Since this was a `[Discuss]` task, `resolved` was conceptually correct. [14:06:15] (03Merged) 10jenkins-bot: mariadb: Repool db2080, depool db2077 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427382 (owner: 10Jcrespo) [14:06:26] (03CR) 10Ottomata: "Done! https://gerrit.wikimedia.org/r/#/c/427385/" [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [14:08:17] RECOVERY - tileratorui on maps-test2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.079 second response time [14:08:47] RECOVERY - Host maps-test2001 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [14:13:57] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2080, depool db2077 (duration: 01m 16s) [14:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:02] (03PS1) 10Urbanecm: Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) [14:24:37] (03CR) 10Jayprakash12345: [C: 031] Enable edit patrol in hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427390 (https://phabricator.wikimedia.org/T192427) (owner: 10Urbanecm) [14:24:50] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139429 (10Marostegui) During the errors spike I have captured the CPU stats and it is interesting to see that some sys or usr CPU get totally overloaded some seconds befor... [14:30:51] (03CR) 10Gehel: [C: 031] admin: let maps-admins run any command as postgres,osmupdater,cassandra [puppet] - 10https://gerrit.wikimedia.org/r/427271 (https://phabricator.wikimedia.org/T192115) (owner: 10Dzahn) [14:32:41] (03CR) 10jenkins-bot: mariadb: Repool db2080, depool db2077 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427382 (owner: 10Jcrespo) [14:34:59] !log Disable puppet on db1114 - T191996 [14:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:06] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [14:37:00] !log restarting Cassandra, restbase1011-a -- T192456 [14:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:06] T192456: Prometheus metrics missing for some hosts - https://phabricator.wikimedia.org/T192456 [14:38:19] (03PS2) 10Muehlenhoff: Remove role::mediawiki::imagescaler [puppet] - 10https://gerrit.wikimedia.org/r/427364 [14:41:20] (03CR) 10Muehlenhoff: [C: 032] Remove role::mediawiki::imagescaler [puppet] - 10https://gerrit.wikimedia.org/r/427364 (owner: 10Muehlenhoff) [14:41:31] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139494 (10Marostegui) At the time of the errors (14:30:10), this is what I saw running for a couple of seconds before the errors: 14:30:06 ``` 9476 root 0 -20 29... [14:43:37] PROBLEM - Check systemd state on db1114 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:43:56] ^ that is me [14:45:22] 10Operations: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457#4139504 (10MoritzMuehlenhoff) [14:51:23] !log starting reimage of db2077 [14:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:52] Hi all! Can someone give me a quick summary of what a standard setup might look like (if we have one) for running a job somewhere on the cluster that arrrg maybe checks something in a browser using its own little sandboxed mediawiki install, and may send an alert somewhere under certain conditions? [14:53:37] Sorry to be so vague... If someone knows if/what infrastructure might be used for something sort of along those lines, I could walk them through the details [14:53:43] thx in advance! [14:54:57] I think labs is an option, but I'm not sure it's the right one... [14:55:19] !log restarting Cassandra, restbase1011-a to test v 0.8 of Prometheus JMX exporter -- T192456 [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:25] T192456: Prometheus metrics missing for some hosts - https://phabricator.wikimedia.org/T192456 [15:01:27] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 45.50, 34.69, 32.03 [15:03:44] (03PS6) 10Andrew Bogott: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez) [15:03:48] ^ actually, thinking more about the requirements for the question above, I guess it wouldn't need it's own mediawiki install, just run a headless browser against production [15:04:14] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez) [15:04:49] I imagine a normal route might be to start by setting something up on labs, and one we figured out all the details, and it's shown to be useful, then see if it makes sense to migrate to the prod cluster (for easier maintenance, security) [15:04:57] does any of that make sense? [15:05:09] (03CR) 10Bstorm: wiki replicas: Add new MCR tables to views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [15:07:56] (03PS1) 10Ottomata: Split role::kafka::analytics into different roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) [15:08:00] (03PS1) 10Andrew Bogott: Revert "Add Chicocvenancio's key for Cloud Services" [labs/private] - 10https://gerrit.wikimedia.org/r/427401 [15:08:08] (03PS2) 10Andrew Bogott: Revert "Add Chicocvenancio's key for Cloud Services" [labs/private] - 10https://gerrit.wikimedia.org/r/427401 [15:08:09] !log reindexing serbian wikis on elastic@eqiad (T189265) [15:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:15] T189265: Re-index Serbian Wikis - https://phabricator.wikimedia.org/T189265 [15:08:22] (03CR) 10jerkins-bot: [V: 04-1] Split role::kafka::analytics into different roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [15:08:27] (03CR) 10Bstorm: wiki replicas: Add new MCR tables to views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [15:09:25] (03PS2) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) [15:09:27] !log decommissioning Cassandra, restbase1010-b -- T189822 [15:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:34] T189822: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822 [15:09:45] (03PS2) 10Ottomata: Split role::kafka::analytics into 2 roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) [15:10:12] (03CR) 10jerkins-bot: [V: 04-1] Split role::kafka::analytics into 2 roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [15:12:10] (03PS3) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) [15:13:37] (03PS3) 10Ottomata: Split role::kafka::analytics into 2 roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) [15:14:03] (03CR) 10jerkins-bot: [V: 04-1] Split role::kafka::analytics into 2 roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [15:14:10] (03CR) 10Andrew Bogott: [V: 032 C: 032] Revert "Add Chicocvenancio's key for Cloud Services" [labs/private] - 10https://gerrit.wikimedia.org/r/427401 (owner: 10Andrew Bogott) [15:16:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427402 [15:16:13] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427402 [15:16:42] (03PS1) 10Andrew Bogott: Remove chicocvenancio from contact groups [puppet] - 10https://gerrit.wikimedia.org/r/427403 [15:17:59] (03CR) 10Andrew Bogott: [C: 032] Remove chicocvenancio from contact groups [puppet] - 10https://gerrit.wikimedia.org/r/427403 (owner: 10Andrew Bogott) [15:18:18] Dereckson: is your deployment window happening? [15:18:38] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10964/" [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [15:18:44] (03PS4) 10Ottomata: Split role::kafka::analytics into 2 roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) [15:19:04] (03CR) 10Ottomata: [V: 032 C: 032] Split role::kafka::analytics into 2 roles so we can apply different hiera [puppet] - 10https://gerrit.wikimedia.org/r/427400 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [15:20:02] (03PS1) 10Muehlenhoff: Add task reference [puppet] - 10https://gerrit.wikimedia.org/r/427405 [15:20:04] ottomata: thanks for removing jupyterhub_old <3 [15:20:35] shurre thing [15:20:41] (03PS6) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [15:20:58] (03PS2) 10Muehlenhoff: Add task reference [puppet] - 10https://gerrit.wikimedia.org/r/427405 [15:21:03] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4139616 (10Cmjohnson) Configured the raid, set to all 14 disks are raid 0 with the ssd in slot 12 first and the ssd in slot 13 second, the other 12 disks are set to raid 0 in order. Upda... [15:21:41] (03CR) 10jerkins-bot: [V: 04-1] tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [15:22:28] (03CR) 10Muehlenhoff: [C: 032] Add task reference [puppet] - 10https://gerrit.wikimedia.org/r/427405 (owner: 10Muehlenhoff) [15:22:38] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4139624 (10Cmjohnson) All the cross connects have been made and connected to the switch ports. I updated the descriptions in both switches on each row. The server still need... [15:23:08] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4139626 (10Cmjohnson) [15:25:22] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4139627 (10Vgutierrez) >>! In T184293#4137521, @ayounsi wrote: > @Cmjohnson: would that works for you? > > |lvs1016|eth0/eno1|asw2-d:xe-7/0/15|cable #4061| > |lvs1016|eth1/e... [15:26:42] (03PS7) 10Filippo Giunchedi: tox: run nagios_common tests [puppet] - 10https://gerrit.wikimedia.org/r/427378 [15:27:40] (03PS1) 10Ottomata: Use --new.consumer for main-eqiad -> analytics MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/427406 (https://phabricator.wikimedia.org/T192387) [15:27:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427402 (owner: 10Marostegui) [15:28:01] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4139635 (10Cmjohnson) @Vgutierrez I set the descriptions on the switches with both jic. xe-4/0/5 description "lvs1016 eth3/ens1f1 #3918" [15:28:31] Hello room [15:29:04] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4139637 (10Vgutierrez) >>! In T184293#4139635, @Cmjohnson wrote: > @Vgutierrez I set the descriptions on the switches with both jic. > > xe-4/0/5 description "lvs1016 eth3... [15:29:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427402 (owner: 10Marostegui) [15:30:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1100 after alter table (duration: 01m 15s) [15:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:10] <_joe_> !log depooling mw1227 for investigation in high load [15:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:21] Dereckson: Are you going to create wiki today? [15:36:24] Hi room this is Suyash from Hindi Wikimedians User Group [15:38:27] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:35] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4139664 (10bearND) @Mholloway I do see your ack in the Icinga UI: > feed/annoucements expected output discrepancy will be fixed... [15:38:38] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 41.77, 34.97, 32.63 [15:39:07] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:39:18] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.031 second response time [15:39:57] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.046 second response time [15:41:50] 10Operations, 10Beta-Cluster-Infrastructure: "Obama" page on Beta Cluster often responds with 503 - https://phabricator.wikimedia.org/T188913#4139674 (10Krinkle) [15:42:08] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[hhvm-luasandbox],Package[hhvm-tidy],Package[hhvm-wikidiff2] [15:42:33] (03PS1) 10Jcrespo: mariadb: Repool db2077 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427411 [15:43:15] (03PS4) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) [15:43:37] (03PS1) 10Cmjohnson: Removing mgmt dns from kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/427412 (https://phabricator.wikimedia.org/T182955) [15:44:16] (03PS2) 10Cmjohnson: Removing mgmt dns from kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/427412 (https://phabricator.wikimedia.org/T182955) [15:44:38] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns from kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/427412 (https://phabricator.wikimedia.org/T182955) (owner: 10Cmjohnson) [15:45:19] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, 10Patch-For-Review: Decommission kafka1018 - https://phabricator.wikimedia.org/T182955#4139708 (10Cmjohnson) [15:45:26] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2077 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427411 (owner: 10Jcrespo) [15:45:53] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, 10Patch-For-Review: Decommission kafka1018 - https://phabricator.wikimedia.org/T182955#3839641 (10Cmjohnson) 05Open>03Resolved removed from rack and network port updated (ge-8/0/0). updated racktables and tracking sheet [15:45:59] (03CR) 10Bstorm: wiki replicas: Add new MCR tables to views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [15:46:17] (03PS5) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) [15:46:48] (03Merged) 10jenkins-bot: mariadb: Repool db2077 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427411 (owner: 10Jcrespo) [15:47:27] (03CR) 10Filippo Giunchedi: "Ready for review!" [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [15:48:16] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139734 (10ayounsi) That's some great investigation! >>! In T191996#4138553, @Marostegui wrote: > @ayounsi does that mean that the switch is the one not being able to cope... [15:49:07] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2077 (duration: 01m 16s) [15:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:57] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4139750 (10Marostegui) >>! In T191996#4139734, @ayounsi wrote: > That's some great investigation! > >>>! In T191996#4138553, @Marostegui wrote: >> @ayounsi does that mean... [15:54:13] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 6 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4139751 (10dr0ptp4kt) The lack of tagging did not appear to be related to any confi... [15:55:25] (03PS3) 10Vgutierrez: ntp: Cleanup jessie only code [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) [15:56:33] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10965/kafka1012.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/427406 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [15:59:17] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 6 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4139773 (10Nuria) @dr0ptp4kt trying to understand: is this the bug that makes the r... [16:01:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427402 (owner: 10Marostegui) [16:01:35] (03CR) 10jenkins-bot: mariadb: Repool db2077 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427411 (owner: 10Jcrespo) [16:07:44] (03PS7) 10Arturo Borrero Gonzalez: cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) [16:08:38] (03CR) 10Vgutierrez: "pcc looks happy and shows that it's actually a noop (as expected): https://puppet-compiler.wmflabs.org/compiler02/10966/" [puppet] - 10https://gerrit.wikimedia.org/r/427101 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [16:09:35] jouncebot: now [16:09:35] For the next 0 hour(s) and 50 minute(s): Create new wikis (window to be confirmed) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1400) [16:09:52] I'm guessing that didn't happen [16:10:18] !log ppchelko@tin Started deploy [cpjobqueue/deploy@749ae82]: Update dependencies and reduce dedupe logging rate [16:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:32] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 6 others: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country - https://phabricator.wikimedia.org/T187014#4139798 (10dr0ptp4kt) >>! In T187014#4139773, @Nuria wrote: > @dr0ptp4kt trying to... [16:10:48] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 39.35, 33.32, 32.58 [16:11:01] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@749ae82]: Update dependencies and reduce dedupe logging rate (duration: 00m 43s) [16:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:07] (03PS5) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) [16:13:13] (03PS1) 10Dzahn: icinga: add contactgroup for mobileapps to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/427417 (https://phabricator.wikimedia.org/T189524) [16:13:20] (03PS6) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) [16:13:28] (03PS6) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) [16:14:03] (03PS7) 10Reedy: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) [16:14:09] (03CR) 10Reedy: [C: 032] Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [16:15:34] (03Merged) 10jenkins-bot: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [16:18:19] Fatal error: Class undefined: MassMessage in /srv/mediawiki-staging/php-1.31.0-wmf.29/extensions/WikimediaMaintenance/addWiki.php on line 278 [16:18:20] ffs [16:19:30] (03PS1) 10Cmjohnson: Removing dns for elastic1021 [dns] - 10https://gerrit.wikimedia.org/r/427420 (https://phabricator.wikimedia.org/T189727) [16:19:39] Luckily, last step we care about [16:20:03] (03PS2) 10Cmjohnson: Removing dns for elastic1021 [dns] - 10https://gerrit.wikimedia.org/r/427420 (https://phabricator.wikimedia.org/T189727) [16:20:45] (03CR) 10Cmjohnson: [C: 032] Removing dns for elastic1021 [dns] - 10https://gerrit.wikimedia.org/r/427420 (https://phabricator.wikimedia.org/T189727) (owner: 10Cmjohnson) [16:20:53] (03CR) 10Andrew Bogott: [C: 032] cloudvps: add apt pinning for mitaka on jessie [puppet] - 10https://gerrit.wikimedia.org/r/426891 (https://phabricator.wikimedia.org/T192162) (owner: 10Arturo Borrero Gonzalez) [16:21:21] !log reedy@tin Synchronized dblists/: advisorswiki (duration: 01m 16s) [16:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4139841 (10Cmjohnson) [16:24:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4139850 (10Cmjohnson) [16:24:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4051243 (10Cmjohnson) 05Open>03Resolved Updated rackables and tracking sheet. [16:24:57] !log reedy@tin rebuilt and synchronized wikiversions files: advisorswiki [16:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:25] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: advisorswikki (duration: 01m 15s) [16:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:39] (03PS1) 10Andrew Bogott: labtestpuppetmaster: switch to default openstack version [puppet] - 10https://gerrit.wikimedia.org/r/427429 (https://phabricator.wikimedia.org/T192162) [16:26:52] (03PS3) 10Reedy: Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181) [16:26:55] (03CR) 10Reedy: [C: 032] Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [16:27:43] (03CR) 10Andrew Bogott: [C: 032] labtestpuppetmaster: switch to default openstack version [puppet] - 10https://gerrit.wikimedia.org/r/427429 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [16:28:39] (03Merged) 10jenkins-bot: Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [16:28:55] Jayprakash12345: yes, right after Reedy has finisehd ton Tin* [16:29:05] Dereckson: You know you've missed most of the window? :P [16:29:25] Dereckson: Thanks for answer [16:29:32] jouncebot: now [16:29:32] For the next 0 hour(s) and 30 minute(s): Create new wikis (window to be confirmed) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1400) [16:29:34] let's see how swat is loaded [16:30:05] There are only two swat patches. [16:30:22] but in any case, we need to be at 18:00 UTC [16:30:35] AndyRussG: ejegg: ping? [16:30:40] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: translatew for advisorswiki (duration: 01m 16s) [16:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:55] Dereckson: hi :) [16:31:05] 10Operations, 10ops-eqiad, 10DBA: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4139871 (10Cmjohnson) [16:31:17] Dereckson: yeah I was getting ready to add another infact ;p [16:31:18] AndyRussG: you're available now for your patches scheduled for SWAT? [16:31:43] If so, we can do them now, and create the new wiki just afterwards. [16:31:48] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 855.44 seconds [16:32:00] Dereckson: yes, just need maybe 5 min to merge the additional one to the CentralNotice deploy branch, sound good? [16:32:17] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:32:28] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 895.04 seconds [16:32:35] !log reedy@tin Synchronized php-1.31.0-wmf.30/extensions/WikimediaMaintenance: fix addwiki.php (duration: 01m 18s) [16:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:52] AndyRussG: it is [16:33:17] (03CR) 10jenkins-bot: Add advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426762 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [16:33:17] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:21] dbstore1002 complaining, could it have crash? [16:33:21] (03CR) 10jenkins-bot: Enable Translate on advisorswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426764 (https://phabricator.wikimedia.org/T189181) (owner: 10Reedy) [16:33:27] Dereckson: cool thx! [16:33:38] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:48] I guess we'll see how Jenkins is running today also... [16:33:56] Dereckson: hi, thanks! [16:35:33] (03PS1) 10Andrew Bogott: labpuppetmaster: switch to default openstack version [puppet] - 10https://gerrit.wikimedia.org/r/427430 (https://phabricator.wikimedia.org/T192162) [16:36:14] (03CR) 10Anomie: [C: 031] "Looks sane to me. Haven't tested." [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [16:36:20] Dereckson: I'll +2 the config change now then, too [16:36:33] (03PS1) 10Cmjohnson: Removing mgmt dns for uranium [dns] - 10https://gerrit.wikimedia.org/r/427432 (https://phabricator.wikimedia.org/T183209) [16:36:48] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:36:48] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427433 [16:36:50] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427433 (owner: 10Reedy) [16:36:54] (03PS2) 10Cmjohnson: Removing mgmt dns for uranium [dns] - 10https://gerrit.wikimedia.org/r/427432 (https://phabricator.wikimedia.org/T183209) [16:37:13] (03CR) 10AndyRussG: [C: 032] CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427275 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [16:37:24] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for uranium [dns] - 10https://gerrit.wikimedia.org/r/427432 (https://phabricator.wikimedia.org/T183209) (owner: 10Cmjohnson) [16:37:51] AndyRussG: you've CR +2 the mediawiki-config [16:37:57] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:11] Dereckson: yes [16:38:18] I think Reedy is busy on Tin [16:38:23] Ah oops [16:38:25] I'm just about finished [16:38:32] Sorry [16:38:45] Thought it might speed things along [16:38:50] (03CR) 10Andrew Bogott: [C: 032] labpuppetmaster: switch to default openstack version [puppet] - 10https://gerrit.wikimedia.org/r/427430 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [16:38:57] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427433 (owner: 10Reedy) [16:38:59] (03Merged) 10jenkins-bot: CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427275 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [16:39:12] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427433 (owner: 10Reedy) [16:40:46] yay, broken stuff [16:41:12] 36962 exception: Could not open extension /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212/luasandbox.so: /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212/luasandbox.so: cannot open shared object file: N [16:41:16] o such file or directory [16:41:18] this one [16:41:29] seems the extension isn't on a server [16:41:40] which server? [16:41:48] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:50] 20150212 is an old API version [16:41:55] didn't apergos had that same error a few hours ago? [16:42:01] but it's HHVM so ok [16:42:16] ask around, probably a known issue [16:42:17] RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:42:25] !log reedy@tin Synchronized wmf-config/interwiki.php: sync! (duration: 01m 15s) [16:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:38] I did, it as due to the absence of those extensions on my os (trusty) where they are not yet removed from the php.ini [16:42:57] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [16:43:01] Dereckson: what host is this? [16:43:04] and what os? [16:44:28] on fatalmonitor dashboard, snapshot1001 [16:44:30] and then, depending on the answers to that, I may want to know if this is for a script from the command line or for an app server serving a web page [16:44:31] oh [16:44:34] yeah known [16:44:41] that would be the host I am working on [16:45:03] it will be better a little later, I'm in a meeting right now so really should not push review and deploy the puppet patch yet [16:46:49] Reedy: you're done? [16:47:24] !log deleted lots of log files (mostly nova-api logs) on labtestnet2001 [16:47:27] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [16:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:32] Dereckson: I think so [16:48:43] AndyRussG: live on mwdebug1002 < CentralNotice: emit CSP headers on banner previews [16:49:13] Dereckson: ok thanks! Sorry, I didn't mention, that's a no-op until the other change is deployed :) [16:49:14] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4139952 (10fgiunchedi) [16:49:54] Dereckson: https://gerrit.wikimedia.org/r/#/c/427235/ [16:50:26] The submodule pointers for core should be updated now that that has merged [16:50:30] ok [16:50:33] (Sorry, CentralNotice is still a snowflake) [16:50:55] Dereckson: I'm still hoping to get the other patch in, just had to wait for that one to merge to our deploy branch [16:51:48] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:52:01] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4139962 (10Reedy) [16:52:17] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:15] Dereckson: this is the other one: https://gerrit.wikimedia.org/r/#/c/427439/ [16:53:23] Jenkins should merge it shortly... [16:53:35] will be for wmf30 too? [16:55:14] Dereckson: yes, I think that's easiest for the upcoming train? [16:55:22] * Dereckson nods [16:55:32] (03PS2) 10Dzahn: admin: let maps-admins run any command as postgres,osmupdater,cassandra [puppet] - 10https://gerrit.wikimedia.org/r/427271 (https://phabricator.wikimedia.org/T192115) [16:55:37] Dereckson: thanks!!! :) [16:55:47] (03PS1) 10ArielGlenn: for hhvm on trusty for the snapshot hosts, disable extra hhvm extensions [puppet] - 10https://gerrit.wikimedia.org/r/427442 [16:55:55] I'll ping in a sec when it merges then [16:55:56] (03CR) 10Dzahn: [C: 032] "approved in ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/427271 (https://phabricator.wikimedia.org/T192115) (owner: 10Dzahn) [16:56:10] (03CR) 10jerkins-bot: [V: 04-1] for hhvm on trusty for the snapshot hosts, disable extra hhvm extensions [puppet] - 10https://gerrit.wikimedia.org/r/427442 (owner: 10ArielGlenn) [16:57:01] (03PS1) 10Andrew Bogott: Openstack: move remaining hosts to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/427443 (https://phabricator.wikimedia.org/T192162) [16:57:07] AndyRussG: https://gerrit.wikimedia.org/r/#/c/427235/ is on mwdebug1002 [16:57:40] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4139985 (10faidon) 05stalled>03Open Seems fine :) Welcome back Sean! [16:58:00] AndyRussG: to deploy, order will be CommonSettings first, extension then? [16:58:18] (03CR) 10Volans: [C: 031] "Code looks good, let's verify that tox environments are run only when those file change." [puppet] - 10https://gerrit.wikimedia.org/r/427378 (owner: 10Filippo Giunchedi) [16:58:47] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [16:58:47] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [16:59:08] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [16:59:46] ^ ongoing ticket about Icinga privs and notifications for people in mobileapps [17:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1700). [17:00:04] AndyRussG and ejegg: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:05] non-"admin" Icinga contact groups having privs [17:00:12] for "their" services [17:00:56] (03PS2) 10ArielGlenn: for hhvm on trusty for the snapshot hosts, disable extra hhvm extensions [puppet] - 10https://gerrit.wikimedia.org/r/427442 [17:01:17] (03PS2) 10Andrew Bogott: Openstack: move remaining hosts to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/427443 (https://phabricator.wikimedia.org/T192162) [17:01:41] Jayprakash12345: if AndyRussG isn't more reactive, I can't move forward for wikis. [17:02:02] ACKNOWLEDGEMENT - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues andrew bogott Andrew is working on this [17:02:02] ACKNOWLEDGEMENT - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues andrew bogott Andrew is working on this [17:02:02] ACKNOWLEDGEMENT - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues andrew bogott Andrew is working on this [17:02:09] (03CR) 10Andrew Bogott: [C: 032] Openstack: move remaining hosts to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/427443 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [17:02:25] Dereckson: I'm here [17:02:30] Sorry I didn't see the pings [17:02:35] Dereckson: will see [17:02:37] aaaarg} [17:02:49] 10Operations, 10ops-eqiad, 10hardware-requests, 10monitoring, and 2 others: decom uranium - https://phabricator.wikimedia.org/T183209#4140009 (10Cmjohnson) [17:03:08] 10Operations, 10ops-eqiad, 10hardware-requests, 10monitoring, and 2 others: decom uranium - https://phabricator.wikimedia.org/T183209#3846949 (10Cmjohnson) 05Open>03Resolved Removed from rack, racktables updated, tracking sheet updated [17:03:19] AndyRussG: so code from CSP is live on mwdeubg1002, CommonSettings + the first change for CN wmf_deploy [17:03:58] Dereckson: checking [17:06:48] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:06:52] Dereckson: I'm not seeing it? [17:07:05] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: sudoer access for pnorman on maps servers - https://phabricator.wikimedia.org/T192115#4140021 (10Dzahn) 05Open>03Resolved @pnorman @Gehel The request has been approved in ops meeting and the change above has been merged. I ran p... [17:07:19] mwdebug1002 [17:08:33] dcdeb2394c688924ef506be95b70f5e0 wmf-config/CommonSettings.php [17:08:47] dcdeb2394c688924ef506be95b70f5e0 wmf-config/CommonSettings.php [17:08:50] CS matches [17:09:12] let's check CN [17:10:02] [tin] 798ae24fcead19055f71a436a23375d1 php-1.31.0-wmf.30/extensions/CentralNotice/CentralNotice.hooks.php [17:10:18] Dereckson: the first patch should be 5ea9990438010c0eed858825f75d95551afe26fe [17:10:20] [mwdebug1002] 798ae24fcead19055f71a436a23375d1 php-1.31.0-wmf.30/extensions/CentralNotice/CentralNotice.hooks.php [17:10:23] yes [17:10:29] and this first patch is live on mwdebug1002 [17:10:35] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4140035 (10ArielGlenn) [17:10:58] (03PS1) 10Cmjohnson: removing dns entries for lawrencium [dns] - 10https://gerrit.wikimedia.org/r/427445 (https://phabricator.wikimedia.org/T191360) [17:11:04] dereckson@mwdebug1002:/srv/mediawiki$ ls php-1.31.0-wmf.30/extensions/CentralNotice/resources/subscribing/ext.centralNotice.cspViolationAlert.js [17:11:07] php-1.31.0-wmf.30/extensions/CentralNotice/resources/subscribing/ext.centralNotice.cspViolationAlert.js [17:11:14] on what wiki are you testing that? [17:11:29] meta [17:11:40] meta is in group1, so .30 [17:11:57] ah 31 ok [17:12:05] (03CR) 10Cmjohnson: [C: 032] removing dns entries for lawrencium [dns] - 10https://gerrit.wikimedia.org/r/427445 (https://phabricator.wikimedia.org/T191360) (owner: 10Cmjohnson) [17:12:18] AndyRussG: no this week is 30, and meta is already 30 [17:12:33] yes [17:13:02] but on mwdebug1002 you deployed for wikis only on 31, right? [17:13:02] https://meta.wikimedia.org/w/extensions/CentralNotice/resources/subscribing/ext.centralNotice.cspViolationAlert.js works targetting on mwdebug1002 by the way [17:13:33] (03PS1) 10Andrew Bogott: openstack::cloudrepo: Tentatively support Stretch + Ocata [puppet] - 10https://gerrit.wikimedia.org/r/427447 (https://phabricator.wikimedia.org/T192162) [17:13:43] mwdebug1002 is a regular application host, with all versions available, and I've currently put there two things: [17:13:52] (1) the config change https://gerrit.wikimedia.org/r/#/c/427275/ [17:14:04] (2) the first of the two CN change, for wmf30, https://gerrit.wikimedia.org/r/#/c/427235/ [17:15:14] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/427442 (owner: 10ArielGlenn) [17:15:16] (03CR) 10Andrew Bogott: [C: 032] openstack::cloudrepo: Tentatively support Stretch + Ocata [puppet] - 10https://gerrit.wikimedia.org/r/427447 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [17:15:31] Dereckson: I don't doubt it, somehow the new header is not being emitted though [17:15:49] Maybe there's something in the varnish config that blocks it [17:16:01] Was tested yesterday on the beta cluster [17:16:21] Shouldn't be hitting the cache though, since I'm testing logged-in [17:17:08] I do get this header: server [17:17:10] mwdebug1002.eqiad.wmnet [17:17:18] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:17:24] this one is expected [17:17:26] so the mwdebug extension seems to be working [17:17:47] ejegg: ^ can you take a peek? The csp header code is out on mwdebug1002 but I'm not seeing the header [17:17:56] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140061 (10Reedy) [17:18:30] You should be able to go to https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners and click to preview any banner, and see the csp header [17:18:38] also didn't see it on test2 [17:19:11] (For example: https://test2.wikipedia.org/wiki/Main_Page?banner=B1718_0416_nlNL_dsk_p1_lg_bdr_ong&uselang=en&force=1) [17:19:17] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140070 (10Reedy) You'll want approval from @K4-713 :) [17:19:26] (03PS3) 10ArielGlenn: for hhvm on trusty for the snapshot hosts, disable extra hhvm extensions [puppet] - 10https://gerrit.wikimedia.org/r/427442 [17:19:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4140076 (10Cmjohnson) [17:19:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4140074 (10Cmjohnson) 05Open>03Resolved removed from rack, tracking sheet updated. [17:19:58] PROBLEM - Host lawrencium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:20:17] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4140085 (10Reedy) [17:20:43] (03CR) 10jenkins-bot: CentralNotice: emit CSP headers on banner previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427275 (https://phabricator.wikimedia.org/T190100) (owner: 10Ejegg) [17:20:49] (03CR) 10ArielGlenn: [C: 032] for hhvm on trusty for the snapshot hosts, disable extra hhvm extensions [puppet] - 10https://gerrit.wikimedia.org/r/427442 (owner: 10ArielGlenn) [17:21:00] Dereckson: just looking with debug mode to see if the client-side code is there [17:21:15] I confirm at https://meta.wikimedia.org/w/index.php?title=Fundraising_2011/Banners_2/lv&banner=B1718_0417_ennlNL_m_p1_lg_txt_skip&uselang=en&force=1 I don't have a CSP either [17:22:20] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#4140106 (10Cmjohnson) [17:22:37] Dereckson: I'm not seeing the related RL module being added to the page, either [17:22:47] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3597404 (10Cmjohnson) 05Open>03Resolved Removed from rack, updated tracking sheet [17:22:51] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4140111 (10ayounsi) [17:23:30] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4140116 (10ayounsi) [17:23:33] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4127315 (10ayounsi) 05Open>03Resolved Indeed, thanks for double checking. All 3 hosts done and verified. [17:24:10] Dereckson: that's expected though if the config change isn't out [17:24:17] And Jenkins says it just merged? https://gerrit.wikimedia.org/r/#/c/427275/ [17:24:57] Dereckson: can you double-check that the config change is deployed perhaps? [17:25:00] and it's live too on mwdebug1002 [17:25:08] commit sha? [17:25:29] 392482abb305ff5be30135b8f064c82b9b69dec4 [17:25:38] and MD5 hashes match for CS [17:26:11] Dereckson: yes that's right [17:27:00] (03PS1) 10ArielGlenn: don't install the external hhvm extensions on trusty [puppet] - 10https://gerrit.wikimedia.org/r/427450 [17:28:17] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:05] Dereckson: still not seeing the header, and the RL module is still not there, either [17:29:24] Not sure where to go from here [17:29:48] Do you want to revert, try it elsewhere, or try the other patch? [17:30:09] AndyRussG: okay, so we've several options: 1. revert, your team tests on mwdebug1001 later / 2. keep the code as asserted not broken, submit further test fix change later / 3. revert config and code changes [17:30:11] Varnish config shouldn't interfere with the RL module being added [17:30:53] er 3. submit a fix in the next minutes if you've someone busy on it [17:31:08] I've no idea why it's not working [17:31:36] Which is the safest option that doesn't interfere with other deployment stuff? Maybe option 1? [17:32:01] The config change is a no-op, I don't think it's necessary to revert that [17:32:08] that is, no-op without the code change [17:32:41] I'd go with 1 [17:33:04] man, this is frustrating :( [17:33:12] ejegg: any idea what could be up? [17:33:18] So, we keep the config, and revert the code? [17:33:27] Dereckson: sounds safest for now, yes [17:33:29] no... the header shouldn't depend on resourceloader or anything [17:33:44] ack'ed [17:33:47] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:33:48] ejegg: correct, but the new RL module should also be going out [17:34:16] right, we should see both a new RL module and a new header [17:34:20] but we see neither [17:34:27] ejegg: so if the problem were something in prod that somehow overrides our attempt to set the new header, we should at least see the new RL module [17:34:39] which suggests it's somehow not fully deployed or activated [17:34:39] I also tried w/ debug=true [17:35:09] ejegg: yes... This state would occur if somehow the new code wasn't getting the config [17:35:19] Maybe spelling mistake in the config variable? [17:35:20] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Emit CSP headers on banner previews (T190100, no-op for now) (duration: 01m 16s) [17:35:21] yeah, me too, and it's def not loading the new RL [17:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:26] T190100: Option to enforce CSP on banner previews and flag errors - https://phabricator.wikimedia.org/T190100 [17:35:46] AndyRussG: pretty sure it's the same as in beta [17:36:31] Is banner being passed? [17:37:18] Reedy: yes the banner displays as expected for the preview [17:37:24] ejegg: not a spelling mistake [17:37:57] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:39:09] Dereckson: will the config change go live everywhere I guess, rather than just mwdebug1002? [17:39:19] indeed, it's now live everywhere [17:39:25] Ok thanks [17:39:55] ejegg: about your -1 to precent premature deploy, we only deploy config changes on request by the way [17:40:05] it's not part of a train or anything [17:40:23] ejegg: Reedy: I think we can go somewhere on prod to make sure the new config variable is visible to php, can't imagine why it wouldn't be tho [17:40:23] (so when you need a config change to be merged, you've to add to swat window like today) [17:40:24] ah cool, was just following some on-wiki directions [17:40:32] (probably out of date) [17:40:34] reedy@tin:~$ mwscript eval.php metawiki [17:40:34] > echo $wgCentralNoticeContentSecurityPolicy; [17:40:34] default-src *.wikimedia.org *.wikipedia.org *.wiktionary.org *.wikisource.org *.wikibooks.org *.wikiversity.org *.wikiquote.org *.wikinews.org www.mediawiki.org www.wikidata.org *.wikivoyage.org data: blob: 'self'; script-src *.wikimedia.org 'unsafe-inline' 'unsafe-eval' 'self'; style-src *.wikimedia.org data: 'unsafe-inline' 'self'; [17:40:34] > echo (bool)$wgCentralNoticeContentSecurityPolicy; [17:40:39] 1 [17:40:50] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4140193 (10ayounsi) Private interfaces ranges have been created on asw2-a/c-eqiad and interfaces added (they can't be created without interfaces). [17:41:26] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4140198 (10ayounsi) [17:41:47] Reedy: okok thanks :) [17:44:14] Dereckson: thanks so much for helping w/ this... Here's some of the doc for config changes, I guess we could update then: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_3:_configuration_and_other_prep_work [17:44:42] "submit them for gerrit review with a -1 comment to avoid early deployment" [17:45:05] AndyRussG: ohhhh [17:45:06] ! [17:45:10] ??? [17:45:12] perhaps look the history, if it hasn't been added for a good reason [17:45:28] special:version shows the old version of CN [17:45:39] ohwait, sorry, did you guys roll back just now? [17:45:45] ejegg: it was only on mwdebug1002, yeah it was just rolled back [17:46:04] (I think?) [17:46:07] (it's still on mwdebug1002 and only there) [17:46:08] ah, sorry, got it [17:46:45] I know sometimes Chad likes to CR +2 older changes waiting, so yes perhaps it's not a bad idea to indicate if they are'nt ready for deploy [17:47:06] How would we do that? [17:47:10] (but probably more important for IS changes to test, than CS changes to provide a future config value for a future extension change) [17:47:39] 2.6.0 (e9f195f) 00:49, 14 March 2018 [17:48:19] Hmm, I'm using the ff extension to try to load mwdebug1002 but now I see it's not hitting that box [17:48:25] ,"wgHostname":"mw1249" [17:48:36] what are other folks using to get there? [17:48:40] ejegg: put the on/off button at on [17:48:49] I've (e9f195f) at mwdebug1002 too [17:49:08] 1.31.0-wmf.29 (1cb7198) [17:49:10] and it's normal [17:49:14] meta is still 1.31.0-wmf.29 [17:49:22] yep, got that set to 'on' and mwdebug1002 chosen in the dropdown [17:49:24] huh [17:49:27] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/10969/" [puppet] - 10https://gerrit.wikimedia.org/r/427450 (owner: 10ArielGlenn) [17:49:29] https://tools.wmflabs.org/versions/ indicates the version for the day [17:49:37] and in 70 minutes, meta will be wmf30 [17:49:59] sorry, but e9f... is not the newest [17:50:00] so you can test on mediawiki.org, test.wikipedia.org? [17:50:16] ejegg: it's because your change has been deplyed for 1.31.0-wmf.30 [17:50:17] did the submodule pointer not get updated? [17:50:22] and meta is still currently at 1.31.0-wmf.29 [17:50:30] ahh, ok [17:50:32] so yes it has been updated, but only for wiki at 1.31.0-wmf.30 like test. or mediawiki.org [17:50:42] sorry, I'll peek on mediawiki.org then [17:50:45] ejegg: should be good to test on mediawiki or test2 [17:50:49] good catch ejegg by the way [17:51:00] (03CR) 10ArielGlenn: [C: 032] don't install the external hhvm extensions on trusty [puppet] - 10https://gerrit.wikimedia.org/r/427450 (owner: 10ArielGlenn) [17:51:28] * Dereckson looks https://www.mediawiki.org/wiki/Special:Version [17:51:32] 1.31.0-wmf.30 (c947a6e) [17:51:36] MediaWiki is good [17:51:42] 2.6.0 (e9f195f) 00:49, 14 March 2018 [17:51:43] CN is not [17:51:55] https://test2.wikipedia.org/w/index.php?title=Special:Version on mwdebug1002 still also shows the old CN [17:52:12] yep, seeing the same thing [17:52:30] tin is at 5ea9990438010c0eed858825f75d95551afe26fe [17:52:53] let's rescap mwdebug1002 [17:53:01] 2.6.0 (5ea9990) 20:27, 17 April 2018 [17:53:03] (03CR) 10Dzahn: [C: 032] "we'll see about the empty resource reference but based on "it just affects trusty" and the worst that can happen it breaks on snapshot.. a" [puppet] - 10https://gerrit.wikimedia.org/r/427450 (owner: 10ArielGlenn) [17:53:14] okay, now https://www.mediawiki.org/wiki/Special:Version is at 5ea9990 [17:53:43] !log restart elasticsearch on elastic1017 [17:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:21] No idea why previous scap pull at 16:50:47 16:56:50 and 16:57:24 dind't took it [17:54:33] Dereckson: ok seeing the same now on mediawiki [17:54:38] Lemme test the code then [17:54:53] nice! seeing the header [17:55:12] ...and the RL module! [17:55:21] it's weird by the way [17:55:25] because we had the JS file! [17:55:36] super weird [17:56:40] so this change works? [17:56:53] let's test the other one [17:56:57] https://www.mediawiki.org/wiki/Toolserver:User:Tommy_Kronkvist?banner=PT_B1718_0415_enIN_anon_variance_44&uselang=en&force=1 looks good [17:57:02] on mwdebug1002 [17:57:08] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:57:15] getting the header [17:57:41] afcf52a265273c91c32e1f03c408a679a3a67ea9 Add privacy warning and allow wikitext in banner content field summary is live on mwdebug1002 (or should be) [17:57:42] ejegg: Dereckson: the RL module is now loaded as expected, too [17:58:21] Dereckson: on which version ^ ? The UI to see if it works is only available on meta [17:58:29] or is it on mwdebug1002 on all versions? [17:58:32] yes, I scap pull on mwdebug1002, but UI still sohow 5ea9990 [17:58:50] cached version page? [17:59:33] ejegg: shouldn't be cached if ur logged-in, eh? [17:59:40] https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/B1718_0417_ennlNL_m_p1_lg_txt_skip doesn't show the new UI element [17:59:48] ah yeah [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:05] AndyRussG: didn't Dereckson just say meta isn't on the new version yet? [18:00:20] You should see the new message about privacy just under where it says "edit banner" [18:00:24] ejegg: yea [18:00:34] so, after the train deploy, it'll have it? [18:00:49] dereckson@mwdebug1002:/srv/mediawiki$ md5sum php-1.31.0-wmf.30/extensions/CentralNotice/special/SpecialCentralNoticeBanners.php [18:00:52] 0d716fca070b544596995414c4adacad php-1.31.0-wmf.30/extensions/CentralNotice/special/SpecialCentralNoticeBanners.php [18:00:55] dereckson@tin:/srv/mediawiki-staging/php-1.31.0-wmf.30/extensions/CentralNotice$ md5sum special/SpecialCentralNoticeBanners.php [18:00:58] 0d716fca070b544596995414c4adacad special/SpecialCentralNoticeBanners.php [18:01:01] they match [18:01:06] so yes, it could be Special:Version uses some caching [18:01:15] (03CR) 10Smalyshev: "Generally lgtm except for one minor issue." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [18:01:24] (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [18:02:11] Dereckson: ok, sorry for my confusion... So, just to check I understand, even though that version page isn't good, meta served from mwdebug1002 should indeed use the new code now? [18:02:23] no, but mediawiki.org yes [18:02:28] Dereckson: ok gotcha [18:02:41] meta will serve it AFTER the train in 58 minutes [18:02:42] Dereckson: if you want to test the second patch, that has to go to meta [18:02:51] Dereckson: ok that's also totally fine [18:03:27] AndyRussG: what about revert your change, and resubmit it after train in evening swat? [18:03:36] that will then be possible to test it on meta [18:03:39] Dereckson: it's really simple enough that it doesn't need testing on mwdebugs [18:04:02] There is an l10n change, and for that, I agree [18:04:19] Dereckson: let's just push it all out... The first change works great on mediawiki [18:04:27] and the PHP code is: [18:04:28] - $this->msg( 'centralnotice-edit-template-summary' )->escaped() ), [18:04:31] + $this->msg( 'centralnotice-edit-template-summary' )->parse() ), [18:04:34] The second one, which only shows on meta, is very low-risk [18:04:35] and we can test that [18:04:36] yep [18:04:53] cool [18:05:37] k, looks like we're in good shape for just now... i'mma grab some food while we wait for the train [18:06:19] If I do a `wfGetMessage('centralnotice-edit-template-summary')->parse()` in the psysh shell [18:06:29] I should get the expected value, shouldn't I? [18:07:07] Dereckson: yes [18:09:01] > echo wfMessage('centralnotice-edit-template-summary')->parse() [18:09:01] To create a localisable message, enclose a hyphenated string in three curly brackets, e.g. {{{jimbo-quote}}}. [18:09:04] dereckson@mwdebug1002:/srv/mediawiki$ mwscript eval.php metawiki [18:09:07] looks good? [18:09:27] Dereckson: that's the old message [18:09:40] ah yes, i18n requires a full scap [18:09:42] scap won't have been run [18:10:27] so who's responsible for the train today? [18:10:37] thcipriani: ping? [18:11:47] Dereckson: just to check, then, where are we with the first patch? It worked fine on mediawiki.org [18:11:51] AndyRussG: best is to get thcipriani green light to deploy this, or at least notify them about what will happen [18:12:22] I can deploy the first patch, or both [18:12:32] Dereckson: let's [18:12:52] or if it's easier, notify thcipriani and ask them to go out on the train [18:13:12] but for testing purposes, I think both are fine to push out, in whatever way is the least trouble for everyone [18:13:30] I mean, as regards testing [18:13:46] as in, both are tested enough or simple enough to go out [18:13:48] what's up? [18:14:34] thcipriani: there is a CentralNotice change for wmf30 merged but not testable, a simple l10n update here: https://gerrit.wikimedia.org/r/#/c/427439/1/special/SpecialCentralNoticeBanners.php [18:15:28] ok, so I need to run a full scap pre-train? [18:16:07] I don't know if that's necessary? [18:16:15] it's a very low risk change, so perhaps just remember there is this change if something odd on CN after the train? [18:16:18] There's a post-train scap I assume? [18:16:42] I can just check it works after the train [18:17:39] !log imarlier@tin Started deploy [performance/coal@f1ca191]: Deploying coal version that includes a runner for service use [18:17:43] !log imarlier@tin Finished deploy [performance/coal@f1ca191]: Deploying coal version that includes a runner for service use (duration: 00m 04s) [18:17:43] there is a post-train scap, if there are l10n changes we should just do them with the train due to T191921 [18:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:49] T191921: mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921 [18:18:49] also, if it's already merged we can run a full scap ~now and ensure everything works. [18:19:55] Dereckson: unless you're still working on tin outside of SWAT? [18:21:48] it's merged, but if we do a full scap now [18:21:50] it won't be testable [18:21:57] as it's a meta-only change [18:22:04] and meta is still .29 [18:22:58] Dereckson: thcipriani: let's just test after the train [18:23:28] thcipriani: so the two options are: revert, train, revert revert at evening swat -or- deploy it and check after train it's okay [18:24:01] If the other change only goes out with the train, that's also fine [18:24:07] or now, whenever [18:24:19] that other change (the first one) doesn't need a full scap [18:25:48] !log imarlier@tin Started deploy [performance/coal@3c0ef36]: coal: typoed the run file [18:25:53] !log imarlier@tin Finished deploy [performance/coal@3c0ef36]: coal: typoed the run file (duration: 00m 04s) [18:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:08] imarlier> you know we're in the sanity break, no deploy, before MediaWiki train? [18:26:09] we could modify wikiversions.json on mwdebug1002 and then run scap wikiversions-compile there. Would be able to test on just that machine [18:26:51] thcipriani: would be more careful, yes [18:27:43] dereckson@mwdebug1002:/srv/mediawiki$ scap wikiversions-compile [18:27:43] 18:27:40 wikiversions-compile must be run as user mwdeploy [18:27:43] 18:27:41 Compiled /srv/mediawiki/wikiversions.json to /srv/mediawiki/wikiversions.php [18:27:46] done [18:28:31] AndyRussG: okay, live on meta wmf30 on mwdebug1002 [18:29:14] AndyRussG: you'll probably get the old message but you can at least assert the special page isn't broken [18:29:26] Dereckson: hey, for the future, if you do not use your scheduled deployment window, please remove it in time so that others can use it [18:30:02] Dereckson: checking [18:32:07] Dereckson: on meta via mwdebug1002, I can confirm the page isn't broken [18:32:08] https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/B1718_0417_ennlNL_m_p1_lg_txt_skip [18:32:22] !log ppchelko@tin Started deploy [changeprop/deploy@d83fad3]: Support multi-topic rules, rename metrics, update dependencies [18:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:31] and indeed the old message is what shows up [18:33:36] !log ppchelko@tin Finished deploy [changeprop/deploy@d83fad3]: Support multi-topic rules, rename metrics, update dependencies (duration: 01m 14s) [18:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:45] mobrovac: yes, I agree. Last time I checked here was 15:50 UTC and dba was syncing a repool change, and afterwards, got a lost of connectivity. [18:34:07] AndyRussG: good [18:34:11] hehe kk [18:35:59] !log dereckson@tin Synchronized php-1.31.0-wmf.30/extensions/CentralNotice: Emit CSP headers on banner preview (duration: 01m 18s) [18:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:15] AndyRussG: ejegg|food: thcipriani: all is in order as far as CentralNotice is concerned [18:37:11] ok, so I still need to run a full scap before I roll the train forward to update l10n, correct? But we've now verified that everything works as intended except the message update? [18:37:16] thcipriani: I'm done on tin, and I restored wikiversions.json on meta [18:37:23] er on mwdebug1002 I mean [18:37:29] awesome, thank you! [18:37:37] yes, correct, for the l10n [18:37:44] ack [18:38:04] Jayprakash12345: sorry for the CentralNotice delay [18:39:13] Dereckson: fantastic, thanks!!!!! [18:39:19] :) [18:39:28] Dereckson: Now any tentative window or date for wiki's creations? [18:41:18] sorry, that's blocking on me responding to emails [18:41:42] MaxSem: maps-admins now have more sudo powers [18:41:49] (the one you +1ed) [18:41:57] wee [18:42:07] still can't rm / though :( [18:42:11] :p [18:42:22] hehe [18:42:42] Dereckson: I'll reply on email as well, but if you're still around.. post train (after tyler's done) looks like an OK time [18:42:45] cc Jayprakash12345 [18:42:48] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:42:54] greg-g: yes, it's fine to me [18:43:03] @seen pnorman [18:43:03] mutante: Last time I saw pnorman they were leaving the channel #wikimedia-analytics at 2/21/2018 8:55:37 PM (55d21h47m25s ago) [18:43:41] looking at the latency thing, it's somewhat expected [18:43:44] aww, he was the one who requested maps-admin change ^ [18:45:52] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4140393 (10ayounsi) Ports vlan configuration done. [18:45:57] (03PS4) 10Dzahn: admin: Create group with snapshot/dumps host access, add springle [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) (owner: 10ArielGlenn) [18:46:06] (03PS5) 10Dzahn: admin: Create group with snapshot/dumps host access, add springle [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) (owner: 10ArielGlenn) [18:46:20] (03CR) 10Dzahn: [C: 032] "approved in ops meeting, compiled and works: http://puppet-compiler.wmflabs.org/10970/" [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) (owner: 10ArielGlenn) [18:51:59] well, I guess I'll go ahead and get a jump on the train so I can get out of the way :) [18:53:14] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4140451 (10Dzahn) compiled: http://puppet-compiler.wmflabs.org/10970/ on bast1002: Notice: /Stage[main]/Admin/Admin::Hashuser[springle]/Admin::Us... [18:54:19] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4140454 (10Dzahn) For SSH config also see https://wikitech.wikimedia.org/wiki/Production_shell_access#Standard_config [18:54:57] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4140457 (10Dzahn) 05Open>03Resolved a:03Dzahn [18:55:18] thcipriani: don't fall off :) [18:55:43] (jumping on moving vehicles can be dangerouz....) [18:56:12] !log thcipriani@tin Started scap: rebuild l10n cache [18:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:51] AndyRussG: tell me about it :P [18:56:56] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4140460 (10Dzahn) @Lucas_Werkmeister_WMDE Thank you for the update! I will say the status "stalled" is still correct in that case until we now. But that was very helpful to... [18:57:59] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4140462 (10Dzahn) [19:00:04] thcipriani: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:03:43] (03PS7) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) [19:06:48] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:12:58] (03PS1) 10Chad: Gerrit: Disable auto-reindexing of changes [puppet] - 10https://gerrit.wikimedia.org/r/427471 [19:13:07] !log ppchelko@tin Started deploy [restbase/deploy@8d8f1df]: Test concurrent worker startups [19:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:45] (03CR) 10Chad: [C: 031] "Requires a service restart, but otherwise safe to land. Not urgent, but we definitely want it." [puppet] - 10https://gerrit.wikimedia.org/r/427471 (owner: 10Chad) [19:15:13] !log restart elasticsearch on elastic1018 with numa interleave [19:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:37] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:16:37] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:17:47] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:17:47] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:17:48] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:18:47] notice how that's a WARN and not a CRIT but icinga-wm reports it , interesting [19:20:22] this is really weird ^ [19:20:29] bearND: mdholloway: ^ ? [19:20:37] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:20:37] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:21:02] mutante: and that's good, because it's not a critical :) [19:21:32] but why does Icinga say critical in this case? [19:21:51] mobrovac: it's good, i was trying to solve that people have these permissions [19:22:00] i just thought that icinga-wm was filtering out WARNS [19:22:25] that's also not bad, i just expected it to behave differently ..slightly [19:22:38] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:22:38] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:22:48] oh right, but mutante the host is critical, but the check is warn [19:22:50] haha [19:22:53] icinga fail [19:23:06] bearND: it's defined somewhere in the check_plugin script [19:23:17] or puppet [19:23:32] why did it just announce the same two hosts with same message? scb1003 + scb1004 [19:23:40] there might be a mismatch between the service-checker script and nagios/icinga [19:23:51] because they are separate issues from Icinga's point of view [19:24:03] it's not configured to know these should have a relationship [19:24:20] no, it's the same fail [19:24:29] service on host A != service on host B [19:25:14] if the entire host is down that would be CRIT more than just a service on it.. that seems to make sense? [19:25:29] mutante: it said it for the same hosts at least twice, see :20 and :23 [19:25:47] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) timed out before a response was received: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:25:57] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:26:25] ok, this one^ is different. [19:26:28] bearND: can you paste 2 identical lines? [19:26:35] not sure if we are talking about the same [19:26:45] mutante: "PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0" [19:27:07] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [19:27:47] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:27:47] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) is WARNING: Test Retrieve announcements responds with unexpected value at path /announce = Expected 1 array elements, gotten 0 [19:27:48] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [19:27:58] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [19:28:04] mutante: actually happened four times in last 12 minutes [19:28:24] (03CR) 10Paladox: [C: 031] Gerrit: Disable auto-reindexing of changes [puppet] - 10https://gerrit.wikimedia.org/r/427471 (owner: 10Chad) [19:28:29] !log ppchelko@tin Finished deploy [restbase/deploy@8d8f1df]: Test concurrent worker startups (duration: 15m 23s) [19:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:34] could you make a pastebin with those lines including the timestamps? [19:29:55] yes, it is normal that it would repeat ongoing issues at the check interval [19:30:07] but 4 times in 12 minutes is more than expected [19:31:23] thcipriani: I may have to be briefly AFK, if there are any issues with CN and I don't respond quickly, pls dive into #wikimedia-fundraising, is that okok? [19:31:40] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4140523 (10herron) Ok, now thinking about options for the rsync source/server side... I tend to agree that volatile isn't very well suited for archive... [19:31:43] AndyRussG: sure, thanks for the heads-up [19:32:58] likewise thanks thcipriani! :) [19:33:47] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4140528 (10Krinkle) [19:34:16] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#4140542 (10Krinkle) [19:37:05] mutante: https://phabricator.wikimedia.org/P7010 [19:40:47] (03CR) 10Bstorm: [C: 032] wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [19:44:52] bearND: thanks! looking at logs in web ui [19:45:36] "View Alert History For This Service" [19:45:53] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=scb1001&service=mobileapps+endpoints+health [19:46:29] compared to scb1003 [19:46:30] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=scb1003&service=mobileapps+endpoints+health [19:48:56] it's repeating the (HARD) CRIT every 5 minutes per that. and that is more what i expected [19:49:04] the 5 min interval [19:49:31] thing is that we first get to WARN and then a little later to CRIT.. and the bot reports it all [19:50:17] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 50.61, 27.65, 21.48 [19:50:27] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 53.01, 27.42, 19.44 [19:50:58] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 49.02, 31.01, 24.83 [19:52:30] spike. is already going lower [19:55:09] !log thcipriani@tin Finished scap: rebuild l10n cache (duration: 58m 57s) [19:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:27] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 16.11, 24.67, 21.74 [19:57:17] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.96, 24.58, 23.10 [19:58:39] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4140587 (10thcipriani) p:05Triage>03High From scap today: `19:37:56 Finished l10n-update (duration: 39m 12s)` 30-40 minutes beta-scap-eqiad runs: h... [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:07:17] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [20:07:52] !log mholloway-shell@tin Started deploy [mobileapps/deploy@9328a7d]: Update mobileapps to fb161d7 [20:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:07] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140624 (10mepps) @Reedy, @K4-713 is on sabbatical until early June. I'm currently serving as her delegate. [20:09:58] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 19.02, 22.23, 23.81 [20:10:05] !log restart elasticsearch on elastic1019 with numa interleave [20:10:06] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140625 (10Reedy) Victoria then? Unless there's someone else in between. Needs to go up the chain for approval [20:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:37] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:11:37] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [20:11:45] (03PS1) 10Andrew Bogott: labtestservices2003: mark as spare [puppet] - 10https://gerrit.wikimedia.org/r/427528 [20:12:07] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:12:08] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [20:12:37] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [20:12:52] (03CR) 10Andrew Bogott: [C: 032] labtestservices2003: mark as spare [puppet] - 10https://gerrit.wikimedia.org/r/427528 (owner: 10Andrew Bogott) [20:13:38] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [20:13:48] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@9328a7d]: Update mobileapps to fb161d7 (duration: 05m 56s) [20:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:56] (03PS2) 10Dzahn: icinga: add contactgroup for mobileapps to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/427417 (https://phabricator.wikimedia.org/T189524) [20:19:05] (03CR) 10Dzahn: [C: 032] icinga: add contactgroup for mobileapps to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/427417 (https://phabricator.wikimedia.org/T189524) (owner: 10Dzahn) [20:25:39] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#4140665 (10Etonkovidova) [20:26:48] (03CR) 10Dzahn: [C: 032] "unfortunately it seems like this wasn't enough to add it where we wanted it" [puppet] - 10https://gerrit.wikimedia.org/r/427417 (https://phabricator.wikimedia.org/T189524) (owner: 10Dzahn) [20:29:03] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4140674 (10mmodell) [20:30:02] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128785 (10mmodell) [20:31:08] Krinkle: if you have a moment could you take a look at: https://gerrit.wikimedia.org/r/#/c/427478/ ? Would unblock train for me [20:31:43] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4140683 (10mmodell) @fgiunchedi Can you upload the package? the tag is at rMSCA004f7635ff44 [20:32:56] (03CR) 10Dzahn: [C: 04-2] "server is too old" [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [20:33:03] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4140686 (10mmodell) a:05mmodell>03None [20:33:18] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128785 (10mmodell) p:05Triage>03Normal [20:33:42] thcipriani: Wil do in 1h after my mtng [20:33:53] ok, thank you! [20:34:28] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128785 (10mmodell) [20:34:43] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128785 (10mmodell) a:03fgiunchedi [20:35:11] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4140693 (10Dzahn) Tried to add missing contactgroup to mobileapps services with the change above (copying wdqs setup) but it see... [20:37:17] (03PS1) 1020after4: Bump scap version to 3.8.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/427535 (https://phabricator.wikimedia.org/T192124) [20:37:32] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140702 (10Dzahn) Hi Margaret, could you explain a little bit why you need that specific group and what you are planning to do. cc: @Nuria [20:37:56] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4140704 (10Dzahn) p:05Triage>03Normal [20:38:57] 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4140706 (10mmodell) [20:39:08] (03PS1) 10Cmjohnson: Adding dns for db1116-1123 [dns] - 10https://gerrit.wikimedia.org/r/427536 (https://phabricator.wikimedia.org/T191792) [20:39:44] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4140710 (10Cmjohnson) [20:46:02] RECOVERY - Check systemd state on labtestweb2001 is OK: OK - running: The system is fully operational [20:52:08] !log restart elasticsearch on elastic1020 with numa interleave [20:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:12] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.89 seconds [21:00:04] Dereckson: I, the Bot under the Fountain, allow thee, The Deployer, to do Create new wikis deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:01:23] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4140768 (10herron) Indeed, after upgrading to `3.0.0~rc5-1~bpo9+1` mtail starts up happily. @fgiunchedi do you think it would be safe to pin the mtail package to stretch-backports for all stretch hosts, or shoul... [21:06:12] (03PS10) 10Dereckson: Initial configuration for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400234 (https://phabricator.wikimedia.org/T183561) (owner: 10Urbanecm) [21:06:39] (03CR) 10Dereckson: [C: 032] Initial configuration for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400234 (https://phabricator.wikimedia.org/T183561) (owner: 10Urbanecm) [21:08:05] (03Merged) 10jenkins-bot: Initial configuration for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400234 (https://phabricator.wikimedia.org/T183561) (owner: 10Urbanecm) [21:08:54] (03PS11) 10Dereckson: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm) [21:09:10] (03CR) 10Dereckson: [C: 032] Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm) [21:09:43] Dereckson: Hello, Felling good to see you again [21:09:52] Hello :) [21:10:21] (03Merged) 10jenkins-bot: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm) [21:10:35] (03PS6) 10Dereckson: Initial configuration for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416930 (https://phabricator.wikimedia.org/T189109) (owner: 10Urbanecm) [21:11:22] (03CR) 10Dereckson: [C: 032] Initial configuration for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416930 (https://phabricator.wikimedia.org/T189109) (owner: 10Urbanecm) [21:12:42] (03Merged) 10jenkins-bot: Initial configuration for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416930 (https://phabricator.wikimedia.org/T189109) (owner: 10Urbanecm) [21:13:15] (03PS6) 10Dereckson: Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm) [21:13:29] (03CR) 10Dereckson: [C: 032] Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm) [21:14:49] (03Merged) 10jenkins-bot: Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm) [21:15:24] (03PS10) 10Dereckson: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [21:15:33] (03CR) 10Dereckson: [C: 032] Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [21:17:08] (03Merged) 10jenkins-bot: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [21:19:12] (03CR) 10Dereckson: "Merge conflict to solve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [21:19:51] Jayprakash12345: I see a comment on the last one [21:19:52] @Urbanecm Window for creating wiki has set. Please add all config like Project Namespace etc. And you will go "Hindi_Wikimedians_User_Group" as the site name. Thanks :) [21:20:51] Dereckson: Just create the Wiki, I will take care all config later :) [21:21:15] Yes, that's a working solution: the basic configuration looks good, and we can just after add the missing parts [21:21:35] as I rebase, I can set the site name [21:22:06] 10Puppet, 10Beta-Cluster-Infrastructure: Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340 - https://phabricator.wikimedia.org/T192473#4140892 (10MarcoAurelio) [21:22:39] Dereckson: Then Set "Hindi_Wikimedians_User_Group" as site name [21:22:50] 10Puppet, 10Beta-Cluster-Infrastructure: Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340 - https://phabricator.wikimedia.org/T192473#4140086 (10MarcoAurelio) @hoo and @Rxy suggests this may be nutcracker refusing to stay up and running. Granted, w... [21:22:56] Other sites in Hindi use names in devanagari by the way [21:23:20] (well content projects) [21:25:23] mutante: around? problems with redis/nutcracker [21:25:38] 10Puppet, 10Beta-Cluster-Infrastructure: Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340 - https://phabricator.wikimedia.org/T192473#4140918 (10hoo) ``` hoo@deployment-jobrunner03:~$ sudo tail /var/log/nutcracker/nutcracker.log [2018-04-18 21:18:5... [21:25:46] (03PS5) 10Dereckson: Initial configuration for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [21:26:16] Hauskatze: i am here, but i don't know about about nutcracker. is something broken? [21:26:18] (03CR) 10Dereckson: "PS5: rebased (MWMultiVersion). Added "Hindi Wikimedians User Group" as site name." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [21:26:19] Well There is many name in suggation in the line, This make lot of conflict. So For Now We all will go with Its offical name [21:26:30] * Dereckson nods [21:26:38] mutante: https://phabricator.wikimedia.org/T192473#4140918 [21:26:45] (03CR) 10Dereckson: [C: 032] Initial configuration for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [21:26:48] so the job queue of beta seems broken [21:26:54] seems -> is [21:27:06] We will change later :) [21:28:07] (03Merged) 10jenkins-bot: Initial configuration for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [21:29:25] (03PS1) 10Dereckson: Declare six new wikis to wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427547 (https://phabricator.wikimedia.org/T183561) [21:29:46] Hauskatze: the error you pasted is a Mediawiki error though, should that be tagged differently? [21:29:55] Hauskatze: i don't think i can help with that [21:30:07] Hauskatze: is it an UBN? [21:30:14] mutante: as far as I can see, it's beta cluster only [21:30:32] Dereckson: it's breaking the whole job queue, I think it's pretty important to have it fixed [21:30:32] (my current changes are no op to prod, but the next one will be op) [21:30:45] Dereckson: no prod, beta cluster [21:30:48] ok [21:30:52] I think you're fine [21:31:03] well "my" the Urbanecm's one [21:31:09] (03CR) 10jenkins-bot: Initial configuration for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400234 (https://phabricator.wikimedia.org/T183561) (owner: 10Urbanecm) [21:31:13] (03CR) 10jenkins-bot: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) (owner: 10Urbanecm) [21:31:18] (03CR) 10jenkins-bot: Initial configuration for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416930 (https://phabricator.wikimedia.org/T189109) (owner: 10Urbanecm) [21:31:23] (03CR) 10jenkins-bot: Initial configuration for euwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419171 (owner: 10Urbanecm) [21:31:29] (03CR) 10jenkins-bot: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [21:31:34] (03CR) 10jenkins-bot: Initial configuration for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [21:31:38] * Hauskatze Dear Santa, I was a good kid this year and so I'd like to ask you to bring to me this year a dedicated team for Puppet maintenance@beta cluster. Thanks! ~~~~ [21:32:02] (03CR) 10Dereckson: [C: 032] Declare six new wikis to wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427547 (https://phabricator.wikimedia.org/T183561) (owner: 10Dereckson) [21:33:15] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4140938 (10thcipriani) I managed to grab a better perf report by using `PHP='hhvm -vEval.PerfPidMap=true'`. Then I waited for beta-scap-eqiad to start... [21:33:30] (03Merged) 10jenkins-bot: Declare six new wikis to wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427547 (https://phabricator.wikimedia.org/T183561) (owner: 10Dereckson) [21:41:09] I wonder if https://phabricator.wikimedia.org/T187184 should be in English or Romanian [21:41:15] (ro-md user group wiki) [21:41:47] I'd say romanian [21:41:49] ah probably ro, yes [21:41:52] there is the sorting key [21:42:02] 'romdwikimedia' => 'uca-ro-u-kn', [21:43:09] fun fact: wgLanguageCode splits chapters and usergroups in two blocks [21:43:43] https://phabricator.wikimedia.org/rOMWC758b0a5f751649b781474c31d38d3b7fea887ce4 [21:43:46] my fault [21:44:13] 10Puppet, 10Beta-Cluster-Infrastructure: redis/nutcracker down on deployment-prep - https://phabricator.wikimedia.org/T192473#4140958 (10MarcoAurelio) [21:46:08] (03PS1) 10Dereckson: Set romd.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427550 (https://phabricator.wikimedia.org/T187184) [21:46:32] (03CR) 10Dereckson: [C: 032] Set romd.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427550 (https://phabricator.wikimedia.org/T187184) (owner: 10Dereckson) [21:48:12] (03Merged) 10jenkins-bot: Set romd.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427550 (https://phabricator.wikimedia.org/T187184) (owner: 10Dereckson) [21:50:13] Dereckson: It seems Urbanecm forget to set languge 'hi' for hiwikimedia [21:50:15] https://etherpad.wikimedia.org/p/addwiki <- the wiki creation commands [21:50:21] Jayprakash12345: ok, on it [21:51:20] 10Puppet, 10Beta-Cluster-Infrastructure: redis/nutcracker down on deployment-prep - https://phabricator.wikimedia.org/T192473#4140979 (10MarcoAurelio) [21:51:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.127 second response time [21:52:56] (03PS1) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) [21:53:38] !log restart elasticsearch on elastic1022 with numa interleave [21:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:40] (03PS1) 10Dereckson: Set hi.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427553 (https://phabricator.wikimedia.org/T188366) [21:55:58] moritzm: yt? [21:56:17] (03CR) 10Jayprakash12345: [C: 031] Set hi.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427553 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson) [21:56:34] okay, let's commit it too [21:56:38] (03CR) 10Dereckson: [C: 032] Set hi.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427553 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson) [21:57:57] (03Merged) 10jenkins-bot: Set hi.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427553 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson) [21:58:18] eddiegp: he's almost certainly asleep [21:59:16] * eddiegp hardly ever remembers in which timezones people live [21:59:29] I'll file a task then. [22:00:37] Reedy: you didn't fix MassMessage? [22:00:52] I thought I've seen a change sooner [22:00:58] I didn't test it [22:00:59] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/1a440989da4da7019ad11d566d52269be05848da [22:01:11] It's in .30 [22:01:14] Not .29 [22:01:25] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/8c79db8d56b84fb635877a7be01428b3df9f6e21 [22:01:29] ack'ed [22:01:51] The point of breakage is annoying, but not the end of the world [22:02:23] I would've thought you would've used .30 ;) [22:02:32] yes but the train is still at 29 [22:03:27] They're new wikis [22:03:36] Shouldn't matter if you put it on the newest [22:03:41] Not like it's only group0 now [22:04:05] by the way [22:04:10] dereckson@mwdebug1002:~$ sudo -u mwdeploy scap wikiversions-compile [22:04:13] 22:03:18 Compiled /srv/mediawiki/wikiversions.json to /srv/mediawiki/wikiversions.php [22:04:16] dereckson@mwdebug1002:~$ mwscript eval.php lfnwiki [22:04:19] > MediaWiki\MassMessage\DatabaseLookup::getDBName( '' ); [22:04:21] it seems ok [22:04:32] Yeah, it was syntactially correct [22:04:39] But i didn't actually test it beyond that [22:05:15] K.renair did a bit of humour about this message once day [22:05:35] apparently some former addwiki breakage were there too [22:08:01] There's been many [22:08:07] Including having to remove stuff from ES [22:08:12] Commenting out half the script [22:08:30] I've created a lot of wikis before now ;) [22:09:13] (03PS1) 10Dzahn: admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) [22:09:30] (03PS2) 10Dzahn: admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) [22:11:01] (03CR) 10jenkins-bot: Declare six new wikis to wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427547 (https://phabricator.wikimedia.org/T183561) (owner: 10Dereckson) [22:11:07] (03CR) 10jenkins-bot: Set romd.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427550 (https://phabricator.wikimedia.org/T187184) (owner: 10Dereckson) [22:11:11] (03CR) 10jenkins-bot: Set hi.wikimedia.org language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427553 (https://phabricator.wikimedia.org/T188366) (owner: 10Dereckson) [22:11:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1950 bytes in 0.106 second response time [22:12:04] (03PS3) 10Dzahn: admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) [22:12:25] https://lfn.wikipedia.org/wiki/Paje_Xef looks good on mwdebug1002 [22:12:26] (03PS4) 10Dzahn: admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) [22:12:29] (03CR) 10jerkins-bot: [V: 04-1] admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [22:12:53] (03CR) 10jerkins-bot: [V: 04-1] admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [22:13:22] Dereckson: i am getting (Cannot access the database: Unknown database 'inhwiki' (10.64.16.39)) [22:13:39] at mwdebug1002 [22:14:13] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 856.97 seconds [22:14:22] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 866.26 seconds [22:14:41] Jayprakash12345: I created lfn [22:14:51] not yet inh [22:15:00] Dereckson: Now looks good sorry [22:16:59] !log Created database for lfn.wikipedia.org (T183561) [22:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:06] T183561: Create Wikipedia Lingua Franca Nova 2 - https://phabricator.wikimedia.org/T183561 [22:17:49] (03PS5) 10Dzahn: admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) [22:18:09] We're waiting https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/42779/ for the addwiki fix backported to wmf29. [22:18:13] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:18:33] (and indeed, create them as wmf30 wouldn't have been dramatic either) [22:20:22] 1000$ question: why I'm backporting this change? [22:20:51] it's not we're going to create other wikis in wmf29 in the future [22:21:11] a manual edit would have worked as fine [22:24:24] (03CR) 10Dzahn: [C: 032] admin: create new group releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427556 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [22:24:57] !log dereckson@tin Synchronized php-1.31.0-wmf.29/extensions/WikimediaMaintenance/addWiki.php: Fix MassMessage fatal error (T192468) (duration: 01m 17s) [22:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:03] T192468: MassMessage fatal in addWiki.php - https://phabricator.wikimedia.org/T192468 [22:27:22] !log Created database and set initial stuff for inh.wikipedia.org (T184374) [22:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:28] T184374: Create Wikipedia Ingush - https://phabricator.wikimedia.org/T184374 [22:28:09] !log Created database and set initial stuff for gor.wikipedia.org (T189109) [22:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:15] T189109: Create Wikipedia Gorontalo - https://phabricator.wikimedia.org/T189109 [22:28:50] https://gor.wikipedia.org/wiki/Special:Log this one is more populated [22:29:33] (03PS1) 10Dzahn: releases: add parsoid-releasers admin group, add subbu [puppet] - 10https://gerrit.wikimedia.org/r/427558 (https://phabricator.wikimedia.org/T150672) [22:31:01] (03PS1) 10Dzahn: webserver_misc_static: remove releasers admin roles [puppet] - 10https://gerrit.wikimedia.org/r/427559 [22:31:03] still not up Dereckson or just on mwdebug? [22:31:05] !log Created database and set initial stuff for eu.wikisource.org (T189465) [22:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:11] T189465: Create Wikisource Basque - https://phabricator.wikimedia.org/T189465 [22:31:21] Hauskatze: all four projects wiki are up on mwdebug1002 [22:31:31] there is an issue for eu.wikisource logos it seems [22:31:35] ah, 'cause I'm redirected at Incubator on production [22:31:45] makes sense if not scap-ed yet [22:32:07] add ?a=1 [22:32:13] or anything to the URL [22:32:16] your browser caches redirects [22:32:32] (and of course, use mwdebug1002) [22:32:52] so eu.wikisource logo fix [22:32:54] Dereckson Reedy fr-tech ejegg|afk hey.... sorry to be the bearer of bad news, and sorry to bother u Dereckson when I see you've been doing a huge amount of work today.... :) It turns out the CN patches didn't go out to with the train... I guess maybe no one remembered to create a patch for the core release branch.... [22:33:11] AndyRussG: wikis are still 29 [22:33:16] Oh [22:33:18] oops [22:33:28] so code is live (we synced CN before train, remember?) [22:33:31] but not yet on metza [22:33:33] meta. [22:33:37] (03CR) 10Dzahn: [C: 032] "ssastry is already deployer and parsoid releaser who can upload to releases1001, just that he is doing that via reprepro and now we are ad" [puppet] - 10https://gerrit.wikimedia.org/r/427558 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [22:33:51] Dereckson: meta usually goes to the next version today [22:34:14] it's group 1 [22:34:21] and group1 is still at 29 [22:34:55] apparently, train didn't happen (perhaps a blocker, I didn't check the blocker task) [22:34:55] Ah okok I see... so something happened with the train i guess [22:35:05] K gotcha :) thanks and apologies again! [22:35:09] you're welcome [22:35:14] ;) [22:35:32] config says /static/images/project-logos/euwikisource.png [22:35:56] but this file doesn't exist [22:36:16] no logo file in mediawiki-config/static/images/project-logos [22:36:36] I'm checking the task to see if there is one logo. If not, I'll put the default wikisource one. [22:37:01] Project logo: https://commons.wikimedia.org/wiki/File:Wikisource-logo-EU.svg [22:37:06] there is one [22:37:40] project:operations/mediawiki-config status:open logo gives nothing [22:37:56] (as a Gerrit search query) [22:38:08] So I guess I'm going to prepare the logos. [22:38:39] (03CR) 10Dzahn: [C: 032] webserver_misc_static: remove releasers admin roles [puppet] - 10https://gerrit.wikimedia.org/r/427559 (owner: 10Dzahn) [22:39:22] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.86 seconds [22:40:36] https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Wikisource-logo-EU.svg/123px-Wikisource-logo-EU.svg.png [22:40:48] yes I've them three [22:40:58] 123, 187, 241 [22:41:21] good night [22:41:25] 'night Hauskatze [22:41:28] g'night [22:41:43] merci Dereckson, a bientôt [22:41:55] new wiki is good. no spam bot. no vandalism. then no contents. [22:41:58] Hauskatze: Good Morning from India [22:41:59] :) [22:42:36] Jayprakash12345: good morning to you :) [22:42:57] rxy: yeah, well, until the bots notice :( [22:43:00] * Hauskatze out [22:45:58] rxy: Do you manage Countervandalism Network? [22:46:07] yes I am [22:47:23] rxy: Someone report me that I am looks like https://lh3.googleusercontent.com/-XAFz45soE-0/WrfFLn7Ar3I/AAAAAAAAC9U/P5oYaUna37saI_DAXkqDkTcwY7qzgTC8ACL0BGAYYCw/h50/das.JPG [22:47:51] Due to Edit War [22:48:04] So How I can remove that? [22:48:31] which wiki? [22:48:48] hiwiki [22:49:04] optipng done, commiting [22:49:35] (we still use optipng -o7 to optimize logos, right?) [22:49:49] yes -o7 [22:49:55] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4141140 (10Krinkle) [22:51:04] rxy: Oh, Sorry. I see now that it seem someone already remove. Now I am Look like 'Unlisted'. [22:51:08] (03PS1) 10Dereckson: Logos for Euskara Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427560 (https://phabricator.wikimedia.org/T189465) [22:51:23] haha, but I added to whitelist for you :) [22:51:52] rxy: Thanks [22:51:57] :))) [22:52:08] (03CR) 10Dereckson: [C: 032] Logos for Euskara Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427560 (https://phabricator.wikimedia.org/T189465) (owner: 10Dereckson) [22:53:39] (03Merged) 10jenkins-bot: Logos for Euskara Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427560 (https://phabricator.wikimedia.org/T189465) (owner: 10Dereckson) [22:53:50] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4141158 (10Krinkle) [22:57:18] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): nutcracker fails to start due to lack of /var/run/nutcracker (ex: deployment-videoscaler01 has memcached failures) - https://phabricator.wikimedia.org/T178457#3692957 (10EddieGP) Saw the same on deployment... [22:57:59] !log Created database and set initial stuff for romd.wikimedia.org [22:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:37] !log Created database and set initial stuff for hi.wikimedia.org (T188366) [22:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:43] T188366: Create Hindi Wikimedian User Group Site - https://phabricator.wikimedia.org/T188366 [22:58:53] (03CR) 10jenkins-bot: Logos for Euskara Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427560 (https://phabricator.wikimedia.org/T189465) (owner: 10Dereckson) [22:58:53] !log dereckson@tin Synchronized static/images/project-logos/: Logos for eu.wikisource (T189465) (duration: 01m 12s) [22:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:59] T189465: Create Wikisource Basque - https://phabricator.wikimedia.org/T189465 [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180418T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:21] 10Puppet, 10Beta-Cluster-Infrastructure: redis/nutcracker down on deployment-prep - https://phabricator.wikimedia.org/T192473#4140086 (10EddieGP) nutcracker now starts on deployment-jobrunner03, see my comment on T178457. The error in the task description still happens though. [23:00:44] !log Starting syncing to production sequence for six wiki creation [23:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:09] !log dereckson@tin Synchronized dblists: (no justification provided) (duration: 01m 15s) [23:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:04] !log dereckson@tin rebuilt and synchronized wikiversions files: (no justification provided) [23:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:24] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for six wikis (duration: 01m 16s) [23:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:43] !log dereckson@tin Synchronized langlist: New languages: gor, inh, lfn (duration: 01m 17s) [23:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:05] !log dereckson@tin Synchronized multiversion/MWMultiVersion.php: +hi.wikimedia.org +romd.wikimedia.org (duration: 01m 15s) [23:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:00] !log HTCP purge for eu.wikisource logos [23:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:17] Dereckson: You need to sync multiversion [23:13:24] Reedy: 23:09:50 <+logmsgbot> !log dereckson@tin Synchronized multiversion/MWMultiVersion.php: +hi.wikimedia.org +romd.wikimedia.org (duration: 01m 15s) [23:13:29] this one? [23:13:33] or there is also another? [23:13:33] Doesn't seem to be working [23:13:41] hiwikimedia is redirecting to hiwiki [23:13:58] Or, there's a conflicting apache redirect [23:13:59] location: https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0 [23:14:15] romd is ok [23:14:28] What's up with the train btw? [23:14:40] I got a successful hit on hi.wikimedia.org at one stage [23:14:52] it's the one I tested after syncing multiversion [23:14:53] hoo: ? [23:15:05] Original exception: [WtfRcgpAAC4AAFdW3EQAAAAL] 2018-04-18 23:14:58: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [23:15:13] Reedy: Will group1 get wmf30 today? [23:15:19] Today being Thursday? :P [23:15:30] still 23:15 in UTC [23:15:47] Today being in the next hours, let's not bike shed about timezones :P [23:16:27] Hmm. I'm confused [23:16:32] thcipriani: You scapped... But no group1 swap? [23:16:52] it was a full scap before train deployment for a l10n change [23:16:56] ^ [23:17:00] there were train blockers [23:17:04] Ahh [23:17:10] https://hi.wikimedia.org/?aaaa <- internal error, https://hi.wikimedia.org/ <-- redirect to wikipedia [23:17:11] hiwikimedia database exists, and has tables [23:17:25] What's the db error? [23:17:46] [{exception_id}] {exception_url} Wikimedia\Rdbms\DBQueryError from line 1453 of /srv/mediawiki/php-1.31.0-wmf.29/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema upd [23:18:02] Error: 1146 Table 'hiwikimedia.revtag' doesn't exist (10.64.32.136) [23:18:08] better [23:18:22] so it's the translate extension [23:18:24] Yeah [23:18:32] Enable that in a seperate patch :P [23:18:33] let's create this table [23:19:14] !log Create tables for Translate extension on hiwikimedia [23:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:25] There we go [23:19:27] and now it works [23:19:40] so when I see the logo it's not the one I've seen grmbl [23:19:47] https://hi.wikimedia.org still redirects to hiwiki [23:19:49] oh okay was romd. I checked [23:19:53] With stuff after it doesn't [23:20:06] Checking apache config [23:20:13] There's nothing in redirects.dat [23:20:18] Might just be cached crap [23:20:22] idid ? action=purge [23:20:35] https://hi.wikimedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0 [23:20:39] is ok [23:20:41] Yeah, it's ok now [23:20:47] Must've been something stuck in cache [23:20:49] :) [23:20:50] browsers cache redirect [23:21:13] but we can do a purge [23:21:30] is browser cache? i think it is varnish cache... [23:21:40] both cache stuff [23:21:48] the output from curl -I wasn't browser [23:21:48] universal conspiracy [23:21:57] so that one was varnish [23:21:57] After rxy said they purged... it fixed it for me [23:22:08] :) [23:22:24] Number of attached accounts: 864 [23:22:34] !log HTCP purge for https://hi.wikimedia.org and https://hi.wikimedia.org/ [23:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:19] good I've now a 301 to https://hi.wikimedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0 [23:23:24] So it works [23:23:55] Yup. Carry on :P [23:24:06] ah, it's a fishbowl [23:24:14] so let's create and promote [23:24:19] hoo: you updated the meta page? [23:24:37] Dereckson: yes [23:24:41] thanks I can do interwiki so [23:24:43] w/o typos, I hope [23:24:50] typos are the best [23:26:28] + '__global:wmhi' => '1 https://hi.wikimedia.org/wiki/$1', [23:26:30] + '__global:wmromd' => '1 https://romd.wikimedia.org/wiki/$1', [23:26:33] no typo apparently [23:27:51] (03PS1) 10Dereckson: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427565 [23:28:20] (03CR) 10Dereckson: [C: 032] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427565 (owner: 10Dereckson) [23:29:53] (03Merged) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427565 (owner: 10Dereckson) [23:30:10] (03CR) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427565 (owner: 10Dereckson) [23:32:52] new interwiki map @ mwdebug1002 [23:33:05] Just run scap update-interwiki [23:33:09] It's much quicker [23:33:17] Except it's broken somewhere at the end, so you need to sync it manually [23:34:16] * Dereckson notes for the next time [23:34:17] Well defined processes :) [23:34:59] I did file a task for it erroring out [23:35:02] But.. it does the update [23:35:08] asks for your https git username/password [23:35:13] pushes and CR+2's it [23:35:16] !log dereckson@tin Synchronized wmf-config/interwiki.php: New interwiki map for the six newest wikis (duration: 01m 17s) [23:35:19] asks you to tell when it's ready [23:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:24] then deploys it [23:35:34] with the commit I see @ https://gerrit.wikimedia.org/r/#/c/399117/3/scap/plugins/updateinterwikicache.py [23:36:03] Fun the autocommit part [23:36:44] so now create and promote for hiwikimedia [23:38:30] Jayprakash12345: mails aren't mandatory, we can use Special:Emailuser/ to send the temporary password [23:38:45] but that helps :) [23:41:23] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [23:44:13] !log Created bureaucrat account for Suyash.dwivedi at hi.wikimedia (T188366) [23:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:20] T188366: Create Hindi Wikimedian User Group Site - https://phabricator.wikimedia.org/T188366 [23:45:34] (03PS2) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) [23:46:27] Password sent by mail [23:47:36] AndyRussG: is CSP header an emergency? [23:47:47] if so we can backport it to wmf29 [23:48:58] Jayprakash12345: you wanted some namespaces, they are documented somewhere? [23:49:33] Bureaucrats can able to Add or Remove translationadmin [23:49:48] * Dereckson checks this one too [23:50:19] Dereckson: It loooks logo problem at hi.wikimedia.org? or my browser cache? [23:51:07] translateadmin rights missing from config [23:51:23] On the Task, Only need to define Project Namespace [23:51:43] (03PS3) 10Dzahn: releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) [23:51:45] What happens with the logo? [23:52:16] I've an orange "Hindi Wikimedians user group" [23:52:24] (03CR) 10jerkins-bot: [V: 04-1] releases: add directory for parsoid archive [puppet] - 10https://gerrit.wikimedia.org/r/427551 (https://phabricator.wikimedia.org/T150672) (owner: 10Dzahn) [23:52:30] 1.5x [23:52:41] ah yes at 1x it doesn't show [23:52:42] let's pruge [23:52:46] Dereckson: you doing ok? [23:53:17] greg-g: the six wiki have been created, all is in working state excepted a logo [23:55:29] awesome [23:55:31] >>> $wgLogo [23:55:31] => "/static/imgages/project-logos/hiwikimedia.png" [23:55:35] ah there is a typo [23:56:28] Jayprakash12345: so 1.5x and 2x logos work (I zoomed my left screen a little bit, so got the 1.5x), 1x is available but path to the file is wrong, fixing that [23:57:18] (03PS1) 10Dereckson: Fix path to hi.wikimedia.org 1x logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427567 [23:57:44] (03CR) 10Dereckson: [C: 032] Fix path to hi.wikimedia.org 1x logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427567 (owner: 10Dereckson) [23:57:47] Dereckson: Also add Project namespace [23:57:54] See task [23:58:24] Jayprakash12345: another thing, when are you going to use translatewiki, now or later? [23:58:41] later [23:59:01] okay, I'll note on the task a further configuration change is needed so [23:59:04] 10Puppet, 10Beta-Cluster-Infrastructure, 10Product-Analytics: redis/nutcracker down on deployment-prep - https://phabricator.wikimedia.org/T192473#4141453 (10EddieGP) - @aaron merged 8ad186728 yesterday at 23:51 UTC. - [[https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/10836/consol... [23:59:05] (03Merged) 10jenkins-bot: Fix path to hi.wikimedia.org 1x logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427567 (owner: 10Dereckson) [23:59:12] AaronSchulz: ^ [23:59:21] (03CR) 10jenkins-bot: Fix path to hi.wikimedia.org 1x logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427567 (owner: 10Dereckson) [23:59:38] Again, no clue what timezone he's in ...