[00:39:40] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [01:26:01] PROBLEM - very high load average likely xfs on ms-be1008 is CRITICAL: CRITICAL - load average: 100.99, 100.24, 99.72 [01:47:22] 10Operations, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3338989 (10Andrew) I (finally) wrote a script to hunt and kill leaned dns records: https://gerrit.wikimedia.org/r/#/c... [02:01:27] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3338992 (10Andrew) > not sure if h/w raid is needed Yes please! Most of the existing labvirts have two spinny drives which are paired in a raid 1... [02:14:50] !log l10nupdate@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [02:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:16] That's not fun [03:36:54] sigh... who broke l10n now [03:42:57] [WT4NSwpAADgAABvhF2sAAAAT] /wiki/File:Agave_guadalajarana_1.jpg?uselang=%E2%A7%BCLang%E2%A7%BD Exception from line 156 of /srv/mediawiki/php-1.30.0-wmf.4/includes/libs/objectcache/MemcachedBagOStuff.php: Key contains invalid characters: commonswiki:pcache:idhash:1107922-0!userlang=⧼lang⧽ [03:43:44] Well, that's mostly co-incidence [03:43:46] https://commons.wikimedia.org/wiki/File:Agave_guadalajarana_1.jpg?uselang=%E2%A7%BCLang%E2%A7%BD [03:43:56] Passing in that for a language won't work [04:11:50] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5989.70 Read Requests/Sec=5561.70 Write Requests/Sec=14.80 KBytes Read/Sec=23317.20 KBytes_Written/Sec=140.80 [04:17:50] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=106.70 Read Requests/Sec=78.60 Write Requests/Sec=18.20 KBytes Read/Sec=1272.40 KBytes_Written/Sec=5270.80 [04:20:07] Reedy: we had a long talk about that error in -core the other day. there is a bug, but the current errors all seem to be bing's crawler [04:21:05] I think b.rion had an idea about the proper fix for the bug in the cache layer that makes it blow up [04:23:19] block it? ;D [04:31:03] people use bing? [05:06:00] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:06:50] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.229 second response time [05:13:59] 10Operations, 10ops-codfw, 10DBA: db2070: Disk about to fail - https://phabricator.wikimedia.org/T167623#3339052 (10Marostegui) [05:14:10] 10Operations, 10ops-codfw, 10DBA: db2070: Disk about to fail - https://phabricator.wikimedia.org/T167623#3339066 (10Marostegui) p:05Triage>03Normal [05:34:04] 10Operations, 10DBA, 10Wikimedia-Site-requests: Renaming Neoalpha: supervision needed - https://phabricator.wikimedia.org/T167597#3339069 (10Marostegui) p:05Triage>03Normal Hi Please ping me or @jcrespo on IRC before doing this so we can monitor the DBs. Also please make sure this is not happening at t... [05:38:18] !log Deploy alter table s4 - labsdb1003 - T166206 [05:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:29] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [05:44:13] I'll be doing rename at 8am UTC or later [06:38:28] (03PS2) 10Muehlenhoff: Tighten access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) [06:46:00] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [06:46:20] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.0.16:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.0.16, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://10.64.0. [06:53:14] !log upgrade remaining app servers running HHVM 3.18 to 3.18.2+wmf5 [06:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358313 (https://phabricator.wikimedia.org/T166206) [07:17:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358313 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [07:18:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358313 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [07:18:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358313 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [07:19:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 - T166206 (duration: 00m 41s) [07:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:57] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [07:21:09] !log Deploy alter table s4 - db1064 - https://phabricator.wikimedia.org/T166206 [07:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:40] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.075 second response time [07:22:57] !log ran restart-pdfrender on scb1002 (OOM errors in the dmesg from hours ago) [07:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:00] !log ran restart-pdfrender on scb1001 (OOM errors in the dmesg from hours ago) [07:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358315 (https://phabricator.wikimedia.org/T166205) [07:27:00] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.076 second response time [07:28:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358315 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [07:29:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358315 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [07:29:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358315 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [07:31:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 - T166205 (duration: 00m 41s) [07:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:18] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [07:31:57] !log Deploy alter table s2 - db1060 - T166205 [07:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:00] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:00] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:10] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:18] 10Operations, 10Electron-PDFs, 10Services, 10Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10elukey) Happened again today afaics on scb100[12], resolved restarting pdfrender on both. [07:33:50] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:33:50] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [07:34:00] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [07:38:14] !log Reboot https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=ms-be1008 as xfs is failing [07:38:22] awesome copy paste [07:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:20] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [07:40:29] !log restarted citoid on scb1001 (kept failing health checks for Error: write EPIPE) [07:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:55] marostegui: there might be a way to fix the SAL, don't know how though [07:41:06] it looks amazing btw [07:41:06] :D [07:41:30] i am editing it :| [07:41:46] howwwww [07:42:13] you could make my day, I am responsible for too many typos in the sal [07:44:17] for example editing the page Luca! Why you don't drink coffee in the morning before talking or restarting services [07:44:20] ? [07:44:44] (I thought that there was a more complicated process for some reason) [07:54:01] RECOVERY - very high load average likely xfs on ms-be1008 is OK: OK - load average: 3.11, 1.19, 0.46 [07:54:34] (03PS1) 10Elukey: hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) [07:56:42] 10Operations, 10ops-eqiad, 10Labs, 10Patch-For-Review: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#3339277 (10Muehlenhoff) [07:56:45] 10Operations, 10Labs, 10Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3339275 (10Muehlenhoff) 05stalled>03Resolved We can simply close the ticket. [08:04:09] (03PS2) 10Elukey: hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) [08:15:49] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Mobile, 10Patch-For-Review: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3339304 (10Marostegui) p:05Triage>03Normal [08:16:56] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3339308 (10Marostegui) p:05Triage>03Normal [08:22:08] !log powercycle scb2005 (console frozen, host unresponsive) [08:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:42] !log reboot ms-be1002, load avg slowly creeping up [08:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:01] godog: I restarted ms-be1008 earlier :( [08:34:10] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg] [08:36:10] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:37:39] jan_drewniak: yt? [08:37:57] wrong channel! [08:38:17] marostegui: heh, the good news is that this is old hw that is being decom [08:40:18] phuedx: hey hey, yeah I'm only on my second coffee (moving our meeting up is fine) [08:41:19] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), and 2 others: Limit maximum x-content-dimension size to avoid hitting nginx limits - https://phabricator.wikimedia.org/T167034#3339389 (10fgiunchedi) [08:43:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935#3339407 (10Marostegui) As per my chat with @Cmjohnson on Friday this will be done on today (Monday) [08:44:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358319 (https://phabricator.wikimedia.org/T166935) [08:45:01] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3339411 (10Gehel) a:05Gehel>03debt This is deployed and can be closed, assigning to @debt. [08:46:34] (03CR) 10Marostegui: [C: 04-2] "Wait until it is closer to the time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358319 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [08:50:53] PROBLEM - DPKG on mw1220 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:51:53] RECOVERY - DPKG on mw1220 is OK: All packages OK [08:57:11] 10Operations, 10Services: scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3339439 (10elukey) [08:58:09] volans: opened --^ for scb2005 [09:00:03] elukey: which OS? [09:01:31] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This class is meant to be used internally by the icinga class that is only to be used on icinga hosts (the hosts that actually do the moni" [puppet] - 10https://gerrit.wikimedia.org/r/358240 (https://phabricator.wikimedia.org/T167602) (owner: 10Halfak) [09:03:36] elukey: check udev persistent rules too [09:12:55] !log Drop table updates on dewiki and wikidatawiki (s5) - T139342 [09:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:04] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [09:13:46] !log upgrading mw1236-mw1249 to HHVM 3.18 [09:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:20] (03PS1) 10Marostegui: s5.hosts: Add new labs infra [software] - 10https://gerrit.wikimedia.org/r/358324 (https://phabricator.wikimedia.org/T153743) [09:19:22] (03PS2) 10Marostegui: s5.hosts: Add new labs infra hosts [software] - 10https://gerrit.wikimedia.org/r/358324 (https://phabricator.wikimedia.org/T153743) [09:19:36] 10Operations, 10Wikimedia-IRC-RC-Server: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3339555 (10akosiaris) [09:19:45] 10Operations, 10Wikimedia-IRC-RC-Server: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3339568 (10akosiaris) p:05Triage>03Normal [09:20:54] (03CR) 10Marostegui: [C: 032] s5.hosts: Add new labs infra hosts [software] - 10https://gerrit.wikimedia.org/r/358324 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:23:15] (03Merged) 10jenkins-bot: s5.hosts: Add new labs infra hosts [software] - 10https://gerrit.wikimedia.org/r/358324 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:25:27] !log swift eqiad-prod finish decom ms-be1005/6/7 - T166489 [09:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:35] T166489: Decommission ms-be1001 - ms-be1012 - https://phabricator.wikimedia.org/T166489 [09:36:15] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:05] RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.400 second response time [09:38:20] 10Operations, 10Goal, 10Kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3339642 (10akosiaris) [09:38:46] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3339644 (10akosiaris) [09:38:48] 10Operations, 10Goal, 10Kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3150713 (10akosiaris) [09:39:06] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3290332 (10akosiaris) [09:39:53] 10Operations, 10Goal, 10Kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3150713 (10akosiaris) [09:39:55] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3290332 (10akosiaris) 05Open>03Resolved Hosts are up and running, taking over service implementation in T162045 [09:40:07] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3339653 (10akosiaris) [09:53:26] 10Operations, 10Services: scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3339736 (10elukey) ``` root@scb2005:~# lspci -v | egrep 'Device\ Serial\ Number|Broadcom' 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe Capabilit... [09:55:39] !log upgrading mw1221-mw1235 to HHVM 3.18 [09:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:11] (03CR) 10Ema: hhvm: force rsyslog config to create log files with www-data perms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) (owner: 10Elukey) [10:03:55] (03PS3) 10Elukey: hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) [10:04:00] (03CR) 10Elukey: hhvm: force rsyslog config to create log files with www-data perms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) (owner: 10Elukey) [10:17:35] 10Operations, 10Services: scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3339768 (10Marostegui) The net persistent rules were playing a role here (or looks so). Removing `/etc/udev/rules.d/70-persistent-net.rules` and rebooting was able to bring eth0 back (I did a backu... [10:19:17] (03PS1) 10Alexandros Kosiaris: Introduce neon.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/358335 (https://phabricator.wikimedia.org/T162045) [10:19:19] (03PS1) 10Alexandros Kosiaris: Introduce kubestagetcd100{1,2,3}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/358336 (https://phabricator.wikimedia.org/T162045) [10:19:46] 10Operations, 10ops-codfw, 10Services: scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3339796 (10elukey) [10:21:36] (03CR) 10Ema: [C: 031] hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) (owner: 10Elukey) [10:24:02] (03PS4) 10Elukey: hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) [10:25:24] 10Operations, 10Patch-For-Review, 10User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#3339844 (10elukey) I was able to repro the issue on mw2251 simply doing the following: ``` rm /var/log/hhvm/error.log systemctl stop nutcracker (this causes... [10:28:19] !log upgrading mw1250-mw1258 to HHVM 3.18 [10:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:12] 10Operations, 10DBA, 10Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#3339848 (10Marostegui) p:05High>03Normal [10:32:22] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2001 - https://phabricator.wikimedia.org/T167118#3339855 (10Marostegui) 05Open>03declined [10:32:26] (03PS1) 10Hashar: contint: install HHVM from main [puppet] - 10https://gerrit.wikimedia.org/r/358341 (https://phabricator.wikimedia.org/T167493) [10:33:32] moritzm: Guten Tag. Above patch would let us switch CI to use HHVM 3.18 :) [10:35:30] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3339865 (10elukey) @Marostegui we decided not to proceed with the BBU replacement, the risk it too high with a little gain. We are ok for the moment to use WriteThrough... [10:38:08] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3339869 (10Marostegui) >>! In T166141#3339865, @elukey wrote: > @Marostegui we decided not to proceed with the BBU replacement, the risk it too high with a little gain.... [10:38:13] (03PS1) 10Lucas Werkmeister (WMDE): Add “Constraints” section for constraint statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358343 (https://phabricator.wikimedia.org/T167126) [10:38:18] hashar: ok, having a look in 10.15 mins [10:45:58] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3339889 (10elukey) Updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [10:46:13] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3339890 (10elukey) [10:47:14] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3339893 (10elukey) 05Open>03Resolved [10:48:04] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10elukey) @Cmjohnson sorry for the extra pings, we don't need anymore the BBU replacement. Thanks a lot anyway! [10:48:57] moritzm: I am going to have lunch with my wife then commute back to office. So I guess sometime this afternoon :] [10:49:16] moritzm: feel free to merge it though and I will refresh the CI images when I am back [10:50:22] hashar: ok [10:52:05] (03PS1) 10Alexandros Kosiaris: Introduce kubestagetcd100{1,2,3} and neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/358344 (https://phabricator.wikimedia.org/T162045) [10:56:14] (03CR) 10Muehlenhoff: [C: 031] contint: install HHVM from main [puppet] - 10https://gerrit.wikimedia.org/r/358341 (https://phabricator.wikimedia.org/T167493) (owner: 10Hashar) [10:59:56] !log Drop table updates on commonswiki (s4) - T139342 [11:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:06] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [11:03:36] !log joal@tin Started deploy [analytics/refinery@d9c3419]: Regular weekly deploy of refinery (mostly unique_devices patches) [11:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:09] !log upgrading job runners mw1162-mw1164 to HHVM 3.18 [11:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:54] !log joal@tin Finished deploy [analytics/refinery@d9c3419]: Regular weekly deploy of refinery (mostly unique_devices patches) (duration: 06m 18s) [11:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:54] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3340006 (10Mvolz) a:03Mvolz [11:27:30] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3340042 (10Mvolz) No response to the e-mail. They only seem to be blocking our native scraper, probably by... [11:28:09] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3340043 (10Mvolz) [11:29:58] marostegui: I think I'm going to do the rename (neoalpha) now, do you want me to wait or can I do it now? [11:30:16] revi: give me a sec to open up all the monitoring things I want to check [11:30:21] sure [11:31:14] revi: do you happen to know if most of the edits are on a concrete wiki? [11:31:20] kowiki [11:31:25] cool [11:31:27] give me a sec [11:32:38] revi: go ahead [11:32:40] ok [11:33:44] marostegui: started [11:33:49] oki [11:39:44] revi: how is it going? [11:39:48] fine [11:39:55] kowiki passed [11:40:11] great [11:40:24] Some expected lag on one of the non powerful slaves [11:40:45] beyond there is just wikidata with 400 edits [11:41:03] (not counting edits less than 10 edits) [11:41:11] ok, let me know when done with wikidata [11:43:20] wikidata in progress [11:43:29] and done [11:43:40] excellent! [11:43:42] just few zhwikis left [11:44:29] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/95016maphack done! [11:44:39] \o/ great! [11:44:41] thanks! [11:44:46] thanks :D [11:44:56] That was pending for around 2 months [11:45:00] will you close the ticket? [11:45:03] yup [11:45:05] or you want me to? [11:45:07] great! [11:45:19] I'll assign to you [11:45:53] I didn't do anything, you did it! [11:45:57] :) [11:46:14] what I was asking for was your eyes on it so it's yours heh [11:46:25] hahaha [11:46:28] 10Operations, 10DBA, 10Wikimedia-Site-requests: Renaming Neoalpha: supervision needed - https://phabricator.wikimedia.org/T167597#3340077 (10revi) 05Open>03Resolved `There are no renames in progress for 95016maphack. They may have already finished.` `(change visibility) 20:33, 12 June 2017 -revi (talk |... [11:46:39] left assignee blank :D [11:46:45] :) [11:47:17] 10Operations, 10DBA, 10Wikimedia-Site-requests: Renaming Neoalpha: supervision needed - https://phabricator.wikimedia.org/T167597#3340093 (10Marostegui) Thanks for the heads up before running it :-) [11:52:55] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2056004 [11:59:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358319 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [12:00:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358319 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [12:00:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358319 (https://phabricator.wikimedia.org/T166935) (owner: 10Marostegui) [12:01:26] !log upgrading mw1266-mw1275 to HHVM 3.18 [12:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance - T166935 (duration: 00m 41s) [12:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:45] T166935: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935 [12:05:35] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:17] high cpu usage on graphite1001 starting at 12ish https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=graphite1001&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now [12:08:25] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 3.847 second response time [12:18:15] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 110.02, 100.33, 84.62 [12:20:58] moritzm: lunch done. Ready to switch CI to hhvm 3.18 ( https://gerrit.wikimedia.org/r/358341 ) :-} [12:22:15] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 46.43, 76.73, 79.35 [12:23:23] (03PS1) 10Gehel: elasticsearch: use $facts['ipaddress'] as the published host [puppet] - 10https://gerrit.wikimedia.org/r/358353 [12:23:52] (03CR) 10Gehel: [C: 04-1] "This needs to be tested on relforge before merging." [puppet] - 10https://gerrit.wikimedia.org/r/358353 (owner: 10Gehel) [12:27:28] hashar: ok, merging [12:28:07] (03CR) 10Muehlenhoff: [C: 032] contint: install HHVM from main [puppet] - 10https://gerrit.wikimedia.org/r/358341 (https://phabricator.wikimedia.org/T167493) (owner: 10Hashar) [12:28:12] \O/ [12:29:04] hashar: merged, let me know when the images are rebuilt, I'll remove the 3.12 packages at a later point [12:35:03] moritzm: it is rebuilding. Will test it out this afternoon then confirm it is all fine tomorrow/wednesday [12:36:31] (03PS1) 10KartikMistry: Update apertium-cat package [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) [12:36:46] (03CR) 10jerkins-bot: [V: 04-1] Update apertium-cat package [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [12:45:06] 10Operations, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Switch CI tests back to HHVM 3.18 - https://phabricator.wikimedia.org/T167493#3340275 (10hashar) Jessie snapshot updated: `hhvm (3.12.14+dfsg-1+wmf1 => 3.18.2+dfsg-1+wmf5)` [12:55:48] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3340294 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'... [12:56:27] ACKNOWLEDGEMENT - HP RAID on db2070 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167667 [12:56:28] jouncebot: refresh [12:56:29] I refreshed my knowledge about deployments. [12:56:36] 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3340311 (10ops-monitoring-bot) [12:56:38] And db2070…finally failed [12:57:02] marostegui: I don't see the failure though... [12:57:07] here on IRC I mean [12:57:19] what do you mean? [12:57:46] I don't see the critical alert from icinga-wm, just the ACK [12:58:02] ah right, that is true [12:58:15] whas it with notification disabled? [12:58:19] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3340322 (10Marostegui) Please @Papaul change the disk when you can. Thanks! [12:59:16] 10Operations, 10ops-codfw, 10DBA: db2070: Disk about to fail - https://phabricator.wikimedia.org/T167623#3340339 (10Marostegui) 05Open>03declined And the disk finally failed: T167667 So let's close this. [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170612T1300). [13:00:04] aharoni, odder, Amir1, and hashar: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:12] o/ [13:00:24] volans: Ah yes, that is it. It was showing predictive failure early in the morning and I downtimed it as I created the task [13:00:39] marostegui: ok then it makes sense, thanks :) [13:00:44] thank you [13:01:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3340352 (10Marostegui) p:05Triage>03Normal [13:01:05] don't forget to reneable it ;) [13:01:06] (03CR) 10Hashar: [C: 04-1] "That has been done on purpose with https://gerrit.wikimedia.org/r/#/c/330401/ for T150618" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358176 (owner: 10Odder) [13:01:17] yeah, will do! [13:01:21] going to merge the logo updates [13:01:51] (03CR) 10Hashar: [C: 032] Update pre-2010 high-density Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358170 (owner: 10Odder) [13:01:55] Сәләм [13:02:04] (03CR) 10Hashar: [C: 032] Update logo for the Norwegian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358175 (https://phabricator.wikimedia.org/T167192) (owner: 10Odder) [13:02:06] (That's "Hello" in Bashkir.) [13:02:10] (03CR) 10Hashar: [C: 032] Add high-density logos for the Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358303 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [13:02:23] tsk tsk tsk, odder is not here :/ [13:02:43] in the Wikipedia in odder's own language, Polish, the logo for high-resolution screens is also broken :) [13:03:41] logo process is quite broken nowadays [13:04:15] Nemo_bis: wasn't it always? :) [13:04:17] (03Merged) 10jenkins-bot: Update pre-2010 high-density Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358170 (owner: 10Odder) [13:04:30] odder made it slightly better, with a lot of personal effort [13:04:42] aharoni: no [13:04:46] (03Merged) 10jenkins-bot: Update logo for the Norwegian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358175 (https://phabricator.wikimedia.org/T167192) (owner: 10Odder) [13:04:55] Odder worked in the traditional process [13:05:09] that is, odder made the _situation_ with logos slightly better, but the process was always broken. [13:05:29] (03PS3) 10Hashar: Add high-density logos for the Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358303 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [13:05:45] (03CR) 10Hashar: [C: 032] Add high-density logos for the Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358303 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [13:05:54] Dunno [13:06:03] I disagre, but opinions [13:07:22] (03Merged) 10jenkins-bot: Add high-density logos for the Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358303 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [13:07:59] I am syncing the logos [13:08:12] ( Process being https://meta.wikimedia.org/wiki/User:Cbrown1023/Logos , for those who missed it. ) [13:08:23] !log hashar@tin Synchronized static/images/project-logos: (no justification provided) (duration: 00m 43s) [13:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:24] (03PS5) 10Andrew Bogott: Remove references to keystone admin_token [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) [13:10:17] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: update some logos 6974b9ab4..76939d15f (duration: 00m 41s) [13:10:19] Nemo_bis: most likely, I don't know all the details that you do, so we probably don't actually disagree. [13:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:28] aharoni: lets do your collation change [13:10:50] (03PS4) 10Hashar: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [13:10:51] I'd say that if a logo has to be created in such a manual process for hundreds of languages, that is by itself broken. [13:10:56] hashar: I'm ready and excited!@ [13:10:57] (03CR) 10Hashar: [C: 032] Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [13:11:38] and I guess I will next do Amir1 change of collation for the persian wikis [13:12:03] Thanks [13:12:06] hashar: and this includes running the maintenance script, right? any idea how long does it take? [13:12:16] I have no idea [13:12:28] that is probably fast enough ;} [13:12:42] (03Merged) 10jenkins-bot: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [13:13:26] (03CR) 10jenkins-bot: Update pre-2010 high-density Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358170 (owner: 10Odder) [13:13:28] aharoni: it is syncing. The update collation doc is at the bottom of https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#updateCollation [13:13:28] (03CR) 10jenkins-bot: Update logo for the Norwegian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358175 (https://phabricator.wikimedia.org/T167192) (owner: 10Odder) [13:13:30] (03CR) 10jenkins-bot: Add high-density logos for the Basque Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358303 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [13:13:32] (03CR) 10jenkins-bot: Set collation for Bashkir wikis to uppercase-ba [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [13:13:50] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Set collation for Bashkir wikis to uppercase-ba - T162823 (duration: 00m 41s) [13:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:58] T162823: Changing the alphabetical sorting (collation) @ ba.wikipedia.org - https://phabricator.wikimedia.org/T162823 [13:14:15] (03PS2) 10Hashar: Change Persian Wikis from uca-fa to xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357951 (https://phabricator.wikimedia.org/T139110) (owner: 10Ladsgroup) [13:16:26] (03CR) 10Hashar: [C: 032] Change Persian Wikis from uca-fa to xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357951 (https://phabricator.wikimedia.org/T139110) (owner: 10Ladsgroup) [13:16:42] aharoni: make sure to run the script in a screen and !log it :-} [13:17:08] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3340387 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'... [13:17:23] hashar: oh, myself? do I even have a permission? [13:17:27] (03Merged) 10jenkins-bot: Change Persian Wikis from uca-fa to xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357951 (https://phabricator.wikimedia.org/T139110) (owner: 10Ladsgroup) [13:17:30] !log upgrading cp1008 to openssl 1.1.0f [13:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:05] (03CR) 10jenkins-bot: Change Persian Wikis from uca-fa to xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357951 (https://phabricator.wikimedia.org/T139110) (owner: 10Ladsgroup) [13:19:40] giving it a try on fawikis [13:19:51] (03PS1) 10KartikMistry: apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) [13:20:05] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [13:20:16] aharoni: for fawiki that takes a while "Fixing collation for 4022112 rows." [13:20:20] hashar: So what's wrong with 358176? [13:20:21] and it is doing it in batches of 100 :/ [13:20:25] You -1'd that change. [13:20:32] hashar: well, it does seem to work. [13:20:38] (Sorry for being late. Traffic.) [13:20:43] odder: had doubt about the opportunity to remove those files :D [13:21:16] !log terbium: for T139110 mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa [13:21:24] hashar: They are not used from $wgLogoHD anyway [13:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:26] T139110: uca-fa collation shows pages starting with ا incorrectly under ء - https://phabricator.wikimedia.org/T139110 [13:21:31] But https://no.wikisource.org/wiki/Wikikilden:Forside looks broken [13:21:41] Sweet. [13:22:41] hashar: for bawiki it's at 131500 out of 366541 [13:22:42] !log terbium: for T139110 mwscript updateCollation.php --wiki=fawikisource --previous-collation=uca-fa [13:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:56] aharoni: neat. Make sure to !log it :-) [13:23:08] mmm - hashar this may be an oops moment, but what do you mean by "!log"? I never did it [13:23:14] !log terbium: for T139110 mwscript updateCollation.php --wiki=fawiktionary --previous-collation=uca-fa [13:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:44] aharoni: add a meaningful message about what you are doing on the cluster in this IRC channel. Ie write here: !log running update collation for TXXXX [13:24:30] !log running mwscript updateCollation.php --wiki=bawiki [13:24:32] !log terbium: for T139110 mwscript updateCollation.php --wiki=fawikibooks --previous-collation=uca-fa [13:24:33] hashar: I think it needs --force [13:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:49] !log terbium: for T139110 mwscript updateCollation.php --wiki=fawikinews --previous-collation=uca-fa [13:24:49] at least fawiki is not fixed https://fa.wikipedia.org/wiki/%D8%B1%D8%AF%D9%87:%D8%A7%D8%B2%D8%AF%D9%88%D8%A7%D8%AC_%D9%87%D9%85%D8%AC%D9%86%D8%B3%E2%80%8C%DA%AF%D8%B1%D8%A7%DB%8C%D8%A7%D9%86_%D8%A8%D8%B1_%D9%BE%D8%A7%DB%8C%D9%87_%DA%A9%D8%B4%D9%88%D8%B1 [13:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:00] Amir1: fawiki is super large and still in progress though [13:25:29] !log terbium: for T139110 mwscript updateCollation.php --wiki=fawikiquote --previous-collation=uca-fa [13:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:58] odder: maybe we can add them to wgLogoHD instead ? [13:26:15] odder: iirc Urbanecm added the .png hd logos for most of the wiki. But maybe they have not all been switched [13:26:16] hashar: wgLogoHD already uses those logos from /project-logos [13:26:20] oh, okay. I saw !log I thought it's done [13:26:39] hashar: for bawiki it's at 300000 out of 366541 [13:26:45] hashar: Those logos just exist in duplicate in static/images/ [13:27:05] while wgLogoHD uses those same files from static/images/project-logos/ [13:27:10] hi odder. If I'm not mistaken the high-res logo for Polish Wikipedia also needs fixing. [13:27:17] Amir1: all switched. But fawiki is still ongoing [13:27:22] You are not mistaken, aharoni [13:27:35] I'm trying to see why I'm still being served the old logo [13:27:36] hashar: okay, let me test a smaller wiki [13:27:43] aharoni: ^ [13:28:05] odder: I talked about this at the Polish village pump a couple of months ago. [13:28:11] !log joal@tin Started deploy [analytics/refinery@0dda4a9]: Bug correction for egular weekly deploy of refinery [13:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:43] hashar: it's not fixed in fawikiquote https://fa.wikiquote.org/wiki/%D8%B1%D8%AF%D9%87:%D8%A7%D9%87%D8%A7%D9%84%DB%8C_%D8%A7%DB%8C%D8%A7%D9%84%D8%A7%D8%AA_%D9%85%D8%AA%D8%AD%D8%AF%D9%87_%D8%A2%D9%85%D8%B1%DB%8C%DA%A9%D8%A7 [13:28:49] hashar: the script for bawiki appears to be done! (should I !log the ending, too?) [13:28:56] do you mind if I retry in fawikiquote using --force [13:28:58] I have not log the end [13:28:58] ? [13:29:19] (03PS2) 10KartikMistry: apertium-cat: Update to latest upstrem snapshot [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) [13:29:28] Amir1: will do [13:29:33] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: Update to latest upstrem snapshot [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [13:30:01] Amir1: but really if we have to pass --force, there is something wrong in the script isn't there? [13:30:03] hashar: I just tested actual categories on bawiki, and the patch appears to work! \o/ [13:30:11] bawiki editors are going to be very happy :) [13:30:37] hashar: honestly, I have no idea how the maintenance script works, that's what bawolff said in the ticket [13:30:41] !log running mwscript updateCollation.php --wiki=bawikibooks [13:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:50] Amir1: well it seems to work for bawikis :-} [13:30:56] ... and for bawikibooks it was very fast of course, becasue that project is almost empty. [13:31:09] looks like I'm done! hashar thanks a lot for the assistance. [13:31:12] maybe icu collation for bawikis are different [13:31:16] \O/ [13:31:51] !log joal@tin Finished deploy [analytics/refinery@0dda4a9]: Bug correction for egular weekly deploy of refinery (duration: 03m 40s) [13:31:53] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3340420 (10Gehel) Only one disk is seen by debian installer, the raid probably needs to be re-created outside of the OS, I'm checking... [13:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:39] (03CR) 10Daniel Kinzler: [C: 031] "We want this, and it looks like this approach should work." [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [13:33:19] hashar: What do you see in https://pl.wikipedia.org/static/images/project-logos/plwiki-2x.png ? [13:33:32] (03PS4) 10Daniel Kinzler: Make /entity/ redirect internal [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [13:35:02] !log uploaded openssl 1.1.0f to apt.wikimedia.org [13:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:13] Amir1: bah fawikibooks rows are still set to `uca-fa` [13:35:44] amir1 - actually no. It's totally custom. Kindly made by the brilliant Bawolff. I think it's the first that such a big custom collation module is made. [13:35:48] the first time [13:36:10] It can become a good precedent. (Of course it would be even better if it did go to CLDR and wouldn't have to be custom.) [13:36:19] For Bashkir I already made a CLDR issue. [13:36:33] Amir1: I have screwed up the deployment apparently [13:38:02] Amir1: ah I forgot to scap sync the configuration change :-} [13:38:07] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Change Persian Wikis from uca-fa to xx-uca-fa - T139110 (duration: 00m 41s) [13:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:17] T139110: uca-fa collation shows pages starting with ا incorrectly under ء - https://phabricator.wikimedia.org/T139110 [13:38:22] hashar: as a deployer I find this quite funny [13:38:37] Amir1: fawikibooks completed now [13:38:41] and verified in the database [13:40:01] hashar: it's really hard to test in fawikibooks as categorization is a mess there [13:40:04] !log redoing all the fawiki* updateCollation.php since I ran them without deploying the IS.php change :( [13:40:08] can you do it on fawikiquote? [13:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:28] fawikinews fawikiquote completed [13:40:34] fawikibooks done [13:40:46] doing fawiktionary [13:43:16] done [13:44:53] (03PS1) 10Volans: TODO: re-organized TODO lists. [software/cumin] - 10https://gerrit.wikimedia.org/r/358371 [13:45:06] really? [13:45:07] hahahahahah [13:45:37] :-P [13:45:58] Amir1: all done and fawiki is in progress [13:46:13] (03CR) 10Volans: [C: 032] TODO: re-organized TODO lists. [software/cumin] - 10https://gerrit.wikimedia.org/r/358371 (owner: 10Volans) [13:46:30] Thanks! [13:46:44] (03Merged) 10jenkins-bot: TODO: re-organized TODO lists. [software/cumin] - 10https://gerrit.wikimedia.org/r/358371 (owner: 10Volans) [13:48:34] !log hashar@tin Synchronized php-1.30.0-wmf.4/includes/specials/SpecialNewimages.php: SpecialNewimages: Do not add the module when the special page is included - T167601 (duration: 00m 41s) [13:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:43] T167601: Transclusion of Special:NewImages produces a JavaScript error - https://phabricator.wikimedia.org/T167601 [13:51:26] hashar: it's fixed https://fa.wikiquote.org/wiki/%D8%B1%D8%AF%D9%87:%D8%A7%D9%87%D8%A7%D9%84%DB%8C_%D8%A7%DB%8C%D8%A7%D9%84%D8%A7%D8%AA_%D9%85%D8%AA%D8%AD%D8%AF%D9%87_%D8%A2%D9%85%D8%B1%DB%8C%DA%A9%D8%A7 [13:51:28] YESSSSS [13:52:16] We have been waiting for this for a year now and it was quite annoying (it was like collating things starting with A or E under Æ) [13:53:00] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [13:53:07] !log Shutdown db1089 for maintenance - T166935 [13:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:19] T166935: db1089: update RAID controller firwmare - https://phabricator.wikimedia.org/T166935 [13:53:52] (03CR) 10Andrew Bogott: [C: 032] Remove references to keystone admin_token [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [13:55:06] hashar: Need to fix that nowikisource logo, looks like the file they provided wasn't ideal [13:55:17] hashar: Also I'm still seeing the pre-2010 Wikipedia logo for pl [13:55:24] While eu has got the new one alright [13:57:10] Amir1: we will want to change fawikivoyage as well I guess. The category collation is set to uppercase [13:57:21] (03PS2) 10Ema: VCL: use resp.reason for synthetic responses generation [puppet] - 10https://gerrit.wikimedia.org/r/358057 [13:57:45] I didn't know that [13:57:48] I'll check [13:59:09] Amir1: I have posted my findings on https://phabricator.wikimedia.org/T139110#3340566 [13:59:11] !log upgrading mw1296-mw1298 to HHVM 3.18 [13:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:28] odder: maybe that is a cache issue? :( [14:00:02] hashar let me talk to the community and come back to you in several days [14:00:05] okay? [14:00:09] sure thing [14:00:28] Amir1: once you get some feedback from them, just prepare a new patch and add it to swat as usual :} [14:00:43] Yes, that's what I will do [14:01:09] hashar: Can you confirm if you see the new logo on pl? [14:01:12] if so it's fine [14:01:28] https://pl.wikipedia.org/static/images/project-logos/plwiki-2x.png [14:01:37] https://pl.wikipedia.org/static/images/project-logos/plwiki-1.5x.png [14:02:01] (I also notice they hotlink files from Commons that are not protected. PENIS ON MAIN PAGE ANYONE?) [14:02:59] (03PS1) 10Odder: Update logo for the Norwegian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358374 (https://phabricator.wikimedia.org/T167192) [14:03:21] odder: both works for me [14:03:34] hashar: Great, so that's a cache issue on my side then. [14:03:45] odder: I can try purging the cache [14:03:59] Please. [14:04:09] So https://gerrit.wikimedia.org/r/#/c/358176/ should be fine to deploy. [14:04:15] !log updating tor in jessie-wikimedia to 0.2.9.11-1~d80.jessie+1 (via reprepro update from tor repository) [14:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:40] (03CR) 10Faidon Liambotis: [C: 031] hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) (owner: 10Elukey) [14:04:43] https://en.wikipedia.org/static/images/project-logos/pawiki-1.5x.png [14:04:44] odder: can you try again the pl logos ? [14:04:54] https://en.wikipedia.org/static/images/project-logos/pawiki-2x.png both exist [14:05:12] (03PS1) 10Filippo Giunchedi: hieradata: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/358376 (https://phabricator.wikimedia.org/T162609) [14:05:14] (03PS1) 10Filippo Giunchedi: hierata: swift active in codfw only [puppet] - 10https://gerrit.wikimedia.org/r/358377 (https://phabricator.wikimedia.org/T162609) [14:05:36] hashar: Fantastic, thanks. [14:06:12] (03CR) 10Hashar: [C: 032] Delete duplicate HD logos for the Punjabi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358176 (owner: 10Odder) [14:08:09] hashar: And https://gerrit.wikimedia.org/r/#/c/358374/ fixes the nowikisource logos [14:08:22] (Sorry for the mess.) [14:08:29] 10Operations, 10Labs, 10cloud-services-team (Kanban): Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3340630 (10Andrew) [14:08:32] 10Operations, 10Labs, 10Patch-For-Review: Disable keystone admin_token usage - https://phabricator.wikimedia.org/T165211#3340628 (10Andrew) 05Open>03Resolved a:03Andrew [14:09:38] (03CR) 10Hashar: [C: 032] Update logo for the Norwegian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358374 (https://phabricator.wikimedia.org/T167192) (owner: 10Odder) [14:09:40] odder: excellent :} [14:11:19] (03Merged) 10jenkins-bot: Delete duplicate HD logos for the Punjabi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358176 (owner: 10Odder) [14:11:22] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: upgrade kibana to v5.3.3 - https://phabricator.wikimedia.org/T167266#3340636 (10debt) 05Open>03Resolved Yay! Thanks @Gehel, it's been added to the weekly status report. [14:11:32] (03CR) 10jenkins-bot: Delete duplicate HD logos for the Punjabi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358176 (owner: 10Odder) [14:11:50] (03Merged) 10jenkins-bot: Update logo for the Norwegian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358374 (https://phabricator.wikimedia.org/T167192) (owner: 10Odder) [14:12:09] odder: syncing the first [14:12:45] !log hashar@tin Synchronized static/images/: Delete duplicate HD logos for the Punjabi Wikipedia (duration: 00m 41s) [14:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] odder: and syncing the fix for nowikisource [14:13:39] (03CR) 10jenkins-bot: Update logo for the Norwegian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358374 (https://phabricator.wikimedia.org/T167192) (owner: 10Odder) [14:13:52] !log hashar@tin Synchronized static/images/project-logos/: Update logo for the Norwegian Wikisource - T167192 (duration: 00m 41s) [14:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:02] T167192: Change the logo of nowikisource - https://phabricator.wikimedia.org/T167192 [14:14:20] odder: nowikisource should be good now [14:14:27] !log European SWAT completed [14:14:30] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nutcracker] [14:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:59] !log updating tor on radium to 0.2.9.11-1~d80.jessie+1 [14:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:25] hashar: Still seeing the old logo, can you do the cache magic again? [14:16:08] odder: I did [14:16:12] odder: but maybe I screwed it up :/ [14:16:33] echo "https://en.wikipedia.org/static/images/project-logos/nowikisource-1.5x.png"|mwscript purgeList.php --wiki=enwiki [14:16:38] echo "https://en.wikipedia.org/static/images/project-logos/nowikisource-2x.png"|mwscript purgeList.php --wiki=enwiki [14:16:55] and I have done just in case: [14:16:56] echo "https://no.wikisource.org/static/images/project-logos/nowikisource-2x.png"|mwscript purgeList.php --wiki=enwiki [14:17:20] odder: is it any better? [14:18:33] hashar: No, it's not, but then the cs.wikisource logo that I based the Norwegian one on isn't any better [14:19:07] So not your fault :-P [14:20:05] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:20:20] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:22:04] (03CR) 10Alexandros Kosiaris: [C: 031] "So, I am guessing this is the first step in a huge pile of refactoring that are coming up. In the interest of not blocking that path forwa" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354449 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [14:22:07] (03PS5) 10Elukey: hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) [14:22:15] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 129.21, 103.80, 80.49 [14:23:17] (03CR) 10Hashar: "Nodepool looks all good to me. I have managed to refresh a snapshot :-}" [puppet] - 10https://gerrit.wikimedia.org/r/357659 (https://phabricator.wikimedia.org/T165211) (owner: 10Andrew Bogott) [14:23:45] the ms-be1019 alert is likely a rebalance plus faulty bbu from T163777 [14:23:46] T163777: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777 [14:23:57] elukey: we had that syslog issue a while back [14:24:09] cmjohnson1: news on ms-be1019's BBU from hp? [14:24:25] (03CR) 10Elukey: [C: 032] hhvm: force rsyslog config to create log files with www-data perms [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) (owner: 10Elukey) [14:25:03] hashar: it keeps spamming us on a regular basis, really annoying [14:25:08] anything against the change? [14:25:11] I can imagine :-) [14:25:17] a year or so ago i did https://gerrit.wikimedia.org/r/#/c/285945/ [14:26:28] but that was on ubuntu [14:26:46] the create field changed to www-data:www-data in the meantime, I guess multiple changes went through [14:26:57] I guess [14:27:09] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:27:15] hashar: anyway, good if I merge? [14:27:25] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:27:59] (merging) [14:28:56] elukey: there is a logrotate conf file as well [14:28:58] that as some specific user/group settings [14:29:04] su <%= @user %> <%= @group %> [14:29:19] so yeah I guess that is inline [14:29:21] godog: they sent me something on saturday....they f'd up and now I have to start over..it's on my agenda to take care of today but will be busy on db1089 for awhile [14:29:31] hashar: yep exactly! [14:29:35] (03CR) 10Hashar: "Seems to be inline with modules/hhvm/templates/hhvm.logrotate.erb :)" [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) (owner: 10Elukey) [14:30:07] cmjohnson1: kk, ping me if you are free before today's ops meeting [14:30:14] (03PS1) 10Muehlenhoff: Use ffmpeg from jessie-backports on jessie-based video scalers [puppet] - 10https://gerrit.wikimedia.org/r/358381 (https://phabricator.wikimedia.org/T145742) [14:30:33] (03PS2) 10Alexandros Kosiaris: Introduce kubestagetcd100{1,2,3}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/358336 (https://phabricator.wikimedia.org/T162045) [14:30:41] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce neon.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/358335 (https://phabricator.wikimedia.org/T162045) (owner: 10Alexandros Kosiaris) [14:31:15] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 54.18, 74.53, 79.05 [14:31:28] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce kubestagetcd100{1,2,3}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/358336 (https://phabricator.wikimedia.org/T162045) (owner: 10Alexandros Kosiaris) [14:31:45] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/20-hhvm.conf] [14:32:54] !log restart elasticsearch on relforge1001 to validate GC configuration [14:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:16] (03PS4) 10Pmiazga: Setup the new wgPopupsGateway config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) [14:37:00] (03PS3) 10Ema: VCL: use resp.reason for synthetic responses generation [puppet] - 10https://gerrit.wikimedia.org/r/358057 [14:37:28] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3340789 (10MoritzMuehlenhoff) Status update: With HHVM 3.18.2+wmf5 we now have a stable HHVM package. It has been rolled out on all appservers, all image scalers, most API servers... [14:38:03] here I am, checking mw2166 [14:39:10] ran puppet, worked fine, weird [14:39:35] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:41:35] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:43:05] (03CR) 10Hashar: apertium-fra: New upstream release (031 comment) [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:44:23] 10Operations, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Switch CI tests back to HHVM 3.18 - https://phabricator.wikimedia.org/T167493#3340807 (10hashar) Mostly done ™. Lets wait a few days to make sure nothing is misbehaving then I guess we can resolve this. [14:45:03] (03CR) 10KartikMistry: apertium-fra: New upstream release (031 comment) [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [14:46:21] (03PS2) 10KartikMistry: apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) [14:46:23] 10Operations, 10Operations-Software-Development: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#3335151 (10akosiaris) Note there's https://phabricator.wikimedia.org/T167269 that describes an approach that at least partly (if not fully) ov... [14:47:19] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: performance regression after upgrading elasticsearch to v5.3.2 - https://phabricator.wikimedia.org/T167685#3340827 (10Gehel) [14:48:15] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 133.18, 105.61, 87.30 [14:48:37] 10Operations, 10Kubernetes, 10Prod-Kubernetes (Experiment), 10User-Joe: Make security updates of docker images manageable - https://phabricator.wikimedia.org/T167269#3340844 (10akosiaris) T167504 has a proposal as well that at least partially (if not fully) overlaps. I think we should merge this one into T... [14:49:08] (03PS3) 10KartikMistry: Update apertium-cat package [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/358354 (https://phabricator.wikimedia.org/T167247) [14:51:35] PROBLEM - configured eth on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:52:25] RECOVERY - configured eth on ms-be1019 is OK: OK - interfaces up [14:53:15] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: performance regression after upgrading elasticsearch to v5.3.2 - https://phabricator.wikimedia.org/T167685#3340852 (10Gehel) [14:54:23] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: performance regression after upgrading elasticsearch to v5.3.2 - https://phabricator.wikimedia.org/T167685#3340827 (10Gehel) 05Open>03Resolved Removing UseConcMarkSweepGC seems to work on relforge1001. young gen size is back to something m... [14:56:05] (03PS1) 10Gehel: elasticsearch: remove UseConcMarkSweepGC [puppet] - 10https://gerrit.wikimedia.org/r/358383 (https://phabricator.wikimedia.org/T167685) [14:56:55] (03PS2) 10Gehel: elasticsearch: remove UseConcMarkSweepGC [puppet] - 10https://gerrit.wikimedia.org/r/358383 (https://phabricator.wikimedia.org/T167636) [15:00:07] 10Operations, 10Operations-Software-Development: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#3340884 (10Volans) @akosiaris yes we were aware of it and I spoke with @Joe last week about the requirements for the Docker part, sorry to not... [15:08:25] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 54.60, 66.18, 78.85 [15:09:30] 10Operations, 10Labs, 10Patch-For-Review: Disable keystone admin_token usage - https://phabricator.wikimedia.org/T165211#3340918 (10bd808) [15:10:26] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3340928 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'... [15:14:46] (03PS3) 10Gehel: elasticsearch: remove UseConcMarkSweepGC [puppet] - 10https://gerrit.wikimedia.org/r/358383 (https://phabricator.wikimedia.org/T167636) [15:16:17] (03CR) 10DCausse: [C: 031] elasticsearch: remove UseConcMarkSweepGC [puppet] - 10https://gerrit.wikimedia.org/r/358383 (https://phabricator.wikimedia.org/T167636) (owner: 10Gehel) [15:17:28] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3340972 (10ayounsi) a:03RobH [15:20:19] (03PS2) 10Ema: varnish mobile redirects: allow for dashes in first label [puppet] - 10https://gerrit.wikimedia.org/r/358028 (https://phabricator.wikimedia.org/T167492) (owner: 10BBlack) [15:20:25] PROBLEM - HHVM rendering on mw2204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:15] RECOVERY - HHVM rendering on mw2204 is OK: HTTP OK: HTTP/1.1 200 OK - 78926 bytes in 0.121 second response time [15:22:18] (03CR) 10Ema: [C: 031] varnish mobile redirects: allow for dashes in first label [puppet] - 10https://gerrit.wikimedia.org/r/358028 (https://phabricator.wikimedia.org/T167492) (owner: 10BBlack) [15:23:04] 10Operations, 10ops-codfw, 10Services: scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3340988 (10Papaul) @Marostegui no link on eth0 or eth1 . I replaced the network cable same problem. When i plugged the cable on NIC 3 and NIC 4 i have link. In the ILO under the HW... [15:28:14] 10Operations, 10ops-codfw, 10Services: scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3341009 (10Marostegui) Which mac addresses do you see for NIC3 and NIC4? For me: eth0: 18:66:da:7d:ac:b4 eth1: 18:66:da:7d:ac:b5 Does any of those correlate with mac addresses you s... [15:29:23] 10Operations, 10ops-codfw, 10Services (watching): scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3341020 (10mobrovac) [15:31:09] 10Operations, 10Monitoring, 10netops: Setup flow monitoring of *internal* network traffic - https://phabricator.wikimedia.org/T79755#3341024 (10ayounsi) a:03ayounsi Prometheus (that didn't exist in 2011) with netstat provides better visibility on problematic frames/segments/datagrams/packets getting in/out... [15:32:35] PROBLEM - salt-minion processes on ms-be2018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:34:03] 10Puppet, 10Analytics, 10Analytics-EventLogging: Eventlogging file logging code split weirdly between role and base class - https://phabricator.wikimedia.org/T86745#975424 (10Nuria) Old task, not relevant. [15:34:10] 10Puppet, 10Analytics, 10Analytics-EventLogging: Eventlogging file logging code split weirdly between role and base class - https://phabricator.wikimedia.org/T86745#3341037 (10Nuria) 05Open>03Resolved [15:34:35] RECOVERY - salt-minion processes on ms-be2018 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:34:38] 10Operations, 10Recommendation-API, 10Services, 10Service-deployment-requests, 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3341038 (10mobrovac) [15:34:55] RECOVERY - MD RAID on elastic2020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:36:10] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3341053 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2020.codfw.wmnet'] ``` and were **ALL** successful. [15:46:45] RECOVERY - Host scb2005 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:47:38] marostegui: ---^ :D [15:47:49] :) [15:47:50] was that you? [15:48:55] PROBLEM - ores on scb2005 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.001 second response time [15:48:55] PROBLEM - Check whether ferm is active by checking the default input chain on scb2005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:48:55] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:49:12] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3341110 (10Reedy) [15:49:47] okay, that's horrifying [15:51:10] Amir1: if you are worried about scb2005 the host was down due to eth0 issues [15:51:14] nothing is exploding :) [15:51:21] phew [15:51:28] I was checking grafana [15:51:34] 10Operations, 10ops-codfw, 10Services (watching): scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3341128 (10Papaul) NIC.Embedded.3-1-1 Ethernet = 18:66:DA:7D:AC:B4 NIC.Embedded.4-1-1 Ethernet = 18:66:DA:7D:AC:B5 [15:51:39] Thanks for noting this [15:51:43] :) [16:01:33] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3341163 (10elukey) [16:01:35] 10Operations, 10Patch-For-Review, 10User-Elukey: hhvm root:adm owned log files cause failures for logrotate - https://phabricator.wikimedia.org/T146464#3341162 (10elukey) 05Open>03Resolved [16:04:01] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250#3341173 (10N... [16:06:12] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3341190 (10ayounsi) [16:10:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:10:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:11:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:11:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:12:25] PROBLEM - HHVM rendering on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:15] RECOVERY - HHVM rendering on mw2144 is OK: HTTP OK: HTTP/1.1 200 OK - 78896 bytes in 0.133 second response time [16:13:38] (03PS1) 10Joal: Rename unique devices daily endpoint [puppet] - 10https://gerrit.wikimedia.org/r/358386 (https://phabricator.wikimedia.org/T167043) [16:13:55] 10Operations, 10Recommendation-API, 10Service-deployment-requests, 10Services (doing), 10User-mobrovac: New Service Request: recommendation-api - https://phabricator.wikimedia.org/T167664#3341247 (10mobrovac) [16:14:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:14:54] elukey: if you have minute: https://gerrit.wikimedia.org/r/#/c/358386/ [16:15:20] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-3h&to=now [16:15:34] ^ some tall but narrow 503 spikes happening on text traffic... [16:15:35] we apparently had a spike of mediawiki errors in the log [16:16:16] from https://grafana.wikimedia.org/dashboard/db/production-logging?refresh=5m&panelId=4&fullscreen&orgId=1&from=now-1h&to=now [16:16:42] (03PS2) 10Elukey: pivot: rename unique devices daily endpoint [puppet] - 10https://gerrit.wikimedia.org/r/358386 (https://phabricator.wikimedia.org/T167043) (owner: 10Joal) [16:19:35] any API issues? uri_path in the 5xx logs is /w/api.php [16:21:24] (03CR) 10Elukey: [C: 032] pivot: rename unique devices daily endpoint [puppet] - 10https://gerrit.wikimedia.org/r/358386 (https://phabricator.wikimedia.org/T167043) (owner: 10Joal) [16:22:42] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3341278 (10Papaul) [16:23:38] ema: there was definitely some queueing on HHVM - https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All&from=now-1h&to=now [16:23:43] Thanks elukey [16:23:52] joal: restarting pivot now [16:23:59] man, you rock [16:24:03] done :) [16:24:10] faster than the fastest [16:24:52] ahahha it was super trivial, you did all the work [16:25:16] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3341291 (10Papaul) [16:25:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:25:26] elukey: I actually did it too fast, I forgot half the patch ... Please excuse, submitting a new one [16:25:37] sure [16:26:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:27:25] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:31:57] (03PS1) 10Joal: pivot: rename unique devices daily endpoint [puppet] - 10https://gerrit.wikimedia.org/r/358389 (https://phabricator.wikimedia.org/T167043) [16:31:58] elukey: --^ [16:32:01] sorry again elukey [16:38:04] (03PS1) 10Marostegui: db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358394 [16:38:27] (03PS2) 10Marostegui: db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358394 [16:38:52] (03CR) 10Elukey: [C: 032] pivot: rename unique devices daily endpoint [puppet] - 10https://gerrit.wikimedia.org/r/358389 (https://phabricator.wikimedia.org/T167043) (owner: 10Joal) [16:38:55] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 29 probes of 433 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:39:52] 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10Release-Engineering-Team (Next), 10Scap (Scap3-Adoption-Phase1): figure out how to not restart jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3341368 (10greg) [16:40:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358394 (owner: 10Marostegui) [16:42:22] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358394 (owner: 10Marostegui) [16:42:36] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1089 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358394 (owner: 10Marostegui) [16:42:38] (03PS1) 10Reedy: Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) [16:43:00] (03PS2) 10Alexandros Kosiaris: Introduce kubestagetcd100{1,2,3} and neon.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/358344 (https://phabricator.wikimedia.org/T162045) [16:43:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 with less weight (duration: 00m 41s) [16:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:51] (03PS1) 10Papaul: DHCP: Add MAC address for labtestpuppetmaster2001,labtestnet2002 and labtestneutron2002 Bug:T167157 Bug:T167159 Bug:T167160 [puppet] - 10https://gerrit.wikimedia.org/r/358397 [16:43:55] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 433 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:53:26] (03PS2) 10Alexandros Kosiaris: compiler: Split fact collection from shipping/collation [puppet] - 10https://gerrit.wikimedia.org/r/358010 [16:53:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] compiler: Split fact collection from shipping/collation [puppet] - 10https://gerrit.wikimedia.org/r/358010 (owner: 10Alexandros Kosiaris) [16:55:08] (03PS1) 10Andrew Bogott: wmfsink: Clean up proxy records for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/358399 (https://phabricator.wikimedia.org/T163765) [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170612T1700). Please do the needful. [17:01:47] (03PS2) 10Andrew Bogott: wmfsink: Clean up proxy records for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/358399 (https://phabricator.wikimedia.org/T163765) [17:02:24] (03PS6) 10Dzahn: fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172 [17:02:58] (03CR) 10jerkins-bot: [V: 04-1] wmfsink: Clean up proxy records for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/358399 (https://phabricator.wikimedia.org/T163765) (owner: 10Andrew Bogott) [17:04:59] (03PS3) 10Andrew Bogott: wmfsink: Clean up proxy records for deleted instances. [puppet] - 10https://gerrit.wikimedia.org/r/358399 (https://phabricator.wikimedia.org/T163765) [17:05:11] 10Operations, 10netops: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3341487 (10Papaul) port 5-6 reference on the patch panel, please see below {F8445373} [17:05:16] !log gehel@tin Started deploy [wdqs/wdqs@84557b8]: (no justification provided) [17:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:23] 10Operations, 10netops: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3341509 (10faidon) Found an old email from CyrusOne: {F8445376} [17:07:48] !log gehel@tin Finished deploy [wdqs/wdqs@84557b8]: (no justification provided) (duration: 02m 32s) [17:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:15] SMalyshev: ^ wdqs updated, feel free to test [17:09:35] 10Operations, 10ops-codfw, 10Services (watching): scb2005 eth0 interface gets renamed to eth2 - https://phabricator.wikimedia.org/T167638#3339439 (10akosiaris) That does sound like motherboard issues. a quick look in RAC's logs does not show anything though. [17:21:47] !log joal@tin Started deploy [analytics/refinery@08fe129]: Bug correction on regular weekly deploy of refinery (2) [17:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:46] !log running stress + bonnie on elastic2020 to check new hardware - T149006 [17:24:48] !log joal@tin Finished deploy [analytics/refinery@08fe129]: Bug correction on regular weekly deploy of refinery (2) (duration: 03m 00s) [17:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:55] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [17:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:01] (03PS5) 10Pmiazga: Setup the new wgPopupsGateway config variable. NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) [17:35:20] 10Operations, 10DNS, 10Labs, 10Labs-Infrastructure, and 3 others: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#3341658 (10Reedy) a:03herron [17:37:02] (03CR) 10Reedy: [C: 031] "Other subdomains to be addressed in different tasks/patches" [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [17:37:46] 10Operations, 10ops-codfw, 10procurement: rack/setup/ - https://phabricator.wikimedia.org/T167705#3341661 (10Papaul) [17:38:25] 10Operations, 10ops-codfw, 10procurement: rack/setup spare systems - https://phabricator.wikimedia.org/T167705#3341693 (10Papaul) [17:50:58] (03CR) 10Krinkle: Setup the new wgPopupsGateway config variable. NOOP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170612T1800). Please do the needful. [18:00:04] raynor and RoanKattouw: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:45] I'm ready [18:01:52] I'm here [18:02:03] Would prefer if someone else could SWAT because I have meetings [18:02:20] I can SWAT [18:02:28] Awesome thanks [18:03:17] (03PS6) 10Thcipriani: Setup the new wgPopupsGateway config variable. NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [18:03:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [18:05:01] thcipriani: my config change introduces a new variable which is not used yet [18:06:11] raynor: I gathered, I'll go ahead and sync it out everywhere once it merges, doesn't see like there will be anything to check on debug hosts. [18:06:28] yeah [18:07:00] today/tomorrow we will merge a new change that is going to use new variable instead of the old one [18:07:19] sounds good. [18:07:20] then I'll create a config change to remove old var so we keep InitialiseSettings clean [18:07:27] akosiaris: pending merge on puppetmaster [18:10:16] thcipriani: I added some patchs to the Deployments list [18:10:38] merged [18:12:28] framawiki: I see that, I'll get those out if/when jenkins merges the last batch :\ [18:13:17] (03Merged) 10jenkins-bot: Setup the new wgPopupsGateway config variable. NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [18:13:27] There is only one urgent, it does not matter if you do not have the time [18:13:28] (03CR) 10jenkins-bot: Setup the new wgPopupsGateway config variable. NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358091 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [18:13:47] (03PS2) 10Dzahn: system::role: remove leading 'role::' to avoid role-role [puppet] - 10https://gerrit.wikimedia.org/r/357960 [18:14:32] framawiki: I think there is time in this window, these patches are good things to get merged [18:16:15] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:17:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:358091|Setup the new wgPopupsGateway config variable. NOOP]] T165018 (duration: 00m 42s) [18:17:37] runs puppet on acamar and sees why.. will fix it [18:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:42] T165018: Page previews can consume new summary-HTML endpoint - https://phabricator.wikimedia.org/T165018 [18:17:46] ^ raynor no-op var addition is live [18:17:47] there is a duplicate declaration of a resource name [18:18:24] (03PS2) 10Thcipriani: Add NS:100 to wgNamespacesToBeSearchedDefault for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358059 (https://phabricator.wikimedia.org/T167511) (owner: 10Framawiki) [18:18:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358059 (https://phabricator.wikimedia.org/T167511) (owner: 10Framawiki) [18:18:41] thcipriani: thx, production is live, there is nothing else I can test right now [18:18:51] thanks for deploying that [18:18:55] yw :) [18:19:40] (03PS1) 10Papaul: Add partman entries for labtestpuppetmaster2001,labtestneutron2002 and labtestnet2002 Bug:T167157 Bug:T167159 Bug:T167160 [puppet] - 10https://gerrit.wikimedia.org/r/358409 [18:20:05] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:20:28] (03Merged) 10jenkins-bot: Add NS:100 to wgNamespacesToBeSearchedDefault for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358059 (https://phabricator.wikimedia.org/T167511) (owner: 10Framawiki) [18:20:37] (03CR) 10jenkins-bot: Add NS:100 to wgNamespacesToBeSearchedDefault for enwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358059 (https://phabricator.wikimedia.org/T167511) (owner: 10Framawiki) [18:21:26] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3341896 (10Amire80) [18:21:54] RoanKattouw: RCFilters: Retain extra url params when comparing url equivalency is live on mwdebug1002, check please [18:23:37] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3341910 (10Amire80) [18:24:25] Reedy: Hi! I don't do this usually, and I'll try not to do this in the future, but just once: Could anybody perhaps do https://phabricator.wikimedia.org/T167714 earlier rather than later? :) [18:24:40] Do you have a timeframe? [18:24:55] It's a language wiki... So the pre-requisites should be less [18:24:58] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:25:02] thcipriani: Works great [18:25:16] RoanKattouw: ok, syncing [18:25:36] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:25:51] Reedy - The nice people behind it hoped to have it working by June 21: https://en.wikipedia.org/wiki/National_Aboriginal_Day [18:25:59] andrewbogott: do you remember why you did a system::role for DNS recursors but with "ensure => absent"? [18:27:28] !log thcipriani@tin Synchronized php-1.30.0-wmf.4/resources/src/mediawiki.rcfilters/mw.rcfilters.Controller.js: SWAT: [[gerrit:358407|RCFilters: Retain extra url params when comparing url equivalency]] T167551 (duration: 00m 41s) [18:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:38] T167551: Number of Results and Number of Days selectors no longer function - https://phabricator.wikimedia.org/T167551 [18:27:51] (03PS1) 10Reedy: Add new language "atj" (Atikamekw) [dns] - 10https://gerrit.wikimedia.org/r/358410 (https://phabricator.wikimedia.org/T167714) [18:27:52] ^ RoanKattouw live now [18:28:07] framawiki: Add NS:100 to wgNamespacesToBeSearchedDefault for enwikisource is live on mwdebug1002, check please [18:28:27] thcipriani: it's good [18:28:34] ok, syncing [18:28:53] (03CR) 10Amire80: [C: 031] Add new language "atj" (Atikamekw) [dns] - 10https://gerrit.wikimedia.org/r/358410 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [18:30:14] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:358059|Add NS:100 to wgNamespacesToBeSearchedDefault for enwikisource]] T167511 (duration: 00m 41s) [18:30:19] ^ framawiki live now [18:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:26] T167511: Addition of portal namespace [[ns:100]] to defaultsearch for English Wikisource - https://phabricator.wikimedia.org/T167511 [18:30:45] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:30:49] thcipriani: confirmed [18:30:49] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3341896 (10Reedy) [18:30:52] (03PS1) 10Dzahn: dnsrecursor: remove duplicate and absented system::role [puppet] - 10https://gerrit.wikimedia.org/r/358411 [18:31:13] 10Operations, 10netops: Faulty link between cr2-codfw and cr1-eqdfw - https://phabricator.wikimedia.org/T167261#3341949 (10ayounsi) Ticket 830782 opened with CyrusOne [18:31:18] aharoni: That's the ops TODO's... [18:31:24] So needs someone to do the mediawiki config changes :) [18:31:29] (03PS2) 10Thcipriani: Lift IP throttle for Wikipedia Editathon (June 16th 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357510 (https://phabricator.wikimedia.org/T167201) (owner: 10Framawiki) [18:31:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357510 (https://phabricator.wikimedia.org/T167201) (owner: 10Framawiki) [18:35:45] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:37:34] (03PS2) 10Dzahn: dnsrecursor: remove duplicate and absented system::role [puppet] - 10https://gerrit.wikimedia.org/r/358411 [18:37:36] framawiki: All these throttle rules look good to me. I will sync them out when they all merge (which is a little slow today :( ). [18:37:42] (03Merged) 10jenkins-bot: Lift IP throttle for Wikipedia Editathon (June 16th 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357510 (https://phabricator.wikimedia.org/T167201) (owner: 10Framawiki) [18:37:51] (03CR) 10jenkins-bot: Lift IP throttle for Wikipedia Editathon (June 16th 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357510 (https://phabricator.wikimedia.org/T167201) (owner: 10Framawiki) [18:38:34] (03PS2) 10Thcipriani: Lift IP throttle for Editathon (13 June 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358056 (https://phabricator.wikimedia.org/T167517) (owner: 10Framawiki) [18:38:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358056 (https://phabricator.wikimedia.org/T167517) (owner: 10Framawiki) [18:39:06] (03PS1) 10Reedy: labs dnsrecursor: add atjwiki [puppet] - 10https://gerrit.wikimedia.org/r/358412 (https://phabricator.wikimedia.org/T167714) [18:40:25] (03Merged) 10jenkins-bot: Lift IP throttle for Editathon (13 June 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358056 (https://phabricator.wikimedia.org/T167517) (owner: 10Framawiki) [18:40:34] (03CR) 10jenkins-bot: Lift IP throttle for Editathon (13 June 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358056 (https://phabricator.wikimedia.org/T167517) (owner: 10Framawiki) [18:40:50] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:357510|Lift IP throttle for Wikipedia Editathon (June 16th 2017)]] T167201 (duration: 00m 41s) [18:40:58] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6739/" [puppet] - 10https://gerrit.wikimedia.org/r/358411 (owner: 10Dzahn) [18:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:00] T167201: Lift IP rate limit - Editathon (WMCL) - 2017-06-16 - https://phabricator.wikimedia.org/T167201 [18:41:39] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3342026 (10ayounsi) Data point, the cp* servers don't send any ICMP packets (the lvs* servers neither), they only receive them from very diverse locations. [18:42:16] (03PS1) 10Pmiazga: Remove unused wgPopupsAPIUseRESTBase config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358415 (https://phabricator.wikimedia.org/T165018) [18:42:24] (03CR) 10jerkins-bot: [V: 04-1] Remove unused wgPopupsAPIUseRESTBase config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358415 (https://phabricator.wikimedia.org/T165018) (owner: 10Pmiazga) [18:42:30] framawiki: looks like https://gerrit.wikimedia.org/r/#/c/357233/ has a merge conflict, could you manually rebase that patch? [18:42:35] (03CR) 10Krinkle: "@DCausse those cirrus failures have been around for at least 6 months." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [18:42:40] (03CR) 10Krinkle: [C: 032] test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [18:42:47] (03CR) 10jerkins-bot: [V: 04-1] test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [18:43:15] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:43:18] (03PS3) 10Krinkle: test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [18:43:25] (03CR) 10Krinkle: [C: 032] test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [18:43:48] (03PS3) 10Dzahn: system::role: remove leading 'role::' to avoid role-role [puppet] - 10https://gerrit.wikimedia.org/r/357960 [18:45:30] (03CR) 10Dzahn: [C: 032] "approved by langcom at https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Atikamekw" [dns] - 10https://gerrit.wikimedia.org/r/358410 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [18:46:04] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:358056|Lift IP throttle for Editathon (13 June 2017)]] T167517 (duration: 00m 41s) [18:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:12] T167517: Lift account registration on en.wikipedia for 13th June 2017 - https://phabricator.wikimedia.org/T167517 [18:47:05] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:47:45] papaul: so you want "gpt" for labtestneutron and labtestnet but "no gpt" for labtestpuppetmaster. correct? [18:48:26] (03Merged) 10jenkins-bot: test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [18:48:33] (03CR) 10jenkins-bot: test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [18:48:52] (03PS2) 10Dzahn: Add partman entries for labtestpuppetmaster2001,labtestneutron2002 and labtestnet2002 Bug:T167157 Bug:T167159 Bug:T167160 [puppet] - 10https://gerrit.wikimedia.org/r/358409 (owner: 10Papaul) [18:49:43] thcipriani: on https://gerrit.wikimedia.org/r/#/c/358056/2/wmf-config/throttle.php from/to date ranges are bad, no ? [18:50:49] framawiki: looks like it. I didn't catch that :( can you fix please? [18:51:20] (03CR) 10Dzahn: [C: 032] Add partman entries for labtestpuppetmaster2001,labtestneutron2002 and labtestnet2002 Bug:T167157 Bug:T167159 Bug:T167160 [puppet] - 10https://gerrit.wikimedia.org/r/358409 (owner: 10Papaul) [18:52:05] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:52:19] (03PS2) 10Thcipriani: Lift IP throttle for Wikipedia workshop (14 June 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357233 (https://phabricator.wikimedia.org/T167011) (owner: 10Framawiki) [18:52:33] ^ framawiki could you check if that rebase is correct? [18:54:31] aharoni: Also, there's steps like https://wikitech.wikimedia.org/wiki/Add_a_wiki#WikimediaMessages that you can do to help out :) [18:54:35] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:55:11] thcipriani: I can't rebase, problem with git, can you do it ? thanks [18:55:34] framawiki: I rebased https://gerrit.wikimedia.org/r/#/c/357233/ manually [18:55:42] does it look correct now that I've rebased? [18:57:24] yes [18:57:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357233 (https://phabricator.wikimedia.org/T167011) (owner: 10Framawiki) [18:57:41] cool, thanks [18:57:45] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:58:13] i'll look for date ranges problem [18:58:28] thank you [18:58:34] 10Operations, 10Office-IT, 10netops: Some BGP sessions to the SF Office down - https://phabricator.wikimedia.org/T167281#3342226 (10ayounsi) a:05ayounsi>03bbogaert [18:59:31] (03Merged) 10jenkins-bot: Lift IP throttle for Wikipedia workshop (14 June 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357233 (https://phabricator.wikimedia.org/T167011) (owner: 10Framawiki) [19:02:36] thcipriani: can I just amend the patch even if it's merged ? or do you want me to create a new one ? [19:03:03] (03PS1) 10Thcipriani: Fix throttle rule for Scotland University editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358422 [19:03:11] framawiki: ^ I made a patch :) [19:03:45] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:04:34] (03PS2) 10Thcipriani: Fix throttle rule for Scotland University editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358422 [19:04:47] thanks ! your patch is good for me [19:04:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358422 (owner: 10Thcipriani) [19:05:20] framawiki: awesome, sorry for the confusion :( [19:06:02] once that is merged I'll sync that fix + the June 14th throttle rule [19:06:02] It's my fault [19:08:03] (03Merged) 10jenkins-bot: Fix throttle rule for Scotland University editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358422 (owner: 10Thcipriani) [19:10:47] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:357233|Lift IP throttle for Wikipedia workshop (14 June 2017)]] T167011 + [[gerrit:358422|Fix throttle rule for Scotland university editathon]] (duration: 00m 41s) [19:10:54] ^ framawiki all done! [19:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:57] T167011: Lift account registration cap for event (14 June 2017) - https://phabricator.wikimedia.org/T167011 [19:10:58] thank you for the patches! [19:11:25] thanks ! [19:12:52] (03CR) 10jenkins-bot: Lift IP throttle for Wikipedia workshop (14 June 2017) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357233 (https://phabricator.wikimedia.org/T167011) (owner: 10Framawiki) [19:12:54] (03CR) 10jenkins-bot: Fix throttle rule for Scotland University editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358422 (owner: 10Thcipriani) [19:32:28] (03CR) 10DCausse: "@Krinkle sorry about that, I only noticed them after these patches when running tests locally. I'll pay more attention to jenkins outputs " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [19:35:31] (03CR) 10Hashar: [C: 031] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/358366 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [19:36:04] (03PS2) 10Dzahn: DHCP: Add MAC address for labtestpuppetmaster2001,labtestnet2002 and labtestneutron2002 Bug:T167157 Bug:T167159 Bug:T167160 [puppet] - 10https://gerrit.wikimedia.org/r/358397 (owner: 10Papaul) [19:36:24] 10Operations, 10Ops-Access-Requests, 10Citoid, 10Services, and 3 others: Give mobrovac production access for citoid - https://phabricator.wikimedia.org/T92389#3342412 (10Jdforrester-WMF) [19:37:09] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3342417 (10RobH) 05Open>03Resolved Ok, fixed and live. [19:37:12] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3342419 (10RobH) [19:37:41] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address for labtestpuppetmaster2001,labtestnet2002 and labtestneutron2002 Bug:T167157 Bug:T167159 Bug:T167160 [puppet] - 10https://gerrit.wikimedia.org/r/358397 (owner: 10Papaul) [19:42:25] PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:31] 10Operations, 10Traffic, 10netops: High amount of unexpected ICMP dest unreachable toward esams cache clusters - https://phabricator.wikimedia.org/T167691#3341190 (10BBlack) The cp* should at least occasionally be sending normal ICMP responses correlated with their TCP flows, e.g. "Time Exceeded" and such.... [19:42:43] Reedy: WikimediaMessages done (I think) [19:42:59] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestneutron2002 switch port configuration - https://phabricator.wikimedia.org/T167326#3342443 (10RobH) 05Open>03Resolved done and live [19:43:02] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3342445 (10RobH) [19:43:15] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 78883 bytes in 0.174 second response time [19:48:32] (03PS4) 10Dzahn: system::role: remove leading 'role::' to avoid role-role [puppet] - 10https://gerrit.wikimedia.org/r/357960 [19:48:32] mutante: Do you have a minute to maybe help me point in the right direction for a puppet issue? It's about puppet and ERB - https://gerrit.wikimedia.org/r/#/c/357310/. It is merged I see two problems. 1) A variable set explicitly to undef evaluates to true in ERB and prints empty string. 2) A variable not set, but defaults to undef, also evaluates to true in ERB, and prints "undef". Neither of these make sense. [19:49:08] in dynamicproxy::init, errorpage is first invoked with footer => $error_details, which is undef. In the second call, it is not given, and meidawiki::errorpage footer defaults to undef. [19:49:25] It is then set in a hash and read from the erb template with a plain 'if'. [19:49:46] It seems one becomes empty string, and the other becomes the string "undef". [19:50:30] 10Operations, 10ops-codfw: Rack/setup codfw spare systems - https://phabricator.wikimedia.org/T167705#3342489 (10faidon) [19:52:11] 10Operations, 10hardware-requests: codfw/eqiad: 12x swift backend refresh - https://phabricator.wikimedia.org/T149336#3342509 (10RobH) [19:52:42] 10Operations, 10Discovery, 10Elasticsearch, 10hardware-requests, 10Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#3342514 (10RobH) [19:52:59] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#3342520 (10RobH) [19:53:15] 10Operations, 10ops-eqiad, 10Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3342523 (10RobH) [19:53:46] 10Operations, 10hardware-requests: Analytics AQS cluster expansion - https://phabricator.wikimedia.org/T149920#3342532 (10RobH) [19:54:11] 10Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#3342537 (10RobH) [19:54:46] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#3342542 (10RobH) [19:54:56] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3342545 (10RobH) [19:55:09] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), 10Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331203 (10RobH) [19:55:24] 10Operations, 10Cassandra, 10Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#3342566 (10RobH) [19:55:31] 10Operations, 10ops-codfw: Broken disk in labstore2001 - https://phabricator.wikimedia.org/T149567#3342570 (10RobH) [19:57:56] (03PS1) 10Krinkle: [WIP] mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170612T2000). [20:00:13] No ORES today. [20:02:05] Krinkle: my first thought about it is 'scoping, whether defines have access to the variables of the calling class.. and the way variables in erb are treated are all things that change in "future parser" (which will be default in puppet 4) and how Alex mentioned that currently variables from all classes will be accesible in all templates but how that will change and that we enable [20:02:11] future-parser option in the puppet compiler. so maybe this is kind of related to https://tickets.puppetlabs.com/browse/PUP-7276 [20:02:40] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3342607 (10RobH) [20:02:43] i dont really know why one would be empty string and the other undef [20:02:47] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2745564 (10RobH) [20:04:07] i could imagine that there are different results whether it is run in compiler or "prod"-labs (dynamicproxy) due to that "future compiler" option [20:11:25] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 110.68, 102.48, 86.66 [20:17:25] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 43.28, 70.94, 78.71 [20:20:23] (03CR) 10Dzahn: [C: 032] "after all the existing names have been fixed, this is a safeguard to avoid getting new "role-role" names in the future" [puppet] - 10https://gerrit.wikimedia.org/r/357960 (owner: 10Dzahn) [20:21:50] (03PS2) 10Dzahn: planet: cleanup en_config.erb [puppet] - 10https://gerrit.wikimedia.org/r/358301 (owner: 10Framawiki) [20:24:24] (03CR) 10Dzahn: [C: 032] planet: cleanup en_config.erb [puppet] - 10https://gerrit.wikimedia.org/r/358301 (owner: 10Framawiki) [20:26:15] !log ns2 - authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones to add new Wikipedia language "atj" (needed when editing langlist but not touching templates) (T167714) [20:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:27] T167714: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714 [20:30:50] !log ns0, ns1 - same as before - gen zones, check zones, reload zones, to add "atj.wikipedia.org" (T167714) [20:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:47] (03PS2) 10Hashar: kibana: support elasticsearch.url setting [puppet] - 10https://gerrit.wikimedia.org/r/356900 [20:32:20] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3341896 (10Dzahn) added to DNS: ``` ;; ANSWER SECTION: atj.wikipedia.org. 600 IN A 198.35.26.96 ;; ADDITIONAL SECTION: atj.wikipedia.org. 600 IN... [20:32:27] (03CR) 10Hashar: "Changed it so that the ::kibana class accepts $elasticsearch_url which is adds to the configuration elasticsearch.url." [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [20:39:23] (03CR) 10EBernhardson: "when setting up a kibana instance to talk to relforge for testing, i also ended up needing to be able to set the SSL CA. Annoyingly kibana" [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [20:53:14] (03PS8) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [20:55:57] (03CR) 10Krinkle: "Passes: https://puppet-compiler.wmflabs.org/6741/cp4021.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170612T2100). [21:12:34] (03PS1) 10Dzahn: rancid: drop "server" suffix, apply on netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/358483 (https://phabricator.wikimedia.org/T159756) [21:16:14] (03PS2) 10Dzahn: rancid: drop "server" suffix, apply on netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/358483 (https://phabricator.wikimedia.org/T159756) [21:20:22] (03CR) 10Dzahn: [C: 032] "netmon1001: no change except motd. netmon1002: gets new rancid classes http://puppet-compiler.wmflabs.org/6742/" [puppet] - 10https://gerrit.wikimedia.org/r/358483 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [21:22:32] (03PS1) 10RobH: setting labtestpuppetmaster2001 production dns [dns] - 10https://gerrit.wikimedia.org/r/358485 [21:23:08] (03CR) 10RobH: [C: 032] setting labtestpuppetmaster2001 production dns [dns] - 10https://gerrit.wikimedia.org/r/358485 (owner: 10RobH) [21:24:15] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:24:45] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair],File[/etc/bacula/ssl] [21:35:42] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3342964 (10Dzahn) @akosiaris @robh Is our goal to shut down netmon1001 after this is done? [21:36:40] ACKNOWLEDGEMENT - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T159756 [21:36:40] ACKNOWLEDGEMENT - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 10 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair],Service[bacula-fd] daniel_zahn https://phabricator.wikimedia.org/T159756 [21:37:45] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:38:15] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [21:38:34] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3342984 (10RobH) [21:38:44] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3319424 (10RobH) [21:38:47] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3342986 (10RobH) 05Open>03Resolved Done! [21:40:13] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3343000 (10RobH) My understanding is netmon1001 will have a new task made for decommission once netmon1002 replaces it. netmon1001 is out of warranty. [21:54:30] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [22:01:00] PROBLEM - DPKG on netmon1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:01:06] !log netmon1002 - apt-get -t jessie-backports install rancid (upgrade from 2.3.8 to 3.6.2 to match version on netmon1001) - rancid version is not specified in puppet so even though backports gets enabled the older version gets installed and this manual step is needed unless we start specifying the version in the manifest (T159756) [22:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:16] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [22:03:00] RECOVERY - DPKG on netmon1002 is OK: All packages OK [22:06:42] mutante: sorry for the slow response… your patch looks just fine to me. [22:06:54] andrewbogott: :) ok great! [22:12:11] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:13:10] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [22:38:23] Is there a process for getting my dotfiles approved and merged? -- https://gerrit.wikimedia.org/r/#/c/353937/ [22:40:00] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:46:16] bd808: not really, usually those are mostly self-merged AFAIK. My only concern is that you're adding *a lot* of them... [22:46:34] right now we have 150 files and 216 between files and directories for *all* users [22:46:42] and you're adding ~300 of them [22:46:55] volans: you should have seen it before I trimmed things down ;) [22:47:29] I could collapse the .bash stuff in to like 3 files but that would fork radically from my non-wmf setup [22:47:35] * volans starts to worry now ;) [22:48:16] i've had my personal dotfiles under some form of VCS for 12 years. things accumulate [22:49:02] me too, but I've never assumed a production environment will match my local one, doesn't work that way in any of the companies I've worked before ;) [22:49:41] its always worked that way for me before, but this is the first place I wasn't a tech founder ;) [22:50:12] if there is a guideline to follow I'd love to comply [22:50:36] or I can just keep doing the ssh and unpack tarball thing I've done for 4 years now [22:51:07] bd808: btw bash-completion is alredy installed, not sure if you still need all that stuff [22:51:13] the files themselves LGTM [22:51:22] good question for the guideline... not sure we have one tbh [22:51:47] I wonder why personal home skel should be on operations/puppet, though [22:52:25] mostly because it was easy/possible I think Platonides. not because it was a really sane or scalable idea [22:53:00] I think o.ir has the largest collection of stuff currently [22:53:07] *o.ri [22:53:10] alex too [22:55:14] if you don't consider both of them we have ~65 files and dirs amongst 20 home dirs :D [22:57:34] bd808: I'll be going to bed very soon, what I can do is ping people tomorrow and try to have a definitive answer if that might help [22:57:47] volans: that would be swell. thanks [22:57:49] p.s. you could have asked in today's meeting :-P [22:58:01] I would be tempted to move it to a submodule [22:58:11] just to keep the repo clean [22:58:23] even though it wouldn't actually change anything :/ [22:58:30] exactly :) [22:58:50] the ship has sailed on keeping that repo clean [22:59:00] :( [22:59:30] its a big pile of operational config. that's always going to be messy [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170612T2300). Please do the needful. [23:09:00] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:12:25] Nothing to deploy. [23:20:30] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. [23:22:09] !log netmon1002 - keyholder arm - loaded rancid deploy key (uses separate passphrase from deployment key) [23:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:13] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3343223 (10Krinkle) @Gilles What that it,... [23:50:21] Hi ops folks... I'm deploying an update to CentralNotice to the cluster in just a few. Haven't done a deploy in a while, would anyone like to screen share to help me check I'm doing it right? [23:50:45] Here is the core change: [23:50:48] https://gerrit.wikimedia.org/r/#/c/358495/ [23:51:22] We no longer use tin? https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host [23:51:38] I can help out with deployment stuff [23:52:06] XioNoX: you should also update the keyholder wikitech page with the reference to the new key ;) sorry forgot to explicitely mention it the other day [23:52:15] AndyRussG: deployment.[datacenter].wmnet is just a cname for whatever the current deployment host is (it is currently tin) it's for user convenience [23:53:03] thcipriani: cool! thx! [23:53:48] volans: i added "rancid" row to the table on wikitech but also referenced https://phabricator.wikimedia.org/T154943 which was to use the same passphrase for all [23:55:23] thcipriani: do you want to do a hangout/screen share to help me avoid doing something bad? Or would you like to actually deploy? [23:55:30] It needs a full scap [23:55:39] AndyRussG: we can jump in a hangout [23:55:44] * thcipriani makes one [23:55:55] thcipriani: cool thx! [23:58:13] AndyRussG: sent you a link (I think :))