[00:02:08] (03CR) 1020after4: [C: 031] "This works for https because the tls layer is already terminated before the vcl, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [00:35:35] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [01:02:55] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:55] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:55] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499216574 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9045127 keys, up 2 minutes 52 seconds - replication_delay is 1499216574 [01:02:56] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:03:55] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9130676 keys, up 3 minutes 46 seconds - replication_delay is 0 [01:03:55] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9136106 keys, up 3 minutes 49 seconds - replication_delay is 0 [01:03:55] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9038850 keys, up 3 minutes 52 seconds - replication_delay is 0 [01:04:06] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9132858 keys, up 3 minutes 52 seconds - replication_delay is 0 [02:27:34] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 10m 23s) [02:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.94 seconds [03:33:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.24 seconds [03:36:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.30 seconds [03:42:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 192.43 seconds [05:08:04] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3405812 (10Marostegui) This happened again ``` root@db1046:~# megacli -LDInfo -Lall -aALL | grep Policy Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write... [05:08:39] !log Force a relearn on db1046's BBU - T166141 [05:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:52] T166141: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141 [05:13:35] !log Deploy alter table on s7 master - db1062 - T168661 [05:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:47] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:15:35] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [05:16:40] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3405816 (10Marostegui) And it recovered for now: ``` ˜/icinga-wm 7:15> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy root@d... [05:18:51] !log Deploy alter table on s3 master - db1075 - T168661 [05:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:01] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:53:25] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=657.00 Read Requests/Sec=1109.50 Write Requests/Sec=2.80 KBytes Read/Sec=50777.60 KBytes_Written/Sec=105.20 [06:00:35] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=143.40 Read Requests/Sec=35.90 Write Requests/Sec=40.30 KBytes Read/Sec=265.60 KBytes_Written/Sec=6137.60 [06:02:32] (03CR) 10MZMcBride: "This was reverted in I791182190e4717e87f7b983a362d076405d03898 () in September 2015." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171998 (owner: 10MZMcBride) [06:03:24] (03CR) 10MZMcBride: "This reverted Id53e72fe8a5cbb15daeec43250303dca9a4c903f ()." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238357 (https://phabricator.wikimedia.org/T86460) (owner: 10CSteipp) [06:32:37] 10Operations, 10Commons, 10Thumbor, 10Traffic: PNG thumbnail gives a 429 error - https://phabricator.wikimedia.org/T169678#3405854 (10Gilles) The rate limits in Thumbor are as close to the old Mediawiki ones as they can be, but there are slight differences. Since this request returned nginx as the server,... [06:45:35] PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [06:47:30] marostegui: not so lucky this time :( [06:49:27] sigh [07:04:13] (03CR) 10Ayounsi: [C: 031] Restrict http access to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/363213 (owner: 10Muehlenhoff) [07:04:45] :( [07:04:55] Will try again [07:07:02] Ah [07:07:05] It is learning now [07:07:07] So I will not touch it [07:08:11] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3405971 (10Marostegui) It alerted again, but this time looks like the BBU is actually doing the learning: ``` BatteryType: BBU Voltage: 3754 mV Current: -674 mA Tempe... [07:12:41] 10Operations, 10netops: zombie rstp configuration - https://phabricator.wikimedia.org/T169637#3405974 (10ayounsi) 05Open>03Resolved Cleaned up. [07:13:48] !log rebooting notebook* hosts [07:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:44] (03CR) 10Marostegui: [C: 031] "Just some minor things but looks good to me." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363195 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [07:19:13] (03CR) 10Marostegui: [C: 031] "I have checked the role and looks good - feel free to test on db1102, s2 is running and replicating." [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [07:24:11] 10Operations, 10ops-codfw, 10DBA: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3406026 (10Marostegui) [07:24:22] 10Operations, 10ops-codfw, 10DBA: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3406040 (10Marostegui) p:05Triage>03Normal [07:25:05] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 296908517 for key PRIMARY on query. Default database: frwiki. [Query snipped] [07:25:14] ^ checking [07:27:03] !log rebooting restbase-dev* for kernel update [07:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:31] (03CR) 10Giuseppe Lavagetto: [C: 032] Re-add support for defining threads from CI/cli [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363214 (owner: 10Giuseppe Lavagetto) [07:27:36] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3406046 (10Volans) You could try to force unmount the NFS partition from the hosts that mount it, it should abort the outstanding I/O they are waiting for... [07:28:11] (03Merged) 10jenkins-bot: Re-add support for defining threads from CI/cli [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363214 (owner: 10Giuseppe Lavagetto) [07:30:25] (03PS2) 10Giuseppe Lavagetto: Move HostWorker to a dedicated class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363215 [07:38:49] (03CR) 10Lydia Pintscher: "Decision: We'll remove the redirect." [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [07:49:33] (03PS2) 10Alexandros Kosiaris: Bump rspec-puppet to 2.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/363194 [07:49:35] (03PS3) 10Alexandros Kosiaris: monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [07:50:46] (03CR) 10jerkins-bot: [V: 04-1] monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [07:51:55] PROBLEM - Check systemd state on ms1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:56:00] (03PS4) 10Alexandros Kosiaris: monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [07:57:01] (03CR) 10jerkins-bot: [V: 04-1] monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [08:02:02] (03PS5) 10Alexandros Kosiaris: monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [08:02:24] (03CR) 10Alexandros Kosiaris: [C: 032] "Per hashar's comment bumping to 2.5.0 instead. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/363194 (owner: 10Alexandros Kosiaris) [08:02:26] (03CR) 10Jcrespo: [C: 04-1] "This requires first all pt-heartbeat executions to be stopped and restarted by puppet again before deploy." [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [08:03:13] (03CR) 10jenkins-bot: Bump rspec-puppet to 2.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/363194 (owner: 10Alexandros Kosiaris) [08:03:55] RECOVERY - Check systemd state on ms1001 is OK: OK - running: The system is fully operational [08:04:07] (03CR) 10Giuseppe Lavagetto: [C: 032] Move HostWorker to a dedicated class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363215 (owner: 10Giuseppe Lavagetto) [08:04:59] 10Operations, 10ops-eqiad: restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3406100 (10MoritzMuehlenhoff) [08:05:22] ACKNOWLEDGEMENT - Host restbase-dev1003 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T169696 [08:13:26] (03PS1) 10Gehel: maps - tiles are not stored in postgresql yet [puppet] - 10https://gerrit.wikimedia.org/r/363293 [08:14:17] PROBLEM - Check systemd state on notebook1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:14:37] (03CR) 10Gehel: [C: 032] maps - tiles are not stored in postgresql yet [puppet] - 10https://gerrit.wikimedia.org/r/363293 (owner: 10Gehel) [08:18:34] (03PS2) 10Filippo Giunchedi: nutcracker: validate new config file [puppet] - 10https://gerrit.wikimedia.org/r/361039 (https://phabricator.wikimedia.org/T168705) [08:19:47] (03CR) 10Filippo Giunchedi: [C: 032] nutcracker: validate new config file [puppet] - 10https://gerrit.wikimedia.org/r/361039 (https://phabricator.wikimedia.org/T168705) (owner: 10Filippo Giunchedi) [08:25:06] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:26:01] checking --^ [08:26:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:26:35] I thought it was related to my nutcracker change but doesn't seem so, change is a noop [08:27:35] PROBLEM - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.117 and port 9042: Connection refused [08:27:36] PROBLEM - salt-minion processes on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:27:42] godog: we have been seeing a similar issue during the past days, big spike for ulsfo and codfw. From x-cache I can see a lot of codfw ints in varnish [08:27:45] PROBLEM - cassandra-b CQL 10.64.48.118:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.118 and port 9042: Connection refused [08:27:45] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:27:45] PROBLEM - Check size of conntrack table on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:27:55] PROBLEM - DPKG on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:27:55] PROBLEM - configured eth on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:05] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:05] PROBLEM - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused [08:28:05] PROBLEM - dhclient process on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:05] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:06] PROBLEM - cassandra-b service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:15] PROBLEM - Disk space on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:25] PROBLEM - Check systemd state on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:25] PROBLEM - cassandra-b SSL 10.64.48.118:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:28:25] PROBLEM - cassandra-a SSL 10.64.48.117:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:28:25] PROBLEM - SSH on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 22: Connection refused [08:28:25] PROBLEM - cassandra-a service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:28:25] PROBLEM - MD RAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:29:03] 10Operations, 10Jupyter-Hub: jupyterhub.service in failed state on notebook1001 due to removed user - https://phabricator.wikimedia.org/T169698#3406154 (10MoritzMuehlenhoff) [08:29:05] elukey: ah yeah indeed zooming in looks like int from ulsfo [08:29:19] 10Operations, 10Jupyter-Hub: jupyterhub.service in failed state on notebook1001 due to removed user - https://phabricator.wikimedia.org/T169698#3406167 (10MoritzMuehlenhoff) [08:29:59] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3406172 (10Volans) @fgiunchedi FYI Doing some cleanup with @ArielGlenn we discovered that most of the load was generated by the prometheus-node-exporter:... [08:30:05] codfw: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text [08:30:24] hola ema :) [08:30:27] o/ [08:30:51] seems Icinga lost an acknowledged state again? I had just acked restbase-dev1003 [08:31:19] (03PS1) 10Jcrespo: tendril: Make tendril active on a single datacenter for now [puppet] - 10https://gerrit.wikimedia.org/r/363294 (https://phabricator.wikimedia.org/T169540) [08:32:18] 10Operations, 10Operations-Software-Development: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T169619#3406179 (10Volans) p:05Triage>03Normal a:03Volans False positive, I'll update the list of patterns to skip. [08:33:05] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:33:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:33:58] jouncebot: refresh [08:34:01] I refreshed my knowledge about deployments. [08:34:02] jouncebot: next [08:34:02] In 4 hour(s) and 25 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1300) [08:34:15] PROBLEM - IPMI Temperature on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:35:05] PROBLEM - MegaRAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:35:55] (03PS6) 10Alexandros Kosiaris: monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [08:35:57] (03PS1) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 [08:37:03] (03CR) 10jerkins-bot: [V: 04-1] monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 (owner: 10Alexandros Kosiaris) [08:41:14] (03PS2) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 [08:41:23] (03CR) 10Ema: [C: 031] Build-Depend on libssl11-dev | libssl-dev for jessie and stretch compat [software/nginx] - 10https://gerrit.wikimedia.org/r/363199 (owner: 10Filippo Giunchedi) [08:44:45] !log puppet disabled and processes accessing dataset1001 exported filesystem shot, on: stat1002,3, snapshot1001,5,6,7, while investigation continues [08:44:47] 10Operations, 10Patch-For-Review: nutcracker test config in puppet doesn't work - https://phabricator.wikimedia.org/T168705#3406215 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Now fixed [08:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:15] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [08:47:43] ACKNOWLEDGEMENT - Check size of conntrack table on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:43] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:43] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:43] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:43] ACKNOWLEDGEMENT - DPKG on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:43] ACKNOWLEDGEMENT - Disk space on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:43] ACKNOWLEDGEMENT - IPMI Temperature on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:44] ACKNOWLEDGEMENT - MD RAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:44] ACKNOWLEDGEMENT - MegaRAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:47:45] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused Muehlenhoff T169696 [08:47:45] ACKNOWLEDGEMENT - SSH on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 22: Connection refused Muehlenhoff T169696 [08:47:46] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.117 and port 9042: Connection refused Muehlenhoff T169696 [08:47:47] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.117:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff T169696 [08:47:47] ACKNOWLEDGEMENT - cassandra-a service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff T169696 [08:48:43] (03PS2) 10Filippo Giunchedi: Build-Depend on libssl11-dev | libssl-dev for jessie and stretch compat [software/nginx] - 10https://gerrit.wikimedia.org/r/363199 [08:49:17] (03CR) 10Filippo Giunchedi: [C: 032] "Updated changelog to jessie-wikimedia" [software/nginx] - 10https://gerrit.wikimedia.org/r/363199 (owner: 10Filippo Giunchedi) [08:50:15] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:51:48] (03PS1) 10Ayounsi: Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) [08:52:33] (03CR) 10jerkins-bot: [V: 04-1] Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [08:57:18] (03CR) 10Hashar: [C: 031] "I was not sure what "closed" means, it seems it is mostly about restricting edits to a subset of privileged users. For all the rest, the w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361686 (https://phabricator.wikimedia.org/T168764) (owner: 10Urbanecm) [09:02:16] (03CR) 10Alexandros Kosiaris: [C: 031] "This only looks fine to me this time around." [puppet] - 10https://gerrit.wikimedia.org/r/363295 (owner: 10Alexandros Kosiaris) [09:03:41] (03PS2) 10Ayounsi: Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) [09:04:33] (03CR) 10Muehlenhoff: Add diffscan module. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [09:05:35] RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [09:06:27] (03CR) 10Zhuyifei1999: [C: 031] "Hopefully it works." [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [09:08:53] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3406297 (10Marostegui) `˜/icinga-wm 11:05> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy` [09:10:21] (03PS2) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363155 (https://phabricator.wikimedia.org/T153743) [09:12:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363155 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:14:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363155 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:15:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363155 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:17:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 - T153743 (duration: 02m 50s) [09:17:34] (03PS1) 10Volans: Raid handler: use regex for skip strings [puppet] - 10https://gerrit.wikimedia.org/r/363298 (https://phabricator.wikimedia.org/T169619) [09:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:41] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:18:39] (03PS1) 10Marostegui: db1085.yaml: Add ROW as binlog format [puppet] - 10https://gerrit.wikimedia.org/r/363299 (https://phabricator.wikimedia.org/T153743) [09:20:04] (03PS1) 10Alexandros Kosiaris: Introduce diadem, dysprosium [dns] - 10https://gerrit.wikimedia.org/r/363300 (https://phabricator.wikimedia.org/T169566) [09:21:27] !log upload nginx_1.11.10-1+wmf2 to jessie-wikimedia and nginx_1.11.10-1+wmf2~stretch1 to stretch-wikimedia [09:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:57] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce diadem, dysprosium [dns] - 10https://gerrit.wikimedia.org/r/363300 (https://phabricator.wikimedia.org/T169566) (owner: 10Alexandros Kosiaris) [09:25:07] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/6925/" [puppet] - 10https://gerrit.wikimedia.org/r/363299 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:27:01] !log Stop MySQL on db1085 for maintenance - T153743 [09:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:11] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:29:34] (03CR) 10Marostegui: [C: 032] db1085.yaml: Add ROW as binlog format [puppet] - 10https://gerrit.wikimedia.org/r/363299 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:29:51] (03PS1) 10Gehel: postgresql - reset user password also when NULL [puppet] - 10https://gerrit.wikimedia.org/r/363306 [09:30:54] (03CR) 10Alexandros Kosiaris: [C: 031] "PCC output at https://puppet-compiler.wmflabs.org/6926/" [puppet] - 10https://gerrit.wikimedia.org/r/363295 (owner: 10Alexandros Kosiaris) [09:33:53] (03PS4) 10WMDE-leszek: mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [09:34:47] (03CR) 10WMDE-leszek: [C: 031] "Restored the initial version of the patch removing the alias, per Lydia's decision." [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [09:34:55] (03PS5) 10WMDE-leszek: mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [09:35:00] (03PS3) 10Ayounsi: Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) [09:35:15] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:33] backups are running, so that is probably ^ I will silence it [09:46:44] (03CR) 10D3r1ck01: "@Aude, any comments about this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [09:48:05] !log move 'instances' graphite hierarchy out of the way, do not delete yet - T143405 [09:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:15] T143405: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405 [09:49:05] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: add icinga check for HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/363307 (https://phabricator.wikimedia.org/T163909) [09:50:13] (03PS2) 10Elukey: role::analytics_cluster::hadoop::master: add icinga check for HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/363307 (https://phabricator.wikimedia.org/T163909) [09:52:02] 10Operations, 10Cloud-Services, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3406402 (10fgiunchedi) @chasemp @bd808 no problem! thanks for working on it :D In terms of rethinking, I don't know exactly what was... [09:53:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363308 (https://phabricator.wikimedia.org/T168661) [09:54:55] (03CR) 10Muehlenhoff: [C: 031] Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [09:56:14] (03PS2) 10Ema: Block WP Zero users from accessing Phabricator uploads [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [09:56:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363308 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:57:16] !log Deploy alter table on s1 eqiad hosts - T168661 [09:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:26] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [09:57:41] (03PS1) 10Ayounsi: Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 [09:57:52] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363308 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:59:46] 10Operations, 10Wikibase-DataModel, 10Wikidata, 10Patch-For-Review, 10Wikidata-Sprint: Remove left-over alias for wikidata.org/ontology (doesn't work) - https://phabricator.wikimedia.org/T169023#3406447 (10thiemowmde) Even if, I wonder how this would help because the canonical URI is http://wikiba.se/ont... [10:01:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 - T168661 (duration: 02m 50s) [10:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:09] go volans [10:06:13] lol, no [10:06:19] what? :D [10:06:33] that was supposed to be /go volans [10:06:42] rotfl [10:07:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363311 [10:09:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363311 (owner: 10Marostegui) [10:11:12] !log rebooting restbase1007 for kernel update [10:11:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363311 (owner: 10Marostegui) [10:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363308 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:12:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363311 (owner: 10Marostegui) [10:13:48] (03PS1) 10Alexandros Kosiaris: Introduce diadem, dysprosium [puppet] - 10https://gerrit.wikimedia.org/r/363313 (https://phabricator.wikimedia.org/T169566) [10:14:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T168661 (duration: 02m 50s) [10:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:56] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:16:03] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Complete stretch reimage for ms-fe / ms-be fleet - https://phabricator.wikimedia.org/T169601#3406518 (10fgiunchedi) [10:16:04] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: tlsproxy fail on ms-fe2005 with stretch - https://phabricator.wikimedia.org/T169612#3406515 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Reimaged successfully ms-fe2006 just now [10:16:51] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce diadem, dysprosium [puppet] - 10https://gerrit.wikimedia.org/r/363313 (https://phabricator.wikimedia.org/T169566) (owner: 10Alexandros Kosiaris) [10:18:36] (03CR) 10Aleksey Bekh-Ivanov (WMDE): [C: 031] Set Wikibase readFullEntityIdColumn setting to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [10:19:05] (03CR) 10Alexandros Kosiaris: [C: 031] postgresql - reset user password also when NULL [puppet] - 10https://gerrit.wikimedia.org/r/363306 (owner: 10Gehel) [10:19:58] (03CR) 10Filippo Giunchedi: [C: 031] Raid handler: use regex for skip strings [puppet] - 10https://gerrit.wikimedia.org/r/363298 (https://phabricator.wikimedia.org/T169619) (owner: 10Volans) [10:23:37] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363315 (https://phabricator.wikimedia.org/T168661) [10:24:26] (03PS2) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363315 (https://phabricator.wikimedia.org/T168661) [10:24:28] !log rebooted dataset1001 to unstick nfsd and pick up new kernel, re-enabled puppet [10:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363315 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:26:51] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363315 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:27:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363315 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [10:29:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T168661 (duration: 02m 51s) [10:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:11] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:32:04] !log rebooting snapshot hosts to clean up hung nfs client processes [10:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:40] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/363306 (owner: 10Gehel) [10:33:28] (03PS2) 10Volans: Raid handler: use regex for skip strings [puppet] - 10https://gerrit.wikimedia.org/r/363298 (https://phabricator.wikimedia.org/T169619) [10:34:20] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 (owner: 10Aude) [10:34:53] (03CR) 10Volans: [C: 032] Raid handler: use regex for skip strings [puppet] - 10https://gerrit.wikimedia.org/r/363298 (https://phabricator.wikimedia.org/T169619) (owner: 10Volans) [10:34:59] (03CR) 10Thiemo Mättig (WMDE): [C: 031] mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [10:37:25] !log rebooting restbase1008 for kernel update [10:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363318 [10:38:49] 10Operations, 10Operations-Software-Development: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T169619#3403963 (10Volans) 05Open>03Resolved [10:39:21] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363318 (owner: 10Marostegui) [10:40:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363318 (owner: 10Marostegui) [10:40:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363318 (owner: 10Marostegui) [10:40:35] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:14] !log rebooting wtp1001 for kernel update [10:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:56] !log Run redact_sanitarium on s6 databases db1102 - T153743 [10:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:06] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [10:45:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T168661 (duration: 02m 59s) [10:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:06] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:47:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:48:15] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:53:07] looks like a similar problem of failed fetches, ema ^ fyi [10:54:56] (03PS1) 10Ema: VCL (zero): X-Forwarded-By2 not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/363322 [10:56:53] !log restarting Jenkins for plugin upgrades [10:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:15] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:57:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:59:45] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [11:06:55] (03PS9) 10Filippo Giunchedi: base: export puppet agent stats to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354457 [11:08:52] (03CR) 10Filippo Giunchedi: [C: 032] base: export puppet agent stats to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354457 (owner: 10Filippo Giunchedi) [11:14:47] !log rebooting restbase1009 for kernel update [11:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:30] !log rebooting wtp2* servers for kernel update [11:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:15] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:24:15] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:27:25] (03CR) 10Hashar: monitoring::host provide basic Rspec (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [11:45:45] (03PS1) 10Jcrespo: Update to mariadb 10.1.25, support multi-instance, move unit path [software] - 10https://gerrit.wikimedia.org/r/363327 (https://phabricator.wikimedia.org/T169514) [11:46:40] (03PS8) 10Filippo Giunchedi: role: set external url for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/354975 [11:46:42] (03PS1) 10Filippo Giunchedi: role: adjust beta prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/363328 [11:47:31] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:21] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [11:49:48] (03CR) 10Alexandros Kosiaris: monitoring::host provide basic Rspec (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [11:51:11] (03PS7) 10Alexandros Kosiaris: monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [11:51:15] (03PS3) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 [11:58:48] (03CR) 10Marostegui: [C: 031] Update to mariadb 10.1.25, support multi-instance, move unit path [software] - 10https://gerrit.wikimedia.org/r/363327 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [12:00:24] (03PS2) 10Gehel: postgresql - reset user password also when NULL [puppet] - 10https://gerrit.wikimedia.org/r/363306 [12:01:00] (03PS3) 10Gehel: postgresql - reset user password also when NULL [puppet] - 10https://gerrit.wikimedia.org/r/363306 [12:02:46] (03CR) 10Gehel: [C: 032] postgresql - reset user password also when NULL [puppet] - 10https://gerrit.wikimedia.org/r/363306 (owner: 10Gehel) [12:06:34] (03PS5) 10Alexandros Kosiaris: ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) [12:10:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363332 (https://phabricator.wikimedia.org/T168661) [12:11:05] !log puppet is currently disabled again on snapshots 1,5,6,7 and on dataset1001; we saw the same nfs issue shortly after reboot, with no dump processes going, as snapshots 5,6,7 had not remounted the filesystem [12:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363332 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:13:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363332 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:13:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363332 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:15:21] (03PS1) 10Alexandros Kosiaris: Remove unused ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/363333 [12:16:04] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove unused ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/363333 (owner: 10Alexandros Kosiaris) [12:16:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T168661 (duration: 02m 51s) [12:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:38] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:21:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363334 [12:22:49] jouncebot: next [12:22:49] In 0 hour(s) and 37 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1300) [12:23:08] sigh, got some patches to sync and won't be able to make that window [12:25:10] (03PS1) 10Marostegui: db-eqiad.php: Repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363335 (https://phabricator.wikimedia.org/T153743) [12:25:17] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363334 (owner: 10Marostegui) [12:26:20] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363334 (owner: 10Marostegui) [12:26:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363334 (owner: 10Marostegui) [12:26:47] (03PS2) 10Marostegui: db-eqiad.php: Repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363335 (https://phabricator.wikimedia.org/T153743) [12:28:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363335 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:29:02] (03PS3) 10Ema: Block WP Zero users from accessing Phabricator uploads [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [12:29:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 - T168661 (duration: 02m 50s) [12:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:37] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:29:59] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363335 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:30:11] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363335 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:32:35] (03PS1) 10Alexandros Kosiaris: Add profile::ores::web::redis_password: [labs/private] - 10https://gerrit.wikimedia.org/r/363336 [12:32:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1085 - T153743 (duration: 02m 49s) [12:33:01] (03PS6) 10Alexandros Kosiaris: ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) [12:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:08] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [12:33:10] !log Stop all replication threads on db1095 for maintenance - T153743 [12:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add profile::ores::web::redis_password: [labs/private] - 10https://gerrit.wikimedia.org/r/363336 (owner: 10Alexandros Kosiaris) [12:36:45] !log Move labsdb1010 main general replication thread to a named replication thread called db1095 - T153743 [12:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:12] 10Operations: Jessie imaging installs nfs-common needlessly - https://phabricator.wikimedia.org/T107412#3407157 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This got fixed in T106477 already, closing [12:39:12] (03PS7) 10Alexandros Kosiaris: ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) [12:39:28] (03PS1) 10Hashar: (WIP) packages for R [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) [12:42:12] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC finally fully happy at https://puppet-compiler.wmflabs.org/6934/scb1001.eqiad.wmnet/. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) (owner: 10Alexandros Kosiaris) [12:42:19] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) (owner: 10Alexandros Kosiaris) [12:43:26] (03CR) 10Ema: [C: 032] Block WP Zero users from accessing Phabricator uploads [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [12:43:34] (03PS4) 10Ema: Block WP Zero users from accessing Phabricator uploads [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [12:43:38] (03CR) 10Ema: [V: 032 C: 032] Block WP Zero users from accessing Phabricator uploads [puppet] - 10https://gerrit.wikimedia.org/r/363264 (https://phabricator.wikimedia.org/T168142) (owner: 10MaxSem) [12:45:42] (03CR) 10Faidon Liambotis: [C: 04-1] Add runbook link and remove
from Nagios interfaces check messages. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363309 (owner: 10Ayounsi) [12:48:05] (03PS1) 10Alexandros Kosiaris: scb: Specify correct IP addresses for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/363338 [12:48:40] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [12:49:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] scb: Specify correct IP addresses for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/363338 (owner: 10Alexandros Kosiaris) [12:49:18] !log Force BBU relearn on db1016 - T166344 [12:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:28] T166344: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344 [12:50:14] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3407193 (10Marostegui) ``` ˜/icinga-wm 14:48> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough BBU status for... [12:53:30] (03PS1) 10Marostegui: s6.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/363339 (https://phabricator.wikimedia.org/T153743) [12:54:53] (03PS1) 10Alexandros Kosiaris: ores: Use nutcracker for redis [puppet] - 10https://gerrit.wikimedia.org/r/363340 (https://phabricator.wikimedia.org/T122676) [12:56:14] (03CR) 10Marostegui: [C: 032] s6.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/363339 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:56:31] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3407214 (10ArielGlenn) Time for a summary of what's been going on. With help from volans, akosiaris, moritzm, tried various things like unmounting the fi... [12:57:00] (03Merged) 10jenkins-bot: s6.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/363339 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:57:08] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3407216 (10ArielGlenn) Current kernel on dataset1001 is now 4.9.25-1~bpo8+3 [12:58:40] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [12:59:42] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3407221 (10Marostegui) ``` ˜/icinga-wm 14:58> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1300). [13:00:05] Urbanecm and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:16] I'm here [13:00:18] I'm around [13:00:52] o/ [13:01:33] hashar: want to do the swat, or should I? [13:01:39] * zeljkof is looking at patches [13:02:53] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3407225 (10ArielGlenn) Forgot to say, looking at the atop logs for around the time of the second incident shows nothing exceptional either. [13:04:50] zeljkof: Waiting for something? [13:05:13] Urbanecm: no, looking at patches, looks like hashar is not around [13:06:08] zeljkof: Okay. [13:06:11] Urbanecm, Amir1: can you test your patches at mwdebug1002? (they are not yet there, just asking) [13:06:26] For sure. It is the same like permissions settings. [13:06:29] zeljkof: the wikidata one is not testable at all [13:06:35] the ckb wiki, sure [13:06:59] great [13:07:04] will ping you when ready [13:07:08] Urbanecm: I will start with 361686 [13:07:13] zeljkof: ack [13:08:33] (03PS2) 10Ema: VCL: zero cleanups [puppet] - 10https://gerrit.wikimedia.org/r/363322 [13:08:41] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361686 (https://phabricator.wikimedia.org/T168764) (owner: 10Urbanecm) [13:09:11] !log rebooting restbase1010 for kernel update [13:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:47] (03Merged) 10jenkins-bot: Reopen nlwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361686 (https://phabricator.wikimedia.org/T168764) (owner: 10Urbanecm) [13:09:56] (03CR) 10jenkins-bot: Reopen nlwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361686 (https://phabricator.wikimedia.org/T168764) (owner: 10Urbanecm) [13:13:10] Urbanecm: 361686 is at mwdebug1002 [13:13:57] Testing [13:14:37] zeljkof: working, please deploy [13:14:57] Urbanecm: deploying... [13:15:00] thx [13:15:20] PROBLEM - puppet last run on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:16:26] !log reboot conf2001 for kernel updates [13:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:50] PROBLEM - Host dataset1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:12] scap's sync-apaches step is at 99% for a while now... [13:17:28] sync-apaches: 99% (ok: 280; fail: 0; left: 1) [13:18:06] !log zfilipin@tin Synchronized dblists/closed.dblist: SWAT: [[gerrit:361686|Reopen nlwikinews (T168764)]] (duration: 02m 50s) [13:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:18] T168764: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764 [13:18:25] !log power cycled dataset1001, crashed, unresponsivle on mgmt console [13:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:58] connect to host mw1196.eqiad.wmnet port 22: Connection timed out [13:19:41] Urbanecm: please test in production while I look what's wrong with mw1196 [13:20:28] zeljkof: host is down with hw error: https://phabricator.wikimedia.org/T169360#3395989 [13:20:33] I'll mark it as inactive [13:20:39] moritzm: thanks! [13:20:50] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:21:08] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw1196.eqiad.wmnet [13:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] moritzm: is there anything I should do? can I just continue with swat? [13:24:45] !log zfilipin@tin Synchronized dblists/closed.dblist: SWAT: [[gerrit:361686|Reopen nlwikinews (T168764)]] (duration: 02m 50s) [13:24:51] :) [13:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:55] T168764: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764 [13:25:15] Amir1: I am reviewing 363043 [13:25:22] Thanks [13:25:59] hashar, moritzm: scap is still complaining about mw1196 [13:26:04] 10Operations, 10vm-requests, 10Patch-For-Review: Site: (2) VM request for DMARC - https://phabricator.wikimedia.org/T169566#3407240 (10akosiaris) 05Open>03Resolved a:03akosiaris VMs `diadem` and `dysprosium` are up and running. In site.pp are marked with only including `::standard` so they have all the... [13:26:30] 13:24:45 ... on mw1196.eqiad.wmnet returned [255]: ssh: connect to host mw1196.eqiad.wmnet port 22: Connection timed out [13:26:45] does anything need to be refreshed? [13:26:59] zeljkof: it probably needs the next puppet run, not sure [13:27:27] moritzm: can I continue with swat and just ignore that scap failure? [13:27:35] or can you force puppet to run? [13:28:05] zeljkof: it can be ignored, when the host is fixed (we probably decomission it anyway), scap pull will be run on it before bringing it back up [13:28:24] moritzm: ok, continuing with swat then, thanks [13:28:43] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363043 (https://phabricator.wikimedia.org/T169563) (owner: 10Ladsgroup) [13:30:02] (03Merged) 10jenkins-bot: Enable WikiLove for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363043 (https://phabricator.wikimedia.org/T169563) (owner: 10Ladsgroup) [13:30:14] (03CR) 10jenkins-bot: Enable WikiLove for ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363043 (https://phabricator.wikimedia.org/T169563) (owner: 10Ladsgroup) [13:31:55] Amir1: 363043 is at mwdebug1002, please test [13:32:03] okay [13:32:55] Amir1: let me know when I can run the full scap [13:33:18] zeljkof: https://ckb.wikipedia.org/wiki/%D9%84%DB%8E%D8%AF%D9%88%D8%A7%D9%86%DB%8C_%D8%A8%DB%95%DA%A9%D8%A7%D8%B1%DA%BE%DB%8E%D9%86%DB%95%D8%B1:Calak#.D8.A7.DB.8C.D9.86_.D8.A8.DA.86.D9.87_.DA.AF.D8.B1.D8.A8.D9.87_.D9.87.D8.AF.DB.8C.D9.87_.D8.A8.D9.87_.D8.B4.D9.85.D8.A7 [13:33:19] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [13:33:19] D9: Remap all submodules to tin - https://phabricator.wikimedia.org/D9 [13:33:21] works fine [13:33:53] Amir1: great, deplying... :) [13:33:57] (03PS2) 10Ayounsi: Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 [13:35:18] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:363043|Enable WikiLove for ckbwiki (T169563)]] (duration: 00m 43s) [13:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] T169563: Add WikiLove to CKB Wikipedia - https://phabricator.wikimedia.org/T169563 [13:35:42] Amir1: deployed, please test in production [13:36:23] zeljkof: it's okay [13:37:29] Amir1: I will review and deploy 362986 directly to cluster, right? there is nothing to test at mwdebug1002 or even production? [13:37:43] yes, it's not testable at all [13:38:50] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [13:40:13] (03Merged) 10jenkins-bot: Set Wikibase readFullEntityIdColumn setting to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [13:40:25] (03CR) 10jenkins-bot: Set Wikibase readFullEntityIdColumn setting to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362986 (owner: 10WMDE-leszek) [13:43:24] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:362986|Set Wikibase readFullEntityIdColumn setting to false]] (duration: 00m 42s) [13:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:19] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3407320 (10Urbanecm) 05Open>03Resolved Wiki is reopened and it can be edited by anyone as of now. As there is nothing we can do now at our side, I'... [13:44:27] Amir1: 362986 is deployed [13:44:53] Urbanecm and Amir1, thanks for deploying with #releng! :) [13:44:59] !log EU SWAT finished! [13:45:00] :)) [13:45:04] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2031805 [13:45:05] Thanks [13:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:34] !log rebooting restbase1011 for kernel update [13:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:44] PROBLEM - puppet last run on diadem is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 21 minutes ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [13:48:25] !log re-enabling puppet on snapshot1001, 1005 for testing [13:48:34] PROBLEM - Check systemd state on conf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:46] checking conf2001 [13:49:22] so the failed ones are the etcd ones, expected since etcdmirror is running only on conf2002 [13:49:33] so I guess I just reset them [13:49:44] RECOVERY - puppet last run on diadem is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:49:47] (03PS1) 10Alexandros Kosiaris: otrs: Enable apache exporter [puppet] - 10https://gerrit.wikimedia.org/r/363347 [13:52:34] RECOVERY - Check systemd state on conf2001 is OK: OK - running: The system is fully operational [13:56:11] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3407407 (10ArielGlenn) Faidon adjusted /proc/sys/vm/min_free_kbytes on dataset1001 from 262144 to 2097152 I've re-enabled puppet on snapshot1001 and s... [13:59:00] !log cache_misc: upgrade to varnish 4.1.7-1wm1 and reboot for kernel update [13:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:24] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:59:32] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3407422 (10ArielGlenn) Also re-enabled puppet on dataset1001. [14:00:41] !log rebooting logstash100[1-3] for kernel update [14:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:04] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 2789 [14:13:38] (03CR) 10Filippo Giunchedi: [C: 031] otrs: Enable apache exporter [puppet] - 10https://gerrit.wikimedia.org/r/363347 (owner: 10Alexandros Kosiaris) [14:14:10] (03PS9) 10Filippo Giunchedi: role: set external url for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/354975 [14:14:30] (03PS2) 10Giuseppe Lavagetto: Rationalize and centralize directory references [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363216 [14:14:32] (03PS2) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [14:14:34] (03PS1) 10Giuseppe Lavagetto: Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 [14:14:36] (03PS1) 10Giuseppe Lavagetto: Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 [14:17:58] (03PS2) 10Jcrespo: Update to mariadb 10.1.25, support multi-instance, move unit path [software] - 10https://gerrit.wikimedia.org/r/363327 (https://phabricator.wikimedia.org/T169514) [14:18:23] (03PS10) 10Filippo Giunchedi: role: set external url for prometheus beta/tools [puppet] - 10https://gerrit.wikimedia.org/r/354975 [14:18:24] (03PS2) 10Filippo Giunchedi: role: adjust beta prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/363328 [14:19:04] !log rebooting logstash100[4-6] for kernel update [14:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:34] (03PS11) 10Filippo Giunchedi: role: set external url for prometheus beta/tools [puppet] - 10https://gerrit.wikimedia.org/r/354975 [14:30:25] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3407542 (10Papaul) p:05Triage>03Normal [14:34:06] (03CR) 10Filippo Giunchedi: [C: 032] role: set external url for prometheus beta/tools [puppet] - 10https://gerrit.wikimedia.org/r/354975 (owner: 10Filippo Giunchedi) [14:35:17] (03PS3) 10Filippo Giunchedi: role: adjust beta prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/363328 [14:39:38] (03CR) 10Filippo Giunchedi: [C: 032] role: adjust beta prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/363328 (owner: 10Filippo Giunchedi) [14:40:34] !log rebooting restbase1012 for kernel update [14:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:22] !log bounce pybal on lvs2006, not synced with etcd information [14:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:57] (03PS1) 10Aklapper: Revert "phabricator: Block IP ranges for recent uploaded offtopic files" [puppet] - 10https://gerrit.wikimedia.org/r/363356 [14:51:04] ema: was it due to my reboot of conf2001 ? [14:51:24] I was about to ask you if pybal was fine in codfw [14:51:33] I'd need to reboot conf200[23] too [14:57:55] elukey: so the reboot was at 13:16 wasn't it [14:58:40] okok [14:58:45] I'll reboot the other ones tomorrow [14:59:36] elukey: no, I mean, did you reboot conf2001 at 13:16? :) [15:02:30] ema: yes! [15:04:23] (sorry I parser "wasn't it" in the wrong way) [15:07:03] yeah so lvs2003 doesn't have any established connection to conf2001 :( [15:07:18] while lvs2006 does (I've bounced pybal there a few minutes ago) [15:07:43] godog: this would explain the issue with ms-fe2007 not getting repooled [15:07:50] let me restart pybal on 2003 too [15:08:05] ema: indeed would explain it [15:08:50] sorry people got caught up with other things and didn't check pybal's log :( [15:09:03] !log restart pybal on lvs2003 to make it reconnect to conf2001 [15:09:09] ema: I'll ping you before doing conf200[23] tomorrow ok? [15:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:15] !log re-enabled puppet on snapshot6,7, still watching dataset1001 performance [15:09:23] elukey: yeah well you'd expect pybal to re-connect I guess :) [15:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:33] elukey: please do! [15:10:39] godog: you should now be seeing traffic on ms-fe2007 [15:10:55] ema: indeed [15:11:24] moritzm: cache_misc rebooted into the new kernel FYI [15:11:36] ok, thanks [15:11:49] ema: I'm about to repool ms-fe2008 too [15:11:54] jouncebot: next [15:11:55] In 2 hour(s) and 48 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1800) [15:12:22] godog: I'll tail pybal.log with renewed confidence then [15:12:54] Jul 5 15:12:33 lvs2003 pybal[28951]: [swift_80] INFO: Merged enabled server ms-fe2008.codfw.wmnet, weight 40 [15:13:02] yeah looks like that did it [15:15:13] elukey: so in pybal.conf we've got stuff like [15:15:14] config = etcd://conf2001.codfw.wmnet/conftool/v1/pools/codfw/sca/zotero/ [15:15:22] pointing directly to conf2001 that is [15:15:48] I guess that rebooting conf200[23] shouldn't have much of an impact pybal-wise? [15:20:51] ema: ahhh okok, probably.. [15:21:37] (03PS4) 10Madhuvishy: tools: Fix maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [15:22:00] (03CR) 10Madhuvishy: [V: 032 C: 032] tools: Fix maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/360779 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [15:25:03] (03PS1) 10Filippo Giunchedi: prometheus: drop DESCRIPTION annotation from alerts [puppet] - 10https://gerrit.wikimedia.org/r/363359 [15:28:19] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:30:19] !log re-enabled puppet on stat1002, did a manual run, dataset filesystem available again there [15:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:05] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: drop DESCRIPTION annotation from alerts [puppet] - 10https://gerrit.wikimedia.org/r/363359 (owner: 10Filippo Giunchedi) [15:31:10] (03PS2) 10Filippo Giunchedi: prometheus: drop DESCRIPTION annotation from alerts [puppet] - 10https://gerrit.wikimedia.org/r/363359 [15:35:03] (03PS1) 10Jcrespo: mariadb: Correct systemd unit path [puppet] - 10https://gerrit.wikimedia.org/r/363360 (https://phabricator.wikimedia.org/T169514) [15:39:32] (03PS9) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459 [15:41:15] (03CR) 10Filippo Giunchedi: [C: 032] "Not possible to run PCC due to puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/354459 (owner: 10Filippo Giunchedi) [15:51:45] 10Operations: Upload nodejs 6.x to stretch-wikimedia - https://phabricator.wikimedia.org/T169763#3407824 (10Paladox) [15:53:47] 10Operations, 10Gerrit: rename user gerrit2 to gerrit - https://phabricator.wikimedia.org/T169634#3407836 (10demon) 05Open>03declined I don't see any reason to bother. We could name the user "mrawesome" if we wanted. [15:54:18] !log restart mysql on db2072 [15:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:39] paladox: please stop filing tasks for "please upload package X to stretch" [15:54:44] these are not helpful [15:54:54] ok sorry. [15:57:05] 10Operations, 10Gerrit, 10Release-Engineering-Team: Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3407851 (10demon) I've got about 6 other priorities before we do this. Yes, systemd first, also finishing logstash, scap deploy.... [15:59:28] elukey: actually yeah, please ping me tomorrow before rebooting conf2003 as it's used by ulsfo's pybals [16:01:12] jouncebot: next [16:01:12] In 1 hour(s) and 58 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1800) [16:03:04] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2047693 [16:03:52] !log restart pybal on lvs200[45] to make them reconnect to conf2001 [16:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:54] !log restart pybal on lvs200[12] to make them reconnect to conf2001 [16:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:24] (03PS6) 10Chad: Create hourly backup schedule, modeled on weekly and use for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/341371 [16:12:56] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3407908 (10Gilles) p:05Triage>03High [16:13:09] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3407910 (10ema) [16:13:41] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3407940 (10ema) p:05Triage>03High [16:14:25] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:05] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:11] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3407910 (10ema) [16:15:15] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [16:15:28] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3407945 (10Gilles) Verified that indeed, attempting to connect to poolcounter1002 just hangs: ``` gilles@deployment-imagescaler01:/srv/log/thumbor$ telnet pool... [16:15:55] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:18:11] (03PS2) 10Jcrespo: tendril: Make tendril active on a single datacenter for now [puppet] - 10https://gerrit.wikimedia.org/r/363294 (https://phabricator.wikimedia.org/T169540) [16:18:56] (03Draft1) 10Paladox: Gerrit: Allow users to change there full name in the ui [puppet] - 10https://gerrit.wikimedia.org/r/363364 [16:19:00] (03PS2) 10Paladox: Gerrit: Allow users to change there full name in the ui [puppet] - 10https://gerrit.wikimedia.org/r/363364 [16:19:38] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765#3407975 (10ema) As @Joe suggested some days ago, we might want to rewrite pybal's etcd code using https://treq.readthedocs.io/. [16:20:08] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3407979 (10Gilles) I commented out the poolcounter server config line in /etc/thumbor.d/60-thumbor-server.conf on deployment-imagescaler01. Restarted thumbor. n... [16:22:56] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3407999 (10Gilles) I bet it's something to do with that hiera value being for the swift::proxy class? I'll attempt re-creating the same list manually in horizon... [16:24:58] (03CR) 1020after4: [C: 031] Revert "phabricator: Block IP ranges for recent uploaded offtopic files" [puppet] - 10https://gerrit.wikimedia.org/r/363356 (owner: 10Aklapper) [16:26:55] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2041650 [16:30:35] (03CR) 10Chad: [C: 04-1] "No, this will just confuse people and create inconsistencies. We have a process for renaming users if they so desire: https://wikitech.wik" [puppet] - 10https://gerrit.wikimedia.org/r/363364 (owner: 10Paladox) [16:31:19] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3408065 (10Gilles) Puppet runs are still nuking the SWIFT_KEY config value... I'm not sure that anything I try to do now will last, as any future puppet run on... [16:31:44] (03Abandoned) 10Paladox: Gerrit: Allow users to change there full name in the ui [puppet] - 10https://gerrit.wikimedia.org/r/363364 (owner: 10Paladox) [16:32:15] 10Operations, 10User-fgiunchedi: Upgrade grafana to 4.4 - https://phabricator.wikimedia.org/T169773#3408066 (10fgiunchedi) [16:33:09] !log restart mysql on db2062 [16:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] you will see some noise on the mediawiki logs of errors for a few minutes- you can ignore those [16:33:55] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3408079 (10Gilles) Alright, that fixed it. I'll let the permanent fixes to @fgiunchedi as I don't know how to fix the hiera issues. I'll try putting my overrid... [16:33:59] (about codfw) [16:34:05] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3408080 (10Gilles) a:03fgiunchedi [16:35:14] (03PS1) 10MarcoAurelio: Alter ContentTranslation default namespace destination for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363370 (https://phabricator.wikimedia.org/T168727) [16:36:46] (03PS2) 10MarcoAurelio: Alter ContentTranslation default namespace destination for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363370 (https://phabricator.wikimedia.org/T168727) [16:37:57] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3408130 (10Gilles) The hotfixes are in /etc/thumbor.d/99-T169114.conf [16:39:17] (03PS3) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [16:39:19] (03PS2) 10Giuseppe Lavagetto: Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 [16:39:21] (03PS2) 10Giuseppe Lavagetto: Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 [16:40:09] (03PS2) 10MarcoAurelio: Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) [16:40:38] (03PS1) 10RobH: conf100[456] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/363372 [16:41:44] (03CR) 1020after4: [C: 031] Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 (owner: 10Thcipriani) [16:41:49] (03CR) 10RobH: [C: 032] conf100[456] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/363372 (owner: 10RobH) [16:42:23] (03CR) 1020after4: [C: 031] phabricator: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362124 (owner: 10Dzahn) [16:43:38] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3408151 (10Gilles) Upgraded all packages to the latest, as I noticed python-thumbor-wikimedia was the previous version. [16:44:17] 10Operations, 10ops-codfw, 10DBA: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3408155 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.... [16:45:23] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3408161 (10MarcoAurelio) [16:47:35] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3408170 (10mmodell) [16:47:53] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3373930 (10mmodell) Need to merge https://gerrit.wikimedia.org/r/#/c/355869/ [16:49:10] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3408175 (10RobH) [16:49:44] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3366966 (10Esc3300) Wikidata item: https://www.wikidata.org/wiki/Q32012187 [16:57:28] (03PS1) 10RobH: set install params for conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/363374 [16:59:43] (03CR) 10RobH: [C: 032] set install params for conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/363374 (owner: 10RobH) [17:00:48] (03PS1) 10Jcrespo: mariadb: Revert parsercaches to pc100[456] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363375 (https://phabricator.wikimedia.org/T167784) [17:02:10] 10Operations, 10ops-eqiad: restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3408247 (10Eevans) Is this a hardware issue of some sort? [17:03:50] (03CR) 10Amire80: [C: 04-1] "Please don't merge now. We are changing this feature now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363370 (https://phabricator.wikimedia.org/T168727) (owner: 10MarcoAurelio) [17:05:31] (03CR) 10MarcoAurelio: "> Please don't merge now. We are changing this feature now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363370 (https://phabricator.wikimedia.org/T168727) (owner: 10MarcoAurelio) [17:06:05] PROBLEM - HTTPS-eventdonations on eventdonations.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Name or service not known [17:11:11] (03Abandoned) 10Jcrespo: Parsercache: Purge rows every day, and reduce TTL to 22 days [puppet] - 10https://gerrit.wikimedia.org/r/361656 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [17:11:24] (03Abandoned) 10Jcrespo: Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [17:11:42] 10Operations, 10Jupyter-Hub: jupyterhub.service in failed state on notebook1001 due to removed user - https://phabricator.wikimedia.org/T169698#3408287 (10madhuvishy) Thanks @moritzm, I'll look and fix [17:16:55] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 17215 [17:20:15] RECOVERY - Check systemd state on notebook1001 is OK: OK - running: The system is fully operational [17:20:30] 10Operations, 10Jupyter-Hub: jupyterhub.service in failed state on notebook1001 due to removed user - https://phabricator.wikimedia.org/T169698#3408298 (10madhuvishy) 05Open>03Resolved Fixed by removing zareen from the hub user database. [17:27:02] 10Operations, 10ops-eqiad, 10Services (watching): restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3408322 (10Eevans) [17:30:18] (03PS1) 10RobH: correct conf1006 entry [dns] - 10https://gerrit.wikimedia.org/r/363376 [17:30:38] (03CR) 10RobH: [C: 032] correct conf1006 entry [dns] - 10https://gerrit.wikimedia.org/r/363376 (owner: 10RobH) [17:32:05] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [17:41:44] !log re-enabled puppet on stat1003 (last dataset nfs client), manually mounted /mnt/data because puppet run has an unrelated error [17:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:17] (03PS1) 10Papaul: ADD partman entries for labtestmetal2001,labtestservices2002,labtestservices2003 and labtestcontrol2003 [puppet] - 10https://gerrit.wikimedia.org/r/363378 [17:46:17] (03PS2) 10Dzahn: Remove smtp port from ferm config [puppet] - 10https://gerrit.wikimedia.org/r/363164 (owner: 10Muehlenhoff) [17:46:57] !log cleaning /srv/wdqs/import on all wdqs servers [17:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:15] (03CR) 10Dzahn: [C: 032] Remove smtp port from ferm config [puppet] - 10https://gerrit.wikimedia.org/r/363164 (owner: 10Muehlenhoff) [17:48:44] (03PS3) 10Dzahn: request-tracker: Remove smtp port from ferm config [puppet] - 10https://gerrit.wikimedia.org/r/363164 (owner: 10Muehlenhoff) [17:51:22] 10Operations, 10ops-eqiad, 10Services (watching): restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3408393 (10MoritzMuehlenhoff) Yes, it's a hardware issue. That happens occasionally, usual fix is to disconnect from power for a bit, but that needs dc ops involvement and Chris i... [17:51:56] (03CR) 10Dzahn: "confirmed the iptables rules are gone now. no more SMTP smarthosts for RT" [puppet] - 10https://gerrit.wikimedia.org/r/363164 (owner: 10Muehlenhoff) [17:53:14] (03PS2) 10Dzahn: ADD partman entries for labtestmetal2001,labtestservices2002,labtestservices2003 and labtestcontrol2003 [puppet] - 10https://gerrit.wikimedia.org/r/363378 (owner: 10Papaul) [17:53:20] !log rebooting osmium for kernel update [17:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:02] (03PS3) 10Dzahn: install_server: partman entries for new labtest[metal|service|control]* [puppet] - 10https://gerrit.wikimedia.org/r/363378 (owner: 10Papaul) [17:55:35] (03CR) 10Dzahn: [C: 032] install_server: partman entries for new labtest[metal|service|control]* [puppet] - 10https://gerrit.wikimedia.org/r/363378 (owner: 10Papaul) [18:00:01] !log rebooting tungsten for kernel update [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T1800). Please do the needful. [18:00:05] James_F, dbrant, and TabbyCat: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:12] I'll SWAT! [18:00:14] here [18:00:20] present [18:00:20] Hey TabbyCat. [18:00:21] * James_F waves. [18:00:25] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:00:28] hi Niharika [18:01:49] (03PS2) 10Niharika29: Enable mobile non-JavaScript editing on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361455 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [18:02:11] (03Abandoned) 10Chad: Jenkins slave: Ensure group exists before trying to make the user [puppet] - 10https://gerrit.wikimedia.org/r/359012 (owner: 10Chad) [18:03:23] (03CR) 10Niharika29: [C: 032] Enable mobile non-JavaScript editing on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361455 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [18:04:32] (03Merged) 10jenkins-bot: Enable mobile non-JavaScript editing on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361455 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [18:05:58] (03CR) 10jenkins-bot: Enable mobile non-JavaScript editing on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361455 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [18:06:09] James_F: Your above patch is on 1002 if there's anything to test. [18:06:18] (03CR) 10Dzahn: "@papaul now merged and i also run puppet on install2002. you can go ahead with installs." [puppet] - 10https://gerrit.wikimedia.org/r/363378 (owner: 10Papaul) [18:06:27] Niharika: Looking now; looks good. [18:08:33] (03PS3) 10Dzahn: Restrict http access to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/363213 (owner: 10Muehlenhoff) [18:08:36] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable mobile non-JavaScript editing on ptwiki [mediawiki-config] - https://gerrit.wikimedia.org/r/361455 (duration: 00m 45s) [18:08:46] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3408470 (10ArielGlenn) And finally, re-enabled puppet on stat1003, though puppet is currently broken over there due to d537148c75f8ad2d2eb15f99374fd843ef9... [18:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:49] James_F: That one is now synced everywhere. [18:08:53] Onwards... [18:09:02] Is there anybody around who gets reports on cron failures? [18:09:05] nuria_: Thanks! [18:09:10] (03PS3) 10Niharika29: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360370 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [18:09:14] Bah. Niharika: Thanks. :-) [18:09:16] (03CR) 10Niharika29: [C: 032] Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360370 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [18:09:25] (03PS4) 10Dzahn: request-tracker:Restrict http access to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/363213 (owner: 10Muehlenhoff) [18:09:46] looks like something is broken in wikidata dump generation, no this week's dump for rdf [18:10:02] it's supposed to run on snapshot1007 so if anybody can look up logs there... [18:10:44] (03Merged) 10jenkins-bot: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360370 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [18:10:53] (03CR) 10jenkins-bot: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/360370 (https://phabricator.wikimedia.org/T162849) (owner: 10Jforrester) [18:11:46] James_F: ^That one is on 1002 as well. [18:11:46] (03CR) 10Dzahn: [C: 032] request-tracker:Restrict http access to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/363213 (owner: 10Muehlenhoff) [18:12:27] (03PS1) 10Niharika29: Add CodeMirror to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363379 [18:12:57] Niharika: Yeah, LGTM. [18:13:52] (03CR) 10Dzahn: "confirmed. rt.wm.org works like before. http://ununpentium.wikimedia.org/ not reachable from outside anymore but from inside" [puppet] - 10https://gerrit.wikimedia.org/r/363213 (owner: 10Muehlenhoff) [18:14:36] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable OOjs UI EditPage buttons on es/fr/it/ja/ru-wiki and meta [mediawiki-config] - https://gerrit.wikimedia.org/r/360370 (duration: 00m 45s) [18:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:53] (03PS3) 10Dzahn: Gerrit: Add tag url to gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362979 (owner: 10Paladox) [18:14:53] James_F: Synced everywhere. [18:15:01] Niharika: Thank you. [18:15:30] (03PS2) 10Niharika29: Add CodeMirror to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363379 [18:16:27] (03CR) 10Dzahn: [C: 032] Gerrit: Add tag url to gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362979 (owner: 10Paladox) [18:16:51] oooh, pretty full swat, I won't add anything then, I should really start using eu morning again.... [18:17:08] my fault addshore [18:17:14] I added 4 today [18:17:21] robh: looks like something is broken in wikidata dump generation, no this week's dump for rdf, it's supposed to run on snapshot1007 so if anybody can look up logs there... [18:17:36] is that anything you could check? [18:18:38] (03PS1) 10ArielGlenn: use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 [18:18:43] thanks mutante :) [18:18:43] 10Operations, 10ops-eqiad, 10Services (watching): restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3408519 (10Eevans) >>! In T169696#3408393, @MoritzMuehlenhoff wrote: > Yes, it's a hardware issue. That happens occasionally, usual fix is to disconnect from power for a bit, but... [18:18:54] (03PS3) 10Dzahn: Gerrit: Remove linkDrafts from gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362987 (owner: 10Paladox) [18:19:14] SMalyshev: see my mail to xmldatadumps-l list [18:19:28] nfs issues yesterday and today [18:19:37] apergos: thanks, looking [18:19:48] all running processes hung, ticket referenced in the email to follow along [18:20:26] (03PS3) 10Niharika29: Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) (owner: 10MarcoAurelio) [18:20:31] (03CR) 10Niharika29: [C: 032] Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) (owner: 10MarcoAurelio) [18:20:42] apergos: probably deserves wider announcement? I don't read that list because my stuff has nothing to do with xml dumps, but looks like it affects all dumps, not just xml ones [18:20:43] and maybe other crons? [18:21:24] (03CR) 10Dzahn: [C: 032] Gerrit: Remove linkDrafts from gitweb [puppet] - 10https://gerrit.wikimedia.org/r/362987 (owner: 10Paladox) [18:21:48] (03CR) 10ArielGlenn: "I don't know if i'm introducing more conflicts here, or if we want to use require_package here instead of ensure_packages there. And that" [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [18:22:00] (03Merged) 10jenkins-bot: Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) (owner: 10MarcoAurelio) [18:22:09] (03CR) 10jenkins-bot: Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362592 (https://phabricator.wikimedia.org/T168321) (owner: 10MarcoAurelio) [18:22:50] SMalyshev: all dump related announcements go out on that list, the 'xml' in the name is a holdover from waaay back when that's all there was [18:22:51] (03CR) 10Dzahn: "i think the answer is yes, we want to use require_package" [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [18:22:56] TabbyCat: Your change https://gerrit.wikimedia.org/r/#/c/362592/ is on mwdebug1002. [18:23:06] apergos: ah, ok then, then I'll subscribe to it [18:23:13] ah yes, you should! [18:23:38] Niharika: nothing looks broken, can't really test it as I cannot read the hebrew script [18:23:49] TabbyCat: Okay. Syncing it then. [18:24:55] Niharika: don't forget the maintenance script once in sync. [18:26:09] (03PS2) 10Rush: Redact ep_courses.course_token [puppet] - 10https://gerrit.wikimedia.org/r/363230 (https://phabricator.wikimedia.org/T169661) (owner: 10Brian Wolff) [18:26:15] TabbyCat: Yup, I'll run that after I finish with the rest of the patches. [18:26:23] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Set wgCategoryCollation to 'numeric' at he.wikisource [mediawiki-config] - https://gerrit.wikimedia.org/r/362592 (duration: 00m 43s) [18:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:57] (03PS2) 10Niharika29: Add "H" as wgNamespaceAlias to NS_HELP for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362508 (https://phabricator.wikimedia.org/T167563) (owner: 10MarcoAurelio) [18:28:08] (03CR) 10Rush: [C: 032] Redact ep_courses.course_token [puppet] - 10https://gerrit.wikimedia.org/r/363230 (https://phabricator.wikimedia.org/T169661) (owner: 10Brian Wolff) [18:28:10] (03CR) 10Niharika29: [C: 032] Add "H" as wgNamespaceAlias to NS_HELP for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362508 (https://phabricator.wikimedia.org/T167563) (owner: 10MarcoAurelio) [18:28:41] (03CR) 10Dzahn: "all confirmed. this only changes dbmonitor2001, no change on dbmonitor1001 (http://puppet-compiler.wmflabs.org/6936/) and tendril.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/363294 (https://phabricator.wikimedia.org/T169540) (owner: 10Jcrespo) [18:28:51] (03PS3) 10Dzahn: tendril: Make tendril active on a single datacenter for now [puppet] - 10https://gerrit.wikimedia.org/r/363294 (https://phabricator.wikimedia.org/T169540) (owner: 10Jcrespo) [18:29:33] (03Merged) 10jenkins-bot: Add "H" as wgNamespaceAlias to NS_HELP for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362508 (https://phabricator.wikimedia.org/T167563) (owner: 10MarcoAurelio) [18:29:44] (03CR) 10jenkins-bot: Add "H" as wgNamespaceAlias to NS_HELP for en.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362508 (https://phabricator.wikimedia.org/T167563) (owner: 10MarcoAurelio) [18:30:19] (03CR) 10Dzahn: [C: 032] tendril: Make tendril active on a single datacenter for now [puppet] - 10https://gerrit.wikimedia.org/r/363294 (https://phabricator.wikimedia.org/T169540) (owner: 10Jcrespo) [18:31:37] dbrant: Your changes are on mwdebug1002 if you can test it there. [18:33:02] (03CR) 10Dzahn: "no change on dbmonitor1001 - tendril works like before - no more http://dbmonitor2001.wikimedia.org/ - iptables rules removed there" [puppet] - 10https://gerrit.wikimedia.org/r/363294 (https://phabricator.wikimedia.org/T169540) (owner: 10Jcrespo) [18:33:13] Niharika: lgtm! [18:33:21] Okay then. [18:34:19] 10Operations, 10Patch-For-Review, 10Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3408582 (10GWicke) 05Open>03Resolved a:03GWicke I just verified that this is working by temporarily lowering the alert threshold in one of the dashboards. The servic... [18:34:33] (03PS1) 10ArielGlenn: clean up perms and temp files during kiwix rsync [puppet] - 10https://gerrit.wikimedia.org/r/363383 [18:36:35] !log niharika29@tin Synchronized php-1.30.0-wmf.7/extensions/MobileApp/: Enable description editing for all wikis except enwiki. (T146705) (duration: 00m 43s) [18:36:43] dbrant: All done. [18:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] T146705: Support rollout of editing wikidata descriptions on Android - https://phabricator.wikimedia.org/T146705 [18:37:14] (03CR) 10Dzahn: phabricator: add support for stretch and PHP7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362124 (owner: 10Dzahn) [18:37:42] TabbyCat: Add "H" as wgNamespaceAlias to NS_HELP for en.wikisource [mediawiki-config] is on mwdebug1002. [18:37:51] checking [18:38:04] (03PS2) 10ArielGlenn: clean up perms and temp files during kiwix rsync [puppet] - 10https://gerrit.wikimedia.org/r/363383 [18:38:23] Niharika: lgtm - don't forget the maintenance script too in this one :) [18:38:27] (03PS4) 10Niharika29: Add 'WP' namespace alias to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362267 (https://phabricator.wikimedia.org/T168164) (owner: 10MarcoAurelio) [18:38:48] test strategy: checked H:A - on mw1002 it redirects to Help:A as expected [18:38:48] (03CR) 10MaxSem: [C: 032] Add CodeMirror to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363379 (owner: 10Niharika29) [18:38:53] TabbyCat: Got it. [18:39:05] (03CR) 10Niharika29: [C: 032] Add 'WP' namespace alias to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362267 (https://phabricator.wikimedia.org/T168164) (owner: 10MarcoAurelio) [18:40:06] (03Merged) 10jenkins-bot: Add CodeMirror to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363379 (owner: 10Niharika29) [18:40:18] (03CR) 10jenkins-bot: Add CodeMirror to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363379 (owner: 10Niharika29) [18:40:28] (03Merged) 10jenkins-bot: Add 'WP' namespace alias to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362267 (https://phabricator.wikimedia.org/T168164) (owner: 10MarcoAurelio) [18:40:59] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Add H as wgNamespaceAlias to NS_HELP for en.wikisource [mediawiki-config] - https://gerrit.wikimedia.org/r/362508 (duration: 00m 42s) [18:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:42] (03CR) 10Dzahn: phabricator: add support for stretch and PHP7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362124 (owner: 10Dzahn) [18:43:05] TabbyCat: Add 'WP' namespace alias to ruwiki is on mwdebug1002 too. [18:43:25] Niharika: lgtm [18:43:45] namespaceDupes.php required as well just in case [18:46:30] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Add 'WP' namespace alias to ruwiki [mediawiki-config] - https://gerrit.wikimedia.org/r/362267 (duration: 00m 42s) [18:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:40] TabbyCat: Synced. Okay. [18:46:42] Niharika: thanks! I'm not yet seeing the change live; does that take a little while, or does something else need to be done? [18:48:12] (03CR) 10ArielGlenn: [C: 032] clean up perms and temp files during kiwix rsync [puppet] - 10https://gerrit.wikimedia.org/r/363383 (owner: 10ArielGlenn) [18:50:29] dbrant: It seems to have synced. Via which url is it getting loaded? [18:51:02] Niharika: https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json [18:52:16] dbrant: According to your patch, the file updated was https://meta.wikimedia.org/static/current/extensions/MobileApp/config/config.json [18:52:23] Which seems updated as it should be. [18:52:39] (03PS4) 10Niharika29: Fix nowikisource template namespace subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362272 (https://phabricator.wikimedia.org/T166035) (owner: 10MarcoAurelio) [18:52:44] (03CR) 10Niharika29: [C: 032] Fix nowikisource template namespace subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362272 (https://phabricator.wikimedia.org/T166035) (owner: 10MarcoAurelio) [18:53:59] (03Merged) 10jenkins-bot: Fix nowikisource template namespace subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362272 (https://phabricator.wikimedia.org/T166035) (owner: 10MarcoAurelio) [18:54:12] Niharika: android.json is actually a symlink to config.json, and should emit the same content. [18:54:43] scap sync-dir instead of sync-file to force? [18:54:58] (idiotic suggest maybe, forget about it) [18:55:19] (03CR) 10Dzahn: phabricator: add support for stretch and PHP7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362124 (owner: 10Dzahn) [18:56:57] TabbyCat: Your last change is on mwdebug1002 as well. Fix nowikisource template namespace subpages [18:57:05] TabbyCat, sync-dir is an alias to sync-file [18:57:29] Niharika: checking [18:57:41] MaxSem: noted, one new lesson learnt [18:57:58] wasn't always the case [18:58:16] Niharika: lgtm [18:59:33] (03CR) 10jenkins-bot: Add 'WP' namespace alias to ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362267 (https://phabricator.wikimedia.org/T168164) (owner: 10MarcoAurelio) [18:59:35] (03CR) 10jenkins-bot: Fix nowikisource template namespace subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362272 (https://phabricator.wikimedia.org/T166035) (owner: 10MarcoAurelio) [18:59:38] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Fix nowikisource template namespace subpages [mediawiki-config] - https://gerrit.wikimedia.org/r/362272 (duration: 00m 42s) [18:59:44] MaxSem: but is there a command to sync the full dir where a file is located? [18:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:49] now I mean [19:00:06] (lgtm now in live as well) [19:00:33] you just tell it to sync-file the dir [19:01:05] dbrant, purging that file doesn't seem to work [19:01:08] TabbyCat: Will run the maintenance scripts now. [19:01:19] (03PS6) 10Dzahn: phabricator: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362124 [19:04:17] Niharika: perfect [19:05:52] (03CR) 10Paladox: [C: 031] phabricator: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362124 (owner: 10Dzahn) [19:07:24] (03CR) 10Dzahn: [C: 032] "no change on iridium or phab2001: http://puppet-compiler.wmflabs.org/6937/ - added the warning comment - amended per paladox' comment" [puppet] - 10https://gerrit.wikimedia.org/r/362124 (owner: 10Dzahn) [19:08:20] 10Operations, 10Cloud-Services, 10DC-Ops: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3408795 (10chasemp) @Cmjohnson I ping'd the wrong chris before :) As of this moment labstore1005 is the standby, if you have time to look at this it would be great ch... [19:09:08] (03PS1) 10EBernhardson: Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 [19:09:54] (03PS7) 10Dzahn: phabricator: add support for stretch and PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/362124 [19:10:34] (03CR) 10jerkins-bot: [V: 04-1] Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [19:10:48] Niharika: can I leave now or you still need me for more tests? [19:11:05] I'll still be around though, but I have other things [19:12:31] (03PS2) 10EBernhardson: Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 [19:12:47] TabbyCat: Couple of small things while running the maintenance scripts. Will comment on tickets. You don't have to be around anymore. [19:12:55] ---SWAT over--- [19:12:59] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10cloud-services-team (Kanban): Puppet CA: virt1000.wikimedia.org' will expire on 2017-08-15 - https://phabricator.wikimedia.org/T168110#3408839 (10chasemp) a:03Andrew [19:13:24] Niharika: I'm still around, which script are you running? [19:13:32] Niharika: MaxSem: so, is there something to be done about syncing the symlink? [19:13:35] (03CR) 10jerkins-bot: [V: 04-1] Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [19:13:44] scap-sync full? [19:13:52] but it'll take a whole lot of time [19:14:14] dbrant, ask ops to purge it on their level? apparenlty nothing we deployers can do about it [19:14:58] (03PS3) 10EBernhardson: Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 [19:16:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:16:04] TabbyCat: I commented on the tickets. [19:16:17] k [19:16:40] i bet maybe touching the android.json (assuming this is still the issue) may be the fix Niharika cc TabbyCat [19:17:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:17:28] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3408854 (10RobH) a:05RobH>03elukey [19:18:42] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:19:32] !log niharika29@tin Synchronized php-1.30.0-wmf.7/extensions/MobileApp/config/android.json: Syncing in hopes of invalidating cache (duration: 00m 42s) [19:19:40] Zppix: Done that^ [19:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:45] any changes? [19:20:03] Nope. [19:20:06] 10Operations, 10ops-eqiad, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3408876 (10RobH) Assigned to @elukey for service implementation. (If this isn't done by you, but someone else, please assign this task to them.) This task can be resolved... [19:20:14] dang... welp Reedy got a solution? [19:20:31] Where's it being cached? [19:20:42] http [19:20:50] yes we tried purgeList [19:20:51] purgeList? [19:20:52] lol [19:21:25] Niharika: maybe the whole MobileApp, go up a few levels? :) [19:21:32] Don't we have this problem every time? [19:21:37] Because of the symlink mess? [19:21:38] idk [19:21:46] TabbyCat, that's not how HTTP caching works [19:21:49] (03CR) 10EBernhardson: "Tyler could you give me a quick sanity check on the scap3 config and paths it will use?" [puppet] - 10https://gerrit.wikimedia.org/r/363384 (owner: 10EBernhardson) [19:22:04] I was happily retired from this things after I wanted to come back [19:22:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:23:42] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:24:07] * dbrant knows very little about this stuff :( [19:24:41] * Zppix has a motto of try everything til something works or breaks [19:24:44] (03PS2) 10Dzahn: decom subra and suhail [puppet] - 10https://gerrit.wikimedia.org/r/363110 (https://phabricator.wikimedia.org/T169506) [19:24:46] (03PS1) 10Herron: Remove exim4-heavy and exim4::ganglia from role requesttracker_server [puppet] - 10https://gerrit.wikimedia.org/r/363390 (https://phabricator.wikimedia.org/T169794) [19:25:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:25:11] dbrant: : dbrant, ask ops to purge it on their level? apparenlty nothing we deployers can do about it [19:25:15] Maybe try that. [19:25:48] isn't this the ops channel? whom should I ask? [19:26:19] bblack or ema? [19:31:19] !log phab2001 - rebooting for kernel upgrade [19:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:27] (03CR) 10Herron: "link to puppet compiler result https://puppet-compiler.wmflabs.org/6938/" [puppet] - 10https://gerrit.wikimedia.org/r/363390 (https://phabricator.wikimedia.org/T169794) (owner: 10Herron) [19:32:44] Reedy: you mentioned there have been symlink issues in the past? What was done in those cases? [19:32:52] I'm not sure [19:32:58] I wasn't involved in the deploys [19:33:04] But I'm fairly sure it's caused various issues like this [19:33:56] I've got a feeling ops had to purge something [19:34:45] 10Operations, 10hardware-requests, 10Patch-For-Review: Reclaim/Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3408920 (10RobH) [19:37:35] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3408936 (10RobH) [19:42:52] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:43:22] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:44:22] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:46:35] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3409004 (10Papaul) [19:47:22] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [19:48:13] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10Papaul) a:05Papaul>03RobH @Robh can you please setup the network port for me? Once done you can assign the task back to me for me to pr... [19:48:23] !log commonswiki: running updateArticleCount.php (against the vslow slave) [19:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:37] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3409009 (10Papaul) [19:48:39] jdlrobson: ^ [19:48:45] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3379794 (10Papaul) @Robh can you please setup the network port for me? Once done you can assign the task back to me for me to proceed with the instal... [19:49:00] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3409011 (10Papaul) a:05Papaul>03RobH [19:49:44] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3409013 (10Papaul) [19:49:57] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3379811 (10Papaul) a:05Papaul>03RobH @Robh can you please setup the network port for me? Once done you can assign the task back to me for me to p... [19:50:40] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3409016 (10Papaul) [19:50:59] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10Papaul) a:05Papaul>03RobH @Robh can you please setup the network port for me? Once done you can assign the task back to me for me to procee... [19:54:42] PROBLEM - Host gerrit2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:55:02] RECOVERY - Host gerrit2001 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T2000). Please do the needful. [20:00:16] no ores [20:00:19] no parsoid deploy today [20:02:48] bblack: ema: around? [20:05:08] dbrant: brandon is out this week AFAIK [20:07:01] !log phab2001 - deleted /etc/systemd/system/phd.service (base::service_unit uses /lib/systemd/system/phd.service both have DIFFERENT content and conflicted, causing systemd degradation after reboot) [20:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:34] mutante: holy crap mutante, how did that happen? [20:07:57] chasemp it was probayl from before when we did not use base::service_unit [20:09:05] chasemp: first it used a setup where puppet puts the unit file in /etc/systemd/system... then it switched to the "base::service_unit"-abstraction, that uses /lib/systemd/system (i forgot why, but i didnt like it because imho /etc/ is still for config and not /lib), nothing removed the old file... then a fix was applied to the template [20:09:13] volans: can you help out, then? :) [20:09:48] * volans reading backlog [20:09:52] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:09:53] chasemp: then later, we had to reboot for kernel upgrade.. and then it breaks, Icinga notices that systemd state is degraded, heh [20:10:02] volans: So, earlier we deployed a change to a static configuration file: https://meta.wikimedia.org/static/current/extensions/MobileApp/config/config.json [20:10:12] But there's also a symlink to that file (which needs to return the same content): https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json [20:10:21] How do we get the symlink to return the updated content? [20:10:22] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:10:53] mutante: gotcha thanks [20:12:23] dbrant: looking, give me a minute [20:14:30] !log commonswiki: nevermind that article count thing [20:14:39] jdlrobson: Ok that was taking too long, canceled [20:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:47] (counting is hard......) [20:14:52] it seems there is an issue with Oauth [20:15:37] I can't get log in with https://tools.wmflabs.org/url2commons/index.html [20:15:47] and someone else too [20:17:06] is this a known issue? [20:17:45] https://tools.wmflabs.org/oauth-hello-world/ -> seems to work fine [20:18:24] dbrant: only for hostname=meta.wikimedia.org ? [20:18:39] also no train today, and none of the SWAT patches seem relevant [20:18:51] probably the tool is broken? [20:18:58] volans: technically yes. [20:20:47] I was able to grant OAuth access to it [20:21:06] (I'm not going to use the tool, don't wanna upload junk data to commons) [20:22:44] RainbowSprinkles: I suppose the problem is that it still recommends you to authorize after a seemingly successful authorization [20:22:51] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409167 (10Dcljr) >>! In T168764#3407320, @Urbanecm wrote: > Wiki is reopened and it can be edited by anyone as of now. Technically, yes, but the conv... [20:23:57] Hmm [20:37:30] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409233 (10MF-Warburg) But that is irrelevant for the bug: >>! In T168764#3407320, @Urbanecm wrote: > As there is nothing we can do now at our side [20:39:52] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:42:39] dbrant: so, something puzzling me, according to the docs I should have banned the object, but I'm still getting the old one [20:43:08] and I can see the new one with the mwdebug extension, so I will not ask you to double-verify that the content changed for real ;) [20:43:35] curious... [20:43:40] OK, I reported the issue on url2commons talk page [20:43:55] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3376111 (10Koavf) For those watching this, the site is live. [20:44:54] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409284 (10Dcljr) I know, but just in case some users interested in editing the wiki are following this task, they should know that editing should wait... [20:45:52] volans: so then, who else might know how to solve this? [20:45:59] (03PS2) 10Niharika29: Config changes for deploying CodeMirror on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362327 (https://phabricator.wikimedia.org/T169284) [20:46:48] dbrant: already poking who's around ;) [20:47:41] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3409286 (10GWicke) This RFC is scheduled for today's IRC meeting, at 2pm SF ti... [20:50:13] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409307 (10Koavf) Yes, that is what I was trying to say--import seems to be done and there aren't redlinks everywhere anymore. [20:51:23] (03PS4) 10EBernhardson: Deploy mjolnir kafka daemon to relforge [puppet] - 10https://gerrit.wikimedia.org/r/363384 [20:51:58] 10Operations, 10DBA, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409316 (10Dcljr) Sigh… My previous comment was in response to MF-Warburg. [20:57:52] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [20:59:22] (03CR) 10MaxSem: Config changes for deploying CodeMirror on testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362327 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [21:00:04] MaxSem and Niharika: Respected human, time to deploy Deploy CodeMirror to testwiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T2100). Please do the needful. [21:01:43] (03PS3) 10Niharika29: Config changes for deploying CodeMirror on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362327 (https://phabricator.wikimedia.org/T169284) [21:01:45] 10Operations, 10Datasets-General-or-Unknown: NFS on dataset1001 overloaded, high load on the hosts that mount it - https://phabricator.wikimedia.org/T169680#3409360 (10ArielGlenn) So far dataset1001 still behaving, dump run beginning to pick back up. It's now late enough that I need to get some sleep. Hopefu... [21:03:27] !log add madhuvishy to wmf-nda phab group [21:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:40] (03PS16) 10Paladox: Upgrade gerrit to 2.14.2-pre (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [21:07:22] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89950.81 seconds [21:08:05] dbrant: it should be solved now [21:08:15] apparently I was a bit too conservative with the first ban [21:08:42] dbrant: do you have a task for the upgrade I can refer to? [21:09:31] volans: that's done it! thanks! The original commit was https://gerrit.wikimedia.org/r/363349 [21:11:19] volans: so, for the future, do we have to go through this process every time a symlinked file is updated? or can there be a way to simplify it? [21:12:50] dbrant: that's why I was asking for a task, I want to follow up for the generic case to avoid this, if possible ;) [21:13:03] gotcha :) [21:16:48] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3409414 (10RobH) a:05RobH>03Papaul In the future please list out the full network port info, so I don't have to go hunting =] Example: ge-1/0/17... [21:16:58] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3409416 (10RobH) [21:17:09] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3409417 (10RobH) [21:17:37] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3379828 (10RobH) Network port setup, in the future, please list the full network port info. Example: ge-1/0/16 is actually asw-c-codfw:ge-1/0/16 [21:17:48] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3409420 (10RobH) a:05RobH>03Papaul [21:27:42] (03PS4) 10Niharika29: Config changes for deploying CodeMirror on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362327 (https://phabricator.wikimedia.org/T169284) [21:28:17] (03CR) 10Niharika29: [C: 032] Config changes for deploying CodeMirror on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362327 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [21:28:17] Niharika: Don't forget tools/release this time :) [21:28:55] Reedy: Yeah! [21:29:58] (03CR) 10jenkins-bot: Config changes for deploying CodeMirror on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362327 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [21:38:00] !log niharika29@tin Synchronized php-1.30.0-wmf.7/extensions/CodeMirror/: Deploying CodeMirror to testwiki (T169284) (duration: 00m 44s) [21:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:10] T169284: Deploy CodeMirror to testwiki - https://phabricator.wikimedia.org/T169284 [21:42:21] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3409504 (10Gehel) [21:47:53] !log niharika29@tin Started scap: Deploying Codemirror on testwiki- full scap (T169284) [21:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:03] T169284: Deploy CodeMirror to testwiki - https://phabricator.wikimedia.org/T169284 [21:51:03] !log niharika29@tin scap aborted: Deploying Codemirror on testwiki- full scap (T169284) (duration: 03m 10s) [21:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:11] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Syncing InitialiseSettings for CodeMirror deployment (duration: 00m 42s) [21:53:15] !log niharika29@tin Started scap: Deploying Codemirror on testwiki- full scap (T169284) [21:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:30] T169284: Deploy CodeMirror to testwiki - https://phabricator.wikimedia.org/T169284 [21:56:06] (03PS1) 10Bearloga: shiny_server: Fix restart instructions [puppet] - 10https://gerrit.wikimedia.org/r/363486 [21:59:52] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:05:09] (03CR) 10Chelsyx: [C: 031] shiny_server: Fix restart instructions [puppet] - 10https://gerrit.wikimedia.org/r/363486 (owner: 10Bearloga) [22:11:46] (03PS1) 10Reedy: Wrap apache_request_headers() in function_exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363491 [22:12:13] (03CR) 10Reedy: "$ mwscript eval.php fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363491 (owner: 10Reedy) [22:13:01] (03CR) 10Chad: "I don't think these hooks typically get called from the CLI :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363491 (owner: 10Reedy) [22:13:40] (03CR) 10Reedy: "I've made one that does some auth stuff.. And it does and breaks :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363491 (owner: 10Reedy) [22:13:59] !log niharika29@tin Finished scap: Deploying Codemirror on testwiki- full scap (T169284) (duration: 20m 43s) [22:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:10] T169284: Deploy CodeMirror to testwiki - https://phabricator.wikimedia.org/T169284 [22:14:12] (03PS1) 10BryanDavis: labsdb: Add babel table to public views [puppet] - 10https://gerrit.wikimedia.org/r/363492 (https://phabricator.wikimedia.org/T160713) [22:16:32] !log subra/suhail: disabling puppet, stopping poolcounterd, stopping other services, first step of decom, replaced by poolcounter200[12] (T169506) [22:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:42] T169506: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506 [22:17:26] (03CR) 10Chad: [C: 032] Wrap apache_request_headers() in function_exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363491 (owner: 10Reedy) [22:18:02] PROBLEM - poolcounter on suhail is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd [22:18:22] PROBLEM - Poolcounter connection on suhail is CRITICAL: connect to address 10.192.0.121 and port 7531: Connection refused [22:18:32] PROBLEM - poolcounter on subra is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd [22:18:32] PROBLEM - Poolcounter connection on subra is CRITICAL: connect to address 10.192.16.124 and port 7531: Connection refused [22:18:35] ACKNOWLEDGEMENT - Poolcounter connection on subra is CRITICAL: connect to address 10.192.16.124 and port 7531: Connection refused daniel_zahn decom [22:18:35] ACKNOWLEDGEMENT - poolcounter on subra is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd daniel_zahn decom [22:18:36] ACKNOWLEDGEMENT - Poolcounter connection on suhail is CRITICAL: connect to address 10.192.0.121 and port 7531: Connection refused daniel_zahn decom [22:18:36] ACKNOWLEDGEMENT - poolcounter on suhail is CRITICAL: PROCS CRITICAL: 0 processes with command name poolcounterd daniel_zahn decom [22:18:46] (03Merged) 10jenkins-bot: Wrap apache_request_headers() in function_exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363491 (owner: 10Reedy) [22:18:54] (03CR) 10jenkins-bot: Wrap apache_request_headers() in function_exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363491 (owner: 10Reedy) [22:20:24] !log demon@tin Synchronized wmf-config/CommonSettings.php: apache_request_headers protection (duration: 00m 42s) [22:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:07] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3409640 (10Dzahn) [22:21:12] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Syncing InitialiseSettings for CodeMirror deployment (take 2) (duration: 00m 42s) [22:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:03] (03CR) 10Dzahn: [C: 032] decom subra and suhail [puppet] - 10https://gerrit.wikimedia.org/r/363110 (https://phabricator.wikimedia.org/T169506) (owner: 10Dzahn) [22:23:27] Niharika: All good? [22:24:05] Reedy: Not really. Getting "Notice: Undefined variable: wmgUseCodeMirror in /srv/mediawiki/wmf-config/CommonSettings.php on line 1944" in fatalmonitor. [22:24:11] For no apparent reason. [22:24:53] Niharika: touch wmf-config/InitialiseSettings.php and sync it [22:24:55] And the extension isn't being enabled, as it should. Funnily I made pretty much the same patch for beta cluster and it works perfectly there. [22:25:03] Okay. [22:25:26] it's probably not being enabled due to the notice [22:25:36] if it's being seen as undefined, it's never gonna be true [22:25:43] Right. [22:26:01] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [22:26:02] Basically... It's a race condition that seems to have unfixed itself recently [22:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:14] CommonSettings.php uses a cached version of IS, based on a timestamp [22:26:44] Why does this always happen when I do deploys! :P [22:27:34] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Syncing InitialiseSettings for CodeMirror deployment (take 4) (duration: 00m 42s) [22:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:15] I should note.. I did see CodeMirror on https://test.wikipedia.org/wiki/Special:Version earlier when I checked after your scap [22:28:51] Reedy: Right, it's on Special:Version. It's not showing up as a beta feature though. [22:29:02] !log subra/suhail: re-enabled puppet, now with role::spare, no more poolcounter, scheduled icinga downtimes for decom (T169506) [22:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:13] T169506: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506 [22:29:34] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3409653 (10Dzahn) [22:31:00] Okay, at least that error is gone. [22:31:04] :) [22:31:10] Now.. This betafeatures stuff not appearing [22:31:15] I swear I've lost 5kgs since morning. [22:31:28] According to MaxSem it's getting added fine. [22:32:20] I'm sure, every time someone deploys a new beta feature, it doesn't work on production [22:32:21] But not showing up. [22:32:48] Reedy: Wait, how come? [22:33:03] I'm trying to remember [22:33:23] I wonder in the world of HHVM if the cached version of InitialiseSettings even makes *sense* anymore [22:33:27] James_F: marktraceur do you guys remember why betafeatures doesn't work? [22:33:35] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3409667 (10Dzahn) @Rob thanks for adding the template! I did all the check boxes up to "non-interruptible". I can't continue that part myself due to lack of switch access. Let me know... [22:33:45] RainbowSprinkles: Should be trivial to test/benchmark [22:33:56] Indeed [22:34:06] File a todo and nerd snipe ori into doing it? [22:38:15] https://phabricator.wikimedia.org/T169821 [22:44:34] * bd808 bets that parsing all of initializesettings on each request is still a bad idea [22:44:48] Mebbe :) [22:45:10] 10Operations, 10hardware-requests, 10Patch-For-Review: Decommission subra/suhail - https://phabricator.wikimedia.org/T169506#3409696 (10Dzahn) a:05Dzahn>03RobH [22:45:29] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Site-requests: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822#3409698 (10Jdlrobson) [22:46:02] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Site-requests: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822#3409698 (10Reedy) Just run it in a screen on terbium and wait? :) [22:46:48] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Site-requests: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822#3409717 (10Jdlrobson) @Reedy I don't have the ability to do that.. so I guess I need somebody to help me! :) [22:46:58] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Site-requests: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822#3409718 (10Jdlrobson) [22:47:20] !log running `mwscript updateArticleCount.php --wiki=commonswiki --update` on screen on terbium T169822 [22:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:31] T169822: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822 [22:47:58] (03PS9) 10Paladox: planet: Update css and templates to be modern look [puppet] - 10https://gerrit.wikimedia.org/r/361190 [22:50:05] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [22:52:05] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.014 second response time [22:52:10] Reedy: Do you mean, someone's failed to go through the process, get clearance, and add it to the whitelist, and now they're surprised? [22:52:29] James_F: Yes. That was me. [22:52:39] Ha. What beta feature? [22:52:49] Syntax highlighting. [22:52:53] CodeMirror [22:52:59] James_F: <3 [22:53:02] Oh, right. [22:53:14] ALSO [22:53:20] Can we document this somewhere more prominently? [22:53:24] How to deploy code? :P [22:53:29] Write a patch and I'll +1. [22:53:40] James_F: The file says "Next to each entry, please note the date **6 months after the last major change**" What if my feature is still in testing phase and I'm putting it out on test wiki and it might still undergo changes? [22:53:49] 10Operations, 10Release-Engineering-Team, 10Wikimedia-Site-requests: Run updateArticleCount.php on Wikimedia Commons - https://phabricator.wikimedia.org/T169822#3409751 (10Reedy) 05Open>03Resolved a:03Reedy ``` reedy@terbium:~$ mwscript updateArticleCount.php --wiki=commonswiki --update Counting articl... [22:53:51] Reedy: it's linked prominently from Beta Features. [22:53:54] heh [22:54:12] Niharika: I bump the dates every now and then, just use today+6. [22:54:20] Alright. [22:54:29] I.e. 2018-01-05. [22:55:20] (03PS1) 10Niharika29: Add CodeMirror as a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) [22:56:44] I believe someone has to update the department names now that we've done the reorg dance. [22:56:56] Hi! Is the following a good way to see if some sort of Cdn cache is available and purgable in a Mediawiki setup, from PHP? [22:56:59] if ( $wgSquidServers || $wgHTCPRouting ) { ...do some purges... } [22:57:33] Niharika: Yeah, I'll do that when I remember. [22:57:44] Just not wanting to add unneeded stuff for non-WMF users, while ensuring it works for WMF setup... [22:57:55] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3409771 (10Dzahn) We talked about this and Keith added his own Icinga contact in private repo, then added himself to "sms" group. https://gerrit.wikimedia.org/r/#/c/363044/ Should be all done now. Th... [22:57:56] *not add [22:58:14] Niharika: Hmm. I have some product comments that mean I'll withhold C+1. Want me to file a task or just comment on the gerrit patch? [22:58:37] bblack ema ^ ? [22:58:48] Thx in advance! [22:59:06] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3409772 (10Dzahn) p:05Triage>03Low [22:59:22] AndyRussG: You might want to ask AaronSchulz [22:59:45] 10Operations, 10Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3301040 (10Dzahn) a:05Dzahn>03herron set priority to low. maybe you can close it once you actually received one? [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170705T2300). [23:01:26] James_F: Comments are fine. [23:01:48] Reedy: ah cool thanks! :) [23:05:50] (03CR) 10Jforrester: [C: 04-1] "I think the feature title (codemirror-beta-title) fails to communicate to users what it is doing, and is confusing as we already have synt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363497 (https://phabricator.wikimedia.org/T169284) (owner: 10Niharika29) [23:25:06] (03PS10) 10Paladox: planet: Update css and templates to be modern look [puppet] - 10https://gerrit.wikimedia.org/r/361190 [23:33:14] (03PS11) 10Paladox: planet: Update css and templates to be modern look [puppet] - 10https://gerrit.wikimedia.org/r/361190 [23:56:43] 10Operations, 10MW-1.30-release-notes, 10Traffic, 10HTTPS, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3410075 (10aaron) >>! In T160616#3404157, @fgiunchedi wrote: > @aaron yeah it will need some love, in the meantime I've patched in https support so swiftrepl w... [23:56:49] 10Operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#3410078 (10aaron) [23:56:51] 10Operations, 10MW-1.30-release-notes, 10Traffic, 10HTTPS, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3410076 (10aaron) 05Open>03Resolved a:03aaron