[00:12:19] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:21:29] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:30:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [00:31:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4794502 keys, up 34 days 16 hours - replication_delay is 0 [00:40:19] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:50:29] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:13:29] PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:27:29] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:41:30] RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:44:32] PROBLEM - MariaDB disk space on db1047 is CRITICAL: DISK CRITICAL - free space: / 419 MB (5% inode=60%) [01:45:03] taking a look [01:54:25] eventlogging logs spam with tls errors [01:56:29] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [01:57:57] !log move /var/log/eventlogging_sync.err to a symlink on /srv on db1047 [01:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:32] RECOVERY - MariaDB disk space on db1047 is OK: DISK OK [02:06:37] 06Operations, 10DBA: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2845942 (10fgiunchedi) [02:12:38] !log add --skip-ssl to mysql commands on eventlogging_sync on db1047 - T152364 [02:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:56] T152364: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364 [02:14:20] !log add --skip-ssl to mysql commands on eventlogging_sync on dbstore1002 - T152364 [02:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:30] marostegui I've hacked --skip-ssl locally for now, should be straightforward to fix in puppet ^ [02:23:11] * godog off [02:46:09] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:14:09] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [03:37:35] (03PS1) 10Tim Landscheidt: Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) [03:53:23] (03PS2) 10Tim Landscheidt: Tools: Install php5-readline [puppet] - 10https://gerrit.wikimedia.org/r/325251 (https://phabricator.wikimedia.org/T136519) [04:15:19] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=600.30 Read Requests/Sec=512.60 Write Requests/Sec=14.70 KBytes Read/Sec=42061.60 KBytes_Written/Sec=242.80 [04:21:19] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.70 Read Requests/Sec=0.00 Write Requests/Sec=1.00 KBytes Read/Sec=0.00 KBytes_Written/Sec=6.80 [04:50:49] PROBLEM - MegaRAID on ms1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [04:51:00] ACKNOWLEDGEMENT - MegaRAID on ms1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T152367 [04:51:03] 06Operations, 10ops-eqiad: Degraded RAID on ms1001 - https://phabricator.wikimedia.org/T152367#2846052 (10ops-monitoring-bot) [05:04:10] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:32:10] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:53:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [05:54:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:58:30] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 78 probes of 411 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:59:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [06:05:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:13:30] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 411 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:15:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [06:21:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:31:50] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 2 minutes ago with 9 failures. Failed resources (up to 3 shown): Service[ssh],Service[nagios-nrpe-server],Package[tzdata],Service[zotero] [06:42:20] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:53:50] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apt-transport-https] [06:59:50] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:06:53] (03PS1) 10Marostegui: labsdb-replica: Disable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/325255 (https://phabricator.wikimedia.org/T152194) [07:09:54] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4787/" [puppet] - 10https://gerrit.wikimedia.org/r/325255 (https://phabricator.wikimedia.org/T152194) (owner: 10Marostegui) [07:10:20] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:13:10] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:16:45] (03PS1) 10Marostegui: eventlogging_sync: By pass ssl check on localhost [puppet] - 10https://gerrit.wikimedia.org/r/325257 (https://phabricator.wikimedia.org/T152364) [07:17:25] 06Operations, 10DBA, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846176 (10Marostegui) Thanks for taking care of this. I have submitted a patch to skip this check: https://gerrit.wikimedia.org/r/325257 I am not completely aware of the who... [07:19:34] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4788/" [puppet] - 10https://gerrit.wikimedia.org/r/325257 (https://phabricator.wikimedia.org/T152364) (owner: 10Marostegui) [07:40:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325260 (https://phabricator.wikimedia.org/T148967) [07:41:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325260 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [07:42:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325260 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [07:42:30] PROBLEM - puppet last run on elastic1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:45:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 - T148967 (duration: 02m 12s) [07:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:31] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [07:48:12] !log Deploy alter table db1082 - dewiki.revision - T148967 [07:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:13] !log Stop MySQL labsdb1010 - maintenance T152194 [08:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:27] T152194: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194 [08:06:10] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [08:06:20] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 0 down 3 [08:06:28] ^ that is me [08:10:30] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:31] RECOVERY - puppet last run on elastic1037 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:12:19] (03CR) 10Giuseppe Lavagetto: [C: 031] kubelet: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324210 (owner: 10Alexandros Kosiaris) [08:14:01] (03CR) 10Giuseppe Lavagetto: [C: 031] Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 (owner: 10Alexandros Kosiaris) [08:21:10] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:30] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [08:49:10] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:53:32] (03PS3) 10Giuseppe Lavagetto: docker: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/323815 [09:00:11] (03CR) 10Giuseppe Lavagetto: [C: 032] docker: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/323815 (owner: 10Giuseppe Lavagetto) [09:05:09] (03PS1) 10Yuvipanda: labs: Add db structure for keeping info about labsdb accounts [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) [09:07:42] (03PS2) 10Yuvipanda: labs: Add db structure for keeping info about labsdb accounts [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) [09:11:59] am I here? [09:12:00] . [09:12:13] you are [09:13:18] (03CR) 10Alexandros Kosiaris: [C: 032] admin: Update my (=legoktm) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/325134 (owner: 10Legoktm) [09:13:21] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:22] (03PS2) 10Alexandros Kosiaris: admin: Update my (=legoktm) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/325134 (owner: 10Legoktm) [09:13:25] (03CR) 10Alexandros Kosiaris: [V: 032] admin: Update my (=legoktm) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/325134 (owner: 10Legoktm) [09:14:20] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [09:16:46] (03PS2) 10Jcrespo: eventlogging_sync: By pass ssl check on localhost [puppet] - 10https://gerrit.wikimedia.org/r/325257 (https://phabricator.wikimedia.org/T152364) (owner: 10Marostegui) [09:16:47] . [09:17:24] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T152339#2846313 (10Volans) p:05Triage>03High Although it might look a false positive, it's actually true, there is a failed disk in the MD array: ``` bast3001 0 ~$ cat /proc/mdstat Personalities : [raid1] md2 : acti... [09:17:33] YuviPanda: ack [09:17:39] (03CR) 10Jcrespo: [C: 032] eventlogging_sync: By pass ssl check on localhost [puppet] - 10https://gerrit.wikimedia.org/r/325257 (https://phabricator.wikimedia.org/T152364) (owner: 10Marostegui) [09:17:40] (I am acking your .) [09:19:43] ty elukey [09:20:31] (03CR) 10Jcrespo: labs: Add db structure for keeping info about labsdb accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) (owner: 10Yuvipanda) [09:25:42] akosiaris: thank you :) [09:30:56] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T152339#2846331 (10Volans) From dmesg: ``` [ +8.008965] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ +0.006609] ata2.00: BMDMA stat 0x25 [ +0.003745] ata2.00: failed command: READ DMA [ +0.004532] ata2.... [09:35:56] (03PS3) 10Yuvipanda: labs: Add db structure for keeping info about labsdb accounts [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) [09:41:56] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T152339#2845080 (10Marostegui) mdadm marked `sda` as broken but both, `sda` and `sdb` had I/O errors [09:43:30] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T152339#2846365 (10Marostegui) Normally before rebooting you'd just remove the failed disk from the array (`mdadm --manage /dev/md0 --remove /dev/sda1`) and let it boot with the healthy disk, but in this case both disks ar... [09:46:42] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/4790/mc1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/321725 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [09:56:10] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [09:58:50] PROBLEM - Host bast3001 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:00] (03CR) 10Elukey: [C: 031] "LGTM, I added the patch to deployment-prep's puppet master and ran puppet on deployment-mediawiki05.deployment-prep.eqiad.wmflabs. As far " [puppet] - 10https://gerrit.wikimedia.org/r/324642 (https://phabricator.wikimedia.org/T111934) (owner: 10Filippo Giunchedi) [10:03:02] (03CR) 10Elukey: [C: 031] role::mediawiki::jobrunner: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/320549 (owner: 10Muehlenhoff) [10:11:00] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:11:50] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [10:14:59] (03PS1) 10Jcrespo: Add user and passwords for labspuppet and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) [10:15:44] (03CR) 10jenkins-bot: [V: 04-1] Add user and passwords for labspuppet and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) (owner: 10Jcrespo) [10:16:49] (03CR) 10Marostegui: [C: 031] "To test it we can use: adywiki.user on db1095:" [puppet] - 10https://gerrit.wikimedia.org/r/325176 (https://phabricator.wikimedia.org/T152194) (owner: 10Jcrespo) [10:17:40] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [10:18:26] (03PS2) 10Jcrespo: Add user and passwords for labspuppet and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) [10:18:30] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4785145 keys, up 35 days 1 hours - replication_delay is 32 [10:19:28] (03CR) 10Marostegui: Add user and passwords for labspuppet and labsdbaccounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) (owner: 10Jcrespo) [10:22:00] RECOVERY - Host bast3001 is UP: PING OK - Packet loss = 0%, RTA = 83.84 ms [10:22:10] RECOVERY - MD RAID on bast3001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:25:10] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:25:34] (03CR) 10Jcrespo: [C: 031] labsdb-replica: Disable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/325255 (https://phabricator.wikimedia.org/T152194) (owner: 10Marostegui) [10:25:50] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [10:27:41] !log enabling trace logging on indices recovery on elasticsearch codfw - T145065 [10:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:55] T145065: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065 [10:28:35] (03PS2) 10Marostegui: labsdb-replica: Disable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/325255 (https://phabricator.wikimedia.org/T152194) [10:29:35] 06Operations, 10DBA, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846422 (10jcrespo) This was caused by the cert expiration on all analytics hosts, making all mysql connections from other databases to fail. This was part of the mitigation o... [10:30:01] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2841012 (10jcrespo) [10:30:03] 06Operations, 10DBA, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846424 (10jcrespo) [10:30:20] (03CR) 10Marostegui: [C: 032] labsdb-replica: Disable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/325255 (https://phabricator.wikimedia.org/T152194) (owner: 10Marostegui) [10:36:26] !log joal@tin Starting deploy [analytics/refinery@2c3b78c]: (no message) [10:36:35] (03PS1) 10Jcrespo: Renew expired TLS certificate for eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/325273 (https://phabricator.wikimedia.org/T152364) [10:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:45] !log joal@tin Finished deploy [analytics/refinery@2c3b78c]: (no message) (duration: 02m 19s) [10:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:53] (03PS1) 10Jcrespo: Add fake passwords for labspuppet and labsdbaccounts databases [labs/private] - 10https://gerrit.wikimedia.org/r/325274 (https://phabricator.wikimedia.org/T152377) [10:41:55] (03CR) 10Marostegui: [C: 031] Renew expired TLS certificate for eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/325273 (https://phabricator.wikimedia.org/T152364) (owner: 10Jcrespo) [10:42:46] (03CR) 10Jcrespo: [C: 032 V: 032] Add fake passwords for labspuppet and labsdbaccounts databases [labs/private] - 10https://gerrit.wikimedia.org/r/325274 (https://phabricator.wikimedia.org/T152377) (owner: 10Jcrespo) [10:45:22] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 10Wikidata, 03WMDE-QWERTY-Team-Board: Add Andrew and Aleksey to ldap/wmde group - https://phabricator.wikimedia.org/T152088#2846456 (10Abraham) Hereby I confirm: Andrew and Aleksey are working for WMDE. [10:45:37] (03PS1) 10Urbanecm: Create import sources list for hsbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325275 (https://phabricator.wikimedia.org/T152382) [10:45:49] (03PS2) 10Jcrespo: mariadb: Update check private data script to handle BINARY fields [puppet] - 10https://gerrit.wikimedia.org/r/325176 (https://phabricator.wikimedia.org/T152194) [10:47:42] (03PS1) 10Gehel: elasticsearch - upgrade codfw cluster to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/325276 (https://phabricator.wikimedia.org/T151325) [10:51:04] (03CR) 10Jcrespo: [C: 031] "Yuvi, looking at this I think you are one of the better mysql db schema designers on the WMF." [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) (owner: 10Yuvipanda) [10:52:13] (03CR) 10Yuvipanda: ":D Step 1: Pick a simple problem to solve...." [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) (owner: 10Yuvipanda) [10:52:50] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:53:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325277 [10:54:04] (03PS2) 10Jcrespo: Renew expired TLS certificate for eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/325273 (https://phabricator.wikimedia.org/T152364) [10:54:50] (03CR) 10Jcrespo: [C: 032] mariadb: Update check private data script to handle BINARY fields [puppet] - 10https://gerrit.wikimedia.org/r/325176 (https://phabricator.wikimedia.org/T152194) (owner: 10Jcrespo) [10:56:35] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325277 (owner: 10Marostegui) [10:57:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325277 (owner: 10Marostegui) [10:57:38] (03PS1) 10Alex Monk: Send secondary DNS recursor IP from labs DHCP [puppet] - 10https://gerrit.wikimedia.org/r/325278 (https://phabricator.wikimedia.org/T137460) [10:58:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 - T148967 (duration: 00m 57s) [10:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:47] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [10:59:37] (03PS3) 10Jcrespo: Add user and passwords for labspuppet and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) [11:01:25] (03PS1) 10Jcrespo: Revert "eventlogging_sync: By pass ssl check on localhost" [puppet] - 10https://gerrit.wikimedia.org/r/325279 [11:02:05] (03CR) 10Jcrespo: [C: 04-2] "This is blocked by the restart of db1046, db1047 and dbstore1002" [puppet] - 10https://gerrit.wikimedia.org/r/325279 (owner: 10Jcrespo) [11:04:01] 06Operations, 10DBA, 13Patch-For-Review: db1047 out of disk space, eventlogging_sync spam - https://phabricator.wikimedia.org/T152364#2846486 (10Marostegui) I have re-enabled puppet and ran it to pick up the commit. [11:24:22] (03CR) 10Reedy: [C: 032] Avoid using CONTENT_MODEL_FLOW_BOARD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 (owner: 10Catrope) [11:25:00] (03Merged) 10jenkins-bot: Avoid using CONTENT_MODEL_FLOW_BOARD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325152 (owner: 10Catrope) [11:25:18] (03PS2) 10Urbanecm: Create import sources list for hsbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325275 (https://phabricator.wikimedia.org/T152382) [11:26:28] !log reedy@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 46s) [11:26:31] !log that was "Avoid using CONTENT_MODEL_FLOW_BOARD" for T152379 [11:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:53] T152379: Beta update.php fails:The content model 'CONTENT_MODEL_FLOW_BOARD' is not registered on this wiki. - https://phabricator.wikimedia.org/T152379 [11:28:20] RECOVERY - Juniper alarms on asw2-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [11:30:18] 06Operations, 10ops-eqiad, 10netops: asw2-d-eqiad.mgmt.eqiad - JNX_ALARMS CRITICAL - 2 red alarms, - https://phabricator.wikimedia.org/T152182#2846529 (10faidon) Thanks @Cmjohnson! I think the other alarm is for a second, SFP management port that the QFX have (at least from what I can see at http://www.junip... [11:30:20] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:20] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:58:23] !next [11:58:26] @next [12:00:04] addshore: Respected human, time to deploy ElectronPdfService extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T1200). Please do the needful. [12:00:04] Addshore: A patch you scheduled for ElectronPdfService extension is about to be deployed. Please be available during the process. [12:17:59] (03PS2) 10Addshore: Enable ElectronPdfService extension on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324487 (https://phabricator.wikimedia.org/T150944) [12:18:32] !log addshore@tin Synchronized php-1.29.0-wmf.4/extensions/ElectronPdfService/specials/SpecialElectronPdf.php: {{gerrit|324791}} Use prefixedDbKey when redirecting to Electron (duration: 00m 45s) [12:18:41] (03CR) 10Addshore: [C: 032] Enable ElectronPdfService extension on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324487 (https://phabricator.wikimedia.org/T150944) (owner: 10Addshore) [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:12] (03Merged) 10jenkins-bot: Enable ElectronPdfService extension on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324487 (https://phabricator.wikimedia.org/T150944) (owner: 10Addshore) [12:22:26] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: {{gerrit|324487}} T150944 Enable ElectronPdfService extension on test wikis & mediawikiwiki PT1 (duration: 00m 45s) [12:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:38] T150944: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944 [12:23:43] !log addshore@tin Synchronized wmf-config/CommonSettings.php: {{gerrit|324487}} T150944 Enable ElectronPdfService extension on test wikis & mediawikiwiki PT2 (duration: 00m 44s) [12:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:39] bah, hashar around? [12:25:12] or Reedy :) [12:25:36] yes [12:26:06] I believe I have missed something, (no messages) https://test.wikipedia.org/w/index.php?title=Special:ElectronPdf&page=Main+Page&action=show-selection-screen&coll-download-url=%2Fw%2Findex.php%3Ftitle%3DSpecial%3ABook%26bookcmd%3Drender_article%26arttitle%3DMain%2BPage%26returnto%3DMain%2BPage%26oldid%3D277777%26writer%3Drdf2latex [12:26:21] guess your scap hasn't rebuild cdb files [12:27:10] ahh, it needs to be in wmf-config/extension-list for that to work right? [12:27:12] addshore: must say I barely know how the message cache is regenerated / invalidated [12:27:36] probably yes [12:28:51] {{doing}} [12:29:46] (03PS1) 10Addshore: Add ElectronPdfService to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325288 [12:29:55] (03CR) 10Addshore: [C: 032] Add ElectronPdfService to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325288 (owner: 10Addshore) [12:30:09] and I think scap has a subcommand to refresh the l10n messages cache [12:30:31] it looks like sync-l10n [12:30:54] (03Merged) 10jenkins-bot: Add ElectronPdfService to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325288 (owner: 10Addshore) [12:33:06] but "scap sync-l10n '1.29.0-wmf.4'" seems to not work [12:36:07] (03PS1) 10Ema: Release 4.1.4-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/325289 [12:36:11] mhhm, sync-l10n failed: 'Namespace' object has no attribute 'message' [12:36:59] hashar: I could do a full sync..? [12:39:06] :( [12:39:36] at least it managed to update a bunch of json files [12:40:14] the Namespace object would be reprenseting the arguments passed to scap sync-l10n [12:41:39] there is only 1 arg, "version MediaWiki version (eg 1.27.0-wmf.7)" [12:41:44] aeo [12:41:47] tried [12:41:54] File "/usr/lib/python2.7/dist-packages/scap/main.py", line 38, in main [12:41:54] with utils.lock(self.config['lock_file'], self.arguments.message): [12:41:54] AttributeError: 'Namespace' object has no attribute 'message' [12:41:55] so yeah [12:41:58] that is a bug in scap [12:42:09] so I guess I'll do a full scap now and file a bug? [12:42:14] scap sync-l10n eventually tries to set a lock [12:42:18] but without any message [12:42:23] and the lock system fails [12:42:51] worth reporting has a bug [12:43:16] I guess that relates to https://lists.wikimedia.org/pipermail/wikitech-l/2016-November/087107.html [12:44:16] yup [12:44:36] !log addshore@tin Started scap: Add ElectronPdfService to extensions-list, sync-l10n seems to have a bug [12:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:47] right, *goes to do the bug* [12:44:59] maybe l10n-update could do it [12:45:24] !log addshore@tin scap aborted: Add ElectronPdfService to extensions-list, sync-l10n seems to have a bug (duration: 00m 47s) [12:45:33] hashar: you think? [12:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:24] wait, I think the combination of l10n-update and then sunc-l10n is needed [12:47:46] !log addshore@tin Started scap: Add ElectronPdfService to extensions-list, sync-l10n seems to have a bug. (Take 2) [12:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:35] (03PS4) 10Yuvipanda: labs: Add db structure for keeping info about labsdb accounts [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) [12:54:32] (03CR) 10Yuvipanda: [C: 032] labs: Add db structure for keeping info about labsdb accounts [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) (owner: 10Yuvipanda) [12:58:53] (03CR) 10Yuvipanda: [C: 032] labs: Add db structure for keeping info about labsdb accounts [puppet] - 10https://gerrit.wikimedia.org/r/325268 (https://phabricator.wikimedia.org/T149933) (owner: 10Yuvipanda) [12:59:33] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T152339#2846678 (10faidon) 05Open>03Resolved a:03faidon I force-rebooted it. After coming back up, the filesystem was dirty so it dropped me to an initramfs shell (in the iDRAC console). I ran fsck manually and reboo... [13:04:00] !log Stopping mysql labsdb1010 and labsdb1009 for maintenance - T152194 [13:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:13] T152194: Provision sanitized data on labsdb1009, labsdb1010, labsdb1011 with from db1095 - https://phabricator.wikimedia.org/T152194 [13:08:30] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:10] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 0 down 3 [13:10:48] ^ me [13:14:20] PROBLEM - HHVM rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:40] PROBLEM - Apache HTTP on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:54] addshore: Yeah, Chad is trying to kill the extension-list files. Not gone yet [13:16:40] PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:00] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325294 (https://phabricator.wikimedia.org/T148967) [13:20:56] (03PS2) 10Marostegui: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325294 (https://phabricator.wikimedia.org/T148967) [13:22:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325294 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [13:23:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325294 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [13:24:50] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:25:38] addshore: are you deploying mediawiki now? [13:25:47] marostegui: yes [13:26:16] Oh! 12UTC [13:26:25] My bad [13:26:36] 86% of the way through sync-apaches [13:26:47] will probably run over the 1.5 window slightly! [13:26:55] addshore: ok, I just need to deploy db-eqiad.php not in a rush [13:27:01] okay! [13:27:12] marostegui: I'll ping you when the sync is done! [13:27:37] addshore: thanks, and sorry for being in the middle, I checked the deployment page but missed that 14:00 was Monday 28th not 5th Dec [13:27:57] haha, no worries :) [13:31:57] (03CR) 10Yuvipanda: [C: 031] Add user and passwords for labspuppet and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) (owner: 10Jcrespo) [13:35:31] (03PS8) 10Thiemo Mättig (WMDE): Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) (owner: 10Aude) [13:37:14] (03PS4) 10Jcrespo: Add user and passwords for labspuppet and labsdbaccounts [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) [13:37:30] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:38:21] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "I can confirm that the code in this patch is 100% identical to the same configuration in Wikibase as it is right now (see Ic9d56e8 for the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) (owner: 10Aude) [13:39:18] it's not swat yet? [13:39:24] nope [13:39:31] * aude adds stuff [13:40:14] (03PS3) 10Hashar: Move EasyTimeline config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321493 (https://phabricator.wikimedia.org/T22825) [13:41:00] !log addshore@tin Finished scap: Add ElectronPdfService to extensions-list, sync-l10n seems to have a bug. (Take 2) (duration: 53m 13s) [13:41:06] marostegui: ^^ its all yours [13:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:21] (03PS4) 10Hashar: Drop '.ttf' from $wgTimelineFontFile settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) [13:41:23] (03PS3) 10Hashar: Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) [13:41:30] addshore: thanks! [13:42:18] (03CR) 10Jcrespo: [C: 032] Add user and passwords for labspuppet and labsdbaccounts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325271 (https://phabricator.wikimedia.org/T152377) (owner: 10Jcrespo) [13:42:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1087 - T148967 (duration: 00m 44s) [13:42:24] (03CR) 10jenkins-bot: [V: 04-1] Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [13:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:30] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [13:42:36] hashar: the Reedy hashar it looks like the full scap sync didn't actually fix my message issues [13:42:50] (03CR) 10Aude: "@thiemo we can deploy this now, so that additions to sorting order don't have to wait for wikidata code deploys." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) (owner: 10Aude) [13:43:50] (03PS5) 10Hashar: Drop '.ttf' from $wgTimelineFontFile settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) [13:43:52] (03PS4) 10Hashar: Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) [13:45:11] 06Operations, 10Phabricator: Phabricator leaving old files in /tmp - https://phabricator.wikimedia.org/T150396#2846753 (10mmodell) I see a few places in the phabricator codebase that create temp files but so far I haven't been able to correlate any of it with the files currently seen in /tmp What I find espec... [13:45:20] (03PS1) 10Addshore: Disable ElectronPdfService on mw.org until messages are fixed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325298 (https://phabricator.wikimedia.org/T150944) [13:46:13] marostegui: and once you are done I'll sync 1 more change to complete my window [13:46:22] addshore: I am done :) [13:46:31] Great! [13:46:39] (03CR) 10Addshore: [C: 032] Disable ElectronPdfService on mw.org until messages are fixed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325298 (https://phabricator.wikimedia.org/T150944) (owner: 10Addshore) [13:47:10] addshore: :( [13:47:12] (03Merged) 10jenkins-bot: Disable ElectronPdfService on mw.org until messages are fixed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325298 (https://phabricator.wikimedia.org/T150944) (owner: 10Addshore) [13:47:16] 06Operations, 10Phabricator: Phabricator leaving old files in /tmp - https://phabricator.wikimedia.org/T150396#2846755 (10mmodell) Also the 8k files almost all contain a single uppercase A or B. The 4k files are just empty directories. [13:48:22] !log Deploy alter table db1087 - dewiki.revision - T148967 [13:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:32] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [13:48:57] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: {{gerrit|325298}} T150944 Disable ElectronPdfService extension on mediawikiwiki until messages are fixed (duration: 00m 45s) [13:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:08] T150944: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944 [13:49:14] yeh, hashar, not entirly sure why.... [13:50:09] !log ElectronPdfService extension deploy window finished [13:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:29] (03PS6) 10Hashar: Drop '.ttf' from $wgTimelineFontFile + bump epoch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) [13:51:31] (03PS5) 10Hashar: Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) [13:52:49] (03PS3) 10Jcrespo: Renew expired TLS certificate for eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/325273 (https://phabricator.wikimedia.org/T152364) [13:52:50] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:52:51] (03PS1) 10Jcrespo: labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) [13:53:03] jouncebot: next [13:53:03] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T1400) [13:54:18] oh manene [13:54:20] oojs [13:54:45] * mafk is around for eu swat, poke me when it starts [13:55:31] hashar: if you want a hand let me know! [13:55:34] [= [13:55:46] (03CR) 10Daniel Kinzler: [C: 031] "Fine with me, if we want to have the new additions in now. We'll need to touch this again however when we start using the new InterwikiSor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) (owner: 10Aude) [13:55:49] addshore: I would rather NOT touch anything related to l10n / message cache etc [13:55:54] it is just too scary [13:56:21] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2846782 (10elukey) @Cmjohnson I can still see errors in the dmesg :( ``` [Mon Dec 5 10:13:57 2016] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offse... [13:58:26] (03PS2) 10Jcrespo: labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) [13:58:50] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2846783 (10Cmjohnson) @elukey please depool. I will need to reseat the DIMM. I also see an error in the h/w log ------------------------------------------------------------------------------- Record: 2 D... [13:58:58] (03CR) 10Marostegui: [C: 031] labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) (owner: 10Jcrespo) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T1400). [14:00:04] MatmaRex, mafk, aude, and hashar: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:27] hello [14:00:36] hi [14:01:29] MatmaRex: hello [14:01:44] I am not going to deploy the to
change in oojs [14:02:03] D: [14:02:06] for chrome 55 breakage/regression. Would rather have a review by someone knowing about oojs [14:02:20] bad luck for chrome. I can't really assert the impact changing to a div will have [14:02:31] such as breaking skins / extensions / gadgets or whatever else :( [14:02:44] hashar: it was a div until a couple weeks ago. [14:03:12] yup [14:03:44] but lets hold on oojs folks to review it such as VolkerE (master change https://gerrit.wikimedia.org/r/#/c/325243/ ) [14:03:53] (03PS3) 10Jcrespo: labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) [14:04:07] fair enough. meh [14:04:21] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:25] MatmaRex: does Chrome ever does emergency releases ? :D [14:04:36] or are they always on their fixed train of N weeks? [14:04:52] hashar: they don't seem to have any sort of sensible schedule, as far as i know [14:05:17] MatmaRex: lets circle it back with the oojs folks so [14:05:36] MatmaRex: and I guess it can be done at anytime outside of a SWAT slot [14:05:40] (03PS2) 10Hashar: Update interwiki map for fiwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324918 (https://phabricator.wikimedia.org/T152201) (owner: 10Aude) [14:06:13] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324918 (https://phabricator.wikimedia.org/T152201) (owner: 10Aude) [14:06:15] aude: interwikis :/ [14:06:23] scary scary [14:06:40] (03CR) 10Marostegui: [C: 031] labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) (owner: 10Jcrespo) [14:06:41] heh [14:06:47] MatmaRex: what annoys me the most is that you have noticed the issue two weeks ago and they failed to fix it / ack it :/ [14:07:11] (03Merged) 10jenkins-bot: Update interwiki map for fiwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324918 (https://phabricator.wikimedia.org/T152201) (owner: 10Aude) [14:07:27] we're supposed to use mwdebug1001? [14:07:33] or mwdebug1002? [14:07:37] no idea [14:07:43] lets pull on both ? :} [14:07:43] hashar: it was reported a month ago, and yes [14:07:45] ok :/ [14:07:55] i no longer bother reporting chrome bugs i keep running into [14:07:56] fiwikivoyage should be trivial [14:07:57] (03PS4) 10Jcrespo: labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) [14:08:05] not worth the effort [14:08:08] and the other is just duplicating default config in wikibase [14:08:11] !log depooling mw1239 for maintenance (T148421) [14:08:18] aude: pulled on both [14:08:21] hashar: mwdebug1002 [14:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:22] T148421: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421 [14:08:23] checking [14:08:51] oh https://en.wikivoyage.org/wiki/Special:Interwiki !! [14:08:52] The 1001 replace mw1017, so it's for general purpose tests, 1002 replaces 1099, so it's for deployments. [14:09:06] ah nice thx Dereckson [14:09:34] looks good [14:09:39] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2846797 (10elukey) ``` elukey@puppetmaster1001:~$ sudo -i confctl --quiet select 'name=mw1239.eqiad.wmnet' get {"mw1239.eqiad.wmnet": {"pooled": "no", "weight": 20}, "tags": "dc=eqiad,cluster=appserver,service... [14:09:56] e.g. https://fi.wikivoyage.org/wiki/Sydney now has interwiki links to other wikivoyage [14:11:23] (03PS2) 10Hashar: Enable $wgAbuseFilterProfile for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324701 (https://phabricator.wikimedia.org/T152087) (owner: 10MarcoAurelio) [14:11:41] !log hashar@tin Synchronized wmf-config/interwiki.php: Update interwiki map for fiwikivoyage - T152201 (duration: 00m 46s) [14:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:51] aude: done [14:11:51] T152201: Update interwikimap of Finnish Wikivoyage - https://phabricator.wikimedia.org/T152201 [14:12:05] ok [14:12:06] mafk: doing Enable $wgAbuseFilterProfile for eswiki [14:12:07] * aude checks again [14:12:16] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324701 (https://phabricator.wikimedia.org/T152087) (owner: 10MarcoAurelio) [14:12:21] it's good [14:12:22] hashar: I'm here for test [14:12:34] let me know when it's on mw1002 [14:12:57] (03Merged) 10jenkins-bot: Enable $wgAbuseFilterProfile for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324701 (https://phabricator.wikimedia.org/T152087) (owner: 10MarcoAurelio) [14:13:19] mafk: it is on the test server [14:13:27] hashar: checking [14:13:30] and I have no idea how to check it hehe [14:13:58] i'm using the wikimedia debug browser plugin [14:14:11] I too [14:14:16] but abusefilter is a mystery to me :D [14:14:26] there'https://wikitech-static.wikimedia.org/wiki/X-Wikimedia-Debug [14:14:30] oh, that [14:14:33] no idea how to check [14:14:50] hashar: looks good to me [14:14:56] k syncing [14:15:10] thx [14:15:31] (03PS7) 10Hashar: Drop '.ttf' from $wgTimelineFontFile + bump epoch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) [14:15:33] (03PS6) 10Hashar: Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) [14:15:35] (03PS1) 10Marostegui: mariadb: Added gtid_domain_id variable [puppet] - 10https://gerrit.wikimedia.org/r/325303 (https://phabricator.wikimedia.org/T149418) [14:15:37] (03PS4) 10Hashar: Move EasyTimeline config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321493 (https://phabricator.wikimedia.org/T22825) [14:15:38] !log hashar@tin Synchronized wmf-config/abusefilter.php: Enable $wgAbuseFilterProfile for eswiki - T152087 (duration: 00m 44s) [14:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:51] T152087: Enable $wgAbuseFilterProfile for eswiki - https://phabricator.wikimedia.org/T152087 [14:17:03] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321493 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [14:18:26] (03Merged) 10jenkins-bot: Move EasyTimeline config to its own file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321493 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [14:19:26] (03CR) 10Marostegui: [C: 031] labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) (owner: 10Jcrespo) [14:20:16] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/4794/" [puppet] - 10https://gerrit.wikimedia.org/r/325303 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [14:21:54] !log hashar@tin Synchronized wmf-config/timeline.php: Move EasyTimeline config to its own file - T22825 (duration: 00m 44s) [14:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:07] T22825: Change default font for EasyTimeline on zh projects to something that actually has glyphs for Chinese characters - https://phabricator.wikimedia.org/T22825 [14:23:24] !log hashar@tin Synchronized wmf-config/CommonSettings.php: Move EasyTimeline config to its own file - T22825 (duration: 00m 44s) [14:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:13] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [14:24:18] hashar: do you want me to deploy the interwiki sorting config when you are done? [14:24:33] sure [14:24:36] ok [14:24:43] if you're sure enough about it [14:24:46] (03Merged) 10jenkins-bot: Drop '.ttf' from $wgTimelineFontFile + bump epoch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321561 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [14:25:58] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2846869 (10Tobi_WMDE_SW) [14:26:14] !log hashar@tin Synchronized fonts: For T22825 (duration: 00m 47s) [14:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:28] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2846887 (10Tobi_WMDE_SW) [14:26:37] testing my change [14:28:07] nieat [14:29:33] (03CR) 10Ottomata: [C: 031] ":D" [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [14:30:06] !log hashar@tin Synchronized wmf-config/timeline.php: Drop ttf from $wgTimelineFontFile and bump epoch - T22825 (duration: 00m 47s) [14:30:16] hmm [14:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:19] T22825: Change default font for EasyTimeline on zh projects to something that actually has glyphs for Chinese characters - https://phabricator.wikimedia.org/T22825 [14:30:50] (03CR) 10Hashar: [C: 032] "Other changes have been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [14:30:58] I was expecting some burt of error [14:31:34] (03Merged) 10jenkins-bot: Test for $wgTimelineFontFile values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321558 (https://phabricator.wikimedia.org/T22825) (owner: 10Hashar) [14:32:33] test case works https://zh.wikipedia.org/wiki/User:Hashar/T22285 [14:33:20] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:33:59] aude: so interwiki sort ? [14:34:06] yes [14:34:17] i can do if you prefer or you can and i test [14:34:33] be bold and do it ! ;} [14:34:38] ok :) [14:34:52] just to be sure, we're still deploying from tin? [14:35:01] and not something in codfw [14:35:06] * hashar runs aways loudly screaming "Interwiiiikiiiiiiiiiii" [14:35:27] I have been doing mine from tin.eqiad.wmnet [14:35:38] the other codfw machine would be mira.codfw.wmnet which would have a huge MOTD [14:35:53] ok [14:36:26] (03CR) 10Aude: [C: 032] Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) (owner: 10Aude) [14:36:54] (03Merged) 10jenkins-bot: Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) (owner: 10Aude) [14:37:36] (03CR) 10DCausse: [C: 031] elasticsearch - upgrade codfw cluster to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/325276 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [14:38:39] * aude tests on mwdebug [14:39:38] looks good [14:39:39] (03CR) 10Jcrespo: [C: 032] labsdb: Add minor fixes for maintain-dbusers schema [puppet] - 10https://gerrit.wikimedia.org/r/325301 (https://phabricator.wikimedia.org/T149933) (owner: 10Jcrespo) [14:39:44] aude: that is really an ugly change :D [14:40:46] then with multiple scripts/langs I dont see how we could sort them [14:41:06] !log aude@tin Synchronized wmf-config/Wikibase.php: Add interwiki sorting config from Wikibase (duration: 00m 47s) [14:41:09] (03CR) 10Jcrespo: [C: 031] mariadb: Added gtid_domain_id variable [puppet] - 10https://gerrit.wikimedia.org/r/325303 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [14:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:36] hashar: the orders originally come from meta wiki [14:41:40] (03PS10) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [14:41:59] ideally we take them from there again, such as with a script, like updating interwiki map [14:42:05] !log Restart MySQL labsdb1011 to disable parallel replication [14:42:13] and the community more easily maintains them [14:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:28] looks fine [14:42:36] aude: \O/ [14:42:48] e.g. https://en.wikipedia.org/wiki/Zander has olo (Livvinkarjala) not at the bottom [14:43:01] anyway, this is progress [14:43:06] !log European SWAT done [14:43:08] yeah yeah baby steps :] [14:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:22] I wonder whether MediaWiki has/had a way to sort interwikis [14:43:55] though I remember a discussion in which people preferred to be able to manually maintain the sort order [14:44:16] so maybe pywikibot had a sort system of some sort. Anyway good to see that being done centrally [14:44:45] water time [14:44:52] pywikibot uses meta wiki [14:44:59] for namespaces that don't have wikibase support [14:45:12] (03CR) 10Elukey: [C: 032] Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [14:45:27] we're moving the sorting code out of wikibase, so maybe it can be applied everywhere w/o bots if the community wants that [14:49:28] (03CR) 10Marostegui: "I have manually tested this in one of the new labs servers and it is all fine, so I will deploy" [puppet] - 10https://gerrit.wikimedia.org/r/325303 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [14:52:30] PROBLEM - MariaDB Slave Lag: m3 on db2012 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.47 seconds [14:52:44] m3 is phabricator [14:53:10] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.26 seconds [14:53:17] if is db1048 again [14:54:11] either the master is doing mass imports or a hw issue [14:54:20] let's see.. [14:54:32] I will try to mitigate by changing the topology of its codfw slave [14:56:35] there are two disks with media errors on db1048 [14:56:38] !log running CHANGE MASTER ON db2012 to base it on db1043 [14:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:25] 2012 is recovering [14:57:30] db1048 is not [14:57:38] which would fit the disk issues [14:58:30] RECOVERY - MariaDB Slave Lag: m3 on db2012 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:59:38] yeah, replication lag is going even worse on 1048 [14:59:45] yes [14:59:56] an no issue on a cross-db host :-) [15:00:00] *dc [15:00:19] so it is not replication [15:00:34] there was a spike in inserts [15:01:00] are you thinking schema drift? [15:01:07] or just commenting it [15:01:23] (03CR) 10Anomie: Set $wgSoftBlockRanges (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [15:01:33] (03PS2) 10Anomie: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 [15:01:34] jynus: yes, that is what I was thinking :) [15:01:41] I am going to check if db2012 had the same [15:02:01] it does [15:02:16] so might be the disk [15:02:41] well, offline the 2 disks (check first they are not members of the same span) [15:02:52] and we will find out [15:03:15] :-) [15:03:47] It looks the 32:2, the other one has failures events, but not media errors [15:04:01] go with that one [15:04:19] worst case scenario, it does nothing and we rebuilt it [15:04:52] also, it is a passive slave [15:05:58] Now, one had media errors too [15:06:01] 32:0 and 32:2 [15:06:05] They are not in the same spans [15:06:22] do one first [15:06:24] we wait [15:06:27] then we do the other [15:06:28] going to do 32:0 [15:06:31] need help? [15:06:35] no [15:06:43] I will post the command here before runnig it [15:06:55] for a last review :) [15:07:00] remember you are the hw expert of the 2 [15:07:03] :-) [15:07:06] Am I? [15:07:26] megacli -PDOffline -PhysDrv \[32:0\] -aALL [15:07:28] sounds good? [15:07:55] yep [15:07:56] 32:0 has: Media Error Count: 17 [15:08:03] so let's go for that one first [15:08:18] done [15:08:50] !log db1048 - set disk 32:0 offline [15:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:28] let's wait a bit [15:12:05] the issues are not new, I think it only went to a critical point: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&from=1480915548143&to=1480950690922&var-dc=eqiad%20prometheus%2Fops&var-server=db1048 [15:13:02] wow [15:13:07] that is a big one [15:15:22] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2847119 (10Ottomata) Hm, you know...stat1002 and stat1003 are out of warranty. We could add this card to stat1004, that would be totally fine. If we... [15:16:22] (03PS2) 10Elukey: Add openjdk-8-jdk to the list of statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/324679 (https://phabricator.wikimedia.org/T151896) [15:17:03] (03CR) 10Ottomata: [C: 031] "Lemme know if I should merge now" [puppet] - 10https://gerrit.wikimedia.org/r/325122 (https://phabricator.wikimedia.org/T141785) (owner: 10Alex Monk) [15:17:20] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.08 seconds [15:17:41] (03CR) 10Elukey: [C: 032] Add openjdk-8-jdk to the list of statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/324679 (https://phabricator.wikimedia.org/T151896) (owner: 10Elukey) [15:18:26] 06Operations, 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2847127 (10Ottomata) YES! Glad I could help! :D [15:21:10] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [15:22:10] marostegui, it is not getting better, is it? [15:22:16] nop [15:22:20] and dbstore1002 is lagging too [15:22:37] oh, it just recovered [15:22:40] yeah, it is because it is the slave of db1048 [15:22:45] I can try to mark 32:2 as failed too [15:22:55] I can change that, but I need to stop replication on db1048 [15:23:04] and do not want to do it now [15:23:10] PROBLEM - MegaRAID on db1048 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [15:23:12] ACKNOWLEDGEMENT - MegaRAID on db1048 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T152411 [15:23:15] 06Operations, 10ops-eqiad: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847147 (10ops-monitoring-bot) [15:23:36] marostegui, do it, if it doesn't help [15:23:39] 06Operations, 10ops-eqiad: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847147 (10Marostegui) We marked 32:0 as failed manually as the server is lagging. [15:23:51] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847153 (10Marostegui) [15:23:55] ok, I will do it now [15:24:02] we will rebuild the disks, stop replication, debug [15:24:07] Let me double check again that it is in a different span [15:24:58] it is [15:25:03] so going to set it offline [15:25:37] done [15:25:55] !log Set disk 32:2 as failed db1048 - T152411 [15:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:06] T152411: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411 [15:26:14] it is going down [15:26:29] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847171 (10Marostegui) We have also marked 32:2 as failed. Both disks had media error, can we get them replaced? [15:27:06] and now up again? [15:27:11] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.39% of data above the critical threshold [1000.0] [15:27:11] it is moving between 620 and 630 yes [15:27:20] going up and down :( [15:27:30] 06Operations, 10Analytics, 13Patch-For-Review: Install java 8 to stat1002 - https://phabricator.wikimedia.org/T151896#2847172 (10elukey) 05Open>03Resolved a:03elukey @EBernhardson: ``` elukey@stat1002:~$ update-alternatives --display java java - manual mode link currently points to /usr/lib/jvm/java-... [15:29:10] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.39% of data above the critical threshold [1000.0] [15:34:54] jynus: now it is going really fast down \o/ [15:35:05] I guess it finished rescanning the disks [15:35:36] not 100% sure it is hw [15:36:17] it could be, or it could be the higher throughput events stopped [15:36:20] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.76 seconds [15:37:10] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 44.20 seconds [15:37:29] https://grafana.wikimedia.org/dashboard/db/mysql?from=1480930642726&to=1480952242727&panelId=3&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1048 [15:37:39] or it could be both [15:38:27] 06Operations, 10Analytics, 13Patch-For-Review: Install java 8 to stat1002 - https://phabricator.wikimedia.org/T151896#2847201 (10EBernhardson) Thanks! [15:38:29] but see: https://grafana.wikimedia.org/dashboard/db/mysql?from=now-6h&to=now&panelId=3&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-server=db1043 [15:38:47] so probably something querying the slave is the cause of this [15:39:00] and the disks may have contributed to make it slow [15:39:38] (03PS3) 10Giuseppe Lavagetto: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 (owner: 10Alexandros Kosiaris) [15:39:50] actually, the queries are high on the master too [15:40:02] it is just maked by high reads [15:40:05] *masked [15:46:58] (03CR) 10Yuvipanda: "Minor nitpicks only, but it looks like this will cause / require all tools k8s nodes to restart kubelet / kube-proxy. If so let's schedule" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324210 (owner: 10Alexandros Kosiaris) [15:49:17] (03CR) 10Yuvipanda: "Same thing as earlier - mostly LGTM, but let's co-ordinate a restart on tools" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324211 (owner: 10Alexandros Kosiaris) [15:49:44] (03CR) 10Hashar: Set $wgSoftBlockRanges (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [15:49:53] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Ganglia varnishkafka python module crashing repeatedly - https://phabricator.wikimedia.org/T152093#2847242 (10elukey) 05Open>03Resolved Merged my changes for varnishkafka statsd monitoring, and Ema cleaned up via salt all the varnishkafka... [15:51:33] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Ganglia varnishkafka python module crashing repeatedly - https://phabricator.wikimedia.org/T152093#2847250 (10elukey) 05Resolved>03Open (needs to be moved to the right column of analytics kanban) [16:00:34] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2847282 (10bd808) With volume that high we might need to setup a specific logstash or other process just to handle the filtering. It looks like kafkatee has some filtering support; maybe it could b... [16:01:50] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325313 [16:03:53] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325313 [16:04:00] (03PS1) 10Ottomata: Install zip for statistics::web role (on thorium) [puppet] - 10https://gerrit.wikimedia.org/r/325314 (https://phabricator.wikimedia.org/T149438) [16:05:12] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325313 (owner: 10Marostegui) [16:05:45] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325313 (owner: 10Marostegui) [16:07:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T148967 (duration: 00m 59s) [16:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:20] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [16:07:44] (03PS2) 10Ottomata: Install zip for statistics::web role (on thorium) [puppet] - 10https://gerrit.wikimedia.org/r/325314 (https://phabricator.wikimedia.org/T149438) [16:07:56] (03CR) 10Ottomata: [C: 032 V: 032] Install zip for statistics::web role (on thorium) [puppet] - 10https://gerrit.wikimedia.org/r/325314 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [16:08:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1087 - T148967 (duration: 00m 49s) [16:08:37] !log Stop mysql db2048 for maintenance - T149553 [16:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:59] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [16:11:06] (03CR) 10Anomie: Set $wgSoftBlockRanges (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [16:12:22] !log reloading haproxy on dbproxy1011 to catch the master going back up [16:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:20] all hosts up [16:13:20] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [16:13:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Release 0.0.3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/324139 (owner: 10Volans) [16:14:18] (03PS1) 10Elukey: Add Nagios process alarms for statsv and EL varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/325317 [16:14:53] (03Merged) 10jenkins-bot: Release 0.0.3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/324139 (owner: 10Volans) [16:15:20] (03CR) 10jenkins-bot: [V: 04-1] Add Nagios process alarms for statsv and EL varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/325317 (owner: 10Elukey) [16:15:46] (03PS3) 10Anomie: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 [16:16:51] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC is happy at https://puppet-compiler.wmflabs.org/4793/ and I 've done a review myself, merging" [puppet] - 10https://gerrit.wikimedia.org/r/325146 (owner: 10Tim Landscheidt) [16:16:56] (03PS2) 10Elukey: Add Nagios process alarms for statsv and EL varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/325317 [16:16:58] (03PS2) 10Alexandros Kosiaris: Quote "owner" and "group" attributes for file and git::clone resources [puppet] - 10https://gerrit.wikimedia.org/r/325146 (owner: 10Tim Landscheidt) [16:17:00] (03CR) 10Alexandros Kosiaris: [V: 032] Quote "owner" and "group" attributes for file and git::clone resources [puppet] - 10https://gerrit.wikimedia.org/r/325146 (owner: 10Tim Landscheidt) [16:21:12] 06Operations, 10ops-codfw, 10DBA: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2847394 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [16:22:12] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2847396 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [16:22:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "hmm we got https://phabricator.wikimedia.org/T120159 for actually deprecating the entire puppet module. This creates/moves files in the mo" [puppet] - 10https://gerrit.wikimedia.org/r/325046 (owner: 10Merlijn van Deen) [16:22:33] 06Operations, 10ops-codfw, 10DBA: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2847399 (10Marostegui) Thanks ``` physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Rebuilding) ``` [16:23:07] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2847400 (10Marostegui) Thanks! ``` physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, Rebuilding) ``` [16:25:25] (03PS3) 10Elukey: Add Nagios process alarms for statsv and EL varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/325317 [16:26:52] (03CR) 10Ottomata: [C: 031] Add Nagios process alarms for statsv and EL varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/325317 (owner: 10Elukey) [16:29:50] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:30:10] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [16:30:24] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2847414 (10Nuria) Let's please not add the card to our current 1002 or 1003 as we plan to replace those machines shortly. [16:31:35] (03CR) 10Ema: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/325317 (owner: 10Elukey) [16:33:17] (03PS4) 10Elukey: Add Nagios process alarms for statsv and EL varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/325317 [16:36:10] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [16:40:10] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 5.08% of data above the critical threshold [1000.0] [16:40:42] (03CR) 10EBernhardson: [C: 031] [cirrus] enable BM25 on all but wikis with spaceless languages [step 1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324738 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [16:40:55] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/325317 (owner: 10Elukey) [16:41:04] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2847443 (10faidon) >>! In T144431#2841953, @GWicke wrote: > Eric's wording here is a bit misleading. Ever sinc... [16:42:10] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [16:43:56] (03PS1) 10Volans: Add missing Build-Depends [software/service-checker] - 10https://gerrit.wikimedia.org/r/325319 [16:44:35] volans: let me know if jenkins +2, I am still waiting :/ [16:45:13] elukey: ok, have you checked the queue? [16:46:17] nope in a meeting :( [16:48:52] (03CR) 10Elukey: [C: 032] Add Nagios process alarms for statsv and EL varnishkafka instance [puppet] - 10https://gerrit.wikimedia.org/r/325317 (owner: 10Elukey) [16:49:00] got it [16:49:07] elukey: I got it too [16:49:12] 4 minutes though [16:50:43] !log added nagios process check alarms for varnishakfka-statsv and varnishkafka-eventlogging on cache::text hosts [16:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:56] (03CR) 10EBernhardson: [C: 031] [cirrus] enable BM25 on all but wikis with spaceless languages [step 2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324752 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [16:57:49] (03CR) 10EBernhardson: [C: 031] [cirrus] enable BM25 on all but wikis with spaceless languages [step 3/3] (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324753 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [17:01:04] thcipriani: got a moment to chat? :) [17:01:28] addshore: just jumped in a meeting, but I'm around [17:01:43] okay, I'll send you a bunch of pms (no rush) reply when ready ;) [17:01:53] sounds good :D [17:12:17] (03CR) 10Andrew Bogott: "this looks promising :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325278 (https://phabricator.wikimedia.org/T137460) (owner: 10Alex Monk) [17:16:03] !log restarting hhvm on mw1285 (hhvm-debug in /tmp/hhvm.140129.bt.) [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:47] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.020 second response time [17:16:47] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 77214 bytes in 0.093 second response time [17:17:18] (03PS1) 10Urbanecm: Disable wgAbuseFilterProfile at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325323 [17:17:39] (03PS2) 10Urbanecm: Disable wgAbuseFilterProfile at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325323 (https://phabricator.wikimedia.org/T149899) [17:19:31] !log restarting hhvm on mw1268 (hhvm-debug in /tmp/hhvm.16827.bt.) [17:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:47] RECOVERY - Apache HTTP on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.066 second response time [17:21:07] RECOVERY - HHVM rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 77215 bytes in 0.103 second response time [17:21:24] (03PS3) 10Urbanecm: Disable wgAbuseFilterProfile at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325323 (https://phabricator.wikimedia.org/T149899) [17:24:37] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.83 seconds [17:24:39] (03PS2) 10BBlack: varnish: make PURGE more efficient [puppet] - 10https://gerrit.wikimedia.org/r/324270 [17:27:51] marostegui^db1048 is back [17:28:06] yep [17:28:08] I see [17:28:19] we can probably ack it for a couple of days [17:28:29] and debug on wedsnesday [17:29:02] interesting, db2012 is not lagging this time [17:29:23] no, as I said, not a rep issue [17:29:31] it is a db1048 issue [17:29:59] config, schema or hw [17:30:23] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 412.02 seconds Marostegui This host is suffering from lag from time to time [17:30:36] ack'ed - let me create a ticket for it [17:31:33] wasn't there already one? [17:31:50] there was one for the degraded raid [17:31:51] https://phabricator.wikimedia.org/T151039 [17:32:03] we should rename it [17:33:26] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1048 - https://phabricator.wikimedia.org/T152411#2847730 (10Marostegui) [17:42:23] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.24 seconds [17:51:33] PROBLEM - puppet last run on mw1274 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:57:23] (03PS1) 10EBernhardson: Turn off CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325329 (https://phabricator.wikimedia.org/T149740) [18:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T1800). [18:00:53] RECOVERY - HP RAID on db2068 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [18:02:56] (03PS3) 10Dzahn: Follow-up I863367b8, Ic9db0829: These two commits conflicted [puppet] - 10https://gerrit.wikimedia.org/r/325122 (https://phabricator.wikimedia.org/T141785) (owner: 10Alex Monk) [18:03:04] (03PS2) 10Gehel: wdqs - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324763 [18:03:16] !log upgrading Wikidata query service to Java 8 [18:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:35] (03CR) 10Dzahn: "uhm... how did the conflict happen?" [puppet] - 10https://gerrit.wikimedia.org/r/325122 (https://phabricator.wikimedia.org/T141785) (owner: 10Alex Monk) [18:04:56] (03CR) 10Gehel: [C: 032] wdqs - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324763 (owner: 10Gehel) [18:05:15] (03CR) 10Alex Monk: "They didn't modify the same things. One commit changed a command definition and one added a command caller, but they were made independent" [puppet] - 10https://gerrit.wikimedia.org/r/325122 (https://phabricator.wikimedia.org/T141785) (owner: 10Alex Monk) [18:07:27] (03CR) 10Dzahn: [C: 032] Follow-up I863367b8, Ic9db0829: These two commits conflicted [puppet] - 10https://gerrit.wikimedia.org/r/325122 (https://phabricator.wikimedia.org/T141785) (owner: 10Alex Monk) [18:07:40] (03PS4) 10Dzahn: Follow-up I863367b8, Ic9db0829: These two commits conflicted [puppet] - 10https://gerrit.wikimedia.org/r/325122 (https://phabricator.wikimedia.org/T141785) (owner: 10Alex Monk) [18:09:03] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:10:35] (03CR) 10Dzahn: "gotcha, thanks for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/325122 (https://phabricator.wikimedia.org/T141785) (owner: 10Alex Monk) [18:16:15] (03CR) 10Alex Monk: Send secondary DNS recursor IP from labs DHCP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325278 (https://phabricator.wikimedia.org/T137460) (owner: 10Alex Monk) [18:19:33] RECOVERY - puppet last run on mw1274 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:22:25] !log gehel@tin Starting deploy [wdqs/wdqs@2b1e1fd]: (no message) [18:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:52] !log gehel@tin Finished deploy [wdqs/wdqs@2b1e1fd]: (no message) (duration: 01m 27s) [18:23:55] SMalyshev: wdqs gui deployment completed, tests are good, feel free to check... [18:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:16] gehel: thank you! [18:24:26] SMalyshev: at your service! [18:31:03] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [18:32:32] (03CR) 10Dzahn: [C: 04-1] "@Paladox no, that is the change that 20after4 made and it enabled clustering support but _because_ we have IPv6 enabled and upstream suppo" [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) (owner: 1020after4) [18:35:05] (03CR) 10Dzahn: "since enabling the clustering support didn't work with IPv6 we have meanwhile used rsync to copy all the repos over to phab2001. It seems " [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [18:35:26] 06Operations, 10ops-eqiad: return/replace bad JNP-QSFP- DAC-5M - https://phabricator.wikimedia.org/T152032#2848016 (10RobH) a:05RobH>03Cmjohnson These arrived last Friday, assigning to Chris. > I just wanted to follow up with you regarding this case; based on the tracking number (1Z7AF3880131298534), I’... [18:36:49] (03CR) 1020after4: [C: 04-1] "what Dzahn said ;)" [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) (owner: 1020after4) [18:37:03] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:38:10] (03CR) 10Dzahn: "if you want to be fancy you could randomize the time it runs." [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [18:41:34] !log stopping for a few minutes replication on db1048 to change dbstore1002's master [18:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:31] (03PS1) 1020after4: Add search.elastic.host to settings for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 [18:48:16] (03CR) 10Paladox: [C: 031] Add search.elastic.host to settings for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 (owner: 1020after4) [18:50:23] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.16 seconds [18:50:31] ^as expected [18:51:27] (03PS2) 1020after4: Add search.elastic.host to settings for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) [18:59:35] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 10Wikidata, 03WMDE-QWERTY-Team-Board: Add Andrew and Aleksey to ldap/wmde group - https://phabricator.wikimedia.org/T152088#2848157 (10demon) 05Open>03Resolved a:03demon Done. [19:00:01] jouncebot: next [19:00:01] In 1 hour(s) and 59 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T2100) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T1900). Please do the needful. [19:00:04] RoanKattouw, MatmaRex, and ebernhardson: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [19:00:24] i'll be around in 10 minutes. gotta run do a thing [19:00:31] o/ [19:00:38] I can SWAT today [19:02:09] (03PS2) 10Thcipriani: Re-enable the Flow beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324764 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [19:02:24] RoanKattouw: assuming your -2 on ^ is no longer valid [19:02:46] Yes, sorry [19:02:55] That was there to prevent people from messing with it before today [19:03:03] (03CR) 10Catrope: [C: 031] Re-enable the Flow beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324764 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [19:03:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324764 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [19:03:42] 06Operations, 06Labs, 10Labs-Infrastructure: cronspam from labtestservices2001 /etc/dns-floating-ip-updater.py > /dev/null - https://phabricator.wikimedia.org/T152439#2848181 (10RobH) [19:04:08] (03Merged) 10jenkins-bot: Re-enable the Flow beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324764 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [19:05:19] \o [19:05:55] RoanKattouw: your first change is live on mwdebug1002, check please [19:06:40] thcipriani: BTW, the OOUI fix is a pair of single-line changes, one each in /vendor and /core, which may be a pain to deploy. :-( [19:07:02] (i'm around now. sorry) [19:07:11] I was wondering about those, look like there's a bit of a circular dependency there? [19:07:16] *looks [19:07:34] .// Temporarily disabled for T138310 [19:07:34] T138310: Flow as a Beta feature: enable, disable and reenable doesn't seem to work - https://phabricator.wikimedia.org/T138310 [19:07:40] no longer true [19:08:43] thcipriani: Flow beta feature working on mwdebug1002 [19:08:48] thcipriani: these should not depend on each other. the CI setup is problematic when upgrading things, but this is a backport and we're not changing the version number [19:08:58] RoanKattouw: ok, going live [19:09:00] thcipriani: My other two are no-ops [19:09:12] thcipriani: the tests are failing, but it seems unrelated? the tests might be broken on wmf.4? :/ [19:10:37] RoanKattouw: okie doke, I'll push them out when they merge. Looks like 1 of them may already be live [19:10:41] MatmaRex: oh good :( [19:10:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:324764|Re-enable the Flow beta feature]] T138310 (duration: 00m 45s) [19:11:02] ^ RoanKattouw first one is live [19:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:21] (03PS2) 10Thcipriani: Add b/c for the $wgEchoConfig -> $wgEchoEventLoggingSchema rename in I2f9d5d111f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324792 (owner: 10Catrope) [19:11:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324792 (owner: 10Catrope) [19:12:21] (03Merged) 10jenkins-bot: Add b/c for the $wgEchoConfig -> $wgEchoEventLoggingSchema rename in I2f9d5d111f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324792 (owner: 10Catrope) [19:14:19] 06Operations, 10media-storage: cronspam cleanup: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) - https://phabricator.wikimedia.org/T152440#2848229 (10RobH) [19:17:48] (03PS2) 10Andrew Bogott: Send secondary DNS recursor IP from labs DHCP [puppet] - 10https://gerrit.wikimedia.org/r/325278 (https://phabricator.wikimedia.org/T137460) (owner: 10Alex Monk) [19:19:43] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:324792|Add b/c for the $wgEchoConfig -> $wgEchoEventLoggingSchema rename in]] I2f9d5d111f (duration: 00m 47s) [19:19:50] (03CR) 10Filippo Giunchedi: [C: 031] Renew expired TLS certificate for eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/325273 (https://phabricator.wikimedia.org/T152364) (owner: 10Jcrespo) [19:19:54] ^ RoanKattouw 2nd change is live [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:01] 3rd one already appears to be merged [19:21:09] (03CR) 10Andrew Bogott: [C: 032] Send secondary DNS recursor IP from labs DHCP [puppet] - 10https://gerrit.wikimedia.org/r/325278 (https://phabricator.wikimedia.org/T137460) (owner: 10Alex Monk) [19:22:11] MatmaRex: hrm. tests might have a problem with wikibase/wikidata or this could be a test environ problem :\ [19:23:09] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 5 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2848311 (10GWicke) There are pros & cons for dividing the API cluster in multiple sub-clusters. The big advanta... [19:23:32] 06Operations, 10ops-eqiad: return/replace bad JNP-QSFP- DAC-5M - https://phabricator.wikimedia.org/T152032#2848318 (10Cmjohnson) 05Open>03Resolved Received the cable RMA return tracking 1Z 7AF 388 90 3129 8546 [19:25:00] thcipriani: grumble [19:29:06] MatmaRex: grumble indeed. I'm trying to figure out if there's an update needed to a submodule that's not in wmf.4 for some reason :\ [19:30:09] thcipriani hopefully https://gerrit.wikimedia.org/r/#/c/325350/ will fix it [19:31:45] MatmaRex: let me get out ebernhardson 's change, and circle back to this. [19:32:18] (03PS2) 10Thcipriani: Turn off CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325329 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [19:33:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325329 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [19:34:05] (03Merged) 10jenkins-bot: Turn off CirrusSearch interwiki load test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325329 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [19:34:32] should be pretty safe t ship [19:34:46] ebernhardson: your change is live on mwdebug1002 if there's anything to test [19:35:17] ebernhardson: ah, ok. looks like wmf-config/CirrusSearch-production.php and then sync-dir wmf-config? [19:35:24] thcipriani: nothing appears obviously broken, should be good to go. Yes sync -production first [19:35:31] ok, going live [19:35:35] thcipriani the backport i did seems to work https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/400/console [19:35:41] fixes wmf.4 [19:36:20] I'm concerned why is something using files that need to be backported into wmf.4 in the first place? [19:37:03] it seems like that ought to be reverted, rather than the missing files backported...I could be wrong, but I've only been looking at it for 10 minutes or so. [19:37:42] makes me nervous about deploying anything to do with wmf.4 since it seems like something could have been erronously merged or backported there. [19:38:12] (03CR) 10Dzahn: [C: 04-1] "let's use https" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [19:39:36] thcipriani how do we revert wikibase / wikidata [19:39:37] greg-g: are incident-related tasks still meant to be tagged someway in particular? asking because I see many that are not tagged [19:39:48] not sure which one to revert + it seems to have no wmf 4 branch [19:39:53] https://github.com/wikimedia/mediawiki-extensions-Wikibase [19:40:59] I guess i should ping the wikibase team, aude_, addshore ^^ [19:41:07] ? [19:41:08] This https://github.com/wikimedia/mediawiki-extensions-Wikibase/commit/ceb7502741b9b93cdf5d84f3c7b9ea3a31d8af45 should be reverted [19:41:19] addshore: wmf.4 master is failing tests because of Wikidata stuff [19:41:20] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:325329|Turn off CirrusSearch interwiki load test]] T149740 PART I (duration: 00m 44s) [19:41:23] addshore the wmf 4 tests are failing because of wikidata [19:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:31] addshore see https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/396/console [19:41:31] T149740: Run load tests of cross-project searching to verify its stability - https://phabricator.wikimedia.org/T149740 [19:41:33] s/master/head/ [19:42:01] godog: incident follow-ups should be tagged with #wikimedia-incident and put in the "follow-up" column [19:42:09] wikidata would be on wmf.3 fwiw https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json [19:42:14] godog: I haven't caught up on the API one joe was investigating [19:42:20] interesting, paladox that isn't an error I have seen in any of the tests to date [19:42:32] addshore it wont fail on master [19:42:45] since the change is in master, but it isent included on wmf 4. [19:43:21] thcipriani i think if there is no wmf 4 branch in zuul for wikidata then it fallbacks to master [19:43:34] ahhh, that could be the issue [19:43:51] (03PS1) 10Chad: scap clean plugin: fix minor typo in progress tracking/reporting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325353 [19:43:52] 19:18:51 DEBUG:zuul.Cloner:Project mediawiki/extensions/Wikidata in Zuul does not have ref refs/zuul/wmf/1.29.0-wmf.4/Zeda21dda2b7246a396649eea6b661cac [19:43:53] 19:18:51 DEBUG:zuul.Cloner:Project mediawiki/extensions/Wikidata in Zuul does not have ref refs/zuul/master/Zeda21dda2b7246a396649eea6b661cac [19:43:53] 19:18:51 INFO:zuul.Cloner:Falling back to branch master [19:43:58] I mean, you could always just CP the core patch introducing the classes into wmf4, which would fix the tests [19:44:02] greg-g: ok thanks! who's in charge of tagging those? I don't see it mentioned on wikitech from a quick search [19:44:18] Yeh, we should create a wmf 4 branch based on wmf 3 [19:44:44] paladox: or yeh, make a wmf4 wikidata branch that is the same as wmf3? [19:44:48] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:325329|Turn off CirrusSearch interwiki load test]] T149740 PART II (duration: 00m 47s) [19:44:51] godog: the people who file them [19:44:51] Yep [19:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:11] actually its like part V, but close enough :) [19:45:16] godog: I might need to edit the example incident report template [19:45:53] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:325329|Turn off CirrusSearch interwiki load test]] T149740 PART III (duration: 00m 46s) [19:45:53] ebernhardson: meh, part II of what I'm doing with it :) [19:45:59] ^ ebernhardson live everywhere [19:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:06] addshore could you create the branch please? [19:46:31] (03CR) 10Chad: [C: 032] scap clean plugin: fix minor typo in progress tracking/reporting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325353 (owner: 10Chad) [19:46:44] greg-g: yeah I asked because I only remember seeing the tag in tasks but not mentioned anywhere on wikitech [19:46:52] paladox: I assume it is the Wikidata repo that is actually causing the issues? [19:47:13] (03Merged) 10jenkins-bot: scap clean plugin: fix minor typo in progress tracking/reporting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325353 (owner: 10Chad) [19:47:40] thcipriani: :) thanks, everything seems sane [19:47:44] addshore yeh [19:48:03] paladox: created! [19:48:13] Thanks [19:48:30] aude_: ^^ FYI I just made a wmf/1.29.0-wmf.4 branch on the Wikidata repo with the same tree as wmf/1.29.0-wmf.3 [19:48:55] TIL: searching something that starts with # in the quick search box redirects you to main_page [19:49:01] !log demon@tin Synchronized scap/plugins/clean.py: Clean all the things (duration: 00m 43s) [19:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:35] cool, will rebase the wmf.4 patches. Thanks addshore and paladox [19:49:44] 06Operations, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: labservices1001 down - https://phabricator.wikimedia.org/T152340#2848411 (10fgiunchedi) [19:49:49] Your welcome :) [19:52:07] (03PS1) 10Andrew Bogott: Page if a labs dns server stops responding [puppet] - 10https://gerrit.wikimedia.org/r/325358 (https://phabricator.wikimedia.org/T152368) [19:52:09] (03PS1) 10Andrew Bogott: Designate: page if services go down. [puppet] - 10https://gerrit.wikimedia.org/r/325359 (https://phabricator.wikimedia.org/T152368) [19:53:19] 06Operations, 05Prometheus-metrics-monitoring: Move prometheus entry point off port 80 - https://phabricator.wikimedia.org/T152445#2848430 (10fgiunchedi) [19:53:31] 06Operations, 05Prometheus-metrics-monitoring: Move prometheus entry point off port 80 - https://phabricator.wikimedia.org/T152445#2848443 (10fgiunchedi) p:05Triage>03Normal [19:55:16] (03CR) 10jenkins-bot: [V: 04-1] Page if a labs dns server stops responding [puppet] - 10https://gerrit.wikimedia.org/r/325358 (https://phabricator.wikimedia.org/T152368) (owner: 10Andrew Bogott) [19:56:20] MatmaRex: sorry about the delay, I can deploy if you've got some time for a little outside the swat window [19:56:36] thcipriani: yeah, i'm still here [19:57:26] (03PS2) 10Andrew Bogott: Page if a labs dns server stops responding [puppet] - 10https://gerrit.wikimedia.org/r/325358 (https://phabricator.wikimedia.org/T152368) [19:57:28] (03PS2) 10Andrew Bogott: Designate: page if services go down. [puppet] - 10https://gerrit.wikimedia.org/r/325359 (https://phabricator.wikimedia.org/T152368) [19:59:26] MatmaRex: cool. Is there a concern about the order in which these patches go out? [20:00:01] thcipriani: nope [20:00:08] okie doke [20:00:09] they're independent [20:00:13] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [20:04:44] (03PS3) 10Dzahn: Add search.elastic.host to settings for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [20:05:03] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [20:05:38] (03CR) 10Dzahn: [C: 031] "PS3, changed to https and port 9243" [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [20:06:32] !log run swift-thumb-stats to gather thumbnail stats on ms-fe1001 [20:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:45] (03PS4) 10Dzahn: phabricator: Add search.elastic.host for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [20:08:14] (03CR) 10Dzahn: "20after4, +1 for PS3?" [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [20:10:03] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 137 seconds ago with 0 failures [20:11:45] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2848514 (10Eevans) [20:16:32] finally merged! [20:17:58] MatmaRex: both changes live on mwdebug1002, if there's anything you want to check there [20:18:55] thcipriani: thanks. i'll just verify quickly [20:19:32] thcipriani: all looks fine at a glance [20:20:09] MatmaRex: ok, going live with both, vendor first [20:20:43] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:21:16] (03PS2) 10Andrew Bogott: Clarify a comment re: labtest private hiera lookups. [puppet] - 10https://gerrit.wikimedia.org/r/324778 [20:22:50] (03CR) 10Andrew Bogott: [C: 032] Clarify a comment re: labtest private hiera lookups. [puppet] - 10https://gerrit.wikimedia.org/r/324778 (owner: 10Andrew Bogott) [20:23:16] !log thcipriani@tin Synchronized php-1.29.0-wmf.4/vendor/oojs/oojs-ui/php/layouts/FieldsetLayout.php: SWAT: [[gerrit:325247|OOjs UI: Backport I73f95965694ec7fb0fa9a474742286e1105e5c85]] T151061 (duration: 00m 46s) [20:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:28] T151061: labels no longer shown in Chrome (55.0.2883.35) and friends - https://phabricator.wikimedia.org/T151061 [20:25:40] !log thcipriani@tin Synchronized php-1.29.0-wmf.4/resources/lib/oojs-ui/oojs-ui-core.js: SWAT: [[gerrit:325246|OOjs UI: Backport I73f95965694ec7fb0fa9a474742286e1105e5c85]] T151061 (duration: 00m 46s) [20:25:43] ^ MatmaRex both live everywhere [20:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:36] thanks thcipriani [20:30:45] Reedy: re: cronjob for generating captchas? so you think "day 1" might be bad because all maintenance runs on day 1? [20:32:04] Reedy: we could randomize it. either once (i declare the day is 23), or permanently: $day = fqdn_rand(28) [20:38:33] (03CR) 10Dzahn: "careful, there is a file /home/mwdeploy/.etcdrc on appservers, it has username/password in it, username: conftool" [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [20:38:37] (03PS1) 10Bmansurov: Enable ReadMore on mobile jawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325365 (https://phabricator.wikimedia.org/T151346) [20:40:22] (03CR) 10Chad: "Bah, ok...." [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [20:47:24] (03CR) 10Jdlrobson: [C: 04-1] Enable ReadMore on mobile jawiki and eswiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325365 (https://phabricator.wikimedia.org/T151346) (owner: 10Bmansurov) [20:48:43] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:50:13] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:51:03] (03PS1) 10Jdlrobson: Roll out wikidata description taglines to French and German Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325366 (https://phabricator.wikimedia.org/T151345) [20:52:47] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2848675 (10GWicke) > Original RESTBase RFC: I don't see anything relevant in the RfC nor do I remember discuss... [20:52:50] (03PS2) 10Bmansurov: Enable ReadMore on mobile jawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325365 (https://phabricator.wikimedia.org/T151346) [20:58:40] (03PS1) 10Jdlrobson: Enable banners on Finnish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325369 (https://phabricator.wikimedia.org/T152344) [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T2100). [21:01:22] (03PS1) 10Jdlrobson: Disable Wikipedia beta banner experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325370 (https://phabricator.wikimedia.org/T148634) [21:02:48] nope [21:04:40] !log mholloway-shell@tin Starting deploy [mobileapps/deploy@ccc69fb]: Update mobileapps to 2fcd49d [21:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:55] (03CR) 10Jdlrobson: [C: 031] Enable ReadMore on mobile jawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325365 (https://phabricator.wikimedia.org/T151346) (owner: 10Bmansurov) [21:05:24] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:05:53] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@ccc69fb]: Update mobileapps to 2fcd49d (duration: 01m 13s) [21:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:53] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:09:23] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [21:12:23] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:47] (03PS1) 10Andrew Bogott: Labs ldap: Hide the novaobserver account from everyone but keystone [puppet] - 10https://gerrit.wikimedia.org/r/325371 (https://phabricator.wikimedia.org/T150092) [21:17:11] (03CR) 10Bmansurov: Disable Wikipedia beta banner experiment (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325370 (https://phabricator.wikimedia.org/T148634) (owner: 10Jdlrobson) [21:17:23] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [21:20:13] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:20:23] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:23] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [21:24:32] (03CR) 10Bmansurov: [C: 031] Enable banners on Finnish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325369 (https://phabricator.wikimedia.org/T152344) (owner: 10Jdlrobson) [21:26:43] (03CR) 10Bmansurov: [C: 031] Roll out wikidata description taglines to French and German Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325366 (https://phabricator.wikimedia.org/T151345) (owner: 10Jdlrobson) [21:27:25] (03PS2) 10Jdlrobson: Disable Wikipedia beta banner experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325370 (https://phabricator.wikimedia.org/T148634) [21:27:27] (03CR) 10Jdlrobson: Disable Wikipedia beta banner experiment (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325370 (https://phabricator.wikimedia.org/T148634) (owner: 10Jdlrobson) [21:30:53] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:33:06] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2848865 (10MarkTraceur) Multimedia team bowing out of working on this because it seems like it might be a pywikibot problem (namely not using as... [21:35:19] (03CR) 10Bmansurov: [C: 031] Disable Wikipedia beta banner experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325370 (https://phabricator.wikimedia.org/T148634) (owner: 10Jdlrobson) [21:37:20] (03PS2) 10Andrew Bogott: Labs ldap: Hide the novaobserver account from everyone but keystone [puppet] - 10https://gerrit.wikimedia.org/r/325371 (https://phabricator.wikimedia.org/T150092) [21:37:47] (03Abandoned) 10Andrew Bogott: Keystone: add 'observer' domain [puppet] - 10https://gerrit.wikimedia.org/r/324963 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [21:37:53] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:38:50] (03PS3) 10Andrew Bogott: Labs ldap: Hide the novaobserver account from everyone but keystone [puppet] - 10https://gerrit.wikimedia.org/r/325371 (https://phabricator.wikimedia.org/T150092) [21:40:38] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Regression: image magick stripping colour profile of PNG files [probably regression] - https://phabricator.wikimedia.org/T113123#2848901 (10MarkTraceur) [21:41:22] (03CR) 1020after4: [C: 031] phabricator: Add search.elastic.host for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [21:42:20] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2848910 (10Pchelolo) [21:42:22] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to beta cluster - https://phabricator.wikimedia.org/T150945#2848909 (10Pchelolo) [21:42:24] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2848911 (10Pchelolo) [21:42:28] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2848912 (10Pchelolo) [21:42:30] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2848914 (10Pchelolo) [21:42:32] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2848907 (10Pchelolo) 05Open>03Resolved Resolving. Deployed and life in production: https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_p... [21:42:34] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), and 2 others: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2848915 (10Pchelolo) [21:48:28] (03CR) 10Filippo Giunchedi: [C: 031] Designate: page if services go down. [puppet] - 10https://gerrit.wikimedia.org/r/325359 (https://phabricator.wikimedia.org/T152368) (owner: 10Andrew Bogott) [21:49:06] (03CR) 10Andrew Bogott: [C: 032] Labs ldap: Hide the novaobserver account from everyone but keystone [puppet] - 10https://gerrit.wikimedia.org/r/325371 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [21:49:59] (03CR) 10Filippo Giunchedi: [C: 04-1] "I'm not convinced dns itself should page, it is meant to be a redundant service. Other higher-level paging checks might be more helpful (e" [puppet] - 10https://gerrit.wikimedia.org/r/325358 (https://phabricator.wikimedia.org/T152368) (owner: 10Andrew Bogott) [21:53:08] (03PS1) 10Andrew Bogott: Explicitly set labs_keystone_host for labtestservices [puppet] - 10https://gerrit.wikimedia.org/r/325422 [21:54:10] (03CR) 10Andrew Bogott: [C: 032] Explicitly set labs_keystone_host for labtestservices [puppet] - 10https://gerrit.wikimedia.org/r/325422 (owner: 10Andrew Bogott) [21:56:09] (03CR) 10Filippo Giunchedi: [C: 031] contint: Add dependencies needed for PoolCounter tests [puppet] - 10https://gerrit.wikimedia.org/r/325145 (https://phabricator.wikimedia.org/T152338) (owner: 10Legoktm) [21:58:13] (03CR) 10Filippo Giunchedi: [C: 032] gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) (owner: 10Chad) [21:58:53] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T2200). Please do the needful. [22:00:43] (03PS5) 10Dzahn: phabricator: Add search.elastic.host for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [22:02:31] (03CR) 10Dzahn: [C: 032] phabricator: Add search.elastic.host for limited elasticsearch testing [puppet] - 10https://gerrit.wikimedia.org/r/325333 (https://phabricator.wikimedia.org/T146843) (owner: 1020after4) [22:04:13] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:12:13] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:14:29] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 5 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2849092 (10greg) @Joe Should I make this an explicit follow-up from the incident? https://wikitech.wikimedia.or... [22:24:45] (03PS3) 10Dzahn: logstash: Move files from root to role module [puppet] - 10https://gerrit.wikimedia.org/r/323332 (owner: 10BryanDavis) [22:24:53] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/4800/" [puppet] - 10https://gerrit.wikimedia.org/r/323332 (owner: 10BryanDavis) [22:26:22] (03CR) 10Dzahn: [C: 032] logstash: Move files from root to role module [puppet] - 10https://gerrit.wikimedia.org/r/323332 (owner: 10BryanDavis) [22:28:48] (03CR) 10Dzahn: "applied, confirmed no-op on logstash1001-1003 (and 1004-1006 were already shown in compiler)" [puppet] - 10https://gerrit.wikimedia.org/r/323332 (owner: 10BryanDavis) [22:28:50] (03PS1) 10Filippo Giunchedi: prometheus: tune ops retention in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/325432 [22:33:13] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:33:13] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py],Service[slapd] [22:33:29] (03PS1) 10Andrew Bogott: Labs ldap: Further attempt to get the keystone IP in an acl [puppet] - 10https://gerrit.wikimedia.org/r/325433 [22:35:21] (03PS2) 10Andrew Bogott: Labs ldap: Further attempt to get the keystone IP in an acl [puppet] - 10https://gerrit.wikimedia.org/r/325433 [22:37:05] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: tune ops retention in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/325432 (owner: 10Filippo Giunchedi) [22:37:10] (03PS2) 10Filippo Giunchedi: prometheus: tune ops retention in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/325432 [22:39:52] (03CR) 10Filippo Giunchedi: [V: 032] prometheus: tune ops retention in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/325432 (owner: 10Filippo Giunchedi) [22:40:03] (03CR) 10Filippo Giunchedi: "Not testable with PCC because of https://phabricator.wikimedia.org/T150456" [puppet] - 10https://gerrit.wikimedia.org/r/325432 (owner: 10Filippo Giunchedi) [22:44:41] surprise! that didn't do anything [22:44:48] (03CR) 10Andrew Bogott: [C: 032] Labs ldap: Further attempt to get the keystone IP in an acl [puppet] - 10https://gerrit.wikimedia.org/r/325433 (owner: 10Andrew Bogott) [22:44:53] (03PS3) 10Andrew Bogott: Labs ldap: Further attempt to get the keystone IP in an acl [puppet] - 10https://gerrit.wikimedia.org/r/325433 [22:47:38] (03PS1) 10Filippo Giunchedi: prometheus: fetch storage_retention for ops from hiera [puppet] - 10https://gerrit.wikimedia.org/r/325434 [22:51:57] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fetch storage_retention for ops from hiera [puppet] - 10https://gerrit.wikimedia.org/r/325434 (owner: 10Filippo Giunchedi) [22:52:01] (03PS2) 10Filippo Giunchedi: prometheus: fetch storage_retention for ops from hiera [puppet] - 10https://gerrit.wikimedia.org/r/325434 [22:52:13] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:59:50] (03PS1) 10BryanDavis: webservice: Fix #!... [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/325436 (https://phabricator.wikimedia.org/T147350) [23:01:59] (03CR) 10BryanDavis: [C: 032] webservice: Fix #!... [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/325436 (https://phabricator.wikimedia.org/T147350) (owner: 10BryanDavis) [23:04:47] (03PS9) 10Dzahn: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [23:04:49] (03Abandoned) 10Andrew Bogott: Page if a labs dns server stops responding [puppet] - 10https://gerrit.wikimedia.org/r/325358 (https://phabricator.wikimedia.org/T152368) (owner: 10Andrew Bogott) [23:05:50] (03PS3) 10Andrew Bogott: Designate: page if services go down. [puppet] - 10https://gerrit.wikimedia.org/r/325359 (https://phabricator.wikimedia.org/T152368) [23:06:42] (03Merged) 10jenkins-bot: webservice: Fix #!... [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/325436 (https://phabricator.wikimedia.org/T147350) (owner: 10BryanDavis) [23:07:43] (03CR) 10Andrew Bogott: [C: 032] Designate: page if services go down. [puppet] - 10https://gerrit.wikimedia.org/r/325359 (https://phabricator.wikimedia.org/T152368) (owner: 10Andrew Bogott) [23:08:05] (03CR) 10Dzahn: "seems good, the only comment is that sometimes we have those alert when they check for _exactly_ 1 process and then there are 2 processes " [puppet] - 10https://gerrit.wikimedia.org/r/325359 (https://phabricator.wikimedia.org/T152368) (owner: 10Andrew Bogott) [23:12:34] (03CR) 10Dzahn: [C: 04-1] "heh, now it's reverting a jdk 7 -> 8 change? http://puppet-compiler.wmflabs.org/4804/" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [23:15:27] (03PS1) 10RobH: reclaim nobelium to spares [puppet] - 10https://gerrit.wikimedia.org/r/325439 [23:22:06] (03CR) 10Dzahn: "that was https://gerrit.wikimedia.org/r/#/c/323154/ or related" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [23:22:46] (03PS10) 10Dzahn: logstash: Break logstash.pp up into individual classes [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [23:23:15] 06Operations, 10ops-eqiad: Reclaim nobelium - https://phabricator.wikimedia.org/T142581#2849298 (10RobH) a:05RobH>03Cmjohnson So in checking against the server lifecycle, this has been removed from all of puppet, as well as monitoring and has been shut down. I've checked and disabled the network port. @C... [23:23:31] 06Operations, 10ops-eqiad, 10hardware-requests: Reclaim nobelium - https://phabricator.wikimedia.org/T142581#2849302 (10RobH) [23:24:00] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/324704/" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [23:33:39] (03CR) 10Dzahn: [C: 032] "amended, double checked, http://puppet-compiler.wmflabs.org/4805/" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [23:36:05] (03CR) 10Dzahn: "confirmed puppet runs are no-op on all logstash servers. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/323333 (https://phabricator.wikimedia.org/T93645) (owner: 10BryanDavis) [23:42:27] (03PS1) 10Dzahn: xhgui: move role class in proper location [puppet] - 10https://gerrit.wikimedia.org/r/325442 [23:45:20] (03PS2) 10Dzahn: xhgui: move role class in proper location [puppet] - 10https://gerrit.wikimedia.org/r/325442 (https://phabricator.wikimedia.org/T93645) [23:45:45] (03CR) 10Dzahn: [C: 032] xhgui: move role class in proper location [puppet] - 10https://gerrit.wikimedia.org/r/325442 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:47:56] (03CR) 10Dzahn: "confirmed no-op on tungsten" [puppet] - 10https://gerrit.wikimedia.org/r/325442 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:49:19] (03PS1) 10Andrew Bogott: Remove an unneeded line in observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/325443 [23:59:28] 06Operations, 06Performance-Team: Define SLAs for media - https://phabricator.wikimedia.org/T112692#2849412 (10Krinkle) 05Open>03Invalid Redundant - closing per sub task (T112691). Now part of the Thumbor project. [23:59:46] (03PS2) 10Andrew Bogott: Remove unneeded lines from observerenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/325443