[00:00:51] RECOVERY - dhclient process on mw1116 is OK: PROCS OK: 0 processes with command name dhclient [00:00:51] RECOVERY - nutcracker process on mw1116 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:01:10] RECOVERY - HHVM processes on mw1116 is OK: PROCS OK: 6 processes with command name hhvm [00:01:11] RECOVERY - salt-minion processes on mw1116 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:01:11] RECOVERY - configured eth on mw1116 is OK: OK - interfaces up [00:01:58] RECOVERY - RAID on mw1116 is OK: OK: no RAID installed [00:02:38] RECOVERY - Check size of conntrack table on mw1116 is OK: OK: nf_conntrack is 0 % full [00:02:57] RECOVERY - DPKG on mw1116 is OK: All packages OK [00:03:47] RECOVERY - SSH on mw1116 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [00:04:19] RECOVERY - nutcracker port on mw1116 is OK: TCP OK - 0.000 second response time on port 11212 [00:08:58] PROBLEM - puppet last run on db1057 is CRITICAL: CRITICAL: Puppet has 1 failures [00:33:18] RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:19:40] What's happening... [03:20:09] Request from 80.217.80.83 via cp3033 cp3033, Varnish XID 1095061767 Error: 503, Service Unavailable at Mon, 25 Apr 2016 03:19:57 GMT [03:20:23] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:23] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:23] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:24] PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:24] PROBLEM - Host 91.198.174.106 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:24] PROBLEM - Host 91.198.174.122 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:25] OTRS and en.wp down... [03:20:33] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:44] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:53] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:54] PROBLEM - Host asw-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [03:20:54] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:54] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:54] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [03:21:03] RECOVERY - Host cp3049 is UP: PING OK - Packet loss = 0%, RTA = 83.39 ms [03:21:03] RECOVERY - Host lvs3002 is UP: PING OK - Packet loss = 0%, RTA = 83.61 ms [03:21:03] RECOVERY - Host cp3008 is UP: PING OK - Packet loss = 0%, RTA = 83.36 ms [03:21:03] RECOVERY - Host cp3040 is UP: PING OK - Packet loss = 0%, RTA = 83.12 ms [03:21:03] RECOVERY - Host cp3045 is UP: PING OK - Packet loss = 0%, RTA = 83.12 ms [03:21:03] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 83.54 ms [03:21:03] RECOVERY - Host cp3032 is UP: PING OK - Packet loss = 0%, RTA = 83.32 ms [03:21:04] RECOVERY - Host cp3033 is UP: PING OK - Packet loss = 0%, RTA = 83.40 ms [03:21:04] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 83.53 ms [03:21:10] ok, nvm :) [03:21:12] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 82.77 ms [03:21:42] RECOVERY - Host 91.198.174.122 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [03:21:43] RECOVERY - Host 91.198.174.106 is UP: PING OK - Packet loss = 0%, RTA = 83.12 ms [03:22:13] RECOVERY - Host asw-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 84.40 ms [03:26:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [03:26:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [03:26:23] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [03:26:44] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [03:26:52] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 6 failures [03:27:44] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:32:03] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:32:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:32:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:45:11] Josve05a, full outage for a moment? [03:45:23] of esams [03:45:28] yeah, was .... 'fun' ... [03:50:43] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [03:51:03] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:51:12] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:59:03] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CRITICAL: puppet fail [04:15:46] (03PS4) 10Mattflaschen: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) [04:25:20] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [04:30:25] (03PS12) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [04:31:30] (03CR) 10jenkins-bot: [V: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [04:37:22] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, 13Patch-For-Review: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2234164 (10Pokefan95) p:05Triage>03High Matching priority with T131769. Also, a patch has been uploaded... [04:46:49] (03PS13) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [04:47:52] (03CR) 10jenkins-bot: [V: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [05:09:32] (03PS14) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [05:22:55] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [05:24:56] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5151505 keys - replication_delay is 0 [05:27:07] (03PS15) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [05:32:39] (03PS16) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [05:36:33] (03PS17) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [05:37:48] (03PS18) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [05:51:41] (03PS1) 10Yuvipanda: Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 [06:12:28] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, 13Patch-For-Review: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2234212 (10Steinsplitter) @Pokefan95 As you [[ https://www.mediawiki.org/wiki/Phabricator/Project_managemen... [06:14:31] (03PS2) 10Yuvipanda: Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 [06:15:07] (03CR) 10jenkins-bot: [V: 04-1] Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 (owner: 10Yuvipanda) [06:29:38] (03PS3) 10Yuvipanda: Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 [06:30:41] (03CR) 10jenkins-bot: [V: 04-1] Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 (owner: 10Yuvipanda) [06:30:48] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:08] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail [06:31:19] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:43] (03PS4) 10Yuvipanda: Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 [06:35:08] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:20] (03CR) 10jenkins-bot: [V: 04-1] Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 (owner: 10Yuvipanda) [06:35:38] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:18] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail [06:36:37] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:52] (03PS5) 10Yuvipanda: Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 [06:37:30] (03CR) 10jenkins-bot: [V: 04-1] Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 (owner: 10Yuvipanda) [06:40:58] (03PS6) 10Yuvipanda: Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 (https://phabricator.wikimedia.org/T98442) [06:51:53] (03CR) 10Yuvipanda: [C: 032] Add legacy support for tomcat webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285142 (https://phabricator.wikimedia.org/T98442) (owner: 10Yuvipanda) [06:53:28] (03PS2) 10Rillke: Set up UploadsLink on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281651 (https://phabricator.wikimedia.org/T131844) [06:55:18] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:17] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Allow display of a job by job id or relatively (e.g. -2 to show the second-last) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/284889 (owner: 10Muehlenhoff) [06:56:39] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:28] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:31:56] akosiaris: hey, do you have some time to check this? https://gerrit.wikimedia.org/r/#/c/285010/ [07:32:01] it would be great [07:34:33] (03PS2) 10Alexandros Kosiaris: ores: fix staging configs [puppet] - 10https://gerrit.wikimedia.org/r/285010 (owner: 10Ladsgroup) [07:34:34] yup [07:34:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: fix staging configs [puppet] - 10https://gerrit.wikimedia.org/r/285010 (owner: 10Ladsgroup) [07:44:32] thanks :) akosiaris :) [07:53:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [07:53:54] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [07:57:53] ran puppet-merge [07:58:03] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [07:59:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:01:12] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.054 second response time [08:02:18] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 73592 bytes in 1.704 second response time [08:03:19] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:19:49] !log dropping old imported logs from pc1006 [08:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:21:31] I will wait a bit and do the same for pc1005 and pc1004 [08:23:25] !log restarted HHVM on mw1116 (output of hhvm-dump-debug available) [08:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:48:59] 06Operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T133517#2234312 (10fgiunchedi) 03NEW [08:50:54] RECOVERY - RAID on ms-be2007 is OK: OK: optimal, 13 logical, 13 physical [09:00:38] ACKNOWLEDGEMENT - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi broken disk T133517 [09:08:33] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:12:35] RECOVERY - DPKG on labmon1001 is OK: All packages OK [09:26:20] 06Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2234413 (10fgiunchedi) the bootstrap didn't complete as expected with cassandra reporting corrupted sstables in logs ```lines=6 root@restbase1015:/var/log/cassandra# grep ERROR system-... [09:38:31] !log deleting imported logs on pc1004 and pc1005 [09:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:01] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: Puppet has 1 failures [10:07:55] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:09:55] (03PS2) 10Filippo Giunchedi: Add atgomez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/284439 (https://phabricator.wikimedia.org/T133102) (owner: 10Alexandros Kosiaris) [10:10:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add atgomez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/284439 (https://phabricator.wikimedia.org/T133102) (owner: 10Alexandros Kosiaris) [10:11:05] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for AGomez (WMF) - https://phabricator.wikimedia.org/T133102#2234539 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi waiting period is over, access should be active shortly [10:11:25] 06Operations, 10DBA, 13Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#2234542 (10jcrespo) Before all 5.5 old servers are decomissioned, this is the list of old checks: ``` DPKG [common] Disk space [common] Ful... [10:11:48] !log stopping db1052 for backup to dbstore1002 [10:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:56] (03PS2) 10Filippo Giunchedi: Add addshore to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/284441 (https://phabricator.wikimedia.org/T133066) (owner: 10Alexandros Kosiaris) [10:22:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add addshore to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/284441 (https://phabricator.wikimedia.org/T133066) (owner: 10Alexandros Kosiaris) [10:22:35] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1052.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1052.eqiad.wmnet (111 Connection refused) [10:22:49] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2234549 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi waiting period is over, access should be granted shortly [10:27:12] ^that is expected, let me ack it [10:27:37] 06Operations, 06Editing-Department, 06Parsing-Team, 06Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2234555 (10mobrovac) [10:46:26] 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234606 (10SindyM3) De oplossing is: Lemonbit maakt een zogenaamd CSR aan, een certificate request voor wikilovesmonuments.org. Dit CSR kan u opstur... [10:48:28] (03PS1) 10Muehlenhoff: Add cron job to synchronise AEAD files from primary auth server [puppet] - 10https://gerrit.wikimedia.org/r/285160 [10:48:32] 06Operations, 10DBA, 13Patch-For-Review: Reimage db2047 - check for hardware errors - https://phabricator.wikimedia.org/T132011#2234613 (10jcrespo) a:05Volans>03jcrespo [10:48:33] It seems that HHVM was upgraded correctly to latest version on Terbium (T132751) but has not been restarted (T133522). Not really sure what's running on Terbium. Can I just restart HHMV? [10:48:33] T132751: upgrade HHVM to 3.12.1 on terbium - https://phabricator.wikimedia.org/T132751 [10:48:34] T133522: Elastica connection pool cannot init: missing curl_init_pooled method. Are you using hhvm >= 3.9.0? - https://phabricator.wikimedia.org/T133522 [10:49:42] (03CR) 10jenkins-bot: [V: 04-1] Add cron job to synchronise AEAD files from primary auth server [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [10:50:05] 06Operations, 10DBA, 13Patch-For-Review: Reimage db2047 - check for hardware errors - https://phabricator.wikimedia.org/T132011#2185858 (10jcrespo) p:05Triage>03Normal [10:50:44] (03PS1) 10Jcrespo: Depool db2068 for cloning to db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285161 (https://phabricator.wikimedia.org/T132011) [10:51:31] (03CR) 10Jcrespo: [C: 032] Depool db2068 for cloning to db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285161 (https://phabricator.wikimedia.org/T132011) (owner: 10Jcrespo) [10:51:32] terbium is running mediawiki maintenance jobs, which can take a long time in some cases, maybe just check whether there currently running process and ping them directly. for reboots of terbium I usually got in touch with Greg and added it to the Deployments page [10:53:35] gehel: so maybe check with Krinkle, the rest seems fine [10:53:53] I also have some things running there, but can be stopped at any time [10:54:35] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2068 for cloning to db2047 (duration: 01m 51s) [10:54:36] moritzm, jynus: individual scripts should be fine, as far as I can see, only the HHVM service needs to be restarted [10:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:59] gehel, a, ok [10:55:11] the issue with mediawiki ones is that they are long-running [10:55:20] (up to 24 hours) [10:57:02] as far as I understand, all long running tasks should be running as standalone scripts, the HHVM service is serving HTTP traffic, and has a max running time of 60 seconds. So restarting it should probably be fine... [10:59:45] (03PS2) 10Muehlenhoff: Add cron job to synchronise AEAD files from primary auth server [puppet] - 10https://gerrit.wikimedia.org/r/285160 [11:01:03] 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234629 (10Akoopal) Quick translation: Lemonbit will create a CSR, a certificate request, for wikilovesmonuments.org. This CSR can be send to the mai... [11:05:38] (03PS1) 10Jcrespo: Set default db2047 installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/285164 (https://phabricator.wikimedia.org/T132011) [11:06:33] (03CR) 10Jcrespo: [C: 032] Set default db2047 installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/285164 (https://phabricator.wikimedia.org/T132011) (owner: 10Jcrespo) [11:08:29] RECOVERY - cassandra-a service on restbase1015 is OK: OK - cassandra-a is active [11:09:06] 5xx high [11:16:13] ^not an issue [11:19:49] I have to reimage a couple of db servers, but I will wait until the afternoon [11:23:45] (03PS1) 10Jcrespo: Reimage db1052 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/285168 (https://phabricator.wikimedia.org/T125028) [11:24:28] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#1972522 (10jcrespo) p:05Triage>03Normal [11:24:55] (03CR) 10Jcrespo: [C: 032] Reimage db1052 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/285168 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [11:27:35] (03CR) 10Jcrespo: [C: 04-1] "Wait until db1052 backups is complete (WIP)." [puppet] - 10https://gerrit.wikimedia.org/r/285168 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [11:29:48] PROBLEM - Restbase root url on restbase1015 is CRITICAL: Connection refused [11:32:50] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.134, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:40:10] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [11:45:11] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 206 seconds ago with 0 failures [12:02:52] godog: i'll deploy rb to restbase1015 to get rid of ^^ [12:08:05] !log restbase re-deployed 7f69f86ee9 to restbase1015 after reimaging [12:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:29] RECOVERY - Restbase root url on restbase1015 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.019 second response time [12:09:09] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [12:09:48] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [12:10:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [12:11:09] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5142673 keys - replication_delay is 0 [12:14:00] 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234742 (10SindyM3) @Akoopal Thank you! [12:15:08] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 196 seconds ago with 0 failures [12:15:29] andre__: how can i get rid of that conpherence sidebar in phab? [12:15:35] (are we even using that???) [12:17:55] mobrovac: press \ [12:18:28] nice! [12:18:33] thnx p858snake! [12:18:51] the lack of real user interface for that has been filed upstream [12:19:33] * mobrovac wonders why is even enabled by default [12:20:21] mobrovac: what p858snake said (thanks!). Documented at https://www.mediawiki.org/wiki/Phabricator/Help#Using_Conpherence [12:20:31] kk [12:21:19] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, 13Patch-For-Review: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2234762 (10matmarex) a:03aaron [12:40:12] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [12:42:25] looking at heka ^^^ [12:43:36] mobrovac: thanks! [12:45:12] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 154 seconds ago with 0 failures [13:06:22] (03PS4) 10Gehel: Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [13:07:28] !log restarting and reimaging db1052 (old enwiki master) [13:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:08:43] (03CR) 10Gehel: [C: 032] Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [13:09:45] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 725 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5148033 keys - replication_delay is 725 [13:12:02] 06Operations: snapshot1006 has hyperthreading disabled - https://phabricator.wikimedia.org/T133534#2234843 (10hoo) [13:12:55] (03PS1) 10Muehlenhoff: Upgrade to 3.19.8-ckt19 [debs/linux] - 10https://gerrit.wikimedia.org/r/285175 [13:20:45] (03PS2) 10Jcrespo: Reimage db1052 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/285168 (https://phabricator.wikimedia.org/T125028) [13:22:14] (03PS1) 10Hashar: nodepool: pool in 2 TRUSTY instances [puppet] - 10https://gerrit.wikimedia.org/r/285178 (https://phabricator.wikimedia.org/T133203) [13:22:55] (03CR) 10Jcrespo: [C: 032] Reimage db1052 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/285168 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [13:28:28] (03PS2) 10Muehlenhoff: Upgrade to 3.19.8-ckt19 [debs/linux] - 10https://gerrit.wikimedia.org/r/285175 [13:32:34] 06Operations, 06Discovery, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2158516 (10mark) @EBernhardson Is the Labs team aware of this project at all? :) [13:33:00] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2234930 (10mark) [13:33:11] 06Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 07HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2234931 (10JanZerebecki) That would work, but there is an easier solution: Use https://letsencrypt.org/ . It should be self-service for the server admin. [13:33:44] 06Operations: Set jessie as the default os installer on network boot and manually mark other versions (precise, trusty) - https://phabricator.wikimedia.org/T133539#2234934 (10jcrespo) [13:34:42] 06Operations: Set jessie as the default os installer on network boot and manually mark other versions (precise, trusty) - https://phabricator.wikimedia.org/T133539#2234948 (10jcrespo) Directly related to: T123525 [13:37:35] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5140698 keys - replication_delay is 0 [13:37:57] 07Blocked-on-Operations, 06Operations, 06Increasing-content-coverage, 06Research-and-Data-Backlog: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2229360 (10fgiunchedi) I took a quick look at this, the source packages we'd need are: scikit-learn, jobli... [13:39:36] (03PS1) 10Rush: labstore throughput thresholds adjustments [puppet] - 10https://gerrit.wikimedia.org/r/285179 (https://phabricator.wikimedia.org/T126237) [13:40:50] (03CR) 10jenkins-bot: [V: 04-1] labstore throughput thresholds adjustments [puppet] - 10https://gerrit.wikimedia.org/r/285179 (https://phabricator.wikimedia.org/T126237) (owner: 10Rush) [13:42:26] 06Operations: snapshot1006 has hyperthreading disabled - https://phabricator.wikimedia.org/T133534#2234971 (10ArielGlenn) It looks like snapshot1007 also has it disabled. [13:43:06] 06Operations, 10RESTBase, 06Services, 10Traffic, 03Mobile-Content-Service: Varnish not purging RESTBase URIs - https://phabricator.wikimedia.org/T127370#2234977 (10fgiunchedi) [13:46:09] (03CR) 10Gehel: "Puppet compiler output: https://puppet-compiler.wmflabs.org/2544/elastic1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [13:49:37] 06Operations: snapshot1006 has hyperthreading disabled - https://phabricator.wikimedia.org/T133534#2234843 (10MoritzMuehlenhoff) /proc/cpuinfo on snapshot1006 has "ht" for all 32 cores, though? [13:51:16] (03PS2) 10Rush: labstore throughput thresholds adjustments [puppet] - 10https://gerrit.wikimedia.org/r/285179 (https://phabricator.wikimedia.org/T126237) [13:51:41] 06Operations: snapshot1006 has hyperthreading disabled - https://phabricator.wikimedia.org/T133534#2235025 (10ArielGlenn) Yeah but that just means it's HT capable. If you look at the number of reported cores in /proc/cpuinfo it's 32 instead of the 64 that snapshot1005 shows. [13:52:27] 06Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2235030 (10fgiunchedi) found this bug report from gerrit which _might_ be related and fixed in newer git versions: https://bugs.chromium.org/p/chromium/issues/detail?id=165330#c8 [13:55:48] !log lowering disk high watermark on elasticsearch eqiad to rebalance the cluster [13:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:26] (03PS3) 10Muehlenhoff: Upgrade to 3.19.8-ckt19 [debs/linux] - 10https://gerrit.wikimedia.org/r/285175 [14:01:14] (03PS19) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [14:05:16] (03PS20) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [14:07:49] 06Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#2183368 (10fgiunchedi) relevant salt query: `salt --out=txt -C 'G@oscodename:jessie' grains.item kernelrelease | grep 3\\.19 | sort` ```lines=5 aqs1001.eqiad.wmnet: {'kernelrelease': '3.19.0-2-amd64'} aqs100... [14:08:26] 06Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#2235130 (10fgiunchedi) p:05Triage>03Normal [14:08:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Upgrade to 3.19.8-ckt19 [debs/linux] - 10https://gerrit.wikimedia.org/r/285175 (owner: 10Muehlenhoff) [14:21:16] (03PS1) 10Jcrespo: Upgrade db1052 to new puppet core class + MariaDB10 + jessie [puppet] - 10https://gerrit.wikimedia.org/r/285183 (https://phabricator.wikimedia.org/T125028) [14:22:41] gehel: https://gerrit.wikimedia.org/r/#/c/283466/ ? [14:23:07] (03CR) 10DCausse: [C: 031] Use unicast instead of multicast for Elasticsearch node communication [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [14:23:45] 06Operations: snapshot1006 has hyperthreading disabled - https://phabricator.wikimedia.org/T133534#2235146 (10ArielGlenn) 05Open>03Resolved a:03ArielGlenn Both hosts now are HT enabled. [14:24:28] 06Operations, 10Dumps-Generation: snapshot1006 has hyperthreading disabled - https://phabricator.wikimedia.org/T133534#2235149 (10ArielGlenn) [14:25:50] 06Operations, 10Monitoring, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#2235153 (10fgiunchedi) removing patch-for-review since it was since merged, ticket still open [14:25:56] paravoid: we'll use the same strategiy as previous upgrade: update reprepro only after having deployed new version on all nodes [14:26:00] (03CR) 10Faidon Liambotis: [C: 031] "That's pretty awesome." [puppet] - 10https://gerrit.wikimedia.org/r/284518 (owner: 10BBlack) [14:26:23] why? [14:26:41] (03CR) 10Gehel: [C: 04-1] "We'll wait until new version is deployed everywhere to update reprepro (see commit message)" [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [14:27:58] paravoid: mainly because that has been used before so not much issue there and it seems to be the easiest way to keep 2 versions at the same time for some duration. [14:28:29] paravoid: I'm open to suggestion if you have something significantly better! [14:28:40] so updating reprepro only matters for new installs/reinstalls basically [14:28:49] it won't upgrade existing nodes [14:28:53] paravoid: yep, that's the idea [14:29:06] so why not do it first and adjust your node-by-node upgrade process to just do "apt-get upgrade"? [14:29:23] i.e. why is it not the first instead of the last step? [14:29:27] Krenair: do you still need that file?: /srv/mediawiki-staging/php-1.27.0-wmf.21 (wmf/1.27.0-wmf.21 *%>)$ ls -la extensions/1 [14:30:19] paravoid: because we need to have a fairly long transition window. If we have an urgent reinstall during that window, we will not be able to do it from reprepro. [14:31:10] 06Operations, 10Analytics-EventLogging, 06Performance-Team, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2235182 (10fgiunchedi) looks like the alert itself now is no longer unknown, @Krinkle anything else to do within the task? [14:31:27] jzerebecki, nope, deleted [14:31:40] paravoid: To be honest, we do not exactly know how the transition process will be. [14:33:41] !log that file is the same diff as the HEAD commit @tin:/srv/mediawiki-staging/php-1.27.0-wmf.21/extensions/CentralAuth ((398f6d4...) %)$ rm test.patch [14:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:53] (03PS16) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [14:38:26] (03PS5) 10Gehel: Use unicast instead of multicast for Elasticsearch node communication [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) [14:38:52] (03PS21) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [14:40:27] gehel: if you had an urgent reinstall, would you install it with 1.7 or 2.x? [14:41:04] 06Operations, 10Monitoring, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#2235193 (10fgiunchedi) a:03faidon moving to @faidon since he mentioned he was working on it [14:41:13] anomie, ostriches, thcipriani|afk, MarkTraceur, Dereckson, matt_flaschen, Urbanecm, *: I'll be starting the swat a bit early unless there are objections [14:41:37] paravoid: most probably 1.7 as we will probably kill traffic to the cluster being upgraded. But again, we need to have more visibility on this transition process... [14:41:49] Ok. I'm ready for SWAT. [14:42:01] jzerebecki: k [14:43:18] (03CR) 10Gehel: [C: 032] Use unicast instead of multicast for Elasticsearch node communication [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [14:43:23] jzerebecki: I see you added one last minute addition, and before Urbanecm a 9th patch [14:43:29] (03PS1) 10ArielGlenn: add nd.edu host as mirror for various datasets incl pageviews [puppet] - 10https://gerrit.wikimedia.org/r/285185 [14:43:59] I can move my patches to the evening SWAT to decrease the load from 10 to 8 if you wish. [14:44:09] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:22] (03PS2) 10ArielGlenn: add nd.edu host as mirror for various datasets incl pageviews [puppet] - 10https://gerrit.wikimedia.org/r/285185 [14:44:26] Dereckson: yes I did. lets see how far we get... [14:45:10] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:19] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:45:54] (03CR) 10ArielGlenn: [C: 032] add nd.edu host as mirror for various datasets incl pageviews [puppet] - 10https://gerrit.wikimedia.org/r/285185 (owner: 10ArielGlenn) [14:46:17] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282381 (https://phabricator.wikimedia.org/T131605) (owner: 10Greg Grossmeier) [14:46:47] !log restarting elasticsearch server elastic1001.codfw.wmnet - activating unicast (T110236) [14:46:48] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:34] !log restarting elasticsearch server elastic2001.codfw.wmnet - activating unicast (T110236) [14:47:35] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:43] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283873 (https://phabricator.wikimedia.org/T132868) (owner: 10Dereckson) [14:48:00] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:48:10] RECOVERY - Host es2019 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [14:48:46] (03CR) 10BBlack: "PS16: just some minor format/comment stuff, removed the leftover husk of an unused function in acme-setup." [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [14:48:50] jzerebecki: 282381 will require once sync'ed to purge the logo file with echo 'https://en.wikipedia.org/static/images/project-logos/cswiki.png' | mwscript purgeList.php [14:51:25] (03PS2) 10Dereckson: Enable RC patrol on ta.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283873 (https://phabricator.wikimedia.org/T132868) [14:51:42] 06Operations, 10Fundraising Tech Backlog, 10Fundraising-Backlog, 10Traffic, 13Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1287413 (10fgiunchedi) looks like this was waited for until the fundraising was over, can it be resumed now? also w... [14:52:05] (03PS2) 10Dereckson: Revert "350K articles celebration logo on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282381 (https://phabricator.wikimedia.org/T131605) (owner: 10Greg Grossmeier) [14:52:29] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:52:36] Dereckson: Ah, preparation for SWAT? [14:52:40] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:53:23] jzerebecki: changes rebased, Zuul rejected their PS1 as it weren't able to rebase them in fast forward mode [14:53:40] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:54:11] jzerebecki: by the way, speaking about Zuul, your 283555 is merged :) [14:54:38] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: authoritative copy of 'root' files for upload.wikimedia.org is only in swift - https://phabricator.wikimedia.org/T130709#2235231 (10faidon) I'm not sure I see this as a problem, although having them in VCS would definitely be nice. Regardless, t... [14:55:33] Dereckson: I never added a new namespace, did you? re https://gerrit.wikimedia.org/r/#/c/283643/ can you review that? do you know what https://phabricator.wikimedia.org/T132746#2209125 means and if we need to do anything else? [14:56:06] * Dereckson looks 283643. [14:56:09] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:57:04] jzerebecki: there is a script to run, namespacesDupe, to convert any "Resolution:foo" page in NS_MAIN to "foo" page in the new NS [14:57:21] k [14:57:35] *sigh* rebase wmf21 failed [14:58:15] (03CR) 10Dereckson: [C: 031] Add Resolution: namespace to foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [14:59:23] jzerebecki: looks good, the only trap for namespaces is "_" is mandatory instead of " ", or it doesn't appear in some places. [15:00:19] * Dereckson looks T132746#2209125 [15:00:53] Present [15:01:51] 06Operations, 07Puppet, 06Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2235251 (10MoritzMuehlenhoff) [15:01:53] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2235252 (10MoritzMuehlenhoff) [15:01:56] 06Operations, 13Patch-For-Review: Mediawiki font packages: switch to Jessie - https://phabricator.wikimedia.org/T102623#2235249 (10MoritzMuehlenhoff) 05Open>03Resolved This is all fixed now. [15:02:14] (03PS2) 10Mattflaschen: Enable Flow opt-in beta feature on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284020 (https://phabricator.wikimedia.org/T132914) (owner: 10Catrope) [15:02:14] k: that means to run `mwscript namespaceDupes.php foundationwiki --move-talk` as a dry run, and if output looks good, the same with --fix at the end to it. [15:02:19] (03PS1) 10ArielGlenn: mark closed wikis as such on the main index.html page for downloads [dumps] - 10https://gerrit.wikimedia.org/r/285187 (https://phabricator.wikimedia.org/T133112) [15:02:26] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:27] (03PS5) 10Mattflaschen: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) [15:02:48] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:03:47] ok fixed the rebase [15:04:50] 06Operations, 10Fundraising Tech Backlog, 10Fundraising-Backlog, 10Traffic, 13Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1287413 (10BBlack) It's mostly been blocked on lack of anyone having time to work on it, too. At this point, it's... [15:04:57] PROBLEM - Host es2019 is DOWN: PING CRITICAL - Packet loss = 100% [15:05:23] (03PS1) 10Gehel: `put_settings` is one of the few operations that does not support timeouts in the version we use. [puppet] - 10https://gerrit.wikimedia.org/r/285189 [15:05:25] 06Operations, 10Traffic, 07Varnish: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503#2235279 (10BBlack) [15:05:27] 06Operations, 10Fundraising Tech Backlog, 10Fundraising-Backlog, 10Traffic, 13Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#2235278 (10BBlack) [15:06:06] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:17] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:07:06] !log jzerebecki@tin Synchronized php-1.27.0-wmf.21/includes/specials/SpecialRunJobs.php: wmf21 e411ad62516bfae31f8d482a22d378912fe5979f T89169 : SpecialRunJobs: delegate error handling to MWExceptionHandler (duration: 00m 37s) [15:07:07] T89169: Log php fatals with full backtraces again (fatal.log on fluorine) - https://phabricator.wikimedia.org/T89169 [15:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:47] (03CR) 10Gehel: [C: 032] `put_settings` is one of the few operations that does not support timeouts in the version we use. [puppet] - 10https://gerrit.wikimedia.org/r/285189 (owner: 10Gehel) [15:08:33] (03PS2) 10Jcrespo: Upgrade db1052 to new puppet core class + MariaDB10 + jessie [puppet] - 10https://gerrit.wikimedia.org/r/285183 (https://phabricator.wikimedia.org/T125028) [15:09:08] (03CR) 10Jcrespo: [C: 032] Upgrade db1052 to new puppet core class + MariaDB10 + jessie [puppet] - 10https://gerrit.wikimedia.org/r/285183 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [15:09:33] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282381 (https://phabricator.wikimedia.org/T131605) (owner: 10Greg Grossmeier) [15:09:59] (03Merged) 10jenkins-bot: Revert "350K articles celebration logo on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282381 (https://phabricator.wikimedia.org/T131605) (owner: 10Greg Grossmeier) [15:10:16] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:11:13] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2158516 (10chasemp) I think we were aware of it in a cursory manner. What is the Labs general outcome for this? Is this a service provided... [15:11:47] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:12:21] !log jzerebecki@tin Synchronized static/images/project-logos/cswiki.png: config 6c5214dcb3f148dab12fe4aa6ae0921cff05038a T131605 1 of 2 : Revert "350K articles celebration logo on cs.wikipedia" (duration: 00m 26s) [15:12:22] T131605: Set celebration logo on Czech Wikipedia - https://phabricator.wikimedia.org/T131605 [15:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:27] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#1177707 (10fgiunchedi) looks like all blocking subtasks are fixed now, @hashar how can we try this again? I tried accessing `... [15:12:33] (03PS3) 10JanZerebecki: Enable RC patrol on ta.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283873 (https://phabricator.wikimedia.org/T132868) (owner: 10Dereckson) [15:12:37] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:12:45] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283873 (https://phabricator.wikimedia.org/T132868) (owner: 10Dereckson) [15:13:30] (03Merged) 10jenkins-bot: Enable RC patrol on ta.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283873 (https://phabricator.wikimedia.org/T132868) (owner: 10Dereckson) [15:14:04] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: config 6c5214dcb3f148dab12fe4aa6ae0921cff05038a T131605 2 of 2 : Revert "350K articles celebration logo on cs.wikipedia" (duration: 00m 30s) [15:14:05] T131605: Set celebration logo on Czech Wikipedia - https://phabricator.wikimedia.org/T131605 [15:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:26] RECOVERY - Host es2019 is UP: PING OK - Packet loss = 0%, RTA = 36.92 ms [15:14:47] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:15:23] damn citoid... guess which gov db is not answering again [15:15:27] mobrovac: ^ [15:16:02] yeah [15:16:11] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2235335 (10Gehel) a/b/c) The goal is to have a cluster to test improvements in search configuration, allowing to run tests against different op... [15:16:13] wth is wrong with them lately? [15:16:37] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:45] nothing is wrong with them. WE should not be relying on them for reliable monitoring [15:18:16] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:31] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2235352 (10Papaul) I did a hardware diagnostic, everything looks good. The system is back up. [15:19:55] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2235357 (10chasemp) If someone wanted to launch a bunch of queries against this cluster from Tools to analyze our search results would this be... [15:19:58] 06Operations, 10Traffic, 13Patch-For-Review: Update prod custom varnish package for upstream 3.0.7 + deploy - https://phabricator.wikimedia.org/T96846#2235358 (10BBlack) 05stalled>03declined At this point we're not investing further in varnish3 except for security fixes, so there's no point pursuing a re... [15:20:05] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2235360 (10EBernhardson) The project will need to be accessible (read/write) from the search project in labs. These will have indices (not upda... [15:20:11] !log T131605 @tin:/srv/mediawiki-staging (master=)$ echo 'https://cs.wikipedia.org/static/images/project-logos/cswiki.png' | mwscript purgeList.php [15:20:12] T131605: Set celebration logo on Czech Wikipedia - https://phabricator.wikimedia.org/T131605 [15:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:48] Dereckson: please test [15:20:59] Testing. [15:21:50] jzerebecki: varnish cache uses en.wikipedia.org to serve /static folder [15:22:07] So the file to purge is https://en.wikipedia.org/static/images/project-logos/cswiki.png [15:22:17] ok will redo [15:22:24] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: config 5c153e934a88f7b5e16a2d4b244ed76a9dfc1cdc T132868 : Enable RC patrol on ta.wikiquote (duration: 00m 28s) [15:22:25] T132868: enable patrol in tawikiquote - https://phabricator.wikimedia.org/T132868 [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:35] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2235367 (10EBernhardson) @chasmp yes i think that would be reasonable. Since we are getting spinning disks they will only see good performance... [15:22:54] !log T131605 @tin:/srv/mediawiki-staging (master=)$ echo 'https://en.wikipedia.org/static/images/project-logos/cswiki.png' | mwscript purgeList.php [15:22:55] T131605: Set celebration logo on Czech Wikipedia - https://phabricator.wikimedia.org/T131605 [15:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:01] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) [15:23:08] Works. [15:23:12] thx [15:23:17] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235389 (10BBlack) [15:23:19] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2235390 (10BBlack) [15:23:26] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) [15:23:28] 06Operations, 10Traffic, 06WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#2235392 (10BBlack) [15:23:44] Dereckson: please test 'config 5c153e934a88f7b5e16a2d4b244ed76a9dfc1cdc T132868 : Enable RC patrol on ta.wikiquote' [15:23:45] T132868: enable patrol in tawikiquote - https://phabricator.wikimedia.org/T132868 [15:23:52] Testing. [15:24:15] (03PS3) 10JanZerebecki: Add Resolution: namespace to foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [15:24:32] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [15:24:56] (03Merged) 10jenkins-bot: Add Resolution: namespace to foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [15:25:47] 06Operations, 10Traffic, 06WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#2235431 (10BBlack) >>! In T101048#1826732, @Platonides wrote: > Is there any technical reason not to have the servers using 700 certificates, using SNI fo... [15:26:39] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: config 70ee0fd97c26da02ed8ea511b2e21b47f374101b T132746 : Add Resolution: namespace to foundationwiki (duration: 00m 31s) [15:26:40] T132746: Add Resolution: namespace to wikimediafoundation.org - https://phabricator.wikimedia.org/T132746 [15:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:41] !log T132746 @tin:/srv/mediawiki-staging (master=)$ mwscript namespaceDupes.php foundationwiki --move-talk [15:28:42] T132746: Add Resolution: namespace to wikimediafoundation.org - https://phabricator.wikimedia.org/T132746 [15:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:52] jzerebecki: 'config 5c153e934a88f7b5e16a2d4b244ed76a9dfc1cdc T132868 : Enable RC patrol on ta.wikiquote' should be tested further on wiki, but looks good to me so far [15:28:52] T132868: enable patrol in tawikiquote - https://phabricator.wikimedia.org/T132868 [15:28:56] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2235476 (10BBlack) [15:29:03] I'll let a note on the task for that. [15:29:58] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) [15:30:00] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [15:30:04] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:30:27] 06Operations, 10Traffic, 06WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#2235483 (10BBlack) [15:30:29] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423896 (10BBlack) [15:31:00] !log T132868 3508 links to fix, 3508 were resolvable. Looks good! [15:31:02] T132868: enable patrol in tawikiquote - https://phabricator.wikimedia.org/T132868 [15:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:09] jouncebot: next [15:31:09] In 1 hour(s) and 28 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160425T1700) [15:31:55] !log T132868 @tin:/srv/mediawiki-staging (master %=)$ mwscript namespaceDupes.php foundationwiki --move-talk --fix >out-fix.txt [15:31:56] T132868: enable patrol in tawikiquote - https://phabricator.wikimedia.org/T132868 [15:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:14] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:33:35] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:33:44] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:35:03] !log T132868 db error: Query: UPDATE `page` SET page_namespace = '100',page_title = '2008-09_Budget' WHERE page_id = '21741' Function: NamespaceConflictChecker::movePage Error: 1062 Duplicate entry '100-2008-09_Budget' for key 'name_title' (10.64.0.205) [15:35:04] T132868: enable patrol in tawikiquote - https://phabricator.wikimedia.org/T132868 [15:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:47] !log err T132746 instead of T132868 [15:35:49] T132746: Add Resolution: namespace to wikimediafoundation.org - https://phabricator.wikimedia.org/T132746 [15:35:49] T132868: enable patrol in tawikiquote - https://phabricator.wikimedia.org/T132868 [15:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:45] jzerebecki: There are pages already existing, but `mwscript namespaceDupes.php foundationwiki --move-talk --merge` would work, merging histories. [15:37:33] I've checked a dry run, 51 pages to fix, 51 were resolvable 3508 links to fix, 3508 were resolvable. [15:38:02] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Provide Jessie package to fullfil Mediawiki::Packages requirement - https://phabricator.wikimedia.org/T95002#2235498 (10hashar) `integration-slave-jessie-1001.integration.eqiad.wmflabs` but you are denied somehow: ``` pam_access(sshd:... [15:38:55] PROBLEM - Auth DNS for labs pdns on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:39:31] 06Operations, 10ops-codfw, 06DC-Ops: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#2235499 (10Papaul) @Joe All the mw servers with old IDRAC firmware are up to date only mw2123 and mw2134 that i can not access for some reason. The systems need to be taken down for me to troubl... [15:40:17] !log T132746 @tin:/srv/mediawiki-staging (master %=)$ mwscript namespaceDupes.php foundationwiki --move-talk --merge --fix >try2-out-fix.txt [15:40:18] T132746: Add Resolution: namespace to wikimediafoundation.org - https://phabricator.wikimedia.org/T132746 [15:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:38] (03CR) 1020after4: "This is now working in beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [15:40:55] (03PS22) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [15:41:56] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2235514 (10jcrespo) @Papaul Did you see anything strange regarding lose power connectors or any other thing that could lead to sudden server stop (silly things such as reset buttons in a bad position or potential... [15:42:11] !log T132746 another run needed @tin:/srv/mediawiki-staging (master %=)$ mwscript namespaceDupes.php foundationwiki --move-talk --merge --fix >try2-out-fix2.txt [15:42:12] T132746: Add Resolution: namespace to wikimediafoundation.org - https://phabricator.wikimedia.org/T132746 [15:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:58] !log recovering backuped db1052 data from dbstore1002 [15:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:05] RECOVERY - Auth DNS for labs pdns on labs-ns1.wikimedia.org is OK: DNS OK: 0.074 seconds response time. nagiostest.eqiad.wmflabs returns [15:45:05] Dereckson: any idea: https://phabricator.wikimedia.org/T132746#2235531 [15:45:06] jzerebecki: Resolution: namespace looks good (and Resolution talk: looks good too). Could anybody tell me what problem are we solving? Thanks. [15:45:25] see my comment [15:46:14] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:46:26] jzerebecki: tested on Terbium, I confirm they don't go away. Page link table could be updated when these pages will be edited in the future I guess. [15:46:45] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:17] We could task TTO, whom has an access on foundation, to replace the links/template names in the text. [15:47:20] (03PS3) 10JanZerebecki: Disable MoodBar on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) (owner: 10Urbanecm) [15:47:21] Where? In the task? [15:47:26] y [15:47:35] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:36] see the last link i posted [15:47:54] Urbanecm: will move on to the next patch, per Dereckson [15:48:07] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) (owner: 10Urbanecm) [15:48:31] (03Merged) 10jenkins-bot: Disable MoodBar on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) (owner: 10Urbanecm) [15:48:51] Ok. [15:49:41] (03PS1) 10BBlack: apt|mirrors|ubuntu: puppetized LE certs [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) [15:50:09] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: config 2497480f39ea2dad27bab554b7d6e3a996560f46 T131685 : Disable MoodBar on testwiki (duration: 00m 27s) [15:50:11] T131685: Disable MoodBar on the Test Wikipedia - https://phabricator.wikimedia.org/T131685 [15:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:40] !log restart pdns to see if that helps a labs issue w/ tools proxy [15:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:01] (03CR) 10jenkins-bot: [V: 04-1] apt|mirrors|ubuntu: puppetized LE certs [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [15:51:05] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:51:13] Urbanecm: please test T131685 [15:51:14] T131685: Disable MoodBar on the Test Wikipedia - https://phabricator.wikimedia.org/T131685 [15:51:37] (03PS3) 10JanZerebecki: Add domain *.natmus.dk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283653 (https://phabricator.wikimedia.org/T132748) (owner: 10Urbanecm) [15:51:45] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:52:11] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283653 (https://phabricator.wikimedia.org/T132748) (owner: 10Urbanecm) [15:52:13] I can't see Moodbar in Special:Version on test2wiki so I think that this's fine. Thanks for deploy. [15:52:35] (03Merged) 10jenkins-bot: Add domain *.natmus.dk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283653 (https://phabricator.wikimedia.org/T132748) (owner: 10Urbanecm) [15:53:04] (03PS5) 10JanZerebecki: Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [15:53:12] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [15:53:23] !log test reboot for mw2212 - T129196 [15:53:23] T129196: mw2212 had several downtimes recently - test before repool - https://phabricator.wikimedia.org/T129196 [15:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:43] (03Merged) 10jenkins-bot: Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [15:55:21] !next [15:55:26] PROBLEM - Auth DNS for labs pdns on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:55:26] !log jzerebecki@tin Synchronized static/images/project-logos/dewiki-1.5x.png: config 0d0d7acf79891ad36d48cd30887bb2299d43f264 T132792 1 of 3 : Add HD versions of logo for dewiki (duration: 00m 25s) [15:55:27] T132792: HD-versions of the dewiki-logos - https://phabricator.wikimedia.org/T132792 [15:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:27] !log jzerebecki@tin Synchronized static/images/project-logos/dewiki-2x.png: config 0d0d7acf79891ad36d48cd30887bb2299d43f264 T132792 2 of 3 : Add HD versions of logo for dewiki (duration: 00m 25s) [15:56:27] T132792: HD-versions of the dewiki-logos - https://phabricator.wikimedia.org/T132792 [15:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:55] !log restarting elasticsearch server elastic2002.codfw.wmnet - activating unicast (T110236) [15:56:55] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [15:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:28] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2235613 (10Papaul) @jcrespo I didn't notice anything. I also make sure after the diagnostic i unplugged the power, the mgmt and data link for 5 minutes before putting everything back. [15:57:45] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:57:45] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:57:59] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: config 0d0d7acf79891ad36d48cd30887bb2299d43f264 T132792 3 of 3 ; and config 7144b0139c85b565af76a993b38209b4c0989dec T132748 (duration: 00m 25s) [15:58:01] T132748: Add domain to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T132748 [15:58:01] T132792: HD-versions of the dewiki-logos - https://phabricator.wikimedia.org/T132792 [15:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:11] Urbanecm: please test both [15:59:13] Urbanecm: jzerebecki: tested 7144b0139c85b565af76a993b38209b4c0989dec, works [15:59:25] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:59:38] anyone up for a manual rebase of https://gerrit.wikimedia.org/r/#/c/284712/ ? [15:59:50] !log restart pdns recursor & pdns on holmium [15:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:56] RECOVERY - Auth DNS for labs pdns on labs-ns0.wikimedia.org is OK: DNS OK: 5.099 seconds response time. nagiostest.eqiad.wmflabs returns [16:01:03] (03CR) 10Thcipriani: [C: 031] "Still run into this problem occasionally." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282441 (https://phabricator.wikimedia.org/T110407) (owner: 10Mattflaschen) [16:01:29] jzerebecki, I can do it. [16:01:33] jzerebecki: It works. Thanks. [16:01:41] matt_flaschen: do your patches need any other work than deployment (like run any maintance scripts)? [16:03:09] (03CR) 10JanZerebecki: [C: 04-1] "This reads as if it needs I09f39f5fc5f13f3253af9f7819bca81f1601da93 to be deployed, it is not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [16:04:18] jzerebecki, can you run FlowUpdateBetaFeaturePreference.php? I'm not sure if anyone on frwikisource has Flow user talk, but if not, it won't hurt. [16:04:37] It's in Flow/maintenance [16:05:04] PROBLEM - Auth DNS for labs pdns on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:05:12] PROBLEM - check_mysql on payments2003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:05:12] PROBLEM - check_mysql on payments2002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:06:02] PROBLEM - Auth DNS for labs pdns on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:06:27] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2235634 (10jcrespo) a:03jcrespo @Papaul thank you very much for the time! I will restart now the service, but not put it back into production. I will keep meanwhile the ticket open for 2 months unless somethin... [16:06:31] (03PS2) 10Mattflaschen: Add *.asc-test.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284712 (https://phabricator.wikimedia.org/T133286) (owner: 10Urbanecm) [16:07:03] RECOVERY - Auth DNS for labs pdns on labs-ns1.wikimedia.org is OK: DNS OK: 0.060 seconds response time. nagiostest.eqiad.wmflabs returns [16:07:34] (03PS6) 10Mattflaschen: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) [16:07:50] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284712 (https://phabricator.wikimedia.org/T133286) (owner: 10Urbanecm) [16:08:16] (03Merged) 10jenkins-bot: Add *.asc-test.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284712 (https://phabricator.wikimedia.org/T133286) (owner: 10Urbanecm) [16:08:27] (03CR) 10Mattflaschen: "It doesn't require simultaneous deployment with the Echo change any more." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [16:08:39] (03PS7) 10Mattflaschen: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) [16:09:11] hmmmm http://quarry.wmflabs.org/ is down [16:09:15] jzerebecki, RoanKattouw changed the config patch so it works with both old and new Echo, so it no longer requires everything be simultaneously deployed. [16:09:36] ^^^ fixing payments200[*] mysql replication [16:09:53] jzerebecki, also, I rebased https://gerrit.wikimedia.org/r/#/c/284712/ [16:10:12] PROBLEM - check_mysql on payments2002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:10:12] PROBLEM - check_mysql on payments2003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:10:13] RECOVERY - Auth DNS for labs pdns on labs-ns0.wikimedia.org is OK: DNS OK: 0.078 seconds response time. nagiostest.eqiad.wmflabs returns [16:10:21] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: config 1b156b46d42f2482c1941427dc4ec8aeb427f0b3 T133286 : Add *.asc-test.nl to wgCopyUploadsDomains (duration: 00m 27s) [16:10:22] T133286: Please put domain *.asc-test.nl on the list of whitelisted domains for upload with GWToolset - https://phabricator.wikimedia.org/T133286 [16:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:33] Urbanecm: please test [16:12:55] (03PS3) 10JanZerebecki: Enable Flow opt-in beta feature on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284020 (https://phabricator.wikimedia.org/T132914) (owner: 10Catrope) [16:13:07] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284020 (https://phabricator.wikimedia.org/T132914) (owner: 10Catrope) [16:13:31] (03Merged) 10jenkins-bot: Enable Flow opt-in beta feature on frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284020 (https://phabricator.wikimedia.org/T132914) (owner: 10Catrope) [16:13:49] I have no access to the toolset so I have no idea how to test if the domain is really whitelisted. [16:15:12] RECOVERY - check_mysql on payments2003 is OK: Uptime: 1210 Threads: 1 Questions: 237 Slow queries: 11 Opens: 231 Flush tables: 1 Open tables: 64 Queries per second avg: 0.195 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [16:15:12] RECOVERY - check_mysql on payments2002 is OK: Uptime: 1246 Threads: 1 Questions: 900 Slow queries: 11 Opens: 231 Flush tables: 1 Open tables: 64 Queries per second avg: 0.722 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [16:15:24] Urbanecm: ok [16:15:49] !log jzerebecki@tin Synchronized wmf-config/InitialiseSettings.php: config 1f8fea8e53b2e95caf0f559e0ad6e2cb1d51c967 T132914 : Enable Flow opt-in beta feature on frwikisource (duration: 00m 28s) [16:15:50] T132914: Enable Flow as a Beta feature on French Wikisource - https://phabricator.wikimedia.org/T132914 [16:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:03] "PROCS CRITICAL: 2 processes with UID = 0 (root), args '/bin/bash /usr/local/bin/eventlogging_sync.sh' " is tripping up often [16:20:13] and it's silenced as well, dunno why [16:20:16] ottomata: ^ :) [16:20:47] I'm going to enable notifications for it now [16:20:48] oof, ok, jynus ^^ :) [16:21:07] jynus: probably silenced it [16:21:53] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: puppet fail [16:24:03] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:56] !log T132914: @tin:/srv/mediawiki-staging (master %=)$ mwscript extensions/Flow/maintenance/FlowUpdateBetaFeaturePreference.php frwikisource [16:24:57] T132914: Enable Flow as a Beta feature on French Wikisource - https://phabricator.wikimedia.org/T132914 [16:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:27] matt_flaschen: please test^^ [16:26:09] (03PS8) 10JanZerebecki: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [16:28:59] jzerebecki: Urbanecm: follow-up about Resolution: issue, after two edits on foundationwiki by guillom, script output is now "0 pages to fix, 0 were resolvable. 0 links to fix, 0 were resolvable. Looks good!". [16:29:28] \o/ [16:29:39] Dereckson: jzerebecki: So it's done? Thanks :) [16:29:50] jzerebecki, Beta Feature works. [16:30:09] Yes, it's done. You're welcome and thanks for submitting config change. [16:30:17] Dereckson: thx [16:30:24] yw [16:31:45] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [16:32:12] (03Merged) 10jenkins-bot: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) (owner: 10Mattflaschen) [16:34:06] !log jzerebecki@tin Synchronized wmf-config/CommonSettings.php: config 3bd39b8b9ac9fb252ff8a3e9a93b14c1defd45fc T132820 : Use notify-type-availability due to Echo change (duration: 00m 26s) [16:34:07] T132820: Rationalize wgEchoDefaultNotificationTypes - https://phabricator.wikimedia.org/T132820 [16:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:24] matt_flaschen: please test [16:35:39] (03CR) 10Dereckson: "There are only two files: jobqueue-labs.php and jobqueue.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) (owner: 10Dereckson) [16:38:31] (03CR) 10Dereckson: "The last comment was for the other patch, so for this one, there is now only PoolCounterSettings.php and not -eqiad/-codfw files anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) (owner: 10Dereckson) [16:40:12] jzerebecki, looks good. [16:40:26] SWAT done. [16:40:29] jzerebecki, thanks. [16:40:32] yw [16:40:49] thanks [16:41:18] (03PS1) 10Yuvipanda: labs: Add diamond collector for PDNS authoritative server [puppet] - 10https://gerrit.wikimedia.org/r/285206 [16:41:27] chasemp: ^ [16:41:42] chasemp: it's only authoritative tho, no recursor. [16:41:45] YuviPanda: I would like to run a bit of testing first I have had really mixed luck w/ their collectors out of the box [16:41:52] I'll look into it [16:42:07] but in general agreed [16:42:21] chasemp: kk. I looked at the code it's just calling pdns_control and parsing out the values from there. [16:42:29] ok probably ok then [16:42:56] yeah. there's rec_control for recursor stuff, just no collector [16:44:02] lol, that won't work because it needs... sudo. [16:46:07] (03CR) 10Yuvipanda: [C: 031] labstore throughput thresholds adjustments [puppet] - 10https://gerrit.wikimedia.org/r/285179 (https://phabricator.wikimedia.org/T126237) (owner: 10Rush) [16:47:21] (03PS3) 10Rush: Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [16:47:38] (03PS3) 10Rush: labstore throughput thresholds adjustments [puppet] - 10https://gerrit.wikimedia.org/r/285179 (https://phabricator.wikimedia.org/T126237) [16:50:24] PROBLEM - eventlogging_sync processes on dbstore1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [16:50:32] (03PS1) 10Faidon Liambotis: network: kill iron's SLAAC from ::constants [puppet] - 10https://gerrit.wikimedia.org/r/285207 [16:50:34] (03PS1) 10Faidon Liambotis: mariadb: don't spawn el_sync.sh via sudo -u root [puppet] - 10https://gerrit.wikimedia.org/r/285208 [16:50:36] (03CR) 10Rush: [C: 032] labstore throughput thresholds adjustments [puppet] - 10https://gerrit.wikimedia.org/r/285179 (https://phabricator.wikimedia.org/T126237) (owner: 10Rush) [16:50:48] (03PS2) 10Faidon Liambotis: network: kill iron's SLAAC from ::constants [puppet] - 10https://gerrit.wikimedia.org/r/285207 [16:51:02] (03CR) 10Faidon Liambotis: [C: 032] network: kill iron's SLAAC from ::constants [puppet] - 10https://gerrit.wikimedia.org/r/285207 (owner: 10Faidon Liambotis) [16:52:13] ottomata, jynus: https://gerrit.wikimedia.org/r/#/c/285208/1/files/mariadb/eventlogging_sync.init [16:53:41] ok [16:53:43] (03PS2) 10Yuvipanda: labs: Add diamond collector for PDNS authoritative server [puppet] - 10https://gerrit.wikimedia.org/r/285206 [16:54:48] I think it already had its own user [16:54:52] RECOVERY - eventlogging_sync processes on dbstore1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [16:55:03] (03CR) 10jenkins-bot: [V: 04-1] labs: Add diamond collector for PDNS authoritative server [puppet] - 10https://gerrit.wikimedia.org/r/285206 (owner: 10Yuvipanda) [16:55:22] fucking arrow alignment [16:55:30] (I presume) [16:55:42] cross-post from #-tech: [16:55:45] linking this here for more visibility, came into -ops channel + phab over the weekend: [16:55:48] https://phabricator.wikimedia.org/T133441 - "Seeing desktop text cache while browsing mobile sites" [16:55:52] the report was only seen for the foundation wiki, but it's still troubling that there's no known cause, and we don't know why isn't affecting other wikis yet (or when it will?) [16:56:51] (03CR) 10Jcrespo: [C: 031] "eventlogging_sync is horribly coded, I only was reposponsible to puppetize it. It needs to be rewritten anyway. T124307" [puppet] - 10https://gerrit.wikimedia.org/r/285208 (owner: 10Faidon Liambotis) [16:58:06] (03PS2) 10Faidon Liambotis: mariadb: don't spawn el_sync.sh via sudo -u root [puppet] - 10https://gerrit.wikimedia.org/r/285208 [16:58:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mariadb: don't spawn el_sync.sh via sudo -u root [puppet] - 10https://gerrit.wikimedia.org/r/285208 (owner: 10Faidon Liambotis) [16:58:46] (03PS3) 10Yuvipanda: labs: Add diamond collector for PDNS authoritative server [puppet] - 10https://gerrit.wikimedia.org/r/285206 [16:59:32] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:05] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160425T1700). [17:01:04] SMalyshev: ^ ping... do we have anything to deploy? [17:01:15] !log restarting elasticsearch server elastic2003.codfw.wmnet - activating unicast (T110236) [17:01:16] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [17:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:43] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2235958 (10mmodell) >>! In T133211#2228392, @hashar wrote: > Hypothetically, can we rewind to d... [17:10:11] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:12:02] (03PS1) 10Dereckson: Reconfigure interface editor group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285209 (https://phabricator.wikimedia.org/T133564) [17:15:11] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:15:11] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:15:11] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:15:12] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:15:12] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:15:53] ? [17:16:58] (03CR) 10Dereckson: "Follow-up: I45b53d7e758a31428a2568262aeb1df5524adbc7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258453 (https://phabricator.wikimedia.org/T120348) (owner: 10Luke081515) [17:17:44] haha paravoid. jynus uh, that seems to make sense to me! [17:18:19] ok [17:20:11] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:20:11] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:20:12] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:20:12] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:20:12] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:20:15] !log 😊 [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:22] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [17:20:47] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2236004 (10Gehel) [17:22:30] !log Ⓐ [17:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:25] I'm gonna have to work more emojis into my !logs [17:25:05] ^^^ please ignore all frack-related pages for the next 30 min or so, I'm working on package updates and there's risk of cross-host activity alerts [17:25:11] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:25:11] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:25:12] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:25:12] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:25:12] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [17:26:04] !log update RESTBase to c1d5193, staging [17:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:06] (03PS1) 10Dzahn: ganglia: remove service notify for aggregators [puppet] - 10https://gerrit.wikimedia.org/r/285212 [17:30:11] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [17:30:11] RECOVERY - check_mysql on payments1004 is OK: Uptime: 3614475 Threads: 1 Questions: 1374542 Slow queries: 4565 Opens: 4756 Flush tables: 167 Open tables: 61 Queries per second avg: 0.380 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:30:11] RECOVERY - check_mysql on payments2001 is OK: Uptime: 5892 Threads: 3 Questions: 10574 Slow queries: 11 Opens: 276 Flush tables: 1 Open tables: 45 Queries per second avg: 1.794 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:30:31] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:36] ostriches: https://phabricator.wikimedia.org/T119806 [17:30:39] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2236031 (10mmodell) Todo: figure out how to get ssh-add to accept the password once so that we... [17:30:51] ostriches: had to search phab for "emoji" :p [17:32:12] (03CR) 10Dzahn: [C: 031] ganglia: don't run ganglia-monitor in labs [puppet] - 10https://gerrit.wikimedia.org/r/284912 (https://phabricator.wikimedia.org/T115330) (owner: 10Filippo Giunchedi) [17:33:02] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.50 ms [17:35:11] RECOVERY - check_mysql on payments1002 is OK: Uptime: 199 Threads: 1 Questions: 1171 Slow queries: 19 Opens: 154 Flush tables: 1 Open tables: 64 Queries per second avg: 5.884 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:37:21] PROBLEM - puppet last run on mc2003 is CRITICAL: CRITICAL: puppet fail [17:37:57] (03CR) 10Ottomata: Read values inbound in X-Analytics header (pageview and preview) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [17:39:11] RECOVERY - check_mysql on payments1003 is OK: Uptime: 117 Threads: 1 Questions: 755 Slow queries: 13 Opens: 155 Flush tables: 1 Open tables: 64 Queries per second avg: 6.452 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:41:01] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:43:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 617 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5145418 keys - replication_delay is 617 [17:47:55] !log update RESTBase to c1d5193, canary on restbase1005 [17:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:13] !log db1052 now in jessie [17:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:13] !log wiping db1052 tendril's monitoring data due to 5.5 -> 5.6 incompatibility [17:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:25] where 5.6 means 10, but anyway [17:55:44] !log update RESTBase to c1d5193, canary on restbase1007 [17:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:51] (03CR) 10Dzahn: [C: 04-2] "nah, still looking for the real fix and only attempt to start the ones we also generate configs for for this site" [puppet] - 10https://gerrit.wikimedia.org/r/285212 (owner: 10Dzahn) [18:01:00] (03PS1) 10Ottomata: Mark analytics1015 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/285216 [18:02:53] (03CR) 10Ottomata: [C: 032] Mark analytics1015 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/285216 (owner: 10Ottomata) [18:03:11] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review: mw2212 had several downtimes recently - test before repool - https://phabricator.wikimedia.org/T129196#2097723 (10fgiunchedi) I did a test reboot of mw2212 and it came up fine, @papaul are you happy with it hardware-wise to put it back in service? [18:03:52] RECOVERY - puppet last run on mc2003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:05:44] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2236185 (10fgiunchedi) a:03Cmjohnson moving to @cmjohnson since he's decommissioning those [18:07:51] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: fr.planet doesn't update as expected - https://phabricator.wikimedia.org/T133573#2236190 (10Dereckson) a:03Dzahn [18:08:08] (03PS3) 10Aaron Schulz: Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (https://phabricator.wikimedia.org/T95501) [18:09:03] !log update RESTBase to c1d5193 [18:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:51] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:12:30] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: fr.planet doesn't update as expected - https://phabricator.wikimedia.org/T133573#2236225 (10Dzahn) ERROR:planet.runner:Error 500 while updating feed https://referencenecessaire.wordpress.com/feed/atom/ INFO:planet.runner:Feed http://www.anthere.org/dotcle... [18:14:01] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 75.22 ms [18:14:32] (03CR) 10Hashar: "I got a Trusty image build and uploaded on wmf labs. I confirmed it is working and would now need Nodepool to spawn a couple instance out " [puppet] - 10https://gerrit.wikimedia.org/r/285178 (https://phabricator.wikimedia.org/T133203) (owner: 10Hashar) [18:14:49] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980389 (10fgiunchedi) anything to be done on this task? related patch has been abandoned [18:17:12] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2236250 (10jcrespo) corruption? ``` 160323 13:11:49 [ERROR] ombudsmenwiki.blobs_cluster25: 1 client is using or hasn't closed the table properly ``` After start: ``` 160425 18:12:15 mysqld_safe Starting mysqld... [18:17:27] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: fr.planet doesn't update as expected - https://phabricator.wikimedia.org/T133573#2236251 (10Dzahn) ERROR:planet.runner:IOError: [Errno 13] Permission denied: '/var/cache/planet/fr/blogger.com ^ some permission errors, caused by running the update manuall... [18:20:42] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:34] !log finished update RESTBase to c1d5193 [18:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:51] (03PS1) 10Dereckson: Maintenance for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285221 [18:26:09] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2236260 (10jcrespo) As expected, 0 logs of shutdown, it was a hard reset/power loss. > I will keep meanwhile the ticket open for 2 months unless something happens. [18:26:58] 06Operations, 10ops-codfw: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2236262 (10jcrespo) 05Open>03stalled a:05jcrespo>03None [18:28:12] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5137437 keys - replication_delay is 0 [18:30:18] hello! I would need a merge for Nodepool settings, that is to bring up Trusty based instances so we can test Zend5.5/ HHVM on disposable instances. A couple copy paste and s/jessie/trusty/ https://gerrit.wikimedia.org/r/#/c/285178/ I already tested it in prod but puppet keeps overriding my hack :D [18:30:44] (03CR) 10Hashar: [C: 031] "Tested in production and image boots just fine." [puppet] - 10https://gerrit.wikimedia.org/r/285178 (https://phabricator.wikimedia.org/T133203) (owner: 10Hashar) [18:31:13] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2236267 (10Pchelolo) 05Open>03Resolved a:03Pchelolo The new redirect handling code has been deployed in production. [18:32:42] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#2236275 (10BBlack) p:05Normal>03Low tbh I'm not sure, maybe leave it open as low-prio. It is technically a bug (in the responses of our pdns-b... [18:32:50] (03PS3) 10Legoktm: Set up UploadsLink on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281651 (https://phabricator.wikimedia.org/T131844) (owner: 10Rillke) [18:33:05] 07Blocked-on-Operations, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality, 03releng-201516-q4: Attempt to provide a Trusty image for Nodepool - https://phabricator.wikimedia.org/T133203#2236277 (10hashar) **Summary** Provisionnent a Trusty image was more or less straight... [18:34:04] jouncebot: next [18:34:04] In 1 hour(s) and 25 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160425T2000) [18:34:24] (03PS2) 10BBlack: Switch mobile hostnames to text IP [dns] - 10https://gerrit.wikimedia.org/r/283364 (https://phabricator.wikimedia.org/T124482) [18:34:47] there's never going to be a perfectly-ideal time in terms of isolating graph effects and all that... [18:34:54] but now seems as good as any! [18:35:31] heheh good luck bblack ! [18:35:51] bblack: or you could move them in batches like we do for mw train ? :-D [18:36:01] good luck [18:36:27] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: fr.planet doesn't update as expected - https://phabricator.wikimedia.org/T133573#2236285 (10Dereckson) Could that be a blacklist from wordpress.com side? The feed is a 200 OK from my side. [18:37:09] hashar: it's pretty low risk. from the internal side of things the two IPs are already identical in virtually every respect. [18:38:03] hashar: but there's an effect for SPDY/H2 clients, in that now if they make serial requests to mobile and desktop hostnames (e.g. mobile logins use desktop login URLs, desktop->mobile redirects for mobile browsers, etc...), they can coalesce them on the same connection instead of making a second separate one [18:38:16] (03CR) 10Legoktm: [C: 032] Set up UploadsLink on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281651 (https://phabricator.wikimedia.org/T131844) (owner: 10Rillke) [18:38:24] mutante: if we look the left column of https://fr.planet.wikimedia.org/ all *.wordpress.com feeds are dead [18:38:53] (03Merged) 10jenkins-bot: Set up UploadsLink on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281651 (https://phabricator.wikimedia.org/T131844) (owner: 10Rillke) [18:38:56] !log switching mobile hostnames to text IPs in DNS (10 min TTL) [18:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:13] Dereckson: getting Error 500 from all wordpress.com feeds indeed [18:39:20] (03CR) 10BBlack: [C: 032] Switch mobile hostnames to text IP [dns] - 10https://gerrit.wikimedia.org/r/283364 (https://phabricator.wikimedia.org/T124482) (owner: 10BBlack) [18:39:49] bblack: neat ;-}  should make redirection slightly faster assuming the mobile browser supports the fancy protocols. Great! [18:40:59] !log legoktm@tin Synchronized wmf-config: ⓛⓐⓑⓢ-ⓞⓝⓛⓨ, ⓝⓞ-ⓞⓟ (duration: 00m 40s) [18:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:24] mutante: Furthermore, we are back in time, there are more recent posts on http://feeds2.feedburner.com/WikimediaFrance [18:41:25] nice log message [18:41:47] hehe, I had to join the party :P [18:41:49] * hashar wonders whether ElasticSearch SAL would match it when looking for "noop" [18:42:04] probably not :) [18:42:14] no, there is a bug for that :) [18:42:18] unicode chars not searchable [18:42:27] just saw it when searching phab for emoji [18:46:08] legoktm: bblack: FixedSys Excelsior doesn't render very well these characters, trying too much to be fixed width: https://ysul.nasqueron.org/~dereckson/tmp/LabsNoOpInFixedSysExcelsior.png [18:46:48] Dereckson: that's pretty similar to how it looked in my terminal [18:47:38] Chrome search matches ⁵ and 5. [18:48:16] Dereckson: so 2 issues, one affected only "fr" but that should be fixed. and wordpress.com affects all. also the "en" feed [18:48:38] i would say they blocked us.. but Error 500 ? [18:49:01] they should at least use a 4xx [18:49:49] blogspot.com f.e. is just fine [18:50:17] define "us", I've checked on terbium we can have the feed in 200 through curl using webproxy.eqiad.wmnet:8080 [18:50:35] in this case "us" = labs .. i think [18:50:37] perhaps they blocked the user agent/ip combo? [18:50:49] the labs IP (range) ? [18:51:01] but either way, serving a 500 is bad? [18:51:09] Oh, planet run on labs, not prod, okay. [18:51:34] Yeah, 403 would be better. [18:51:40] no,no, sorry. it runs in prod. i'm testing on planet1001 and i [18:51:46] i get the 500 thtere [18:54:05] Dereckson: how are you testing on terbium? [18:54:11] curl? [18:54:11] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.72 ms [18:54:14] yes [18:54:26] `curl -x webproxy.eqiad.wmnet:8080 "https://darkoneko.wordpress.com/category/wikipedia/feed/atom/"` works. [18:54:59] (03PS1) 10Hashar: contint: drop integration/{phpcs,phpunit} [puppet] - 10https://gerrit.wikimedia.org/r/285226 [18:55:36] Dereckson: it works on planet1001 too.. hmm [18:55:53] (03PS1) 10BBlack: mobile IP DNS decom [dns] - 10https://gerrit.wikimedia.org/r/285227 (https://phabricator.wikimedia.org/T124482) [18:56:03] (03CR) 10jenkins-bot: [V: 04-1] contint: drop integration/{phpcs,phpunit} [puppet] - 10https://gerrit.wikimedia.org/r/285226 (owner: 10Hashar) [18:56:18] (03CR) 10BBlack: [C: 04-1] "Staged on hold for later, see commitmsg." [dns] - 10https://gerrit.wikimedia.org/r/285227 (https://phabricator.wikimedia.org/T124482) (owner: 10BBlack) [18:56:19] also when replacing webproxy with url-downloader, which planet uses [18:56:48] UserAgent becomes more likely [18:57:11] (03PS2) 10Hashar: contint: drop integration/{phpcs,phpunit} [puppet] - 10https://gerrit.wikimedia.org/r/285226 [18:57:35] (03CR) 10Hashar: [C: 031] "I can't find any jobs still using them beside the integration-phpunit jobs. Cherry picked on puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/285226 (owner: 10Hashar) [19:00:54] (03PS1) 10BBlack: decom mobile IPs from LVS/caches [puppet] - 10https://gerrit.wikimedia.org/r/285229 (https://phabricator.wikimedia.org/T124482) [19:01:09] (03CR) 10BBlack: [C: 04-1] decom mobile IPs from LVS/caches [puppet] - 10https://gerrit.wikimedia.org/r/285229 (https://phabricator.wikimedia.org/T124482) (owner: 10BBlack) [19:01:14] Dereckson: https://github.com/rubys/venus/issues/29 ? but why the recent change then [19:01:23] bbiaw [19:01:46] code hasn't been updated for 5 years [19:01:57] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566#2236346 (10Smalyshev) Planned sequence: # (day before) Send email to the wikidata list # Take wdq1001 out of varnish config # Shut down and reimage wdq1001... [19:02:20] but on the Automattic side, they could have recently hardened their wordpress.com service. [19:02:23] Perhaps reach them for assistance? Contact information is on https://automattic.com/contact/. [19:02:33] Dereckson: so wordpress.com decided to throw out the Python-httplib clients maybe [19:02:41] even though that bug claims they already did in 2015 [19:02:42] 2014 [19:02:51] yes [19:03:16] (03CR) 10jenkins-bot: [V: 04-1] decom mobile IPs from LVS/caches [puppet] - 10https://gerrit.wikimedia.org/r/285229 (https://phabricator.wikimedia.org/T124482) (owner: 10BBlack) [19:06:32] en-planet.log:ERROR:planet.runner:Error 404 while updating feed http://blog.wikimedia.org.au/category/wikimedia/feed/ [19:06:48] lol, our own blog URL is wrong/changed [19:07:00] no. .org.au ! damnit [19:08:37] !log restarting elasticsearch server elastic2004.codfw.wmnet - activating unicast (T110236) [19:08:38] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:08] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: fr.planet doesn't update as expected - https://phabricator.wikimedia.org/T133573#2236411 (10Dzahn) It really feels like a blacklist on UserAgent from the wordpress.com side, yes. Also see unrelated errors on T133577 [19:10:36] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: fr.planet doesn't update as expected - https://phabricator.wikimedia.org/T133573#2236420 (10Dzahn) https://github.com/rubys/venus/issues/29 [19:18:35] (03PS1) 10Dereckson: Maintenance for zh.planet [puppet] - 10https://gerrit.wikimedia.org/r/285234 (https://phabricator.wikimedia.org/T133577) [19:18:41] (03PS6) 10BBlack: cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) [19:23:31] [19:23:49] This is still the link offered by their Wordpress instance. [19:24:12] 06Operations, 13Patch-For-Review: make network saturation alert on labstore1003 sane - https://phabricator.wikimedia.org/T126237#2236476 (10chasemp) 05Open>03Resolved a:03chasemp [19:24:14] 06Operations, 06Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2236478 (10chasemp) [19:24:18] (03PS1) 10Hashar: contint: decouple slave_scripts and composer [puppet] - 10https://gerrit.wikimedia.org/r/285235 [19:26:22] And http://blog.wikimedia.org.au/index.php/?feed=atom redirects to blog.wikimedia.org.au/feed/atom/. We can't do anything for them from a config point of view. [19:26:25] (03PS2) 10Hashar: contint: decouple slave_scripts and composer [puppet] - 10https://gerrit.wikimedia.org/r/285235 (https://phabricator.wikimedia.org/T128092) [19:27:28] !log start configuration of new maps caching servers (T109162) [19:27:29] T109162: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162 [19:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:52] !log disabling puppet on cp1043/cp1044 while reconfiguring Maps caching servers [19:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:39] (03CR) 10Gehel: [C: 032] cache_maps: re-role old mobile servers [puppet] - 10https://gerrit.wikimedia.org/r/268236 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [19:31:44] 07Blocked-on-Operations, 10OOjs-UI, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Provide composer on the nodepool servers so OOjs UI can use it in the npm job - https://phabricator.wikimedia.org/T128092#2236550 (10hashar) Pending merge in puppet.git of the following... [19:32:56] (03CR) 10Ori.livneh: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [19:34:10] (03PS2) 10Ori.livneh: nodepool: pool in 2 TRUSTY instances [puppet] - 10https://gerrit.wikimedia.org/r/285178 (https://phabricator.wikimedia.org/T133203) (owner: 10Hashar) [19:34:19] (03CR) 10Ori.livneh: [C: 032 V: 032] nodepool: pool in 2 TRUSTY instances [puppet] - 10https://gerrit.wikimedia.org/r/285178 (https://phabricator.wikimedia.org/T133203) (owner: 10Hashar) [19:36:12] (03PS2) 10Dzahn: Maintenance for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285221 (owner: 10Dereckson) [19:36:18] (03CR) 10Dzahn: [C: 032] Maintenance for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285221 (owner: 10Dereckson) [19:38:12] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:38:12] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:38:42] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures [19:39:12] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 2 failures [19:39:13] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 2 failures [19:39:50] !log Nodepool now booting 2 trusty instances for CI needs. T133203 [19:39:51] T133203: Attempt to provide a Trusty image for Nodepool - https://phabricator.wikimedia.org/T133203 [19:39:52] (03PS1) 10Dereckson: Maintenance for en.planet [puppet] - 10https://gerrit.wikimedia.org/r/285239 [19:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:03] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:02] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:52] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures [19:42:03] is anyone looking at citoid? [19:42:25] ignore cpXXXX puppet alerts, that's ongoing work with gehel+I, sorry for the spam [19:42:38] (03PS1) 10Dereckson: Maintenance for es.planet [puppet] - 10https://gerrit.wikimedia.org/r/285240 (https://phabricator.wikimedia.org/T133577) [19:42:52] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 2 failures [19:43:32] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [19:44:32] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [19:44:33] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [19:45:03] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 2 failures [19:45:12] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 2 failures [19:45:36] (03PS1) 10BBlack: fixup maps cache routing: T109162 [puppet] - 10https://gerrit.wikimedia.org/r/285241 [19:45:42] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 2 failures [19:45:43] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 2 failures [19:45:43] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [19:45:48] 07Blocked-on-Operations, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality, 03releng-201516-q4: Attempt to provide a Trusty image for Nodepool - https://phabricator.wikimedia.org/T133203#2236625 (10hashar) 05Open>03Resolved Nodepool processed: ``` 2016-04-25 19:41 INFO... [19:46:02] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 2 failures [19:46:11] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 2 failures [19:46:12] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 140 connecting: cp3015_v4, cp3015_v6, cp3016_v4, cp3016_v6, cp3017_v4, cp3017_v6, cp3018_v4, cp3018_v6 [19:46:21] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures [19:46:22] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [19:46:23] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [19:46:33] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures [19:46:51] (03CR) 10Gehel: [C: 032] fixup maps cache routing: T109162 [puppet] - 10https://gerrit.wikimedia.org/r/285241 (owner: 10BBlack) [19:47:21] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:48:11] (03CR) 10Gehel: [V: 032] fixup maps cache routing: T109162 [puppet] - 10https://gerrit.wikimedia.org/r/285241 (owner: 10BBlack) [19:48:12] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 140 connecting: cp3015_v4, cp3015_v6, cp3016_v4, cp3016_v6, cp3017_v4, cp3017_v6, cp3018_v4, cp3018_v6 [19:48:26] ipsec is gehel and I too, sorry :) [19:49:30] !log restarting elasticsearch server elastic2005.codfw.wmnet - activating unicast (T110236) [19:49:31] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:55] (03PS1) 10Dereckson: Add Walaa Abdel Manaem blog to ar.planet [puppet] - 10https://gerrit.wikimedia.org/r/285243 [19:50:27] (03PS3) 10Dzahn: Maintenance for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285221 (owner: 10Dereckson) [19:51:01] (03PS1) 10ArielGlenn: pass through verbose option from caller to XmlDump jobs [dumps] - 10https://gerrit.wikimedia.org/r/285244 [19:52:22] I don't know if I should see pages for these criticals but I am not. just saying. [19:52:28] (03PS1) 10BBlack: esams cache_maps puppetization fixup [puppet] - 10https://gerrit.wikimedia.org/r/285245 (https://phabricator.wikimedia.org/T109162) [19:53:20] (03PS3) 10Hashar: contint: decouple slave_scripts and composer [puppet] - 10https://gerrit.wikimedia.org/r/285235 (https://phabricator.wikimedia.org/T128092) [19:53:22] (03PS3) 10Hashar: contint: drop integration/{phpcs,phpunit} [puppet] - 10https://gerrit.wikimedia.org/r/285226 [19:53:26] Luke081515: to answer your 14:52 UTC question, j.zerebecki started the SWAT 20 minutes earlier as 10 patches were scheduled. [19:53:31] apergos: the cpNNNN + ipsec ones, no [19:53:44] bblack, gehel, there was some conv about spdy vs http/2. Will maps use http/2 from the start? [19:53:44] yeah those are the ones I was curious about [19:53:46] ok thx [19:53:50] yurik: no [19:54:14] apergos: for the cache clusters, we have a lot of CRITICALs defined because they really shouldn't happen by surprise and something needs fixing [19:54:27] but usually the configuration is resilient in the face of them from a user-facing POV, so they don't page [19:54:34] gotcha [19:54:40] (03PS1) 10Hashar: nodepool: trusty label was overriding jessie one [puppet] - 10https://gerrit.wikimedia.org/r/285247 [19:54:51] Dereckson: Ah, ok :) [19:54:58] at this point I'm not really watching the channel, I just happened to peek in and was a bit surprised [19:54:58] (03CR) 10Gehel: [C: 032 V: 032] esams cache_maps puppetization fixup [puppet] - 10https://gerrit.wikimedia.org/r/285245 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [19:55:15] (03CR) 10Hashar: nodepool: trusty label was overriding jessie one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285247 (owner: 10Hashar) [19:55:21] (03PS2) 10Hashar: nodepool: trusty label was overriding jessie one [puppet] - 10https://gerrit.wikimedia.org/r/285247 [19:57:31] (03CR) 10Nuria: Read values inbound in X-Analytics header (pageview and preview) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [19:57:35] (03CR) 10Ori.livneh: [C: 032 V: 032] nodepool: trusty label was overriding jessie one [puppet] - 10https://gerrit.wikimedia.org/r/285247 (owner: 10Hashar) [19:59:21] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [20:00:02] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:05] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160425T2000). Please do the needful. [20:00:56] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2236678 (10Papaul) [20:01:06] !log deploying parsoid version d5363193 [20:01:06] 06Operations, 10DBA: Gerrit 285208 broke eventlogging_sync.sh - https://phabricator.wikimedia.org/T133588#2236680 (10Volans) [20:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:21] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [20:02:58] (03PS4) 10Dzahn: Maintenance for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285221 (owner: 10Dereckson) [20:03:16] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 140 connecting: cp3015_v4, cp3015_v6, cp3016_v4, cp3016_v6, cp3017_v4, cp3017_v6, cp3018_v4, cp3018_v6 [20:03:37] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 140 connecting: cp3015_v4, cp3015_v6, cp3016_v4, cp3016_v6, cp3017_v4, cp3017_v6, cp3018_v4, cp3018_v6 [20:04:10] !log synced new code and restarted parsoid on wtp1001 as a canary [20:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:57] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 140 connecting: cp3015_v4, cp3015_v6, cp3016_v4, cp3016_v6, cp3017_v4, cp3017_v6, cp3018_v4, cp3018_v6 [20:05:16] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [20:06:17] PROBLEM - traffic-pool service on cp1047 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:06:17] PROBLEM - Varnishkafka log producer on cp1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:06:17] PROBLEM - traffic-pool service on cp1060 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:06:17] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp2003 is CRITICAL: Connection refused [20:06:17] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp2009 is CRITICAL: Connection refused [20:06:17] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3004 is CRITICAL: Connection refused [20:06:17] PROBLEM - traffic-pool service on cp2021 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:06:47] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp2003 is CRITICAL: Connection refused [20:06:55] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp3004 is CRITICAL: Connection refused [20:06:55] PROBLEM - Varnishkafka log producer on cp3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:06:55] PROBLEM - Varnishkafka log producer on cp4011 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:07:15] PROBLEM - Freshness of OCSP Stapling files on cp1046 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:07:15] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp2021 is CRITICAL: Connection refused [20:07:16] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp4012 is CRITICAL: Connection refused [20:07:16] PROBLEM - Freshness of OCSP Stapling files on cp4011 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:07:16] PROBLEM - traffic-pool service on cp4020 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:07:36] PROBLEM - Varnishkafka log producer on cp1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:07:36] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp2021 is CRITICAL: Connection refused [20:07:36] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp4012 is CRITICAL: Connection refused [20:07:36] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3006 is CRITICAL: Connection refused [20:07:45] (03PS1) 10Volans: Revert "mariadb: don't spawn el_sync.sh via sudo -u root" [puppet] - 10https://gerrit.wikimedia.org/r/285249 (https://phabricator.wikimedia.org/T133588) [20:07:55] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp1047 is CRITICAL: Connection refused [20:07:55] PROBLEM - Freshness of OCSP Stapling files on cp1059 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:07:56] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp3006 is CRITICAL: Connection refused [20:07:56] !log finished deploying parsoid version d5363193 [20:07:56] PROBLEM - Varnishkafka log producer on cp2015 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:07:56] PROBLEM - Varnishkafka log producer on cp4019 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:05] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp1060 is CRITICAL: Connection refused [20:08:05] PROBLEM - Varnishkafka log producer on cp2009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:08:06] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3003 is CRITICAL: Connection refused [20:08:06] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp4020 is CRITICAL: Connection refused [20:08:06] PROBLEM - traffic-pool service on cp4011 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:08:26] PROBLEM - traffic-pool service on cp1046 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:08:27] PROBLEM - traffic-pool service on cp1059 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:08:27] PROBLEM - Varnishkafka log producer on cp2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:08:27] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp3003 is CRITICAL: Connection refused [20:08:35] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp4020 is CRITICAL: Connection refused [20:08:35] PROBLEM - Varnishkafka log producer on cp3004 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:08:46] PROBLEM - Freshness of OCSP Stapling files on cp2003 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:08:47] PROBLEM - Freshness of OCSP Stapling files on cp3004 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:08:47] PROBLEM - traffic-pool service on cp4019 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:09:06] PROBLEM - Varnishkafka log producer on cp2021 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:09:07] PROBLEM - Varnishkafka log producer on cp4012 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:09:26] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp1046 is CRITICAL: Connection refused [20:09:26] PROBLEM - Varnishkafka log producer on cp1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:09:26] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp4011 is CRITICAL: Connection refused [20:09:26] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp3005 is CRITICAL: Connection refused [20:09:26] PROBLEM - Varnishkafka log producer on cp3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:09:43] 06Operations, 10DBA, 13Patch-For-Review: Gerrit 285208 broke eventlogging_sync.sh - https://phabricator.wikimedia.org/T133588#2236757 (10Volans) p:05Triage>03Unbreak! a:03Volans I'm reverting the fix right now to make it work again, then we can sync to do a proper fix. [20:09:46] PROBLEM - Freshness of OCSP Stapling files on cp1047 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:09:46] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp3005 is CRITICAL: Connection refused [20:09:46] PROBLEM - Varnishkafka log producer on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:09:46] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp4011 is CRITICAL: Connection refused [20:09:46] PROBLEM - Freshness of OCSP Stapling files on cp2021 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:09:46] PROBLEM - traffic-pool service on cp2015 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:09:46] PROBLEM - Freshness of OCSP Stapling files on cp4012 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:10:05] PROBLEM - Freshness of OCSP Stapling files on cp1060 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [20:10:05] PROBLEM - traffic-pool service on cp2003 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:10:05] PROBLEM - traffic-pool service on cp2009 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:10:06] PROBLEM - Varnishkafka log producer on cp3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:10:06] PROBLEM - Varnishkafka log producer on cp4020 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [20:10:16] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp1059 is CRITICAL: Connection refused [20:10:16] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp2015 is CRITICAL: Connection refused [20:10:17] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp4019 is CRITICAL: Connection refused [20:10:36] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp2009 is CRITICAL: Connection refused [20:10:36] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp2015 is CRITICAL: Connection refused [20:10:36] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp4019 is CRITICAL: Connection refused [20:10:36] PROBLEM - traffic-pool service on cp4012 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [20:10:57] (03CR) 10Volans: [C: 032] "Reverting for now to unbreak it." [puppet] - 10https://gerrit.wikimedia.org/r/285249 (https://phabricator.wikimedia.org/T133588) (owner: 10Volans) [20:11:05] RECOVERY - Freshness of OCSP Stapling files on cp3004 is OK: OK [20:11:27] (03PS1) 10BBlack: cache_maps route table: use codfw only for direct [puppet] - 10https://gerrit.wikimedia.org/r/285250 [20:12:42] (03CR) 10Gehel: [C: 032 V: 032] cache_maps route table: use codfw only for direct [puppet] - 10https://gerrit.wikimedia.org/r/285250 (owner: 10BBlack) [20:14:05] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:15:27] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: kafka1013_v4,kafka1013_v6,kafka1014_v4,kafka1014_v6,kafka1018_v4,kafka1018_v6 [20:15:27] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:15] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:15] RECOVERY - Freshness of OCSP Stapling files on cp2021 is OK: OK [20:16:36] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: kafka1013_v4,kafka1013_v6,kafka1014_v4,kafka1014_v6,kafka1018_v4,kafka1018_v6 [20:17:22] (03PS5) 10Dzahn: Maintenance for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285221 (owner: 10Dereckson) [20:17:29] (03CR) 10Dzahn: [V: 032] Maintenance for fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285221 (owner: 10Dereckson) [20:18:07] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: kafka1013_v4,kafka1013_v6,kafka1014_v4,kafka1014_v6,kafka1018_v4,kafka1018_v6 [20:18:22] (03PS2) 10Dzahn: Maintenance for zh.planet [puppet] - 10https://gerrit.wikimedia.org/r/285234 (https://phabricator.wikimedia.org/T133577) (owner: 10Dereckson) [20:18:23] 06Operations, 10DBA, 13Patch-For-Review: Gerrit 285208 broke eventlogging_sync.sh - https://phabricator.wikimedia.org/T133588#2236794 (10Volans) p:05Unbreak!>03Low I've merged the revert and killed the process in the infinite loop on `db1047` and `dbstore1002` that I think are the only 2 server where thi... [20:18:25] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [20:18:25] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [20:18:31] (03CR) 10Dzahn: [C: 032] Maintenance for zh.planet [puppet] - 10https://gerrit.wikimedia.org/r/285234 (https://phabricator.wikimedia.org/T133577) (owner: 10Dereckson) [20:19:16] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 24 not-conn: kafka1013_v4,kafka1013_v6,kafka1018_v4,kafka1018_v6 [20:19:46] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [20:19:56] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [20:20:27] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [20:20:45] (03PS3) 10Hashar: hhvm: allow passing service parameters [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) [20:20:51] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2236837 (10hashar) Pending review and merge of puppet patches by #operations https://gerrit.wikimedia.org/r/#/c/269946/ https://gerrit.wikim... [20:20:56] (03PS3) 10Hashar: contint: disable HHVM background service [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) [20:20:57] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [20:20:57] 06Operations, 10DBA, 13Patch-For-Review: Gerrit 285208 broke eventlogging_sync.sh - https://phabricator.wikimedia.org/T133588#2236839 (10Volans) a:05Volans>03None [20:21:23] (03CR) 10Hashar: "Rebased. Still applied on integration puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [20:21:26] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [20:21:28] (03CR) 10Hashar: "Rebased. Still applied on integration puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [20:22:05] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [20:22:05] PROBLEM - eventlogging_sync processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [20:22:08] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2236841 (10hashar) [20:22:16] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [20:22:27] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [20:22:35] PROBLEM - eventlogging_sync processes on dbstore1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [20:22:57] RECOVERY - Freshness of OCSP Stapling files on cp1060 is OK: OK [20:23:17] (03PS18) 10Hashar: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:23:34] (03CR) 10Hashar: "Amended to reference T126699" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:23:41] (03CR) 10jenkins-bot: [V: 04-1] contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:24:45] RECOVERY - eventlogging_sync processes on dbstore1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [20:24:56] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp4011 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.155 second response time [20:24:56] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp3005 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.187 second response time [20:24:57] RECOVERY - Freshness of OCSP Stapling files on cp4012 is OK: OK [20:24:57] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp2021 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.078 second response time [20:24:58] (03PS2) 10Dzahn: Maintenance for en.planet [puppet] - 10https://gerrit.wikimedia.org/r/285239 (owner: 10Dereckson) [20:25:03] (03CR) 10Dzahn: [C: 032] Maintenance for en.planet [puppet] - 10https://gerrit.wikimedia.org/r/285239 (owner: 10Dereckson) [20:25:05] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp4012 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.151 second response time [20:25:16] RECOVERY - Freshness of OCSP Stapling files on cp1059 is OK: OK [20:25:25] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.198 second response time [20:25:31] (03PS19) 10Hashar: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:25:55] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.083 second response time [20:25:55] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp4019 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.154 second response time [20:25:56] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.076 second response time [20:25:56] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.175 second response time [20:25:57] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp4020 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.151 second response time [20:25:57] (03CR) 10Hashar: [C: 031] "Fixed trivial conflict and cherry picked again on integration puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:26:14] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2236845 (10Papaul) [20:26:26] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.077 second response time [20:26:26] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp3004 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.169 second response time [20:26:56] RECOVERY - Freshness of OCSP Stapling files on cp4011 is OK: OK [20:27:57] (03PS2) 10Dzahn: Maintenance for es.planet [puppet] - 10https://gerrit.wikimedia.org/r/285240 (https://phabricator.wikimedia.org/T133577) (owner: 10Dereckson) [20:28:08] (03CR) 10Dzahn: [C: 032 V: 032] Maintenance for es.planet [puppet] - 10https://gerrit.wikimedia.org/r/285240 (https://phabricator.wikimedia.org/T133577) (owner: 10Dereckson) [20:28:45] RECOVERY - eventlogging_sync processes on db1047 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [20:29:20] 07Blocked-on-Operations, 07Puppet, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699#2236856 (10hashar) *summary* This task has a long history, the puppet patc... [20:30:16] (03CR) 10Hashar: "This has a somehow complicated history across T96230 and T126699. I wrote a quick summary on the later." [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [20:32:54] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:33:00] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Make /usr/bin/php a wrapper that picks the right PHP version on CI slaves - https://phabricator.wikimedia.org/T126211#2236861 (10hashar) 05Resolved>03Open We still need the puppet patch https://gerrit.wikimedia.org/r/#/... [20:33:38] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Make /usr/bin/php a wrapper that picks the right PHP version on CI slaves - https://phabricator.wikimedia.org/T126211#2007557 (10hashar) a:05Legoktm>03None [20:33:54] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Disable HHVM fcgi server on CI slaves - https://phabricator.wikimedia.org/T126594#2236867 (10hashar) a:05hashar>03None [20:34:04] PROBLEM - confd service on cp1043 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [20:34:05] RECOVERY - Freshness of OCSP Stapling files on cp1047 is OK: OK [20:34:14] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:45] RECOVERY - Varnishkafka log producer on cp4012 is OK: PROCS OK: 1 process with command name varnishkafka [20:34:46] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:15] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:35:18] (03CR) 10Hashar: "Patch was still on the integration puppet master. I have removed it." [puppet] - 10https://gerrit.wikimedia.org/r/269095 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [20:35:25] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:35:34] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp4012 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.151 second response time [20:35:35] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.013 second response time [20:35:37] !log restarting elasticsearch server elastic2006.codfw.wmnet - activating unicast (T110236) [20:35:38] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:45] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:46] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:35:54] RECOVERY - Varnishkafka log producer on cp2009 is OK: PROCS OK: 1 process with command name varnishkafka [20:35:55] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp4011 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.153 second response time [20:36:04] (03PS8) 10Hashar: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) [20:36:04] RECOVERY - Varnishkafka log producer on cp1046 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:04] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3004 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.189 second response time [20:36:04] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:36:04] RECOVERY - Varnishkafka log producer on cp3006 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:05] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:36:05] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.073 second response time [20:36:05] RECOVERY - Varnishkafka log producer on cp1060 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:06] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp2015 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.079 second response time [20:36:14] RECOVERY - Varnishkafka log producer on cp2021 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:15] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp4020 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.160 second response time [20:36:24] RECOVERY - Varnishkafka log producer on cp1059 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:25] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.184 second response time [20:36:25] RECOVERY - Freshness of OCSP Stapling files on cp1046 is OK: OK [20:36:25] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:36:45] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.073 second response time [20:36:45] RECOVERY - Freshness of OCSP Stapling files on cp2003 is OK: OK [20:36:46] RECOVERY - Varnishkafka log producer on cp1047 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:46] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp2021 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.076 second response time [20:36:46] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:36:46] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:36:46] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:36:47] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:36:54] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp1047 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.002 second response time [20:36:55] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp4019 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.149 second response time [20:36:55] RECOVERY - Varnishkafka log producer on cp4011 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:55] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:36:56] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:36:56] RECOVERY - Varnishkafka log producer on cp3003 is OK: PROCS OK: 1 process with command name varnishkafka [20:36:56] RECOVERY - Varnishkafka log producer on cp4020 is OK: PROCS OK: 1 process with command name varnishkafka [20:37:04] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.201 second response time [20:37:04] RECOVERY - Varnishkafka log producer on cp3005 is OK: PROCS OK: 1 process with command name varnishkafka [20:37:15] RECOVERY - Varnishkafka log producer on cp2015 is OK: PROCS OK: 1 process with command name varnishkafka [20:37:15] RECOVERY - Varnishkafka log producer on cp2003 is OK: PROCS OK: 1 process with command name varnishkafka [20:37:16] RECOVERY - Varnishkafka log producer on cp4019 is OK: PROCS OK: 1 process with command name varnishkafka [20:37:16] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:16] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:24] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:25] PROBLEM - confd service on cp1044 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [20:37:25] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:25] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.012 second response time [20:37:34] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp1059 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.002 second response time [20:37:35] RECOVERY - Varnishkafka log producer on cp3004 is OK: PROCS OK: 1 process with command name varnishkafka [20:37:45] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp3005 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.171 second response time [20:38:52] (03PS2) 10Yuvipanda: dnsmasq: drop beta cluster mobile public IP [puppet] - 10https://gerrit.wikimedia.org/r/283987 (https://phabricator.wikimedia.org/T130473) (owner: 10Hashar) [20:39:41] YuviPanda: maybe that whole dos mass aliasing can be dropped now. No idea how it is handled now [20:46:54] !log deployed updated patch for T98313 to include very headers [20:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:51] hashar: yes, the whole thing can be dropped now, it's handled in a totally different place now, and is automatic [20:50:22] YuviPanda: feel free to drop my patch and clean up the dnsmasq leftover stuff so ;-) [20:50:42] sleeping time *wave* [20:51:02] (03PS2) 10Dzahn: Add Walaa Abdel Manaem blog to ar.planet [puppet] - 10https://gerrit.wikimedia.org/r/285243 (owner: 10Dereckson) [20:51:17] hashar: bye! [20:52:10] (03CR) 10Dzahn: [C: 032] Add Walaa Abdel Manaem blog to ar.planet [puppet] - 10https://gerrit.wikimedia.org/r/285243 (owner: 10Dereckson) [20:52:20] (03CR) 10Dzahn: [V: 032] Add Walaa Abdel Manaem blog to ar.planet [puppet] - 10https://gerrit.wikimedia.org/r/285243 (owner: 10Dereckson) [20:52:35] (03CR) 10Dzahn: "yea, content is relevant" [puppet] - 10https://gerrit.wikimedia.org/r/285243 (owner: 10Dereckson) [20:54:08] (03PS1) 10Papaul: DNS: Adding mgmt DNS entries for restbase200[7-9] Bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285280 (https://phabricator.wikimedia.org/T132976) [20:58:44] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [20:58:44] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [20:59:25] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [21:00:44] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [21:02:40] !log updated OCG to version 58a720508deb368abfb7652e6a8c7225f95402d2 [21:02:43] !log adding cp10(46|47|59|60)\.eqiad\.wmnet to maps caching cluster (T109162) [21:02:44] T109162: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162 [21:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:51] (03PS3) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [21:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:50] (03CR) 10Eevans: [C: 04-1] "-1'ing because this is not yet ready" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [21:07:35] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:08:04] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:08:05] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:10:14] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [21:10:14] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [21:10:14] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [21:10:59] (03PS1) 10BBlack: cache_maps: whole cluster to varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/285282 (https://phabricator.wikimedia.org/T122880) [21:12:05] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [21:15:14] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 144 seconds ago with 0 failures [21:16:22] (03PS2) 10Ottomata: [WIP] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [21:18:02] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: whole cluster to varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/285282 (https://phabricator.wikimedia.org/T122880) (owner: 10BBlack) [21:19:24] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:24] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:25] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [21:22:04] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 4 failures [21:23:35] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [21:24:46] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp1060 is CRITICAL: Connection refused [21:25:34] (03CR) 1020after4: ores: Scap3 deployment configurations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [21:25:44] PROBLEM - Varnishkafka log producer on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:26:35] PROBLEM - Varnish HTTP maps-backend - port 3128 on cp1060 is CRITICAL: Connection refused [21:26:45] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:27:55] RECOVERY - Varnishkafka log producer on cp1060 is OK: PROCS OK: 1 process with command name varnishkafka [21:28:45] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:28:55] RECOVERY - Varnish HTTP maps-backend - port 3128 on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 152 bytes in 0.034 second response time [21:29:25] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 151 bytes in 0.005 second response time [21:35:11] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:36:05] (03PS2) 10Dzahn: DNS: Adding mgmt DNS entries for restbase200[7-9] Bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285280 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [21:36:11] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [21:36:39] (03CR) 10Dzahn: [C: 032] DNS: Adding mgmt DNS entries for restbase200[7-9] Bug: T132976 [dns] - 10https://gerrit.wikimedia.org/r/285280 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [21:37:11] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:39:37] (03PS46) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [21:40:56] (03CR) 10jenkins-bot: [V: 04-1] ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [21:41:31] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [21:41:31] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [21:52:11] (03PS1) 10Alex Monk: Point dyanmicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) [21:53:42] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:10] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 12 failures [21:54:11] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:01] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:01] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:57:10] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [21:57:10] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [21:58:06] (03Abandoned) 1020after4: Phabricator: support systemd as well as upstart. [puppet] - 10https://gerrit.wikimedia.org/r/274488 (owner: 1020after4) [21:58:10] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [21:58:31] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [22:01:52] (03PS23) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [22:10:49] (03PS1) 10Dzahn: ganglia: on systemd only start service if site is monitored site [puppet] - 10https://gerrit.wikimedia.org/r/285295 [22:15:26] (03Abandoned) 10Dzahn: ganglia: remove service notify for aggregators [puppet] - 10https://gerrit.wikimedia.org/r/285212 (owner: 10Dzahn) [22:17:47] (03PS3) 10Dzahn: Add cron job to synchronise AEAD files from primary auth server [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [22:19:28] (03Abandoned) 10Bartosz Dziewoński: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [22:20:00] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:21:46] (03CR) 10Dzahn: "wouldn't you need an rsyncd on the other side and speak rsync:// for this to work? or does it have a key to do ssh auth?" [puppet] - 10https://gerrit.wikimedia.org/r/285160 (owner: 10Muehlenhoff) [22:22:00] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 73019 bytes in 0.167 second response time [22:24:05] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2237213 (10BBlack) Noticed doing v3 -> v4 upgrades on the new cache_misc hosts, I had to do the following to effect the upgrades: 1. Disable puppet on affected node... [22:24:52] (03PS2) 10Dzahn: ganglia: on systemd only start service if site is monitored site [puppet] - 10https://gerrit.wikimedia.org/r/285295 [22:25:58] (03CR) 10Dzahn: [C: 032] ganglia: on systemd only start service if site is monitored site [puppet] - 10https://gerrit.wikimedia.org/r/285295 (owner: 10Dzahn) [22:26:31] RECOVERY - traffic-pool service on cp4012 is OK: OK - traffic-pool is active [22:26:51] RECOVERY - traffic-pool service on cp1059 is OK: OK - traffic-pool is active [22:26:51] RECOVERY - traffic-pool service on cp1047 is OK: OK - traffic-pool is active [22:27:00] RECOVERY - traffic-pool service on cp2015 is OK: OK - traffic-pool is active [22:27:01] RECOVERY - traffic-pool service on cp4011 is OK: OK - traffic-pool is active [22:27:21] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [22:27:22] RECOVERY - traffic-pool service on cp1060 is OK: OK - traffic-pool is active [22:27:38] awww, nice, compiler is a liar about "no changes" [22:27:40] RECOVERY - traffic-pool service on cp2003 is OK: OK - traffic-pool is active [22:27:41] RECOVERY - traffic-pool service on cp2009 is OK: OK - traffic-pool is active [22:27:50] RECOVERY - traffic-pool service on cp4020 is OK: OK - traffic-pool is active [22:28:10] RECOVERY - traffic-pool service on cp1046 is OK: OK - traffic-pool is active [22:28:20] RECOVERY - traffic-pool service on cp2021 is OK: OK - traffic-pool is active [22:28:21] RECOVERY - traffic-pool service on cp4019 is OK: OK - traffic-pool is active [22:29:41] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:29:51] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 2 failures [22:30:50] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [22:30:50] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 2 failures [22:31:55] ^ all me, sorry, spammy day of non-user-impacting CRITs [22:32:00] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [22:32:20] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:33:01] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:33:10] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:35:21] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 2 failures [22:36:50] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 2 failures [22:37:00] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures [22:37:04] (03PS1) 10Dzahn: ganglia: fix aggregator config/service dependency [puppet] - 10https://gerrit.wikimedia.org/r/285299 [22:37:32] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:37:32] (03PS2) 10Dzahn: ganglia: fix aggregator config/service dependency [puppet] - 10https://gerrit.wikimedia.org/r/285299 [22:38:13] (03CR) 10Dzahn: [C: 032] ganglia: fix aggregator config/service dependency [puppet] - 10https://gerrit.wikimedia.org/r/285299 (owner: 10Dzahn) [22:39:02] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:39:11] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:40:51] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:43:34] (03PS10) 10Dzahn: create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [22:45:28] bblack: do you mind if i go ahead? PS8/9 ? [22:47:10] mutante: I'm lost, you mean merge the patch above, but no other ones? [22:47:30] (03PS24) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [22:48:44] bblack: yes, basically just merge it to then keep working on top of it [22:49:25] and that i noticed you amended to it [22:50:06] yeah that's fine. my dependent patch is the next thing after, so that I can really test some things, but I don't know if I'll get to it tonight [22:50:25] (03CR) 10BBlack: [C: 031] create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [22:50:40] :) thanks [22:51:19] (03CR) 10Dzahn: [C: 032] create letsencrypt module, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [22:54:42] (03PS5) 10BBlack: cache_maps: remove cp104[34] test caches [puppet] - 10https://gerrit.wikimedia.org/r/268237 (https://phabricator.wikimedia.org/T109162) [22:55:09] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: remove cp104[34] test caches [puppet] - 10https://gerrit.wikimedia.org/r/268237 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [23:00:03] (03PS25) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [23:00:05] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160425T2300). [23:00:05] matt_flaschen Dereckson jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:22] present [23:00:52] bblack, gehel, are you keeping the existing maps infrastructure intact for now? it would be useful to have two setups for a bit until we are all set with the migration [23:01:29] no, and why would it be useful? [23:01:55] (03PS1) 10Dzahn: planet: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285307 [23:03:44] ok [23:03:59] Present [23:04:19] jdlrobson, do you know the security patch process? do you want me to do it? [23:04:45] Krenair: i dont im afraid. I can't remember the protocol on whether im allowed to post the phabricator task on the wiki [23:04:48] (03PS1) 10Dzahn: racktables: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285308 [23:05:22] jdlrobson, I think the task names are private. [23:05:31] so i can paste here matt_flaschen ? [23:05:34] no [23:05:40] paste the ticket number [23:05:40] also matt_flaschen do you happen to be an oversighter? [23:05:52] Krenair > https://phabricator.wikimedia.org/T132653 [23:05:55] (03PS3) 10Alex Monk: Disable Echo survey on French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284831 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:06:03] (03CR) 10Alex Monk: [C: 032] Disable Echo survey on French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284831 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:06:09] jdlrobson, no. Someone could make you one on testwiki if you need to test, though. [23:06:11] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2237325 (10BBlack) @Gehel and I made partial progress on this today, to resume tomorrow. Current situation: 1. all 16x new cache_maps... [23:06:28] (03CR) 10jenkins-bot: [V: 04-1] Disable Echo survey on French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284831 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:06:46] (03CR) 10jenkins-bot: [V: 04-1] Disable Echo survey on French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284831 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:07:02] (03PS1) 10Dzahn: piwik: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285309 [23:07:25] what [23:07:33] jzerebecki, what's going on with jenkins? [23:07:39] legoktm, ^ [23:08:07] bblack, the backends are useful because this way we can make sure all the new services are up and running ok before doing the switchover, and we could switch to the old backends if we mess up [23:08:18] thanks for getting so far with varnishes today! [23:08:19] PROBLEM - eventlogging_sync processes on dbstore1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [23:08:28] (03CR) 10Alex Monk: [V: 032] Disable Echo survey on French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284831 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [23:08:31] yurik: we're not working on the backends, just the traffic infrastructure that sits in front of them [23:08:36] Krenair: i can take care of getting the security fix in MobileFrontend master after it's deployed and confirmed fixed [23:08:45] ok [23:08:55] yurik: I don't even know what the status of the backends really is in terms of hw purchase -> install, etc :) [23:09:18] Krenair: what's wrong? [23:09:19] (03PS26) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [23:09:22] bblack, cool, as long as we keep the backends for a while, its all good :) As far as i heard from robh, they have been purchased [23:09:41] legoktm, jenkins -1'd that patch [23:10:04] so should be around soon! bblack, do you have any objections to enable maps on some tiny wikis in the meantime to start getting wp-side feedback? [23:10:17] now that it sems we are going ahead with it :) [23:10:20] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/284831/ - Disable Echo survey on French wikis [23:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:30] matt_flaschen, ^ please check [23:10:33] 23:06:45 /tmp/hudson5646543977288841764.sh: line 2: /srv/deployment/integration/phpunit/vendor/bin/phpunit: No such file or directory [23:10:43] hmm [23:10:48] I think hashar killed that [23:11:09] (03PS1) 10Dzahn: RT: move role classes to autoloader layout, split labs [puppet] - 10https://gerrit.wikimedia.org/r/285310 [23:11:29] yurik: I have no answer on a question like that yet. let's actually finish the traffic-layer work at least first, and then see what the status/timeline is on the rest? [23:11:43] sounds good :) [23:12:33] Krenair: it's just going to fail, I don't have time to fix it right now, sorry [23:12:42] ok [23:13:12] (03PS1) 10Dzahn: noc site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285312 [23:13:16] Krenair, works. Thanks. [23:15:58] (03PS1) 10Dzahn: performance site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285313 [23:16:40] (doing jdlrobson's patch) [23:17:09] (03PS27) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [23:17:13] (03PS2) 10Dzahn: piwik: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285309 [23:17:43] Krenair: awesome. To test on testwiki (if you are unable to) i'll need delete revision rights on Jdlrobson [23:17:49] (temporarily) [23:18:25] (03PS2) 10Dzahn: RT: move role classes to autoloader layout, split labs [puppet] - 10https://gerrit.wikimedia.org/r/285310 [23:19:45] (03PS2) 10Dzahn: noc site: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285312 [23:19:55] jdlrobson, shouldn't you be using your WMF account for this? [23:21:04] ohh good point :) [23:21:12] i forgot about that [23:21:31] that already has staff too [23:21:38] !log Deployed patch for T132653 [23:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:10] * jdlrobson really hates how logging out for testing logs him out evvvverryyywhere on all his devices :/ [23:23:18] RECOVERY - eventlogging_sync processes on dbstore1002 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [23:23:34] jdlrobson: ctrl + shift + n > private browsing > you can open a new session under your secondary account [23:23:55] Dereckson: not when i need to use two accounts to test :-) [23:24:32] Krenair: Jon (WMF) will need some admin privileges on test.wikipedia.org [23:24:41] Yes, that's the goal: your private browsing session is independant of your main browsing session, so you can be logged under two accounts simultaneously [23:24:59] Dereckson: no i dont want to use my personal account in my main browsing session [23:25:13] k [23:25:14] and you can only have one in incognito mode (unless you use multiple browsers which id rather not) [23:25:25] indeed [23:25:29] I'm not able to grant that [23:25:34] anyway i was mostly moaning at the state of affairs for log out behaviour :) [23:25:39] there's a longstanding bug somewhere for that [23:25:51] https://phabricator.wikimedia.org/T51890 < [23:25:54] 06Operations, 10ops-eqiad, 10hardware-requests: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2237371 (10BBlack) [23:27:59] (03PS1) 10BBlack: remove cp104[34] storage cfg T133614 [puppet] - 10https://gerrit.wikimedia.org/r/285315 [23:28:12] legoktm, are you able to give Jon (WMF)@testwiki admin rights? [23:28:16] (03CR) 10BBlack: [C: 032 V: 032] remove cp104[34] storage cfg T133614 [puppet] - 10https://gerrit.wikimedia.org/r/285315 (owner: 10BBlack) [23:28:21] sure [23:28:44] thanks legoktm. Krenair: alternatively if someone can point me at an already deleted revision on enwiki I can verify there [23:28:50] (change visibility) 23:28, 25 April 2016 Legoktm (talk | contribs | block) changed group membership for Jon (WMF) from (none) to administrator (per request) [23:28:54] i'm not sure if there is a special page for listing deleted revisions [23:29:01] sweet thanks legoktm [23:32:19] 06Operations, 07Icinga, 13Patch-For-Review: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2237401 (10Dzahn) a:05Dzahn>03None technically this bug is invalid. we will not do this. we will just setup einsteinium and tegmen with shinken and one day shut this down [23:33:05] 06Operations, 07Icinga, 13Patch-For-Review: shutdown neon (icinga) after it has been replaced with shinken - https://phabricator.wikimedia.org/T125023#2237403 (10Dzahn) [23:33:20] 06Operations, 07Icinga, 07Shinken: shutdown neon (icinga) after it has been replaced with shinken - https://phabricator.wikimedia.org/T125023#1972468 (10Dzahn) [23:34:45] Dereckson, around? [23:35:00] Yes. [23:35:30] want to do your patches or shall I? [23:36:45] As you want. [23:37:06] (03PS3) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [23:37:10] can you do it please? [23:37:12] Sure. [23:39:14] (03PS3) 10Dereckson: Flow dblist on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 [23:39:25] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson) [23:40:11] (03CR) 10jenkins-bot: [V: 04-1] Flow dblist on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson) [23:40:24] oh, right [23:40:25] (03CR) 10jenkins-bot: [V: 04-1] Flow dblist on noc.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson) [23:40:26] jenkins is broken [23:40:41] have to remove jenkins' V-1, add V+2, and submit yourself [23:41:20] k [23:41:50] i.e. the old way of merging [23:41:52] (03CR) 10Dereckson: [V: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281977 (owner: 10Dereckson) [23:42:33] (03PS2) 10BBlack: apt|mirrors|ubuntu: puppetized LE certs [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) [23:42:35] (03PS17) 10BBlack: letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [23:42:56] (03PS28) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [23:44:16] (03PS2) 10Dereckson: noc: jobqueue-eqiad.php.txt → jobqueue.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285061 [23:44:27] (03CR) 10Dereckson: [C: 032 V: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285061 (owner: 10Dereckson) [23:44:53] 06Operations, 10DBA, 13Patch-For-Review: Gerrit 285208 broke eventlogging_sync.sh - https://phabricator.wikimedia.org/T133588#2237441 (10Volans) I've also noticed (from icinga alarms) that periodically `eventlogging_sync.sh` crashes with this error: ``` mysqldump: Couldn't find table: "MobileV?b?lickTracking... [23:45:08] 06Operations, 06Services-next, 15User-mobrovac, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2237443 (10GWicke) [23:45:19] (03PS2) 10Dereckson: noc: PoolCounterSettings-eqiad.php → PoolCounterSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) [23:45:44] (03CR) 10Dereckson: [C: 032 V: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) (owner: 10Dereckson) [23:45:46] (03CR) 10jenkins-bot: [V: 04-1] noc: PoolCounterSettings-eqiad.php → PoolCounterSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) (owner: 10Dereckson) [23:45:59] (03CR) 10Dereckson: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285062 (https://phabricator.wikimedia.org/T133324) (owner: 10Dereckson) [23:46:49] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2237449 (10mmodell) Ok, I discovered that ssh-add will reuse the same passphrase on multiple key... [23:47:34] (03CR) 10BBlack: [C: 031] letsencrypt module guts + acme-setup script [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [23:48:13] (03CR) 10BBlack: [C: 031] "Tested in compiler and looks "right"." [puppet] - 10https://gerrit.wikimedia.org/r/285196 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [23:48:15] !log dereckson@tin Synchronized docroot/noc/createTxtFileSymlinks.sh: noc: PoolCounterSettings-eqiad.php → PoolCounterSettings.php ([[Gerrit:285062]], no-op) (duration: 00m 30s) [23:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:27] (03PS4) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [23:51:55] !log dereckson@tin Synchronized docroot/noc/conf/: noc.wikimedia.org update ([[Gerrit:285061]], [[Gerrit:285062]], [[Gerrit:281977]]) (duration: 00m 26s) [23:52:02] Testing. [23:52:44] Works. [23:53:01] Dereckson: thank you :)) [23:53:06] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [23:53:08] You're welcome. [23:59:15] (03CR) 10Luke081515: [C: 031] Reconfigure interface editor group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285209 (https://phabricator.wikimedia.org/T133564) (owner: 10Dereckson)