[00:26:33] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [00:27:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [00:27:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:27:33] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [00:31:36] sigh [00:33:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [00:33:27] !log starting back cassandra on restbase1011 [00:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [00:35:33] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [00:38:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [00:38:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:38:33] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [00:41:43] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [00:42:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [00:42:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [00:42:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4255761 keys, up 48 days 16 hours - replication_delay is 47 [00:43:33] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [00:43:53] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days) [00:46:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [00:46:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:46:33] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [00:46:33] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [00:56:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [00:57:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [00:57:33] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days) [00:57:34] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [01:00:33] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:00:33] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [01:01:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [01:01:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:07:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [01:07:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [01:07:33] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [01:07:43] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days) [01:18:33] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:18:34] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [01:19:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [01:19:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:21:24] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:24:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [01:24:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [01:24:43] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days) [01:25:33] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.001 second response time on 10.64.0.119 port 9042 [01:50:24] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [02:18:55] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 06m 39s) [02:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:18] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Dec 19 02:23:18 UTC 2016 (duration 4m 23s) [02:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:01] (03PS1) 10Legoktm: Run tests on Python 3.4 too for Jessie [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/328116 [03:03:33] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:12:43] Should we be on .7 [03:14:03] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:03] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:54] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [03:17:03] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [03:19:33] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:20] 07Puppet, 06Labs: Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608#2885600 (10scfc) [03:20:34] 07Puppet, 06Labs: Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608#2885612 (10scfc) p:05Triage>03Lowest [03:20:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 797.78 seconds [03:24:44] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:27:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.87 seconds [03:30:53] PROBLEM - NTP on prometheus2003 is CRITICAL: NTP CRITICAL: Offset unknown [03:33:33] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [03:48:33] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:52:43] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [04:09:57] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2885616 (10Smalyshev) @Esc3300 please see above about the difference between entity identifiers and URLs. @MZMcBride if you mean links t... [04:15:13] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=642.40 Read Requests/Sec=408.80 Write Requests/Sec=25.70 KBytes Read/Sec=39434.80 KBytes_Written/Sec=196.00 [04:25:13] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=8.10 Read Requests/Sec=13.60 Write Requests/Sec=35.90 KBytes Read/Sec=71.60 KBytes_Written/Sec=3988.40 [05:15:40] (03PS2) 10Tim Landscheidt: Tools: Disable automatic backups of aptly repositories [puppet] - 10https://gerrit.wikimedia.org/r/328031 (https://phabricator.wikimedia.org/T150726) [05:33:43] PROBLEM - puppet last run on wtp1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:39:33] 07Puppet, 06Labs: Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612#2885673 (10scfc) [05:39:52] 07Puppet, 06Labs: Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612#2885686 (10scfc) [05:44:17] 07Puppet, 06Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#2885702 (10scfc) 05Open>03Resolved [05:50:33] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:01:43] RECOVERY - puppet last run on wtp1017 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:08:53] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:18:33] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:31:53] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:36:53] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:44:38] !log Deploy innodb compression dbstore2001 on dewiki and wikidatawiki - T151552 [06:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:43] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [06:55:09] (03PS7) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [06:56:09] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [06:59:53] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:42:15] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2885801 (10Joe) @aaron another possibility is to have the process call a special url on HHVM to... [07:42:44] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, 13Patch-For-Review: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2885803 (10Revent) To update, the backlog is now over 8000... when around, I have been kicking 'excess... [07:50:12] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, 13Patch-For-Review: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Joe) @Revent first of all thanks for all the work you're putting into this. We will try to m... [07:50:53] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2885807 (10Joe) [07:53:35] 06Operations, 10hardware-requests: codfw: (2) servers request for ORES redis databases - https://phabricator.wikimedia.org/T142190#2885808 (10akosiaris) @RobH We will need these for the next switchover to CODFW, as ORES is not operational in CODFW without those. So the Jan 2017-March 2017 quarter is the targ... [07:53:58] (03PS8) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [07:54:47] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [08:03:52] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2885867 (10Revent) I appreciate the thanks...really. I'd already been working on kicking failed transcodes back... [08:04:37] (03PS3) 10ArielGlenn: move table job info to a default config file and add setting for override [dumps] - 10https://gerrit.wikimedia.org/r/325844 (https://phabricator.wikimedia.org/T152679) [08:06:58] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2885870 (10Esc3300) As Smalyshev mentions, traditionally http may be used, but there isn't really a rule against using https. Traditionall... [08:07:04] (03CR) 10ArielGlenn: [C: 032] move table job info to a default config file and add setting for override [dumps] - 10https://gerrit.wikimedia.org/r/325844 (https://phabricator.wikimedia.org/T152679) (owner: 10ArielGlenn) [08:10:13] (03PS2) 10ArielGlenn: document the new table jobs yaml file [dumps] - 10https://gerrit.wikimedia.org/r/325943 (https://phabricator.wikimedia.org/T152679) [08:10:47] (03PS9) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [08:10:49] (03CR) 10ArielGlenn: [C: 032] document the new table jobs yaml file [dumps] - 10https://gerrit.wikimedia.org/r/325943 (https://phabricator.wikimedia.org/T152679) (owner: 10ArielGlenn) [08:11:22] (03PS2) 10ArielGlenn: remove unneeded dblists and references to them [dumps] - 10https://gerrit.wikimedia.org/r/325944 (https://phabricator.wikimedia.org/T152679) [08:11:44] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [08:12:03] (03CR) 10ArielGlenn: [C: 032] remove unneeded dblists and references to them [dumps] - 10https://gerrit.wikimedia.org/r/325944 (https://phabricator.wikimedia.org/T152679) (owner: 10ArielGlenn) [08:12:35] (03PS2) 10ArielGlenn: cleanup of README for general configuration and sample config file [dumps] - 10https://gerrit.wikimedia.org/r/325945 (https://phabricator.wikimedia.org/T152679) [08:14:00] (03CR) 10ArielGlenn: [C: 032] cleanup of README for general configuration and sample config file [dumps] - 10https://gerrit.wikimedia.org/r/325945 (https://phabricator.wikimedia.org/T152679) (owner: 10ArielGlenn) [08:14:24] (03PS2) 10ArielGlenn: remove halt, last reference to forcenormal configuration settings [dumps] - 10https://gerrit.wikimedia.org/r/325946 [08:14:48] (03PS10) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [08:15:55] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2885896 (10akosiaris) [08:17:13] (03CR) 10ArielGlenn: [C: 032] remove halt, last reference to forcenormal configuration settings [dumps] - 10https://gerrit.wikimedia.org/r/325946 (owner: 10ArielGlenn) [08:18:10] (03PS2) 10ArielGlenn: fix up silly handling of table job names [dumps] - 10https://gerrit.wikimedia.org/r/325947 (https://phabricator.wikimedia.org/T152679) [08:18:39] (03CR) 10ArielGlenn: [C: 032] fix up silly handling of table job names [dumps] - 10https://gerrit.wikimedia.org/r/325947 (https://phabricator.wikimedia.org/T152679) (owner: 10ArielGlenn) [08:25:14] (03PS2) 10ArielGlenn: allow dumps of private tables to be skipped via config setting [dumps] - 10https://gerrit.wikimedia.org/r/324702 (https://phabricator.wikimedia.org/T152021) [08:25:37] (03CR) 10jenkins-bot: [V: 04-1] allow dumps of private tables to be skipped via config setting [dumps] - 10https://gerrit.wikimedia.org/r/324702 (https://phabricator.wikimedia.org/T152021) (owner: 10ArielGlenn) [08:26:48] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2885950 (10akosiaris) I 'd like to add a hiera key formation clause please. 8. __hiera keys SHOULD try to reflect the most specific common shared names... [08:29:13] (03PS3) 10ArielGlenn: allow dumps of private tables to be skipped via config setting [dumps] - 10https://gerrit.wikimedia.org/r/324702 (https://phabricator.wikimedia.org/T152021) [08:30:31] (03CR) 10ArielGlenn: [C: 032] allow dumps of private tables to be skipped via config setting [dumps] - 10https://gerrit.wikimedia.org/r/324702 (https://phabricator.wikimedia.org/T152021) (owner: 10ArielGlenn) [08:32:22] (03PS2) 10ArielGlenn: move configuration of tables to be dumped out to a yaml file [puppet] - 10https://gerrit.wikimedia.org/r/325939 [08:33:59] (03CR) 10ArielGlenn: [C: 032] move configuration of tables to be dumped out to a yaml file [puppet] - 10https://gerrit.wikimedia.org/r/325939 (owner: 10ArielGlenn) [08:35:46] (03CR) 10Alexandros Kosiaris: [C: 032] introduce dbmonitor, add dbmonitor[12]001, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [08:35:49] (03PS6) 10Alexandros Kosiaris: introduce dbmonitor, add dbmonitor[12]001, v4 and v6 [dns] - 10https://gerrit.wikimedia.org/r/327266 (https://phabricator.wikimedia.org/T149340) (owner: 10Dzahn) [08:37:57] (03PS2) 10ArielGlenn: remove some dblist paths from dump config settings, no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/325941 [08:39:22] (03CR) 10ArielGlenn: [C: 032] remove some dblist paths from dump config settings, no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/325941 (owner: 10ArielGlenn) [08:42:53] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:43:01] the snapshot whines are me, fixing [08:43:33] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:44:32] (03PS1) 10ArielGlenn: dumps: fix typo in setup of tables yaml file path [puppet] - 10https://gerrit.wikimedia.org/r/328140 [08:45:31] (03CR) 10ArielGlenn: [C: 032] dumps: fix typo in setup of tables yaml file path [puppet] - 10https://gerrit.wikimedia.org/r/328140 (owner: 10ArielGlenn) [08:47:33] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:50:04] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#2886008 (10akosiaris) VMs created. MACs for DHCP/PXE are ``` sudo gnt-instance list -o name,nic.mac/0 dbmonitor1001.wikimedia.org Instance NicMAC/0 dbmonitor1001.wik... [08:51:49] (03PS1) 10ArielGlenn: dumps: fix path to tables yaml file [puppet] - 10https://gerrit.wikimedia.org/r/328141 [08:53:01] (03CR) 10ArielGlenn: [C: 032] dumps: fix path to tables yaml file [puppet] - 10https://gerrit.wikimedia.org/r/328141 (owner: 10ArielGlenn) [08:57:53] (03PS1) 10ArielGlenn: turn off dumping of private tables [puppet] - 10https://gerrit.wikimedia.org/r/328142 [08:57:53] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:59:20] (03CR) 10ArielGlenn: [C: 032] turn off dumping of private tables [puppet] - 10https://gerrit.wikimedia.org/r/328142 (owner: 10ArielGlenn) [09:04:10] !log ariel@tin Starting deploy [dumps/dumps@c8fb9a1]: table jobs to yaml config; stop dumping private tables completely [09:04:12] !log ariel@tin Finished deploy [dumps/dumps@c8fb9a1]: table jobs to yaml config; stop dumping private tables completely (duration: 00m 01s) [09:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:42] (03PS5) 10Niharika29: Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) [09:14:19] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/4892/ and cherry-picked in deployment-prep. Should be good throughout the fleet, merging" [puppet] - 10https://gerrit.wikimedia.org/r/313650 (owner: 10Alexandros Kosiaris) [09:14:23] PROBLEM - HP RAID on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [09:14:26] (03PS4) 10Alexandros Kosiaris: Rework network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/313650 [09:14:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rework network::subnets [puppet] - 10https://gerrit.wikimedia.org/r/313650 (owner: 10Alexandros Kosiaris) [09:17:34] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:34] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:34] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:34] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:43] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:53] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:53] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:53] PROBLEM - puppet last run on graphite2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:54] PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:33] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:34] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:34] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:34] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:34] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:34] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:34] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:35] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:35] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:36] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:36] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:43] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:43] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:44] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:53] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:53] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:53] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:53] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:53] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:53] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:53] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:54] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:54] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:55] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:55] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:56] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:06] (03CR) 10ArielGlenn: [] "It's nice to know that this works. But let's not enable it til we know it's needed. We would want to verify that a slowdown incident is c" [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [09:19:24] akosiaris ^ is that because of your merge? [09:19:33] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:33] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:33] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:33] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:33] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:34] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:34] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:34] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:35] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:35] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:36] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:36] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:37] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:43] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:43] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:43] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:44] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:44] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:44] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:44] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:47] (03PS1) 10Elukey: Repurpose two jobrunners to videoscalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/328144 (https://phabricator.wikimedia.org/T153488) [09:19:55] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:55] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:56] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:56] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:57] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:57] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:58] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:58] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:03] PROBLEM - puppet last run on elastic2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:04] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:04] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:04] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:04] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:09] !log killing irc-echo [09:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:24] done [09:21:24] Could not find data item network::subnets in any Hiera data file and no default supplied at /etc/puppet/modules/network/manifests/constants.pp:26 [09:21:28] akosiaris: --^ [09:23:47] !log Stop mysql db2048 (depooled) for maintenance - T149553 [09:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:51] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [09:26:18] damn... how on earth this did not fail in the compiler ? [09:26:45] unless it's transient .... [09:26:55] let's see [09:27:01] * akosiaris looking [09:27:08] it's transient indeed ! [09:27:11] nice! [09:27:18] lol @puppet [09:27:27] so what happened is a race [09:27:28] mmm mw2168.codfw.wmnet does not agree :( [09:27:38] still erroring [09:27:44] (random one that I picked) [09:28:46] grrr [09:29:13] the freaking code no longer even has a hiera lookup at that line... this is definitely a race or something [09:29:23] ahahhah lol [09:29:41] a puppet master may not have been correctly updated [09:29:42] puppet races on Monday mornings [09:29:57] jynus: yeah, looking at that now [09:31:34] niah, all 5 are updated correctly [09:31:53] Hey… anoying around familiar with TimedMediaHandler? [09:32:02] *anyone [09:32:30] hmm so cp1073 is now compiling as well [09:32:46] maybe an apache reload (not a restart) will speed up this [09:32:48] Revent: (I am working on T153488 FYI) [09:32:49] T153488: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488 [09:32:53] Specifically, I’m interested in understanding the ‘exact’ logic it uses to decide what transcode to start next. [09:33:02] elukey: Awesome. [09:33:17] I am going to repurpose two "old" jobrunners [09:33:41] but I need to depool and the reimage them [09:33:46] elukey: Hopefully it’s clear exactly ‘what’ logic I’m using to manipulate the queue (and that it’s exploiting an obvious bug) [09:33:47] (Debian to Trusty) [09:33:53] elukey: mw2168 is now compiling as well [09:33:59] \o/ [09:34:26] akosiaris: so you had to reload apache on the puppet masters? [09:34:33] not sure that did anything [09:34:43] hosts were anything starting to compile anyway [09:34:43] Really, the biggest issue (IMO) is that the servers try to multitask transcodes… the task is rather inherently not useful to multitask. [09:35:12] hosts were anyway* starting to compile [09:35:33] this reminds me of a bug brandon was mentioning he had encountered [09:35:56] that is hosts consistently getting an older version of the code until some timeout expired [09:36:12] Revent: I am totally ignorant about the subject, can help only adding hw :( [09:36:12] maybe apache children dying and new ones being spawned [09:36:55] elukey: Yeah, my impression is that only brion really understands it, and that he had been ‘back burnering’ a rewrite for years. [09:38:50] !log stopping jobrunner/jobchron daemons on mw116[89] as prep step for repurpose to videoscalers - T153488 [09:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:53] T153488: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488 [09:40:36] elukey: FWIW, when I started messing with the huge backlog of broken ones, it became obvious that the system was fragile (and I mentioned it at the time, a couple of months ago). It’s just that someone did (uploading a ton of huge videos) exacly what I was afraid of. [09:42:35] Revent: thanks a lot for taking a look to this [09:42:58] akosiaris: should I re-enable irc-echo or just wait a bit more to avoid the shower of recoveries? [09:43:03] <_joe_> Revent: was that done manually with server-side uploads? [09:43:27] elukey: no wait [09:43:34] <_joe_> Revent: I'm a bit upset that happened without sending out any warning to ops, during the FR freeze, etc [09:43:36] we got 650 criticals still [09:43:43] yeah [09:43:49] _joe_: Some were server side, a lot were done with chunked uploader. [09:43:52] <_joe_> (I knwo it wasn't you) [09:43:59] <_joe_> ok [09:44:04] <_joe_> thanks [09:44:12] _joe_: good to go? https://gerrit.wikimedia.org/r/#/c/328144/1 [09:44:18] https://commons.wikimedia.org/wiki/File:10-21-14-_White_House_Press_Briefing.webm [09:44:21] ^ joe [09:44:56] for hiera, that exploded on labs projects as well. So feel free to test a change on there [09:45:01] eg the beta cluster puppetmaster [09:45:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Repurpose two jobrunners to videoscalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/328144 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey) [09:45:18] The ‘negative’ encode times are a seperate bug, they are due to me attempting to reset ‘long’ transcodes that ended up completing before they were killed… [09:45:54] I have a quarry query that lists them, and will eventually shove them back through. [09:46:36] _joe_: I just linked that one as an examle of a ‘non-server-side’ one that caused drama. [09:46:38] akosiaris: on labs I have restarted puppetmaster and puppet pass all fine [09:46:48] <_joe_> ok [09:46:53] hashar: yeah I had done that already [09:46:58] on deployment-prep that is [09:47:15] I know from experience that some changes confuse puppet entirely [09:47:21] yes [09:47:35] there must be some cache that ends not being pruned/refreshed properly [09:47:36] :( [09:47:36] on this one, old puppet code was trying to reference the removed filed [09:47:38] file* [09:47:41] _joe_: https://quarry.wmflabs.org/query/14861 <- I will eventually kick these back through once the servers are sane. [09:47:55] <_joe_> Revent: thanks a lot :) [09:49:11] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2886085 (10Gehel) Doing a test with multiple bonnie instance in parallel (see script below) give slightly higher numbers in term of throughput (~300-400 Mb/s). Intermedi... [09:51:13] TBH, if the ‘logic’ for how transcodes were started did not attempt to start more transcodes than the available number of CPUs, this would be far less trauma, as the ones ‘doomed to fail’ would not prevent other transcodes from successfully completing. [09:52:01] I’ve seen examples of 5MB SD videos that completed successfully, but took 6-7 hours to do so because of server load…. they would normally run in a minute or two. [09:53:11] (03PS2) 10ArielGlenn: Move default config into a file [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [09:54:13] (03PS2) 10Elukey: Repurpose two jobrunners to videoscalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/328144 (https://phabricator.wikimedia.org/T153488) [09:54:25] _joe_: forgot site.pp --^ :/ [09:54:35] <_joe_> elukey: meh :P [09:54:58] (03CR) 10Zfilipin: [] "The last update was 6 months ago. Are you still working on this?" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (owner: 10Hashar) [09:55:50] Also, it would be better if somehow resetting a running transcode actuall ‘killed’ the task immediately, and put it on the failed queue… fixing the bug w/o making it impossible to not abort a transcode. [09:56:45] My impression is that when a running transcode is reset, it eventually fails because it’s working files have disappeared, after a significant time lag. [09:57:06] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:57:07] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:57:07] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:57:08] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:57:08] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:57:09] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:57:09] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:57:10] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:57:10] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:57:11] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:57:33] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:57:34] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:57:34] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:57:34] RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:57:34] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:57:34] RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:57:34] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:57:35] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:57:35] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:57:36] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:57:36] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:57:43] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:57:43] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:57:43] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:57:43] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:57:43] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:57:44] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:57:49] I haven't re-enabled irc echo [09:57:53] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:57:53] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:57:53] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:57:53] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:57:53] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:57:53] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:57:53] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:57:54] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:57:54] RECOVERY - puppet last run on restbase2008 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:57:55] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:57:55] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:57:56] maybe puppet? [09:58:33] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:58:33] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:58:33] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:58:33] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:58:33] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:58:33] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:58:34] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:58:34] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:58:35] yeah [09:58:35] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:58:35] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:58:36] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:58:48] stopped again [10:00:33] puppetmaster issues? [10:01:06] ema: puppet kept serving stale/wrong content after the last CR that Alex merged [10:01:14] so we killed irc-echo to avoid spam [10:01:17] BTW, that transcodes are shown as “Started [INVALID] ago. comma” is an (I hope) completely unrelated bug, that popped up when the queue went over about 5000 or so. [10:02:13] I suspect that (because incredibly verbose error messages are added to the table) it’s just a memory issue. [10:02:43] ^ incredibly versobe meaning messages that are 10s of k in length. [10:03:48] (03CR) 10Elukey: [C: 032] Repurpose two jobrunners to videoscalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/328144 (https://phabricator.wikimedia.org/T153488) (owner: 10Elukey) [10:05:24] elukey: yeah that was puppet [10:05:37] so, 1 critical now in icinga [10:05:46] we can reenable ircecho [10:06:41] done :) [10:09:23] (03CR) 10Hashar: [C: 031] phabricator: delete labs role [puppet] - 10https://gerrit.wikimedia.org/r/327690 (https://phabricator.wikimedia.org/T139475) (owner: 10Dzahn) [10:09:52] _joe_: elukey: FWIW, while adding more videoscalers will be a huge help, it’s going to probably have a quite slow effect on the queue, due to the other bugs… it’s not going to prevent the system from trying to start so many transcodes that the servers get locked up. [10:10:06] <_joe_> yes [10:10:08] <_joe_> we know [10:10:17] (nods) [10:10:20] <_joe_> it's still going to add capacity [10:10:30] <_joe_> it's the quickest gain we can give you :) [10:10:34] Yeah, I’m not (at all) saying it’s not worthwhile. [10:13:20] (03PS1) 10Jcrespo: labsdb: Block access to replicas' mysql from almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) [10:13:23] I’m very open, btw, to any criticism of how I’ve been trying to deal with this (kicking the big transcodes back off) or more info about how they are actually dying (I can’t ofc watch the actual running tasks on the server) [10:14:32] (03PS2) 10Jcrespo: labsdb: Block access to replicas' mysql from almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) [10:15:55] (03PS3) 10Jcrespo: labsdb: Block access to replicas' mysql from almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) [10:16:38] !log reimaging mw1168 and mw1169 to Trusty - T153488 [10:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:40] T153488: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488 [10:16:49] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2886184 (10ArielGlenn) >>! In T148494#2880204, @Volans wrote: > @ArielGlenn it's surely depends on the specific cases, but I thi... [10:20:33] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/endowment] [10:21:47] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2886206 (10ArielGlenn) My .02€: At some point I'd like to put my few snapshot hosts and couple of dataset servers into a cluster for monitoring. Stacked graphs w... [10:22:58] 06Operations, 06Collaboration-Team-Triage, 10DBA: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154#2886223 (10Marostegui) [10:28:30] (03CR) 10Marostegui: [] "Question: is there any reason why db1095/db1069 should be allowed to connect to the labsdb10XX hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [10:39:59] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot on ru.wiki - https://phabricator.wikimedia.org/T153602#2886251 (10Aklapper) [10:41:33] (03PS2) 10Alexandros Kosiaris: Move external_networks to network module data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/302695 [10:43:42] (03CR) 10Jcrespo: [] "> Question: is there any reason why db1095/db1069 should be allowed to connect to the labsdb10XX hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [10:43:52] (03CR) 10Alexandros Kosiaris: [] "I 've heavily reworked the patch on top of ec0f5594a56b80e9d90dd4ed18a6d462ae7472eb which should address most of the concerns here, making" [puppet] - 10https://gerrit.wikimedia.org/r/302695 (owner: 10Alexandros Kosiaris) [10:45:01] (03PS4) 10Ema: varnishrls: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327733 (https://phabricator.wikimedia.org/T151643) [10:45:07] (03CR) 10Marostegui: [] "I was just double checking :)" [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [10:45:22] (03CR) 10Ema: [V: 032 C: 032] varnishrls: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327733 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [10:47:13] (03CR) 10Marostegui: [C: 031] labsdb: Block access to replicas' mysql from almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [10:49:52] 06Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001#2886278 (10ArielGlenn) These files are in the Swift filesystem so there are multiple copies of each file that is uploaded. There are no extern... [10:50:10] elukey: do you think https://gerrit.wikimedia.org/r/#/c/328037/ will work? [10:50:52] zhuyifei1999_: I am adding two more videoscalers to the pool, let's see how it goes [10:51:06] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2886283 (10Gehel) The read latency we see are coherent with the current configuration of the deadline scheduler: ``` gehel@elastic2006:~$ cat /sys/block/sda/queue/iosche... [10:51:07] ETA: ~30 mins [10:52:46] k [10:55:34] <_joe_> zhuyifei1999_: that doesn't mean we're solving all the issues [10:55:45] <_joe_> but well, doubling the capacity should do something [10:56:05] hopefully [11:07:09] (03CR) 10Alex Monk: [] "Doesn't this class apply on 1001/1003? Can't check right now" [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [11:10:52] _joe_: Until the backlog gets sane, I’m probably going to keep the same criteria for what I kick off the queue. “most” videos are far less than 200MB. [11:21:40] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:24:21] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot on ru.wiki - https://phabricator.wikimedia.org/T153602#2886361 (10MarcoAurelio) Andre, this is a global rename request :) [11:27:40] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:28:59] Revent,zhuyifei1999_ - new videoscalers working [11:29:16] https://ganglia.wikimedia.org/latest/?c=Video%20scalers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [11:33:29] <_joe_> elukey: it might be worth taking a look at what's going on on the "old" scalers, doing now [11:33:58] (03CR) 10Jcrespo: [] "> Doesn't this class apply on 1001/1003? Can't check right now" [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [11:35:21] _joe_ yep! [11:36:21] <_joe_> mw1259:~$ ps -ef | grep ffmpeg | wc -l [11:36:21] <_joe_> 323 [11:36:23] <_joe_> sigh [11:36:24] <_joe_> ok [11:36:38] <_joe_> I think I have a pretty clear image of what's going on there [11:37:47] <_joe_> :/ [11:38:24] <_joe_> elukey: also, sigh, "load":64 [11:38:24] <_joe_> , "queued":2234 [11:38:30] <_joe_> hhvm check-health [11:39:09] :( [11:39:58] _joe_: Hopefully you are clear on ‘exactly’ what I have been doing.. and tyvm for the added capacity. [11:40:02] in the jobqueue error log I can see also proxy timeouts (afaiu we have 300s set) [11:40:33] <_joe_> that's the apache timeout, I fear [11:40:38] Wow, it just started a ‘ton’ more of the problematic ones. [11:41:00] _joe_: yeah the ProxyTimeout is not set and defaults to Timeout [11:41:09] <_joe_> we should have /no/ timeout on those machines [11:41:13] Which is not itself a bad thing, if I can kick them in a way that makes them not be restarted. [11:41:20] <_joe_> maybe that's part of what is going on [11:43:23] (03PS1) 10ArielGlenn: make (most) snapshot shell scripts files instead of templates [puppet] - 10https://gerrit.wikimedia.org/r/328158 [11:44:27] (03CR) 10ArielGlenn: [] "I'm not sure I like the path for the little file of directory paths but what do you think of something like this?" [puppet] - 10https://gerrit.wikimedia.org/r/328158 (owner: 10ArielGlenn) [11:47:38] <_joe_> !log disabling puppet, reconfiguring timeout on apache, restarting HHVM on mw1259 [11:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:49] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name <- I note the immediate climb toward 100% on the new servers (sigh) [11:53:20] Revent: yes I noticed that too.. it might be the apache timeout that exacerbate the issue [11:54:18] elukey: Well, being loaded at 100% is ‘reasonable’ considering the backlog… it’s just that with the other bugs they will also get overloaded soon. [11:55:19] If nothing else, though, it will allow me to kick the huge ones off the queue faster, and thus bring it down. [11:55:56] elukey: Do you happen to know if those machines are ‘more powerful’ cpus? [11:56:14] yes they are [11:56:18] Nice. [11:57:01] That, in and of itself, will help lots. [11:57:51] Revent: so mw116[89] have 32 nprocs and 64GB of ram [11:58:00] elukey: thanks [11:59:29] what happened to 1259 though? [11:59:43] <_joe_> zhuyifei1999_: see my log up there [11:59:56] <_joe_> I'll bbiab [12:00:11] <_joe_> but I think me and elukey might be onto something here. no promises [12:00:20] ok [12:00:36] zhuyifei1999_: so apache times out after 300 seconds, and we are bumping it up [12:01:12] now mw1259 have 1 day of Timeout value [12:01:14] in httpd [12:01:33] _joe_: elukey: Again, please yell at me if you need me to stop manipulating the queue so you can see what happens. [12:01:36] let's watch it in Ganglia for the next 15/20 minutes [12:01:53] Revent: sure, thanks again for the support :) [12:03:59] Hey, I (massively) appreciate action taken on this. [12:05:21] so mw1259 looks really good now [12:05:54] what it might happen is that apache times out after 300s but the job in hhvm does not, and the jobrunner retries submitting another job, and so on [12:06:07] (this with jobs that takes more than 300s to complete) [12:06:37] on mw1259 I can see only 5 ffmpeg processes [12:06:40] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:06:44] that is what we have configured right? [12:07:24] ‘theoretically’ one per CPU would be optimum. [12:07:36] oh yes I meant what we have in puppet [12:08:21] Yeah, I cannot look at the code, I just mean… by understanding the basics, that multitasking transcoding is not helpful. [12:08:46] elukey: at some point we had the transcode job holding the database connection, having it killed after some time which failed the job [12:09:26] (as in, it’s a cpu intensive task that will load the particuar cpu at 100% until it completes, so swapping tasks just slows it all down) [12:09:36] hashar: so the 300s were there on purpose? [12:09:52] but if so, it might be better to have them not in apache but in the php code [12:10:05] (or maybe I am not understanding it correctly) [12:10:16] the issue we had months ago was WebVideoTranscodeJob spurting a stacktrace because the db connection got closed https://phabricator.wikimedia.org/T127428#2043535 [12:10:38] got worked around by ignoring the database failure [12:10:40] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is inactive [12:10:57] ah! [12:11:14] so probably unrelated [12:11:41] I have no idea how the jobrunner behave when hitting the RPC runJobs.php [12:13:14] but presumably, the jobrunner hits the RPC with a X timeout [12:13:26] which ends up being smaller than the time it takes to complete the job (eg video transcoding) [12:13:27] bails out [12:13:30] Revent: I think we fixed the issue, the cpu utilization looks stable on mw1259 [12:13:32] job fails, and is added back/retried [12:14:11] iirc the RPC hit is a synchronous operation. That get returned a json blob of jobs that got completed and the ones in error [12:14:13] elukey: Do you want me to stop kicking transcodes off the queue? [12:14:34] in this particular case, I think that httpd times out but because of a known mod_proxy_fcgi bug it does not propagate the abort to the FCGI buddy (hhvm in this case) [12:14:52] so jobs requiring more than 300s gets piled up in hhvm [12:14:59] and the magic command would be: mwscript showJobs.php --wiki=commonswiki --group --type=webVideoTranscode [12:15:20] webVideoTranscode: 9563 queued; 1142 claimed (544 active, 598 abandoned); 0 delayed [12:15:39] Revent: let's wait a bit to have the new config rolled out everywhere [12:16:00] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 381 bytes in 0.232 second response time [12:16:05] kk, I will stop and see what happens. [12:17:00] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:08] Frankly, even if it will reasonably ‘fail’ the huge ones, without overloading the servers so that other taks fail, that weill be an improvement. [12:17:31] <_joe_> elukey: you have only half the story right [12:17:48] buuuu [12:17:50] <_joe_> elukey: when a job is done by hhvm, it will be marked as such [12:18:05] elukey: Do you want me to go ahead and kick the ‘negative time’ one back on the queue? [12:18:07] <_joe_> the issue here is that requests will pile up [12:18:10] *ones [12:18:17] <_joe_> Revent: please wait [12:18:24] yeah they are retried a few times before being abandoned [12:18:25] (nods) [12:18:34] <_joe_> Revent: in ~ 1 hour we will be able to extend the fix to the other videoscalers [12:18:43] <_joe_> maybe do a bit more of tuning [12:18:48] a random example: webVideoTranscode File:The_President_Holds_a_Town_Hall_with_Service_Members.webm ... attempts=3) status=abandoned [12:18:59] Oh, that video... [12:19:34] hashar: I have repeatedly kicked transcodes of that video off the queue, because of it’s size. [12:19:53] (03PS11) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [12:21:05] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [12:21:08] and we have neat errors about source not being found: webVideoTranscode File:8-8-14-_White_House_Press_Briefing.webm ... t=40784 error=File:8-8-14- White House Press Briefing.webm: Source not found /tmp/localcopy_f569073a5756.webm [12:21:13] _joe_: Just to be clear, can you give me a simple description of what the patch tweaked? [12:21:32] <_joe_> Revent: it's a hotfix at the moment, but [12:21:58] <_joe_> the jobrunner service submits jobs to an apache webserver, which is backed by HHVM [12:22:08] If you say ‘how many tasks are run at once’ I will be estatic. :P [12:22:24] <_joe_> now the timeout for a request in apache was set to 5 minutes [12:22:37] (03CR) 10Yuvipanda: [C: 031] "Minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [12:22:56] <_joe_> so when a requested timed out in apache (quite easy), the jobrunner thought the transcoding ended (although in an error) [12:23:03] <_joe_> and submitted a new job [12:23:09] Oh gawd. [12:23:20] <_joe_> the isseu would've been limited if not for an apache bug [12:23:37] <_joe_> https://bz.apache.org/bugzilla/show_bug.cgi?id=56188 discovered by elukey [12:23:37] That would, indeed, seem to apply to the most important bug. [12:23:47] <_joe_> so basicly apache wouldn' [12:23:51] <_joe_> t tell hhvm [12:24:07] <_joe_> "hey, I sent out a timeout, what about aborting the processing?" [12:24:17] <_joe_> and so the trancodes piled up [12:24:40] <_joe_> so let's still not chant victory, it's a working hypothesis by me and elukey [12:25:07] To be specific, we had videos being uploaded that would, if the server were ‘not’ loaded, take over 7 hours to process… 5 minutes? (lol) [12:26:44] <_joe_> yeah it was a misconfiguration probably when we moved the jobrunners to HHVM [12:26:46] ah so I'll abandon my changeset [12:27:12] <_joe_> which we would've noticed if not for that apache bug [12:27:34] (lunch! brb in a bit) [12:27:41] <_joe_> now, all this is our hypothesis, let's see if that's the case [12:28:14] _joe_: Yes, I will keep watching, I still have a huge backlog of broken stuff to kick back through [12:28:35] (03Abandoned) 10Zhuyifei1999: videoscaler: Reduce runners_transcode from 5 to 2 [puppet] - 10https://gerrit.wikimedia.org/r/328037 (https://phabricator.wikimedia.org/T153488) (owner: 10Zhuyifei1999) [12:29:53] _joe_: BTW, can you take a look at why times of running or queued transcodes are shown as ‘invalid’? [12:30:19] The behavior popped up when the queued backlog went over 5k or so. [12:30:34] <_joe_> uhm I'd need help from someone familiar with the thm extension [12:31:00] <_joe_> can I ask you how you check the queue? [12:31:18] https://commons.wikimedia.org/wiki/Special:TimedMediaHandler [12:31:25] And… sec. [12:31:35] https://quarry.wmflabs.org/query/14838 [12:32:03] <_joe_> ok [12:32:17] <_joe_> because I usually look at the jobqueue instead [12:32:34] The query includes transcodes that are not shown by the check page, but were ‘broken’ by being reset while running. [12:32:34] <_joe_> I think yours is more accurate though [12:32:47] https://quarry.wmflabs.org/query/14842 [12:33:45] _joe_: I attempted, at one point, to remove jobs from the queue by deleting, and then undeleting, the fils. [12:34:26] It did not work, they were shown as ‘unititalized’ but were started anyhow. [12:34:29] Revent: fyi, https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/36a496fce0ceff59e24565488a05e1d65952a9a9/SpecialTimedMediaHandler.php#L16 [12:34:40] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:37:08] and the invalid thing should be bugging somewhere near https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/36a496fce0ceff59e24565488a05e1d65952a9a9/TranscodeStatusTable.php#L235 [12:37:14] elukey: _joe_ : not sure how relevant but on mw1168 /var/log/mediawiki/jobrunner.log shows apache emitting 503 [12:37:38] I'm not familar enough with php to know what exactly went wrong [12:38:02] and the webVideoTranscode job output random stdout which is not json !!! [12:38:14] zhuyifei1999_: Yeah, a previous version of those queries was based on that. I futzed with it later, after finding that files not in that specifc report were still being started [12:43:51] If the overloading problem proves to be fixed, I’ll kick the ‘fuckedup’ ones back on so they can run properly. [12:45:00] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [12:48:57] <_joe_> Revent: some improvements seem to be there [12:49:03] <_joe_> but it's too early to tell [12:49:17] _joe_: Oh yes, it looks better, but yeah. [12:49:39] <_joe_> anyways, given now the jobs are returning results, I'll extend the manual hack to the other machines [12:50:43] _joe_: Even if tasks fail as taking too long to run, as long as they are not causing other ‘reasonable’ tasks to fail it willbe a vast improvement. [12:50:54] <_joe_> yes [12:52:11] And I promise to look (and I have various queries) for transcodes that have a messed up status due to this, and send them back through. [12:54:18] _joe_ on mw1259 I can see a new one [12:54:19] Connection reset by peer: [client 127.0.0.1:44613] AH01075: Error dispatching request to :9005: [12:54:39] this will not stop hhvm jobs as well I am afraid [12:54:49] <_joe_> 9005 is the port apache is listening on [12:55:45] <_joe_> elukey: mw1259 is the only server on which I see no real errors [12:55:57] <_joe_> ok just saw one [12:56:06] <_joe_> but 1 in a lot of time [12:56:27] <_joe_> but yeah, not encouraging [12:56:41] sure sure, I wanted to say that the CPU is rising again and possibly issues like connection reset by peer hit the same bug [12:56:46] I see a ton of new ‘failed’ transcodes. [12:56:48] <_joe_> yup [12:56:58] <_joe_> Revent: from mw1168 I think [12:57:35] https://commons.wikimedia.org/wiki/File:IgA_Nephropathy.webm 480P, should have been easily run. [12:58:04] <_joe_> elukey: I think the two new videoscalers have some issues [12:58:23] <_joe_> I see a ton of failed runs there [12:59:56] <_joe_> elukey: so I'll extend my hack everywhere anyways [13:00:00] <_joe_> doesn't seem to hurt [13:00:08] yeah [13:00:11] +1 [13:00:41] https://quarry.wmflabs.org/query/14906 shows the error message [13:01:11] “Exitcode: 134” [13:02:17] <_joe_> Revent: seems like an ffmpeg2theora issue [13:05:03] <_joe_> elukey: one good thing on mw1259 is the hhvm load has remained almost constant [13:08:01] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.142 second response time [13:09:59] transcodes do seem to be successfully complete at a vastly improved rate [13:10:07] \o/ [13:10:18] <_joe_> Revent: that's me restarting the jobrunners I fear [13:10:36] <_joe_> elukey: so, one bad thing; it seems we're still leaking requests [13:10:43] <_joe_> but the rate seems better [13:10:47] <_joe_> let me find some numbers [13:10:52] Oh, doing at a lower lever what I had been doing? [13:10:57] *level [13:11:29] <_joe_> elukey: where do I find the stats we gather from hhvm in prometheus? [13:11:59] _joe_ this is a good question, I think that we don't have a dashboard yet [13:12:55] ah and also I don't think that we have the exporters on the video scalers :( [13:12:59] _joe_ --^ [13:13:10] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2886903 (10Tbayer) >>! In T107430#2882009, @fgiunchedi wrote: >>>! In T107430#2881955, @fgiunchedi wrote: >> I'm not sure how to change that and how it is set from though > > @Krenair p... [13:13:39] <_joe_> elukey: and we don't have the data on ganglia anymore either [13:13:41] <_joe_> sigh [13:14:00] 1260’s load just kicked waay down [13:14:59] <_joe_> that's because I restarted it [13:15:04] <_joe_> I forgot to log, meh [13:15:34] <_joe_> !log restarted hhvm, apache on mw1260, raised the apache timeout to 1 day, restarted the jobrunner, disabled puppet [13:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:58] (03PS5) 10Alexandros Kosiaris: kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 [13:24:00] (03PS5) 10Alexandros Kosiaris: k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 [13:24:02] (03PS5) 10Alexandros Kosiaris: k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 [13:24:04] (03PS14) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [13:24:06] (03PS13) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [13:24:15] (03CR) 10jenkins-bot: [V: 04-1] kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 (owner: 10Alexandros Kosiaris) [13:24:32] (03CR) 10jenkins-bot: [V: 04-1] k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 (owner: 10Alexandros Kosiaris) [13:24:59] (03CR) 10jenkins-bot: [V: 04-1] k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 (owner: 10Alexandros Kosiaris) [13:25:03] <_joe_> Revent: we're pretty far from being ok, but me and elukey are discussing solutions [13:25:33] (03CR) 10jenkins-bot: [V: 04-1] Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 (owner: 10Alexandros Kosiaris) [13:26:15] (03CR) 10jenkins-bot: [V: 04-1] Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 (owner: 10Alexandros Kosiaris) [13:26:27] (03PS12) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [13:26:44] _joe_: Hey, I’m thrilled that tech people are on it. [13:27:16] (03CR) 10jenkins-bot: [V: 04-1] labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [13:27:47] <_joe_> Revent: I might step away in a few, I am not feeling very well today, but elukey will be running point if I go AFK [13:28:20] (03PS13) 10Yuvipanda: labs: maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [13:28:31] _joe_: Sorry to hear that, hope you feel better. [13:29:06] <_joe_> Revent: thanks :) [13:31:41] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2886921 (10thiemowmde) @Krinkle: The `sites` table must be updated to reflect the change from https://gerrit.wikimedi... [13:33:40] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - create-dbusers is active [13:35:48] (03PS1) 10ArielGlenn: add rest of the dump rsync modules that filter dirs on our side for [puppet] - 10https://gerrit.wikimedia.org/r/328167 (https://phabricator.wikimedia.org/T152954) [13:39:12] !log Manually raise hhvm.server.connection_timeout_seconds on mw1259 to one day [13:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:21] (03CR) 10ArielGlenn: [C: 032] add rest of the dump rsync modules that filter dirs on our side for [puppet] - 10https://gerrit.wikimedia.org/r/328167 (https://phabricator.wikimedia.org/T152954) (owner: 10ArielGlenn) [13:49:31] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:38] (03PS4) 10Jcrespo: labsdb: Block access to replicas' mysql from almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) [13:55:41] (03PS6) 10Alexandros Kosiaris: kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 [13:55:43] (03PS6) 10Alexandros Kosiaris: k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 [13:55:45] (03PS6) 10Alexandros Kosiaris: k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 [13:55:47] (03PS15) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [13:55:49] (03PS14) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [13:56:30] Revent: has anything changed in the queue? [13:56:55] I am trying to test a new setting but the jobrunner seems sitting there doing nothing :D [13:57:47] It’s actually gone up a bit, but that might be new uploads. [13:58:41] because I am seeing load going down on all the hosts too [13:58:58] (03PS5) 10Jcrespo: labsdb: Block access to replicas' mysql from almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) [13:59:00] (03PS1) 10Jcrespo: labsdb: Enable view creation on labsdb1009/10/11 [puppet] - 10https://gerrit.wikimedia.org/r/328168 (https://phabricator.wikimedia.org/T147052) [13:59:18] (03CR) 10Jcrespo: [C: 032] labsdb: Block access to replicas' mysql from almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/328150 (https://phabricator.wikimedia.org/T147051) (owner: 10Jcrespo) [13:59:27] (03PS1) 10Giuseppe Lavagetto: videoscaler: fix and harmonize timeouts [puppet] - 10https://gerrit.wikimedia.org/r/328169 (https://phabricator.wikimedia.org/T153488) [13:59:34] <_joe_> elukey: ^^ [14:00:41] The ‘running’ list is almost entirely full of those “While Hous Press Briefing” huge files. [14:00:41] <_joe_> elukey: I need to rest a bit now, but for the queue size you can just go to terbium [14:00:52] *White House [14:01:04] (03CR) 10Jcrespo: [C: 032] labsdb: Enable view creation on labsdb1009/10/11 [puppet] - 10https://gerrit.wikimedia.org/r/328168 (https://phabricator.wikimedia.org/T147052) (owner: 10Jcrespo) [14:01:10] <_joe_> elukey: mwscript showJobs.php --wiki=commonswiki --group | grep webVideoTranscode [14:01:24] * elukey takes notes [14:02:40] !log deploying new firewall rules to labsdb1009/10/11 [14:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:45] <_joe_> elukey: I'm merging the change now [14:03:51] (03CR) 10Giuseppe Lavagetto: [C: 032] videoscaler: fix and harmonize timeouts [puppet] - 10https://gerrit.wikimedia.org/r/328169 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [14:03:58] (03PS2) 10Giuseppe Lavagetto: videoscaler: fix and harmonize timeouts [puppet] - 10https://gerrit.wikimedia.org/r/328169 (https://phabricator.wikimedia.org/T153488) [14:04:03] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] videoscaler: fix and harmonize timeouts [puppet] - 10https://gerrit.wikimedia.org/r/328169 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [14:06:51] _joe_ thanks, I was checking the official docs and the connection timeout setting is "The maximum number of seconds a connection is allowed to stand idle after its previous read or write.". Meanwhile, request_timeout_seconds is "The amount of time provided for a request to process before the server times out. If 0 (default), there is no explicit request timeout." [14:07:14] let's see how it goes [14:07:15] That the running queue would fill up with the huge files is rather inevitable… shorted oves will pass throught lasted, and leave the long transcodes. What matters is that ‘small’ transcodes are not beiling because of the long ones. [14:07:16] <_joe_> yeah request_timeout_seconds is also broken [14:07:28] *typos [14:07:42] ahahha ok [14:07:44] very nice [14:07:45] *shorter ones will pass throught faster [14:08:01] *not failing [14:09:53] <_joe_> elukey: I'm moderately confident this last round of changes will work [14:10:06] <_joe_> elukey: btw I've reenabled puppet everywhere and make it run [14:10:12] sure.. [14:10:40] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:10:53] _joe_ should we bump up the apache Timeout too? [14:11:02] <_joe_> elukey: it was done in my patch [14:11:08] I lost that part [14:11:11] re-checkign [14:11:12] <_joe_> https://gerrit.wikimedia.org/r/#/c/328169/2/modules/role/manifests/mediawiki/videoscaler.pp [14:11:36] this is new magic for me [14:11:43] <_joe_> not the most elegant of solutions, but still [14:11:59] <_joe_> elukey: you mean you were never introduced to the wonderful world of augeas? [14:12:10] nopE! [14:18:30] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:26:43] elukey: prepare to run! Augeas is even scarier than puppet! [14:27:41] :D [14:31:01] (03PS1) 10ArielGlenn: modify dumps index.html to clarify that cap is only on WMF servers [puppet] - 10https://gerrit.wikimedia.org/r/328172 [14:31:08] <_joe_> gehel: the httpd lens is quite nice, come on [14:31:28] <_joe_> gehel: the docs are horrible and people around suggest very very stupid patterns, though [14:32:02] (03CR) 10ArielGlenn: [C: 032] modify dumps index.html to clarify that cap is only on WMF servers [puppet] - 10https://gerrit.wikimedia.org/r/328172 (owner: 10ArielGlenn) [14:32:35] _joe_: I find the principle of Augeas itself quite disturbing... modifying pieces without a view on the whole file, that can work, but it is scary! [14:33:03] (and yes, I do agree that in a few cases it actually does make sense) [14:34:40] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:38:04] <_joe_> elukey: so checking that the load of HHVM and the number of apache busy workers should always be ~ equal [14:38:12] <_joe_> is what we must do [14:38:26] <_joe_> (now I'm going away for reals, finally, be back in a bit) [14:38:40] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:49:24] (03PS1) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174 [14:49:26] (03PS1) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 [14:50:15] Revent: the videoscalers looks healthy now.. Did you say that you could have increased a bit the current load? [14:51:51] Umm… not sure what you mean. I have no way to directly affect the load. [14:52:09] ah ok sorry I thought you were holding off some jobs [14:53:04] I have stuff I need to throw back on the queue, yes, but they wil just be queued. [14:54:37] (03CR) 10Dereckson: "Superseded by 86f0b35a94c3." [puppet] - 10https://gerrit.wikimedia.org/r/328037 (https://phabricator.wikimedia.org/T153488) (owner: 10Zhuyifei1999) [14:55:41] Revent: if you want we can see how it goes with those [14:55:58] Ok, I’ll kick them back in/ [14:56:17] Revent: I transfer jason videos from Terbium to Commons? [14:56:20] Do you have an idea of the volume? [14:56:37] A couple hundred. [14:56:48] should be fine [14:56:57] (03PS1) 10Gehel: osm: install prerequisite packages for meddo [puppet] - 10https://gerrit.wikimedia.org/r/328176 (https://phabricator.wikimedia.org/T153289) [14:57:05] Ones that I had intentionally kicked off the queue earlier. [14:57:49] (that’s a couple of hundred ‘transcodes’, not files) [14:58:48] hashar: newbie question from me (as always) - what does claimed/active mean in the context of webVideoTranscode: 0 queued; 10589 claimed (9989 active, 600 abandoned); 0 delayed ? [14:59:14] nobody knows [14:59:38] or more correctly, I once looked at some documentation or terminology of those terms, but never found it [14:59:40] from the count [15:00:05] you can assume that claimed means the job runner has marked the jobs has being processed in the queue [15:00:21] active one are those that are going to be run/running/pending retry (maybe)? [15:00:24] One can expect "claimed" means "the job has been taken to be processed", "active" "the job is currently processed" "abandoned" "there was an issue during processing, it failed" [15:00:29] abandonned is that the jobrunner gave up after X retries [15:00:38] not sure what it does with the abandoned jobs [15:01:14] the active part is what puzzles me, because 9989 seems a lot of work [15:01:23] and I don't see it on the videoscalers [15:01:32] but I am surely missing something trivial [15:02:05] Ok… something fucked. [15:02:11] https://commons.wikimedia.org/wiki/File:11-4-14-_White_House_Press_Briefing.webm [15:02:11] depends what "active" means [15:02:24] they are just jobs pending in the queue [15:02:36] Ones I kicked back on the queue, ‘failing’ immeditely. [15:02:40] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:02:45] but really, beside reading the JobQueue + jobrunner source code, I don't think we have any documentation [15:03:18] ah so "queued" could mean "waiting to decide if we process them or not" [15:03:39] Revent: failing with a specific error? [15:03:57] elukey: Give me a sec, to write a query to look [15:04:38] “File:11-4-14- White House Press Briefing.webm: Source not found” [15:05:29] (that’s what’s in the SQL table) [15:07:43] okok no rush, I just wanted to know if it was a clear infrastruture issue or not [15:09:02] in the meantime, I am going to write a summary of what we have done in the task [15:15:52] elukey: Despite being shown on the file page as ‘error’, and having an entry in transcode_error, that are shown as queued on Special:TimedMediaHandler. [15:20:13] TMH really needs better docs [15:22:09] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10elukey) Summary: While investigating the high load on mw1259/1260 we discovered this apache httpd e... [15:24:35] I added the summary in --^ [15:24:48] hopefully it is clear enough but let me know if you have questions :) [15:25:17] elukey: the load looks in fact "underloaded". Any possibility to run 1 ffmpeg per core? [15:26:20] (03PS1) 10Ema: varnishxcache: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) [15:26:44] zhuyifei1999_: for the moment let's focus on making sure that we fixed the problem, afterwards I'll be more than happy to tune it up :) [15:26:53] ok [15:27:49] fwiw, video2commons on labs runs at a maximum of 2 ffmpeg per core [15:29:11] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884289 (10MisterSynergy) FWIW: https://www.mediawiki.org/wiki/Talk:Wikidata_query_service#Links_in_query_results_should_be_https.2C_not_http [15:30:09] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2887110 (10Ottomata) I think only @ellery can speak to this question. @ellery? :) [15:32:57] <_joe_> elukey: seems like our work succeeded [15:33:36] (03PS1) 10Ema: varnish: remove varnishprocessor [puppet] - 10https://gerrit.wikimedia.org/r/328180 (https://phabricator.wikimedia.org/T151643) [15:33:44] <_joe_> I see no videoscaler with more than 5 running threads [15:34:03] <_joe_> OTOH, I keep seeing File:10-14-14- White House Press Briefing.webm: Source not found [15:34:09] <_joe_> as an error [15:34:23] <_joe_> but that has nothing to do with what we have fixed AFAICT [15:35:53] (03PS1) 10Cmjohnson: Addin Eric Evans to analytics-privatedata for stat1002 https://phabricator.wikimedia.org/T153375. Merge Date is Dec 21,2017. [puppet] - 10https://gerrit.wikimedia.org/r/328181 [15:36:09] _joe_: \o/ [15:36:44] (03CR) 10Elukey: [C: 031] Addin Eric Evans to analytics-privatedata for stat1002 https://phabricator.wikimedia.org/T153375. Merge Date is Dec 21,2017. [puppet] - 10https://gerrit.wikimedia.org/r/328181 (owner: 10Cmjohnson) [15:37:05] Who is this Eric Evans that dares to ask access to the Hadoop Cluster? [15:37:14] :P [15:37:25] _joe_: Dunno if you tweaked something, but that video just reset successfully, and actually started running. [15:38:02] 11-4-14-_White_House_Press_Briefing.webm I mean, the one that was showing ‘error’ right after reset [15:39:01] <_joe_> Revent: well let's see how things proceed now [15:39:44] * elukey does not wish to find another timeout being hit during the next hours [15:40:44] Hopes [15:41:18] <_joe_> elukey: actually, if everything ok, we might want to raise a bit the number of concurrent transcodes and aim at 90% utilization [15:41:27] <_joe_> but that's for laters [15:42:14] yesss [15:43:05] <_joe_> actually, if we want to, we can add a monitoring check to see if we're leaking transcode processes [15:44:03] let's also extend the prometheus exporters to all the hhvm/apache hosts, so we'll have a decent dashboard [15:44:09] <_joe_> yes [15:44:12] <_joe_> can you do that? [15:44:20] <_joe_> I'll write a tiny monitoring check [15:45:43] I had a chat with godog about this, not sure where it would be best to put the exporters (apache/hhvm::monitoring classes or directly in roles) [15:46:03] but yes I'll take care of it [15:46:05] <_joe_> roles [15:46:13] <_joe_> go read the rfc :P [15:46:37] hahahaha it is in my backlog! You are right, I was waiting for a summary but you have probably done it [15:46:57] <_joe_> I'm going to fix the wikitech pages on puppet coding "ASAP" [15:47:12] 2018? [15:47:25] * _joe_ larts Reedy [15:47:28] (03PS1) 10Hashar: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 [15:48:05] (03CR) 10jenkins-bot: [V: 04-1] rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [15:53:36] (03PS1) 10Elukey: Add the apache/hhvm prometheus exporter to all the mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/328184 (https://phabricator.wikimedia.org/T147316) [16:04:28] !log upgrading to python-urllib3_1.19 on scb1001 [16:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 32 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:08:37] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07WorkType-Maintenance: Jenkins master / client ssh connection fails due to missing ssh algorithm - https://phabricator.wikimedia.org/T100509#2887180 (10hashar) [16:09:51] (03PS1) 10ArielGlenn: add Umeå University to public dumps mirrors, yay! [puppet] - 10https://gerrit.wikimedia.org/r/328185 [16:10:59] (03CR) 10Ema: [] "Note that this version of varnishxcache doesn't report values that haven't been incremented (eg: if there's been no int-remote, that value" [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:11:12] (03CR) 10ArielGlenn: [C: 032] add Umeå University to public dumps mirrors, yay! [puppet] - 10https://gerrit.wikimedia.org/r/328185 (owner: 10ArielGlenn) [16:15:38] (03PS1) 10ArielGlenn: (dumps mirrors) aaaand add the missing href close tag. [puppet] - 10https://gerrit.wikimedia.org/r/328186 [16:16:44] 06Operations, 10Traffic, 13Patch-For-Review: Strip query string in varnish upload - https://phabricator.wikimedia.org/T153336#2887188 (10BBlack) 05Open>03Resolved a:03BBlack [16:17:05] (03CR) 10ArielGlenn: [C: 032] (dumps mirrors) aaaand add the missing href close tag. [puppet] - 10https://gerrit.wikimedia.org/r/328186 (owner: 10ArielGlenn) [16:17:09] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2887192 (10BBlack) [16:17:12] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2887190 (10BBlack) 05Open>03Resolved a:03BBlack [16:17:21] !log Run lots os small optimize tables on db1015 as it needs to get some space back urgently [16:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:21] 06Operations, 10Traffic, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2887195 (10BBlack) [16:18:24] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2887196 (10BBlack) [16:18:26] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2887193 (10BBlack) 05Open>03declined We're going to leave this as-is and assume eventstream replacement (which will be HTTPS-only from the get-g... [16:19:30] (03CR) 10Hashar: [C: 031] ":-}" [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [16:19:51] (03PS1) 10ArielGlenn: (dumps mirrors) third time's a charm? [puppet] - 10https://gerrit.wikimedia.org/r/328187 [16:20:02] thank goodness there's a meeting soon. clearly I need a break [16:21:17] (03CR) 10ArielGlenn: [C: 032] (dumps mirrors) third time's a charm? [puppet] - 10https://gerrit.wikimedia.org/r/328187 (owner: 10ArielGlenn) [16:22:10] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 8 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:22:11] (03CR) 10Hashar: [C: 031] "Much easier to figure out what is happening now. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [16:27:27] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2887209 (10BBlack) @EdErhart-WMF - Any update on setting the appropriate Strict-Transport-Security header on this service? [16:30:00] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#2887212 (10BBlack) @PPena (or anyone) - who's responsible in the WMF for store.wikimedia.org? This is a pretty basic request and it's been outstanding for months. It's one of t... [16:33:17] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2887213 (10BBlack) [16:34:10] jouncebot: now [16:34:10] No deployments scheduled for the next 319 hour(s) and 25 minute(s) [16:35:16] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 2 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2887223 (10Joe) I might add we had another issue, once we fixed that timeout: hhvm had a setting for connectio... [16:35:53] that clean deployment calendar is what I like to see [16:36:17] robh: can you do your magic and make me clinic duty in the topic pretty please? [16:36:24] * apergos makes puppy-eyes [16:37:14] _joe_, the parsoid patch is merged. [16:41:46] (03PS1) 10Eevans: enable instance restbase1018-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328192 (https://phabricator.wikimedia.org/T151086) [16:42:38] (03CR) 10Eevans: [C: 031] "Ready." [puppet] - 10https://gerrit.wikimedia.org/r/328192 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:43:24] (03PS20) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [16:43:26] (03PS20) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [16:43:28] (03PS21) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [16:43:30] (03PS5) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) [16:43:32] (03PS1) 10BBlack: TLS: reduce scope of stream.wm.o redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/328193 (https://phabricator.wikimedia.org/T143925) [16:44:21] urandom: ready to go? [16:44:26] elukey: sure! [16:45:30] (03CR) 10Elukey: [C: 032] enable instance restbase1018-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328192 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [16:45:36] (03PS1) 10Giuseppe Lavagetto: mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) [16:46:29] urandom: done, you are free to run puppet :) [16:46:53] <_joe_> elukey: when you're done there, a review of ^^ will be appreciated. [16:47:02] I just opened it :) [16:47:05] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:47:15] elukey: yup, it's on its way; thanks! [16:47:29] 06Operations, 06Analytics-Kanban, 10EventBus, 10Traffic, and 2 others: Productionize and deploy Public EventStreams - https://phabricator.wikimedia.org/T143925#2887270 (10BBlack) [16:49:17] (03Abandoned) 10Hashar: Support flake8 with python3 [puppet] - 10https://gerrit.wikimedia.org/r/327746 (https://phabricator.wikimedia.org/T152950) (owner: 10Hashar) [16:50:20] (03PS2) 10Alexandros Kosiaris: kubernetes::master: Introduce the kubernetes profile [puppet] - 10https://gerrit.wikimedia.org/r/328174 [16:50:21] (03PS2) 10Alexandros Kosiaris: Create and assign the kubernetes::master role [puppet] - 10https://gerrit.wikimedia.org/r/328175 [16:51:47] (03CR) 10jenkins-bot: [V: 04-1] mediawiki::scaler: check orphaned HHVM threads [puppet] - 10https://gerrit.wikimedia.org/r/328194 (https://phabricator.wikimedia.org/T153488) (owner: 10Giuseppe Lavagetto) [16:53:57] Anyone, can you tell me why were still on wmf.6 & not on .7? [16:54:32] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2887285 (10Volans) @ArielGlenn remember that if the file will still be a `.py.erb` it will be skipped by our tox checker as of n... [16:54:44] (03PS1) 10BBlack: sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/328195 [16:57:12] _joe_ except the the line too long in the py file that makes jenkins upset, it looks great to me! [16:57:23] <_joe_> elukey: thanks [16:57:28] I tried it on mw1259, works nicely [16:59:10] PROBLEM - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused [16:59:35] got it ^^^ [17:00:00] _joe_: sorry to interrupt, but could you please answer my question if you look up in chat history a bit. [17:00:09] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused eevans Bootstrapping. [17:01:21] Zppix: Because that's the schedule? [17:01:27] Last week was .5 to .6 [17:01:39] Now it's a deployment freeze till January [17:01:51] So no .7 yet? [17:02:54] Not till 3/1 [17:03:17] Ok thank you Reedy merry christmas [17:04:00] <_joe_> Zppix: I'm in a meeting [17:04:45] _joe_: no worries i was answered sorry for interrupting. [17:04:49] Zppix, because https://www.mediawiki.org/wiki/MediaWiki_1.29/Roadmap [17:05:12] ...and https://lists.wikimedia.org/pipermail/wikitech-l/2016-December/087138.html [17:05:25] Ah, ok thanks Andre [17:18:21] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2887378 (10Gehel) We are using software RAID on eqiad because the controllers we have are known to be not so good (INTEL C600/X79 and INTEL C610/X99). We are also using s... [17:27:26] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884289 (10DSGalaktos) FWIW, there was also a bit of discussion about this on PC following the initial WDQS Beta announcement: https://www... [17:45:50] (03PS2) 10BBlack: sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/328195 [17:45:52] (03PS1) 10BBlack: Remove expired unified certs (GS 2015) [puppet] - 10https://gerrit.wikimedia.org/r/328200 [17:45:54] (03PS1) 10BBlack: Add new digicert unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/328201 [17:46:23] 06Operations, 10Ops-Access-Requests, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10RobH) Please note the groups requested: researchers statistics-privatedata-users... [17:46:43] (03CR) 10BBlack: [V: 032 C: 032] sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/328195 (owner: 10BBlack) [17:47:21] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2881052 (10Dzahn) Has been approved in ops meeting [17:47:26] (03PS2) 10Elukey: Add the new user fdans with basic Analytics group permissions [puppet] - 10https://gerrit.wikimedia.org/r/327730 (https://phabricator.wikimedia.org/T153303) [17:48:31] (03CR) 10BBlack: [V: 032 C: 032] Remove expired unified certs (GS 2015) [puppet] - 10https://gerrit.wikimedia.org/r/328200 (owner: 10BBlack) [17:50:28] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2887533 (10Joe) @akosiaris LGTM! There are a bunch of variables that are going to be global anyways, but I agree with your comment. [17:53:49] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2887554 (10Joe) [17:54:45] cmjohnson1: anything against me merging https://gerrit.wikimedia.org/r/#/c/327730/ ? [17:55:11] (03CR) 10Cmjohnson: [C: 031] Add the new user fdans with basic Analytics group permissions [puppet] - 10https://gerrit.wikimedia.org/r/327730 (https://phabricator.wikimedia.org/T153303) (owner: 10Elukey) [17:55:18] elukey: nope [17:55:22] thanks! [17:56:32] (03PS3) 10Elukey: Add the new user fdans with basic Analytics group permissions [puppet] - 10https://gerrit.wikimedia.org/r/327730 (https://phabricator.wikimedia.org/T153303) [17:57:48] (03CR) 10Elukey: [C: 032] Add the new user fdans with basic Analytics group permissions [puppet] - 10https://gerrit.wikimedia.org/r/327730 (https://phabricator.wikimedia.org/T153303) (owner: 10Elukey) [18:00:01] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:13] (03PS2) 10Dzahn: Trending Edits: Add the admin group (and add it to SCB) [puppet] - 10https://gerrit.wikimedia.org/r/327754 (https://phabricator.wikimedia.org/T153458) (owner: 10Mobrovac) [18:01:24] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2887639 (10Joe) [18:04:47] (03PS2) 10BBlack: Add new digicert unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/328201 [18:04:55] (03CR) 10BBlack: [V: 032 C: 032] Add new digicert unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/328201 (owner: 10BBlack) [18:05:30] (03PS3) 10Dzahn: Trending Edits: Add the admin group (and add it to SCB) [puppet] - 10https://gerrit.wikimedia.org/r/327754 (https://phabricator.wikimedia.org/T153458) (owner: 10Mobrovac) [18:08:35] (03CR) 10Dzahn: [C: 032] Trending Edits: Add the admin group (and add it to SCB) [puppet] - 10https://gerrit.wikimedia.org/r/327754 (https://phabricator.wikimedia.org/T153458) (owner: 10Mobrovac) [18:09:00] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:09:21] (03PS2) 10Dzahn: Add jdlrobson to the deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/327755 (https://phabricator.wikimedia.org/T153458) (owner: 10Mobrovac) [18:13:37] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2887682 (10Dzahn) ``` Info: Caching catalog for scb1001.eqiad.wmnet Info: Applying... [18:14:18] 07Puppet, 06Labs: Retire and remove module labs_debrepo - https://phabricator.wikimedia.org/T153612#2885673 (10Multichill) How exactly is this related to T153439 Tim? A bit more info than one line would be nice. [18:14:23] (03CR) 10Dzahn: [C: 032] Add jdlrobson to the deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/327755 (https://phabricator.wikimedia.org/T153458) (owner: 10Mobrovac) [18:19:54] 06Operations, 10Mobile-Content-Service, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 3 others: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2887695 (10Dzahn) [18:19:56] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2887692 (10Dzahn) 05Open>03Resolved a:03Dzahn after this second merge, on tin... [18:20:35] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 2 others: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2887697 (10Dzahn) @jdlrobson @bearND you should now be able to manage and deploy th... [18:21:06] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2887698 (10Dzahn) [18:24:23] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#2887702 (10Dzahn) found quote from a mail from Seddon //"Change in management.. Wikipedia Store project has moved under Michael Beattie.. Sandra Hust [2] will be the primary cont... [18:25:43] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2887703 (10Andrew) Thank you for all the rewrites -- I'm happy with the proposal as it stands today. If you have time to add snippets of sample code to... [18:29:00] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:30:29] (03CR) 10Pnorman: [C: 031] osm: install prerequisite packages for meddo [puppet] - 10https://gerrit.wikimedia.org/r/328176 (https://phabricator.wikimedia.org/T153289) (owner: 10Gehel) [18:36:48] (03PS1) 10BBlack: update-ocsp: fixups for Digicert deploy [puppet] - 10https://gerrit.wikimedia.org/r/328207 [18:37:48] (03CR) 10BBlack: [C: 032] update-ocsp: fixups for Digicert deploy [puppet] - 10https://gerrit.wikimedia.org/r/328207 (owner: 10BBlack) [18:40:00] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 10 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:00] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 20 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:00] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 52 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:01] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 51 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:01] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 20 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:01] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 20 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:01] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 20 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:02] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 20 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:02] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 20 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:10] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 28 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:20] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:40] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Exec[digicert-2016-rsa-unified-create-ocsp],Exec[digicert-2016-ecdsa-unified-create-ocsp] [18:40:48] that's me, but it's either the puppet check being silly (reporting old failure on fix), or it's a race that will fix itself with a second run [18:41:00] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:41:00] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:41:00] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:41:00] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:41:01] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:41:01] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:41:01] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:41:02] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:41:02] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:41:10] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:41:20] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:41:40] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:42:05] yeah I guess it was the former (reporting old failure when renabled->fixed) [18:50:30] (03CR) 10Dzahn: [] "@Paladox yes (interested in working on that?)" [puppet] - 10https://gerrit.wikimedia.org/r/327690 (https://phabricator.wikimedia.org/T139475) (owner: 10Dzahn) [18:51:00] (03PS3) 10Dzahn: contint: combine contint1001/2001 in a single node regex [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) [18:52:51] (03CR) 10Volans: [C: 04-1] "I don't see the changes in the puppet files from content => template() to source =>... Plus a possible improvement inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328158 (owner: 10ArielGlenn) [18:58:05] (03PS4) 10Yuvipanda: labs: Clean out projects that don't exist anymore from mounts [puppet] - 10https://gerrit.wikimedia.org/r/327522 [18:58:26] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2887780 (10Gehel) Having software RAID in codfw is to keep configuration uniform between DCs. It might make sense to experiment switching to hardware RAID and see if the... [19:00:06] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#2887787 (10GWicke) [19:04:23] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#2887809 (10GWicke) [19:04:30] (03PS2) 10Filippo Giunchedi: Add the apache/hhvm prometheus exporter to all the mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/328184 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [19:05:16] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#2887787 (10GWicke) [19:05:39] (03CR) 10Dzahn: [C: 032] contint: combine contint1001/2001 in a single node regex [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [19:05:44] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#2887787 (10GWicke) [19:07:10] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:14] (03CR) 10Dzahn: "no-op on both servers confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [19:07:48] (03CR) 10Filippo Giunchedi: [C: 032] Add the apache/hhvm prometheus exporter to all the mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/328184 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [19:07:54] (03PS3) 10Filippo Giunchedi: Add the apache/hhvm prometheus exporter to all the mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/328184 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [19:08:35] !log nuria@tin Starting deploy [analytics/refinery@711a572]: (no message) [19:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:43] (03PS4) 10Dzahn: contint: fix/move 'backup'-includes, move from node to role [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) [19:13:07] !rebase-race start [19:13:08] hehehe [19:13:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [19:13:10] (03CR) 10Dzahn: [C: 032] contint: fix/move 'backup'-includes, move from node to role [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [19:13:10] (03PS5) 10Dzahn: contint: fix/move 'backup'-includes, move from node to role [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) [19:13:10] first attack blocked [19:14:20] PROBLEM - Disk space on analytics1027 is CRITICAL: DISK CRITICAL - free space: / 167 MB (0% inode=86%) [19:15:02] wah wah [19:15:36] uhm, i freed 2% [19:16:25] !log analytics1027 - out of disk, apt-get clean to free about 500M [19:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:40] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:11] ottomata or elukey, wanna take a look at that analytics server running out of disk? i saved a little bit [19:17:59] analytics1027? [19:18:02] yah looking now [19:18:07] / is only 20G [19:18:07] yes. thanks [19:18:13] and we deploy refinery (and jar artifacts there) [19:18:17] and scap keeps a cache of old deploys [19:18:22] so, i'm going to make a larger /srv partition [19:18:23] there is a var/lib with 100G or so [19:18:24] 06Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001#2887871 (10Tgr) Swift copies are good for hardware errors but when there is a bug in the application code, all the copies get deleted (or, more... [19:18:24] to avoid this [19:18:27] sounds good [19:18:40] oh actually, that /var/lib can probably be reclaimed, it was used for mysql when we used to run mysql here [19:18:53] oh, then you have 200G , yea [19:18:59] 50% used [19:19:04] well, the vg that is on is 1T, so we have room [19:19:07] but we should delete that anyway [19:19:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 24 probes of 407 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:19:12] cool [19:19:13] am on it, thanks mutante [19:19:21] yup, np [19:24:00] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[prometheus-apache-exporter],Package[prometheus-hhvm-exporter] [19:24:10] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 407 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:24:15] mutante: you still logged in there? [19:24:18] can you cd out of /srv? [19:24:44] OH [19:24:45] maybe that's me [19:24:46] sorry [19:24:53] it was, nm [19:24:54] doh! [19:25:16] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#2887902 (10GWicke) [19:25:41] 06Operations, 10Traffic, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2887903 (10BBlack) Status update - Digicert unified certs (RSA+ECDSA) are now deployed and stapled alongside the GlobalSign ones on all cache terminators. They're not being used for user... [19:27:25] (03CR) 10BryanDavis: [] labs: maintain-dbusers.py for maintaining labsdb users (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [19:28:40] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[prometheus-apache-exporter],Package[prometheus-hhvm-exporter] [19:29:42] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2887911 (10DarTar) Note that @ellery is currently OoO and will respond when he's back. He's the primary driver for this request although other team me... [19:29:53] I thinks those failures will recover by themselves on the next puppet run [19:31:34] 06Operations, 10hardware-requests: codfw: (2) servers request for ORES redis databases - https://phabricator.wikimedia.org/T142190#2526394 (10fgiunchedi) My two cents: since hw requirements are so modest we could also imitate what we did for poolcounter, namely one VM and one bare metal on a different row [19:31:40] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[prometheus-apache-exporter],Package[prometheus-hhvm-exporter] [19:34:20] RECOVERY - Disk space on analytics1027 is OK: DISK OK [19:35:45] !log nuria@tin Finished deploy [analytics/refinery@711a572]: (no message) (duration: 27m 10s) [19:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:02] !log nuria@tin Starting deploy [analytics/refinery@711a572]: (no message) [19:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:07] !log nuria@tin Finished deploy [analytics/refinery@711a572]: (no message) (duration: 00m 04s) [19:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:10] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:40:34] Krinkle: yt? [19:41:07] !log nuria@tin Starting deploy [analytics/refinery@711a572]: (no message) [19:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:11] !log nuria@tin Finished deploy [analytics/refinery@711a572]: (no message) (duration: 00m 04s) [19:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:39] 06Operations, 10Traffic, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2887935 (10BBlack) In case such an incident happens before the changes in January and I'm not around, the procedure to switch GlobalSign to Digicert globally would be: 1) Commit a change... [19:42:50] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:40] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [19:44:21] !log otto@tin Starting deploy [analytics/refinery@711a572]: (no message) [19:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:28] !log otto@tin Finished deploy [analytics/refinery@711a572]: (no message) (duration: 00m 06s) [19:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:40] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [19:59:46] (03PS1) 10Ottomata: Properly point eventbus.svc to codfw endpoing in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 [20:00:33] (03PS2) 10Ottomata: Properly point eventbus.svc to codfw endpoint in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 [20:00:51] (03PS1) 10Eevans: enable instance restbase1018-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/328213 (https://phabricator.wikimedia.org/T151086) [20:01:01] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4226498 keys, up 49 days 11 hours - replication_delay is 619 [20:02:01] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4184951 keys, up 49 days 11 hours - replication_delay is 0 [20:02:28] (03CR) 10Eevans: [C: 04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/328213 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [20:06:56] (03CR) 10Ppchelko: [C: 031] Properly point eventbus.svc to codfw endpoint in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 (owner: 10Ottomata) [20:07:05] (03PS3) 10Ottomata: Properly point eventbus.svc to codfw endpoint in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 [20:07:43] (03CR) 10jenkins-bot: [V: 04-1] Properly point eventbus.svc to codfw endpoint in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 (owner: 10Ottomata) [20:11:00] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown): Exec[ip addr add 2620:0:860:102:10:192:16:30/64 dev eth0],Service[ferm],Service[diamond],Service[prometheus-node-exporter] [20:19:06] (03PS1) 10Awight: Make md5sums.txt files compatible with md5sum --check [dumps] - 10https://gerrit.wikimedia.org/r/328219 (https://phabricator.wikimedia.org/T69886) [20:24:07] (03Draft1) 10Paladox: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [20:24:10] (03Draft2) 10Paladox: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [20:25:26] (03CR) 10Paladox: [] "See http://askubuntu.com/questions/485856/how-do-i-downgrade-google-chrome" [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [20:25:39] (03PS1) 10Filippo Giunchedi: prometheus: sum node_procs_running across clusters [puppet] - 10https://gerrit.wikimedia.org/r/328220 [20:26:54] (03PS3) 10Paladox: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [20:27:25] (03PS4) 10Paladox: Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [20:28:03] (03PS1) 10Filippo Giunchedi: package_builder: rebuild Packages only when needed [puppet] - 10https://gerrit.wikimedia.org/r/328221 [20:28:15] (03PS4) 10Ottomata: Properly point eventbus.svc to codfw endpoint in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328212 [20:28:48] (03PS1) 10Filippo Giunchedi: admin: add proxy/on-off for filippo [puppet] - 10https://gerrit.wikimedia.org/r/328222 [20:32:17] (03PS2) 10Filippo Giunchedi: prometheus: add aggregation rules for varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327873 (https://phabricator.wikimedia.org/T147424) [20:35:57] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services (done): establish new thresholds for cassandra alarms after switching restbase to dtcs - https://phabricator.wikimedia.org/T118976#2888171 (10Eevans) 05stalled>03Resolved >>! In T118976#2882711, @mobrovac wrote: > Could be time to close this o... [20:37:00] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:51:14] (03PS5) 10Paladox: Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [20:51:24] ottomata: sorry, logged out [20:52:08] (03CR) 10jenkins-bot: [V: 04-1] Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [20:54:49] (03PS6) 10Paladox: Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) [20:58:53] !log nuria@tin Starting deploy [analytics/refinery@ead5b8b]: (no message) [20:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:52] !log nuria@tin Finished deploy [analytics/refinery@ead5b8b]: (no message) (duration: 01m 59s) [21:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:08] (03CR) 10Filippo Giunchedi: [C: 032] admin: add proxy/on-off for filippo [puppet] - 10https://gerrit.wikimedia.org/r/328222 (owner: 10Filippo Giunchedi) [21:04:49] (03PS1) 10Ottomata: Ensure PYTHONPATH has refinery utils available for analytics cluster camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/328229 [21:05:29] (03PS2) 10Ottomata: Ensure PYTHONPATH has refinery utils available for analytics cluster camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/328229 [21:05:45] (03CR) 10Dzahn: [] "wait, what? "stretch" is a _down_grade? that's after jessie" [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [21:06:33] (03CR) 10Paladox: [] "Yep, but it currently shows stretch is using an older version, see https://packages.debian.org/stretch/chromium" [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [21:06:40] (03CR) 10Ottomata: [] "https://puppet-compiler.wmflabs.org/4936/analytics1027.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/328229 (owner: 10Ottomata) [21:06:42] (03CR) 10Ottomata: [C: 032] Ensure PYTHONPATH has refinery utils available for analytics cluster camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/328229 (owner: 10Ottomata) [21:06:48] (03PS3) 10Ottomata: Ensure PYTHONPATH has refinery utils available for analytics cluster camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/328229 [21:06:52] (03CR) 10Ottomata: [V: 032 C: 032] Ensure PYTHONPATH has refinery utils available for analytics cluster camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/328229 (owner: 10Ottomata) [21:07:14] (03CR) 10Dzahn: [C: 04-1] "Package : chromium-browser" [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [21:08:06] (03CR) 10Paladox: [] "Oh, but ci is now broken for mw core and ciruss search (mw ext)" [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [21:08:53] (03CR) 10Dzahn: [C: 04-1] "https://www.debian.org/security/2016/dsa-3731" [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [21:10:00] (03CR) 10Paladox: [] "Yep, but we doint visit other webites using the browser. We test mw core in the browser." [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [21:10:48] (03PS1) 10Ottomata: Fix script path to camus [puppet] - 10https://gerrit.wikimedia.org/r/328231 [21:11:03] (03CR) 10Ottomata: [V: 032 C: 032] Fix script path to camus [puppet] - 10https://gerrit.wikimedia.org/r/328231 (owner: 10Ottomata) [21:22:47] (03CR) 10Filippo Giunchedi: [C: 04-1] "+1 on running on 3.4, though let's go with 3.4+2.7 and without skip_missing_environments, reason being that I want tox to fail if py2 and " [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/328116 (owner: 10Legoktm) [21:30:32] (03CR) 10Paladox: [C: 04-1] Contint: Downgrade Chromium to 53.0.2785 [puppet] - 10https://gerrit.wikimedia.org/r/328217 (https://phabricator.wikimedia.org/T153597) (owner: 10Paladox) [21:32:40] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [21:34:00] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [21:34:50] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:35:18] (03PS2) 10Filippo Giunchedi: prometheus: sum node_procs_running across clusters [puppet] - 10https://gerrit.wikimedia.org/r/328220 [21:35:20] (03PS1) 10Filippo Giunchedi: admin: fix .bashrc for filippo [puppet] - 10https://gerrit.wikimedia.org/r/328237 [21:37:27] (03PS1) 10Ottomata: Alert on EventBus service HTTP error rate [puppet] - 10https://gerrit.wikimedia.org/r/328239 (https://phabricator.wikimedia.org/T153034) [21:38:51] (03PS2) 10Ottomata: Alert on EventBus service HTTP error rate [puppet] - 10https://gerrit.wikimedia.org/r/328239 (https://phabricator.wikimedia.org/T153034) [21:40:21] (03CR) 10Ppchelko: [C: 031] Alert on EventBus service HTTP error rate [puppet] - 10https://gerrit.wikimedia.org/r/328239 (https://phabricator.wikimedia.org/T153034) (owner: 10Ottomata) [21:43:33] (03CR) 10Filippo Giunchedi: [C: 032] admin: fix .bashrc for filippo [puppet] - 10https://gerrit.wikimedia.org/r/328237 (owner: 10Filippo Giunchedi) [21:43:49] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: sum node_procs_running across clusters [puppet] - 10https://gerrit.wikimedia.org/r/328220 (owner: 10Filippo Giunchedi) [21:50:14] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2888316 (10Zppix) 05Open>03stalled >>! In T152132#2873547, @Dzahn wrote: > traffic team asked to wait a couple days because they were in the middle of... [21:57:40] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.079 second response time [21:59:57] (03PS2) 10Dzahn: lists/exim: move files from /files to role module [puppet] - 10https://gerrit.wikimedia.org/r/327138 [22:02:40] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:03:00] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:06:59] (03PS1) 10Andrew Bogott: Keystone: Give custom auth plugins entry points [puppet] - 10https://gerrit.wikimedia.org/r/328293 (https://phabricator.wikimedia.org/T150773) [22:08:02] (03CR) 10jenkins-bot: [V: 04-1] Keystone: Give custom auth plugins entry points [puppet] - 10https://gerrit.wikimedia.org/r/328293 (https://phabricator.wikimedia.org/T150773) (owner: 10Andrew Bogott) [22:09:19] (03PS2) 10Andrew Bogott: Keystone: Give custom auth plugins entry points [puppet] - 10https://gerrit.wikimedia.org/r/328293 (https://phabricator.wikimedia.org/T150773) [22:18:34] (03PS3) 10Andrew Bogott: Keystone: Give custom auth plugins entry points [puppet] - 10https://gerrit.wikimedia.org/r/328293 (https://phabricator.wikimedia.org/T150773) [22:19:48] (03CR) 10Andrew Bogott: [C: 032] Keystone: Give custom auth plugins entry points [puppet] - 10https://gerrit.wikimedia.org/r/328293 (https://phabricator.wikimedia.org/T150773) (owner: 10Andrew Bogott) [22:23:40] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:26:00] andrewbogott hi, login into horozon is failing [22:26:00] now [22:26:01] Unable to establish connection to keystone endpoint. [22:26:06] https://horizon.wikimedia.org/auth/login/ [22:26:12] paladox: yep, I'm looking [22:26:17] ok [22:26:19] thanks [22:28:10] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:28:43] andrewbogott you made a syntax error https://gerrit.wikimedia.org/r/#/c/328293/3/modules/openstack/manifests/keystone/service.pp [22:28:46] on ^^ [22:28:53] line 55 [22:29:10] recurse => true; should be recurse => true, [22:29:32] require => Package['keystone']; should also be require => Package['keystone'], [22:29:55] why? [22:30:00] Lines above have ; [22:30:00] paladox: I think you are incorrect [22:30:05] Oh [22:30:08] see 35, 41 [22:30:12] (03PS1) 10Andrew Bogott: Include some needed __init__ files [puppet] - 10https://gerrit.wikimedia.org/r/328297 [22:30:33] ok sorry [22:30:50] RECOVERY - NTP on prometheus2003 is OK: NTP OK: Offset 0.0001608729362 secs [22:31:26] paladox: login should be fixed [22:31:47] ok [22:31:48] thanks [22:32:09] (03CR) 10Andrew Bogott: [C: 032] Include some needed __init__ files [puppet] - 10https://gerrit.wikimedia.org/r/328297 (owner: 10Andrew Bogott) [22:32:10] yep [22:32:13] works now, thanks [22:35:53] (03PS1) 10Filippo Giunchedi: debian: ditch sysv/systemd on trusty, use upstart [software/hhvm_exporter] (debian/trusty) - 10https://gerrit.wikimedia.org/r/328298 [22:36:11] (03CR) 10jenkins-bot: [V: 04-1] debian: ditch sysv/systemd on trusty, use upstart [software/hhvm_exporter] (debian/trusty) - 10https://gerrit.wikimedia.org/r/328298 (owner: 10Filippo Giunchedi) [22:39:55] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4938/" [puppet] - 10https://gerrit.wikimedia.org/r/327138 (owner: 10Dzahn) [22:40:01] (03PS3) 10Dzahn: lists/exim: move files from /files to role module [puppet] - 10https://gerrit.wikimedia.org/r/327138 [22:47:46] (03PS1) 10Andrew Bogott: Nova: Fix some deprecated settings in nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/328299 [22:50:05] (03CR) 10Andrew Bogott: [C: 032] Nova: Fix some deprecated settings in nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/328299 (owner: 10Andrew Bogott) [22:51:40] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:54:40] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [22:56:40] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [22:57:10] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:57:14] (03PS4) 10Dzahn: lists/exim: move files from /files to role module [puppet] - 10https://gerrit.wikimedia.org/r/327138 [23:03:28] (03PS1) 10Filippo Giunchedi: debian: move gdb.conf to gbp.conf [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/328301 [23:05:12] (03PS1) 10Filippo Giunchedi: debian: ditch sysv/systemd on trusty, use upstart [debs/prometheus-apache-exporter] (debian/trusty) - 10https://gerrit.wikimedia.org/r/328302 [23:09:20] (03CR) 10Dzahn: "no-op on fermium, mx1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/327138 (owner: 10Dzahn) [23:09:40] (03PS2) 10Dzahn: installserver/CI: give shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327595 (https://phabricator.wikimedia.org/T148494) [23:15:58] (03PS1) 10Andrew Bogott: Glance: rename a deprecated setting [puppet] - 10https://gerrit.wikimedia.org/r/328303 [23:17:51] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/exim4/wikimedia_domains] [23:17:58] oh, heh, puppet.. well.. that error is acceptable: Error: Attempt to assign to a reserved variable name: 'trusted' on node [23:19:01] (03PS2) 10Andrew Bogott: Glance: rename a deprecated setting [puppet] - 10https://gerrit.wikimedia.org/r/328303 [23:19:40] mutante: I don't know where that error came from but it's breaking the puppet compiler… [23:19:47] Is it because of a package upgrade or something? [23:20:11] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [23:21:44] andrewbogott: i think it's a limitation of the compiler itself [23:21:54] mutante: but why today? [23:22:52] does the compiler fail hard on that? the same error happens in production sometimes too T153246 [23:22:52] T153246: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246 [23:23:17] that's the ticket i could not find [23:23:17] thanks [23:23:36] i compiled a couple other things just fine today [23:23:54] i thought it was just related to which node i test something on, now i dont know anymore [23:25:22] I couldn't get useful compiler runs at all today [23:25:29] but maybe coincidence, it was only one patch [23:26:03] (03CR) 10Andrew Bogott: [C: 032] Glance: rename a deprecated setting [puppet] - 10https://gerrit.wikimedia.org/r/328303 (owner: 10Andrew Bogott) [23:28:51] (03CR) 10Legoktm: [] "OK, but just to be clear, if the tests run and fail on any interpreter, then the overall job will fail. skip_missing_environments is just " [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/328116 (owner: 10Legoktm) [23:28:56] recompiles the same thing another time [23:33:46] andrewbogott: which host was it in your case, btw? for me it was carbon where it fails (can repeat it), but not on others [23:34:02] labcontrol1001 and labtestcontrol2001 [23:34:04] i think i had it before and it was where there is also ganglia [23:34:20] well there goes that theory [23:35:56] (03PS1) 10Filippo Giunchedi: Add apache/hhvm exporter to imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/328309 (https://phabricator.wikimedia.org/T147423) [23:37:37] godog: I commented on https://gerrit.wikimedia.org/r/#/c/328116/ [23:38:48] (03CR) 10Filippo Giunchedi: [C: 032] debian: move gdb.conf to gbp.conf [debs/prometheus-apache-exporter] - 10https://gerrit.wikimedia.org/r/328301 (owner: 10Filippo Giunchedi) [23:39:04] (03CR) 10Filippo Giunchedi: [C: 032] debian: ditch sysv/systemd on trusty, use upstart [debs/prometheus-apache-exporter] (debian/trusty) - 10https://gerrit.wikimedia.org/r/328302 (owner: 10Filippo Giunchedi) [23:41:45] legoktm: thanks! yeah I'm interested in making sure tox gets to run on at least python2 and 3 regardless of the minor version, it looks like if we skip environments it might happen that python2 or 3 will get skipped if missing [23:42:10] (03PS3) 10Dzahn: installserver/CI: give shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327595 (https://phabricator.wikimedia.org/T148494) [23:42:21] (03CR) 10Dzahn: [] "http://puppet-compiler.wmflabs.org/4940/" [puppet] - 10https://gerrit.wikimedia.org/r/327595 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [23:42:36] godog: I could change it to "py2,py3" then? [23:43:08] sure that'd work too I think, I'm assuming it means "any version" [23:43:17] yes [23:43:40] (03CR) 10Dzahn: [C: 032] installserver/CI: give shell scripts a file extension [puppet] - 10https://gerrit.wikimedia.org/r/327595 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [23:45:01] (03PS2) 10Legoktm: Run tests on any Python 2 & 3 version [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/328116 [23:48:59] (03PS3) 10Dzahn: Gerrit: Remove java 7 package [puppet] - 10https://gerrit.wikimedia.org/r/327756 (owner: 10Paladox) [23:50:25] (03CR) 10Filippo Giunchedi: [C: 032] Run tests on any Python 2 & 3 version [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/328116 (owner: 10Legoktm) [23:50:35] legoktm: awesome, thanks [23:51:21] np :) [23:52:22] (03CR) 10Filippo Giunchedi: [] "recheck" [software/hhvm_exporter] (debian/trusty) - 10https://gerrit.wikimedia.org/r/328298 (owner: 10Filippo Giunchedi) [23:52:28] (03CR) 10Filippo Giunchedi: [C: 032] Add apache/hhvm exporter to imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/328309 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [23:52:33] (03PS2) 10Filippo Giunchedi: Add apache/hhvm exporter to imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/328309 (https://phabricator.wikimedia.org/T147423) [23:52:40] (03CR) 10jenkins-bot: [V: 04-1] debian: ditch sysv/systemd on trusty, use upstart [software/hhvm_exporter] (debian/trusty) - 10https://gerrit.wikimedia.org/r/328298 (owner: 10Filippo Giunchedi) [23:55:15] (03PS1) 10Filippo Giunchedi: debian: ditch sysv/systemd on trusty, use upstart [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/328310 [23:56:19] (03Abandoned) 10Filippo Giunchedi: debian: ditch sysv/systemd on trusty, use upstart [software/hhvm_exporter] (debian/trusty) - 10https://gerrit.wikimedia.org/r/328298 (owner: 10Filippo Giunchedi) [23:56:28] (03CR) 10Filippo Giunchedi: [C: 032] debian: ditch sysv/systemd on trusty, use upstart [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/328310 (owner: 10Filippo Giunchedi)